- Issue created by @ressa
- 🇺🇸United States xjm
I'm confused by the title --
å
andø
appear to be in the override file whileæ
is not. Shouldn't the title mentionæ
instead, if that's what the issue is about?However, isn't the underlying difference though that
å
andø
are pronounced very differently from the English letters they most closely resemble, whereas the standard transliteration foræ
in most languages including English is alreadyae
? That's why the override file was needed for them.In
drupal/core/lib/Drupal/Component/Transliteration/data/x00.php
, the standard transliterations ofAE
andae
for0xC6
and0xE6
already appear at the expected positions, at least as far as I understand how it works.Can you provide manual testing steps to more clearly illustrate the bug, or a test case?
- 🇩🇰Denmark ressa Copenhagen
Thanks for looking at this @xjm. I have previously opened the transliteration mapping file
core/lib/Drupal/Component/Transliteration/data/x00.php
but didn't get it ... But I now understand the structure after checking in https://byte-tools.com/en/ascii/ -- that0xc0
is the first value,0xc1
is number two, etc.Like you write, for some reason, the standard English transliteration defines the transliteration for Æ and æ correctly on this line, at position #7, as
'AE'
(0xC6):0xC0 => 'A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I',
... and æ at position #7, as
'ae'
(0xE6):0xE0 => 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i',
I am not sure what the thought behind transliterating æ, å and ø differently is ... But the problem is, quite often a Drupal installation may be installed with only English language, but the content is in reality Danish, Norwegian, Icelandic, Faroese or Swedish. In such cases, æ will be transliterated correctly, but not ø or å, which is inconsistent. Also, this can cause slight differences in URL aliases for multilingual sites, where you prefer them to be streamlined, as can be seen in for example 🐛 German letter "ä", "ü" in translation Postponed: needs info .
But if the characters æ, ø and å were defined in the generic transliteration data file, it would help Danish, Norwegian, Icelandic, Faroese and Swedish language web sites in an English Drupal installation to not have to create a custom module, as well as streamline URL aliases between languages, and I have created an MR.
I tried removing the
da.php
file but got weird test results, so I restored it.Maybe there should also be mapping added for Aa, Ae and Oe? I am not sure in which of the data files this needs to be done, though.
- 🇩🇰Denmark ressa Copenhagen
Adding some more context in the Issue Summary, in an attempt to make it clearer what the challenge is.
- 🇺🇸United States xjm
Thanks @ressa for the excellent IS update. (Just adding some punctuation; disregard my small changes.)
- 🇺🇸United States xjm
This is one of those things best reviewed with
git diff --color-words
, so that's what I did. Most of the changes look correct but there are a couple I don't understand.The changes to the unicode table look correct and reflect the summary:
Several tests change transliterations in the expected ways
But this one seems to be transliterating
å
toa
instead ofaa
. Kan du hjelpe meg å forstå hvorfor? (See what I did there? Forgive any grammatical errors; I only had one semester of Norwegian.)And then this I do not get at all in
PhpTransliterationTest
:We seem to be adding
å
instead of removing them?I will try to understand better, but given the scope has expanded to correcting the base table, let's ask for subsystem review here too.
Thanks a bunch!
- 🇩🇰Denmark ressa Copenhagen
Thanks for a fast response and thorough review @xjm, I really appreciate it!
Great idea using coloured diff's, they really help spot the changes. Og godt norsk efter kun et semester, jeg forstår det! (jeg håber du kan læse dansk)
About item #3 in your list (
modules/search/tests/src/Functional/SearchNodeDiacriticsTest.php
), I could not figure out why the test broke ... it is looking for the string "påŔťıçȉpǎǹţș" which seems present in the page. I figured that since the test tests "diacritics in the search phrase." that I could simply change "å" to "a", because I mistakenly assumed "å" isn't a diacritic.But I now realize that "å" is indeed also a diacritic, so I reverted the Diacritic test in SearchNodeDiacriticsTest.php (#3 in your list) back to its original state, as well as the "Test all characters in the Unicode" bit in PhpTransliterationTest.php (#4 in your list), so these tests will now fail.
It is interesting that Æ is included in x00.php, but not Ø, since it is not a diacritic, as I understand it.
Great idea to get some more eyes on this, by requesting a subsystem maintainer review, since I don't know if the approach in the current MR is the correct method, or even if it's a good idea to start with, maybe it has unintended consequences ...?
I have added some more things to consider in the Issue Summary, under "Remaining tasks". Thanks so far!
- 🇺🇸United States xjm
(jeg håber du kan læse dansk)
Ja!. Jeg vet du har den poteten i halsen, men jeg kan lese dansk greit.
Test fail is:
Php Transliteration (Drupal\Tests\Component\Transliteration\PhpTransliteration) ✘ Remove diacritics with data set 0 ┐ ├ Failed asserting that two strings are equal. ┊ ---·Expected ┊ +++·Actual ┊ @@ @@ ┊ -'AAAAAAÆCEEEEIIII' ┊ +'AAAAAÅÆCEEEEIIII' │ │ /builds/issue/drupal-3525904/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:34 ┴ ✘ Remove diacritics with data set 1 ┐ ├ Failed asserting that two strings are equal. ┊ ---·Expected ┊ +++·Actual ┊ @@ @@ ┊ -'ÐNOOOOO×OUUUUYÞß' ┊ +'ÐNOOOOOרUUUUYÞß' │ │ /builds/issue/drupal-3525904/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:34 ┴ ✘ Remove diacritics with data set 2 ┐ ├ Failed asserting that two strings are equal. ┊ ---·Expected ┊ +++·Actual ┊ @@ @@ ┊ -'aaaaaaæceeeeiiii' ┊ +'aaaaaåæceeeeiiii' │ │ /builds/issue/drupal-3525904/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:34 ┴ ✘ Remove diacritics with data set 3 ┐ ├ Failed asserting that two strings are equal. ┊ ---·Expected ┊ +++·Actual ┊ @@ @@ ┊ -'ðnooooo÷ouuuuyþy' ┊ +'ðnooooo÷øuuuuyþy' │ │ /builds/issue/drupal-3525904/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:34 ┴
So, yeah, I think we'll want @amateescu's guidance here on whether to change the test expected output, add additional test coverage, change code, or what. 🤷♀️
- 🇩🇰Denmark ressa Copenhagen
Ha, ja det synes svenskerne og nordmændene jo vi har :)
Yes, it's probably best to wait and see what @amateescu thinks before proceeding. Reviewing the updated MR can probably wait as well, since the current approach could be breaking a well planned structure.