Danish characters ø and å are not transliterated correctly

Created on 21 May 2025, about 2 months ago

Problem/Motivation

When generating URL path aliases from nodes with Danish characters in the title, the characters 'å' and 'ø' gets transliterated incorrectly to 'a' and 'o' instead of 'aa' and 'oe' when node language is English. Danish language nodes work as expected.

A node with the title "1æ 2ø 3å 4Æ 5Ø 6Å" gets these paths, first in a node with language Danish, and then English:

/da/1ae-2oe-3aa-4ae-5oe-6aa
/en/1ae-2o-3a-4ae-5o-6a

The English language node should get this path (identical with the Danish node, except for en bit):

/en/1ae-2oe-3aa-4ae-5oe-6aa

In #2895315: Danish characters are not translated correctly with transliteration , the transliteration file was renamed to da.php, from dk.php (into core/lib/Drupal/Component/Transliteration/data/da.php).

The Danish alphabet contains the special characters æ, ø and å. However, I noticed that æ is missing in the core/lib/Drupal/Component/Transliteration/data/da.php file:

<?php

/**
 * @file
 * Danish transliteration data for the PhpTransliteration class.
 */

$overrides['da'] = [
  0xC5 => 'Aa',
  0xD8 => 'Oe',
  0xE5 => 'aa',
  0xF8 => 'oe',
];

Should these two be added, for Æ and æ?

0xC6 => 'Ae',
0xE6 => 'ae',

From https://byte-tools.com/en/ascii/

Proposed resolution

  • Document a workaround, so that the letters ø and å are turned into oe and aa, in any language.
  • Set up transliteration in Drupal, so that these substitutions take place, in all languages, English, Danish, Spanish, etc.:
    æ > ae
    Æ > ae
    Ø > oe
    ø > oe
    Å > aa
    å > aa
    
  • Add Æ and æ in da.php?

Remaining tasks

User interface changes

API changes

Data model changes

🐛 Bug report
Status

Active

Version

11.0 🔥

Component

transliteration system

Created by

🇩🇰Denmark ressa Copenhagen

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @ressa
  • 🇩🇰Denmark ressa Copenhagen
  • 🇩🇰Denmark ressa Copenhagen

    Adding workaround.

  • 🇺🇸United States xjm

    I'm confused by the title -- å and ø appear to be in the override file while æ is not. Shouldn't the title mention æ instead, if that's what the issue is about?

    However, isn't the underlying difference though that å and ø are pronounced very differently from the English letters they most closely resemble, whereas the standard transliteration for æ in most languages including English is already ae? That's why the override file was needed for them.

    In drupal/core/lib/Drupal/Component/Transliteration/data/x00.php, the standard transliterations of AE and ae for 0xC6 and 0xE6 already appear at the expected positions, at least as far as I understand how it works.

    Can you provide manual testing steps to more clearly illustrate the bug, or a test case?

  • Pipeline finished with Failed
    4 days ago
    Total: 143s
    #537075
  • Pipeline finished with Failed
    4 days ago
    Total: 126s
    #537097
  • Pipeline finished with Failed
    4 days ago
    Total: 145s
    #537117
  • Pipeline finished with Failed
    3 days ago
    Total: 247s
    #537121
  • Pipeline finished with Failed
    3 days ago
    Total: 287s
    #537131
  • Pipeline finished with Failed
    3 days ago
    Total: 214s
    #537153
  • Pipeline finished with Failed
    3 days ago
    Total: 158s
    #537174
  • Pipeline finished with Failed
    3 days ago
    Total: 303s
    #537182
  • Pipeline finished with Failed
    3 days ago
    Total: 137s
    #537204
  • Pipeline finished with Failed
    3 days ago
    Total: 1016s
    #537214
  • Pipeline finished with Success
    3 days ago
    Total: 607s
    #537243
  • 🇩🇰Denmark ressa Copenhagen

    Thanks for looking at this @xjm. I have previously opened the transliteration mapping file core/lib/Drupal/Component/Transliteration/data/x00.php but didn't get it ... But I now understand the structure after checking in https://byte-tools.com/en/ascii/ -- that 0xc0 is the first value, 0xc1 is number two, etc.

    Like you write, for some reason, the standard English transliteration defines the transliteration for Æ and æ correctly on this line, at position #7, as 'AE' (0xC6):

    0xC0 => 'A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I',

    ... and æ at position #7, as 'ae' (0xE6):

    0xE0 => 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i',

    I am not sure what the thought behind transliterating æ, å and ø differently is ... But the problem is, quite often a Drupal installation may be installed with only English language, but the content is in reality Danish, Norwegian, Icelandic, Faroese or Swedish. In such cases, æ will be transliterated correctly, but not ø or å, which is inconsistent. Also, this can cause slight differences in URL aliases for multilingual sites, where you prefer them to be streamlined, as can be seen in for example 🐛 German letter "ä", "ü" in translation Postponed: needs info .

    But if the characters æ, ø and å were defined in the generic transliteration data file, it would help Danish, Norwegian, Icelandic, Faroese and Swedish language web sites in an English Drupal installation to not have to create a custom module, as well as streamline URL aliases between languages, and I have created an MR.

    I tried removing the da.php file but got weird test results, so I restored it.

    Maybe there should also be mapping added for Aa, Ae and Oe? I am not sure in which of the data files this needs to be done, though.

  • 🇩🇰Denmark ressa Copenhagen

    Adding some more context in the Issue Summary, in an attempt to make it clearer what the challenge is.

  • 🇩🇰Denmark ressa Copenhagen

    Adding x00.php lines with æ in Issue Summary.

  • 🇺🇸United States xjm
  • 🇺🇸United States xjm

    Thanks @ressa for the excellent IS update. (Just adding some punctuation; disregard my small changes.)

  • 🇺🇸United States xjm

    This is one of those things best reviewed with git diff --color-words, so that's what I did. Most of the changes look correct but there are a couple I don't understand.

    The changes to the unicode table look correct and reflect the summary:

    Several tests change transliterations in the expected ways

    But this one seems to be transliterating å to a instead of aa. Kan du hjelpe meg å forstå hvorfor? (See what I did there? Forgive any grammatical errors; I only had one semester of Norwegian.)

    And then this I do not get at all in PhpTransliterationTest:

    We seem to be adding å instead of removing them?

    I will try to understand better, but given the scope has expanded to correcting the base table, let's ask for subsystem review here too.

    Thanks a bunch!

  • 🇺🇸United States xjm
  • Pipeline finished with Failed
    3 days ago
    Total: 193s
    #538113
  • Pipeline finished with Success
    3 days ago
    Total: 489s
    #538130
  • Pipeline finished with Canceled
    2 days ago
    Total: 286s
    #538198
  • Pipeline finished with Failed
    2 days ago
    Total: 206s
    #538206
  • 🇩🇰Denmark ressa Copenhagen

    Thanks for a fast response and thorough review @xjm, I really appreciate it!

    Great idea using coloured diff's, they really help spot the changes. Og godt norsk efter kun et semester, jeg forstår det! (jeg håber du kan læse dansk)

    About item #3 in your list (modules/search/tests/src/Functional/SearchNodeDiacriticsTest.php), I could not figure out why the test broke ... it is looking for the string "påŔťıçȉpǎǹţș" which seems present in the page. I figured that since the test tests "diacritics in the search phrase." that I could simply change "å" to "a", because I mistakenly assumed "å" isn't a diacritic.

    But I now realize that "å" is indeed also a diacritic, so I reverted the Diacritic test in SearchNodeDiacriticsTest.php (#3 in your list) back to its original state, as well as the "Test all characters in the Unicode" bit in PhpTransliterationTest.php (#4 in your list), so these tests will now fail.

    It is interesting that Æ is included in x00.php, but not Ø, since it is not a diacritic, as I understand it.

    Great idea to get some more eyes on this, by requesting a subsystem maintainer review, since I don't know if the approach in the current MR is the correct method, or even if it's a good idea to start with, maybe it has unintended consequences ...?

    I have added some more things to consider in the Issue Summary, under "Remaining tasks". Thanks so far!

  • 🇺🇸United States xjm

    (jeg håber du kan læse dansk)

    Ja!. Jeg vet du har den poteten i halsen, men jeg kan lese dansk greit.

    Test fail is:

    Php Transliteration (Drupal\Tests\Component\Transliteration\PhpTransliteration)
     ✘ Remove diacritics with data set 0
       ┐
       ├ Failed asserting that two strings are equal.
       ┊ ---·Expected
       ┊ +++·Actual
       ┊ @@ @@
       ┊ -'AAAAAAÆCEEEEIIII'
       ┊ +'AAAAAÅÆCEEEEIIII'
       │
       │ /builds/issue/drupal-3525904/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:34
       ┴
     ✘ Remove diacritics with data set 1
       ┐
       ├ Failed asserting that two strings are equal.
       ┊ ---·Expected
       ┊ +++·Actual
       ┊ @@ @@
       ┊ -'ÐNOOOOO×OUUUUYÞß'
       ┊ +'ÐNOOOOOרUUUUYÞß'
       │
       │ /builds/issue/drupal-3525904/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:34
       ┴
     ✘ Remove diacritics with data set 2
       ┐
       ├ Failed asserting that two strings are equal.
       ┊ ---·Expected
       ┊ +++·Actual
       ┊ @@ @@
       ┊ -'aaaaaaæceeeeiiii'
       ┊ +'aaaaaåæceeeeiiii'
       │
       │ /builds/issue/drupal-3525904/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:34
       ┴
     ✘ Remove diacritics with data set 3
       ┐
       ├ Failed asserting that two strings are equal.
       ┊ ---·Expected
       ┊ +++·Actual
       ┊ @@ @@
       ┊ -'ðnooooo÷ouuuuyþy'
       ┊ +'ðnooooo÷øuuuuyþy'
       │
       │ /builds/issue/drupal-3525904/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:34
       ┴
    

    So, yeah, I think we'll want @amateescu's guidance here on whether to change the test expected output, add additional test coverage, change code, or what. 🤷‍♀️

  • 🇩🇰Denmark ressa Copenhagen

    Ha, ja det synes svenskerne og nordmændene jo vi har :)

    Yes, it's probably best to wait and see what @amateescu thinks before proceeding. Reviewing the updated MR can probably wait as well, since the current approach could be breaking a well planned structure.

Production build 0.71.5 2024