Japanese Characters replaced by Chinese english words in URL aliases.

Created on 20 June 2017, over 7 years ago
Updated 8 May 2023, over 1 year ago

When ever a new content with Japanese title is generated, URL alias shows english words which are of Chinese dialect.

Some explanation from our Japanese Colleague :

First of all, please let me explain Japan use 3 different set of characters such as Hiragana, Katakana and Kanji.

Hiragana and Katakana are used for Japanese Language only. Therefore Drupal generates URL “almost” properly. I will explain later why I added “almost”. Hiragana is usually used for Japanese traditional word, and Katakana is used for showing foreign word.
Kanji is originally coming from China and evolved in Japan. Therefore, it looks very similar but Japanese people understand which one is Japanese and which one is Chinese.

For more details about Japanese please see http://www.how-ocr-works.com/languages/japanese-alphabet.html

Tips: In Korea, they are only using one type character called “Hangul” for Traditional Word, Foreign Word and Chinese Oriented Word. Therefore, Dural create URL properly.

It seems that Drupal confuses Kanji and Chinese original character...
The problem occurs on Drupal is as follows.

雑草 and 版 is Kanji, and the pronunciation are as follows.

雑草 – JP “Zassou” , CN “Za-Cao”
版 – JP “Ban” , CN “Ban” <- It just happened to be same. But most of the time, pronunciation is different for Japanese and Chinese..

ポケットブックインライン is a Katakana. Therefore, Drupal could recognize it is Japanese. “Correct” Pronunciation is as follows.

ポケットブックインライン – JP “Pokettobukkuonrain” CH – N/A, as Katakana is only used for Japan

I thought Drupal is creating URL properly for Katakana, but I found the system error. That’s why I was added “almost” at the beginning…
Drupal does not recognize geminate consonant which is expressed by small “ッ”. It is quite similar but you can see “ッ” is smaller than other Japanese characters. On the other hand, large “ツ” pronounced as “tu”, and again… It is used differently compare with small “ッ”.
The correct pronunciation for ポケットブックインライン is “Pokettobukkuonrain”. However, Drupal create the URL like “poketutobutukuonkine” which is ポケツトブツクオンライン.

It might be too complicated… but our preferred URL for this page is…

zasso-pokettobukkuonrain-ban
↑Kanji ↑Katakana   ↑Kanji

Tips:
By the way, when you have discussion with Japanese coworkers, you hear very strong Japanese accent. That is coming from “Katakana” which is used for foreign language (as I explained at the beginning of this email).

“Pokettobukkuonrain” is actually,
“Poketto Bukku Onrain” which is…
“Pocket Book Online” transformed by Japanese weird accent.

Therefore, the following URL can also be used.

zasso-pocket-book-online-ban
↑Kanji ↑English    ↑Kanji

In summary, Drupal’s URL auto-generator has the following obstacles.

Confuse Kanji and Chinese original character
Cannot recognize geminate consonant for Katakana (ッ) and Hiragana (っ).
Tips (Katakana is used for foreign languages. Therefore, using original English expression is suitable for URL in some case. )
Please look into this issue.

Adding a doc file with the same explanation. We look on D.o but was not able to find any help on these lines. Can you please have someone who understands Japanese look into this.

🐛 Bug report
Status

Postponed: needs info

Version

1.0

Component

Menus

Created by

🇨🇦Canada vikas_jain Vancouver

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇯🇵Japan tyler36 Osaka

    So I can confirm this issue is present in Drupal 8, 9, 10.

    I came across it today when migrating user data (machineName plugin) but stuggle with it whenever I enter Japanese in a label field, which then automatically generates a machine name.

    I believe it comes down to this TransliterationInterface::transliterate($string, $langcode = 'en', $unknown_character = '?', $max_length = NULL) and more specifically Drupal\Component\Transliteration::replace($code, $langcode, $unknown_character)

    Drupal assumes the Chinese reading of kanji even when specifing Japanese. I have confirmed this on both a English default and Japanese default site.

    $this->transliteration->transliterate('雑草', 'en', '_')
    "zacao"
    
    $this->transliteration->transliterate('雑草', 'ja', '_')
    "zacao"
    

    Likewise, katakana, which is a script only used for Japanese writing, also has problems; Eg. 'ポケットブックインライン'

    Drupal 10 converts this to "hoke~tsutofu~tsukuinrain", however it should be "pokettobukkuinrain".

    1. It incorrectly handles "ッ" which should double the next letter (tto, kku)
    2. It also incorrectly handles diacritic marks: ポ (po) as ホ (ho) in katakana.
     

    The diacritic problems exists in both Katakana and Hirgana; and there are 25 characters with diacritic marks in each so 50 total.
    Katanana: ホ (ho), ボ (bo), ポ (po) => ho
    Hiragana: ほ (ho), ぼ (bo), ぽ (po) => ho

  • 🇨🇦Canada joseph.olstad

    @tyler36, is your database MySQL /MariaDB ? what db type? Also what collation type are you using for your Drupal database? utf8mb4_general_ci is the one that you should most likely be using. If you install Drupal with utf8mb4_general_ci , or convert your database to use utf8mb4_general_ci , this might resolve your issues. Try a fresh install of Drupal on a database using utf8mb4_general_ci

    What collation type are you currently using?

  • 🇨🇦Canada joseph.olstad

    Hmm, there is a core issue open for Drupal core.

  • 🇯🇵Japan tyler36 Osaka

    @joseph.olstad Thanks for the reply and link to parent issue.

    Database is MariaDB:10.4 using "utf8mb4_general_ci" collation.

Production build 0.71.5 2024