Japanese content is transliterated as Chinese

Created on 11 July 2018, about 7 years ago
Updated 3 May 2023, about 2 years ago

Problem/Motivation

Japanese content is transliterated as Chinese. The same characters are "romanized" in Japanese in a different way that Chinese, for example a content that has 日本語 should be transliterated as "nihongo" for Japanese but in Chinese it is "ribenyu". Drupal does the second no matter on the language of the article.

Hiragana and Katakana are correctly transliterated, only kanji seem to have this issue.

Proposed resolution

I am not completely sure about what the ideal solution is, to give more background, Garrett Albright wrote this excellent outline of the problem back in 2013 in the Japan groups site: https://groups.drupal.org/node/377438

There have been efforts from other members of the community and it seems the "betterer" solution orbits around using the MeCab library, which is not trivial to install and probably not available in shared hostings and so on.

Here's Garrett sandbox module for D8 that tries to solve this problem: https://www.drupal.org/sandbox/garrettalbright/2153499

Other efforts in D7, using the same MeCab library: https://www.drupal.org/sandbox/qchan/1324666 and an alternative one (kakasi) https://www.drupal.org/sandbox/qchan/1324644

Just for reference, here's a unicode list of kanji codes: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

Remaining tasks

Figure out what is the sensible support core can have for Japanese transliteration.

🐛 Bug report
Status

Active

Version

10.1

Component
Transliteration 

Last updated 3 days ago

Created by

🇪🇸Spain pcambra Asturies

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇨🇦Canada joseph.olstad

    Would be nice to get a patch for this.

  • 🇯🇵Japan tyler36 Osaka

    Experience issue and using MariaDB 10.4 with utf8mb4_general_ci.

    Getting hit with issue when Drupal generates machine names which do not resemble the Japanese language labels.
    Able to reproduce with drush repl so probably not limited to database in scope.

    Results of some recent testing are here 🐛 Japanese Characters replaced by Chinese english words in URL aliases. Postponed: needs info .

  • 🇨🇭Switzerland phma Basel, CH

    This won't help if you only want ASCII letters in your URLs. But if you prefer cleaner URLs and keep transliteration working for non-Japanese URLs, you can use something like this (it's important to turn off transliteration in settings when using this):

    /**
     * Deal with Japanese and other non-Latin characters in Pathauto aliases.
     *
     * Unicode ranges taken from here and converted to PHP:
     * @see https://gist.github.com/ryanmcgrath/982242
     *
     * Implements hook_pathauto_alias_alter().
     */
    function pathauto_cjk_pathauto_alias_alter(&$alias, array &$context) {
      // If the alias contains CJK characters, clean up punctuation but do not
      // transliterate because Japanese gets transliterated as Chinese.
      // @see https://www.drupal.org/project/drupal/issues/2984977
      if (preg_match('/[\x{3000}-\x{303F}]|[\x{3040}-\x{309F}]|[\x{30A0}-\x{30FF}]|[\x{FF00}-\x{FFEF}]|[\x{4E00}-\x{9FAF}]|[\x{2605}-\x{2606}]|[\x{2190}-\x{2195}]|\x{203B}/u', $alias)) {
        // Cleanup fullwidth characters.
        $alias = mb_convert_kana($alias, 'KVrn');
        // Replace punctuation with hyphens.
        $alias = preg_replace('/[\x{3000}-\x{303F}]|[\x{2605}-\x{2606}]|[\x{2190}-\x{2195}]|[\x{203B}\x{30FB}]|[  \t]/u', '-', $alias);
        // Replace remaining special characters with hyphens.
        $alias = preg_replace('/[^\p{Han}\p{Katakana}\p{Hiragana}\p{Latin}\d\/]+/u', '-', $alias);
      } else {
        // Transliterate the alias.
        $alias = \Drupal::transliteration()->transliterate($alias, $context['language'] ?? 'en');
      }
      $alias = \Drupal::service('pathauto.alias_cleaner')->cleanAlias($alias, $context['source'], $context['language']);
    }

    This might need more polishing and testing, so any suggestions and improvements are welcome.

  • 🇯🇵Japan tyler36 Osaka

    Thank you @phma.

    That's a useful snippet.

  • 🇯🇵Japan u7aro Japan

    We discussed this issue as part of [Drupal Contribution Day Japan 2025]( https://www.drupal.org/community/events/drupal-japan-contribution-day-ju... )

    Result:

    - We conducted a feasibility check using kuromoji.js.
    - We are making it possible to incorporate the feature as a contributed module.

    Members:

    - otofu https://www.drupal.org/u/otofu
    - u7aro https://www.drupal.org/u/u7aro
    - kazuko.murata https://www.drupal.org/u/kazukomurata
    - Tom Konda https://www.drupal.org/u/tom-konda
    - hagi https://www.drupal.org/u/hagi

  • 🇨🇦Canada Charlie ChX Negyesi 🍁Canada

    Please check the overrides already in core/lib/Drupal/Component/Transliteration/data for example eo.php or kg.php and provide a similar ja.php. The file is a simple PHP array where the keys are Unicode code points and the values are transliterated strings.

  • 🇦🇺Australia jimmycann

    This was investigated as part of DrupalSouth Melbourne 2025 contribution day in March 2025. Apologies, being new to the process I had neglected to comment on the issue at the time. I was also in attendance at the Japan Drupal contribution day and was able to see the progress on this issue, so would like to add some context

    Between myself and https://www.drupal.org/u/nterbogt we found the transliteration library and related eo.php/kg.php files, but found that the transliteration library is not entirely compatible with the way the Japanese language works.

    For example if we compare the array in eo.php

    ```
    $overrides['kg'] = [
    0x41 => 'E',
    ...
    ```

    In this language `0x41` invariably corresponds with the letter `E`, but this isn't true for Japanese where a character could have many ways to be transliterated, but there is only one correct one in a given context.

    For example

    `0x65E5` is the unicode character for `日`, which could be 'hi', 'bi' or 'nichi' depending on the characters next to it, choosing the wrong one will make the meaning non-sensible. There is also many thousand such examples.

    In other words a simple array mapping unicode values to strings won't be sufficient for Japanese.

    Writing an equivalent port of [kuromoji.js](https://github.com/takuyaa/kuromoji.js/) to PHP and incorporating it into the Drupal core transliteration package is also likely out of scope of this project, it will also add an amount of weight due to the complexity so should be able to optionally included.

    I think https://www.drupal.org/u/u7aro and the team that investigated this during Japan contribution day have found with a contributed module that incorporates [kuromoji.js](https://github.com/takuyaa/kuromoji.js/) might be the best course to be able to solve this for Japanese language users

Production build 0.71.5 2024