Japanese content is transliterated as Chinese

Created on 11 July 2018, over 6 years ago
Updated 24 October 2023, over 1 year ago

Problem/Motivation

Japanese content is transliterated as Chinese. The same characters are "romanized" in Japanese in a different way that Chinese, for example a content that has 日本語 should be transliterated as "nihongo" for Japanese but in Chinese it is "ribenyu". Drupal does the second no matter on the language of the article.

Hiragana and Katakana are correctly transliterated, only kanji seem to have this issue.

Proposed resolution

I am not completely sure about what the ideal solution is, to give more background, Garrett Albright wrote this excellent outline of the problem back in 2013 in the Japan groups site: https://groups.drupal.org/node/377438

There have been efforts from other members of the community and it seems the "betterer" solution orbits around using the MeCab library, which is not trivial to install and probably not available in shared hostings and so on.

Here's Garrett sandbox module for D8 that tries to solve this problem: https://www.drupal.org/sandbox/garrettalbright/2153499

Other efforts in D7, using the same MeCab library: https://www.drupal.org/sandbox/qchan/1324666 and an alternative one (kakasi) https://www.drupal.org/sandbox/qchan/1324644

Just for reference, here's a unicode list of kanji codes: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

Remaining tasks

Figure out what is the sensible support core can have for Japanese transliteration.

🐛 Bug report
Status

Active

Version

11.0 🔥

Component
Transliteration 

Last updated 4 months ago

Created by

🇪🇸Spain pcambra Asturies

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇨🇦Canada joseph.olstad

    Would be nice to get a patch for this.

  • 🇯🇵Japan tyler36 Osaka

    Experience issue and using MariaDB 10.4 with utf8mb4_general_ci.

    Getting hit with issue when Drupal generates machine names which do not resemble the Japanese language labels.
    Able to reproduce with drush repl so probably not limited to database in scope.

    Results of some recent testing are here 🐛 Japanese Characters replaced by Chinese english words in URL aliases. Postponed: needs info .

  • 🇨🇭Switzerland phma Basel, CH

    This won't help if you only want ASCII letters in your URLs. But if you prefer cleaner URLs and keep transliteration working for non-Japanese URLs, you can use something like this (it's important to turn off transliteration in settings when using this):

    /**
     * Deal with Japanese and other non-Latin characters in Pathauto aliases.
     *
     * Unicode ranges taken from here and converted to PHP:
     * @see https://gist.github.com/ryanmcgrath/982242
     *
     * Implements hook_pathauto_alias_alter().
     */
    function pathauto_cjk_pathauto_alias_alter(&$alias, array &$context) {
      // If the alias contains CJK characters, clean up punctuation but do not
      // transliterate because Japanese gets transliterated as Chinese.
      // @see https://www.drupal.org/project/drupal/issues/2984977
      if (preg_match('/[\x{3000}-\x{303F}]|[\x{3040}-\x{309F}]|[\x{30A0}-\x{30FF}]|[\x{FF00}-\x{FFEF}]|[\x{4E00}-\x{9FAF}]|[\x{2605}-\x{2606}]|[\x{2190}-\x{2195}]|\x{203B}/u', $alias)) {
        // Cleanup fullwidth characters.
        $alias = mb_convert_kana($alias, 'KVrn');
        // Replace punctuation with hyphens.
        $alias = preg_replace('/[\x{3000}-\x{303F}]|[\x{2605}-\x{2606}]|[\x{2190}-\x{2195}]|[\x{203B}\x{30FB}]|[  \t]/u', '-', $alias);
        // Replace remaining special characters with hyphens.
        $alias = preg_replace('/[^\p{Han}\p{Katakana}\p{Hiragana}\p{Latin}\d\/]+/u', '-', $alias);
      } else {
        // Transliterate the alias.
        $alias = \Drupal::transliteration()->transliterate($alias, $context['language'] ?? 'en');
      }
      $alias = \Drupal::service('pathauto.alias_cleaner')->cleanAlias($alias, $context['source'], $context['language']);
    }

    This might need more polishing and testing, so any suggestions and improvements are welcome.

  • 🇯🇵Japan tyler36 Osaka

    Thank you @phma.

    That's a useful snippet.

Production build 0.71.5 2024