Japanese content is transliterated as Chinese

Comment over 2 years ago →
🇨🇦Canada joseph.olstad
Would be nice to get a patch for this.
Comment over 2 years ago →
🇯🇵Japan tyler36 Osaka
Experience issue and using MariaDB 10.4 with utf8mb4_general_ci.

Getting hit with issue when Drupal generates machine names which do not resemble the Japanese language labels.
Able to reproduce with drush repl so probably not limited to database in scope.

Results of some recent testing are here 🐛 Japanese Characters replaced by Chinese english words in URL aliases. Postponed: needs info .

🇨🇭Switzerland phma Basel, CH

This won't help if you only want ASCII letters in your URLs. But if you prefer cleaner URLs and keep transliteration working for non-Japanese URLs, you can use something like this (it's important to turn off transliteration in settings when using this):

/**
 * Deal with Japanese and other non-Latin characters in Pathauto aliases.
 *
 * Unicode ranges taken from here and converted to PHP:
 * @see https://gist.github.com/ryanmcgrath/982242
 *
 * Implements hook_pathauto_alias_alter().
 */
function pathauto_cjk_pathauto_alias_alter(&$alias, array &$context) {
  // If the alias contains CJK characters, clean up punctuation but do not
  // transliterate because Japanese gets transliterated as Chinese.
  // @see https://www.drupal.org/project/drupal/issues/2984977
  if (preg_match('/[\x{3000}-\x{303F}]|[\x{3040}-\x{309F}]|[\x{30A0}-\x{30FF}]|[\x{FF00}-\x{FFEF}]|[\x{4E00}-\x{9FAF}]|[\x{2605}-\x{2606}]|[\x{2190}-\x{2195}]|\x{203B}/u', $alias)) {
    // Cleanup fullwidth characters.
    $alias = mb_convert_kana($alias, 'KVrn');
    // Replace punctuation with hyphens.
    $alias = preg_replace('/[\x{3000}-\x{303F}]|[\x{2605}-\x{2606}]|[\x{2190}-\x{2195}]|[\x{203B}\x{30FB}]|[ 　\t]/u', '-', $alias);
    // Replace remaining special characters with hyphens.
    $alias = preg_replace('/[^\p{Han}\p{Katakana}\p{Hiragana}\p{Latin}\d\/]+/u', '-', $alias);
  } else {
    // Transliterate the alias.
    $alias = \Drupal::transliteration()->transliterate($alias, $context['language'] ?? 'en');
  }
  $alias = \Drupal::service('pathauto.alias_cleaner')->cleanAlias($alias, $context['source'], $context['language']);
}

This might need more polishing and testing, so any suggestions and improvements are welcome.

Comment almost 2 years ago →
🇯🇵Japan tyler36 Osaka
Thank you @phma.

That's a useful snippet.
Comment about 1 month ago →
🇯🇵Japan u7aro Japan
We discussed this issue as part of [Drupal Contribution Day Japan 2025]( https://www.drupal.org/community/events/drupal-japan-contribution-day-ju... → )

Result:

- We conducted a feasibility check using kuromoji.js.
- We are making it possible to incorporate the feature as a contributed module.

Members:

- otofu https://www.drupal.org/u/otofu →
- u7aro https://www.drupal.org/u/u7aro →
- kazuko.murata https://www.drupal.org/u/kazukomurata →
- Tom Konda https://www.drupal.org/u/tom-konda →
- hagi https://www.drupal.org/u/hagi →
Comment about 1 month ago →
🇨🇦Canada Charlie ChX Negyesi 🍁Canada
Please check the overrides already in core/lib/Drupal/Component/Transliteration/data for example eo.php or kg.php and provide a similar ja.php. The file is a simple PHP array where the keys are Unicode code points and the values are transliterated strings.
Comment about 1 month ago →
🇦🇺Australia jimmycann
This was investigated as part of DrupalSouth Melbourne 2025 contribution day in March 2025. Apologies, being new to the process I had neglected to comment on the issue at the time. I was also in attendance at the Japan Drupal contribution day and was able to see the progress on this issue, so would like to add some context

Between myself and https://www.drupal.org/u/nterbogt → we found the transliteration library and related eo.php/kg.php files, but found that the transliteration library is not entirely compatible with the way the Japanese language works.

For example if we compare the array in eo.php

```
$overrides['kg'] = [
0x41 => 'E',
...
```

In this language `0x41` invariably corresponds with the letter `E`, but this isn't true for Japanese where a character could have many ways to be transliterated, but there is only one correct one in a given context.

For example

`0x65E5` is the unicode character for `日`, which could be 'hi', 'bi' or 'nichi' depending on the characters next to it, choosing the wrong one will make the meaning non-sensible. There is also many thousand such examples.

In other words a simple array mapping unicode values to strings won't be sufficient for Japanese.

Writing an equivalent port of [kuromoji.js](https://github.com/takuyaa/kuromoji.js/) to PHP and incorporating it into the Drupal core transliteration package is also likely out of scope of this project, it will also add an amount of weight due to the complexity so should be able to optionally included.

I think https://www.drupal.org/u/u7aro → and the team that investigated this during Japan contribution day have found with a contributed module that incorporates [kuromoji.js](https://github.com/takuyaa/kuromoji.js/) might be the best course to be able to solve this for Japanese language users

Japanese content is transliterated as Chinese

Problem/Motivation

Proposed resolution

Remaining tasks

Comments & Activities