Expand the remove diacritics feature to other scripts

Created on 9 December 2019, almost 5 years ago
Updated 21 September 2023, about 1 year ago

Problem/Motivation

It's currently possible to search for Éclair recipes by typing in Eclair. This is currently only done for Latin letters and the issue is about expanding this to other scripts.

Proposed resolution

Accented characters among the Unicode characters are precomposed characters. For example, É (official name: Latin Capital Letter E with Acute) decomposes into Latin Capital Letter E and Combining Acute Accent. The proposed resolution is to apply these decomposition rules to Letters and remove the Combining Marks. This is the recommended way to remove accents in the ICU documentation.

The suggested userspace implementation have been created by a script. This script uses intl (the PHP wrapper for ICU) and attempts to remove accents from letters in the first 8192 characters and where this succeeded, the results are recorded into a simple associative array (if it didn't succeed then the letter didn't have a decomposition rule). It's 827 characters . Use this array for a new remove diacritics class. The results with ample doxygen is a less than 10K PHP file. See the ranges up to and including Greek Extended for included scripts.

Provide an alter hook in a core remove diacritics service.

Implement this alter hook in new a hidden for backwards compatibility and install it in an update hook. This is necessary because out of the 288 characters the current implementation removes diacritics from 40 characters which are no longer handled. See this Linguistics StackExchange answer why these characters have no decomposition rules. The current remove diacritics implementation uses the transliteration character database which has been compiled from various sources instead of just using the standard which is useful in the declared goal of machine name creation -- but it's less useful when creating strings users actually need to interact with. Nonetheless, the alter hook is there if a site wants to do something else and if some rules become widespread in a language community , I hope eventually there'll eventually be a contrib module which provides a config entity utilizing this alter hook and localize.drupal.org could provide these entities.

This module probably should be deprecated but adding lifecycle: deprecated to a core module breaks a lot of tests and it is not documented how to do this. So this should be a follow up pending on said documentation.

User interface changes

None.

API changes

TransliterationInterface::removeDiacritics is deprecated.

A new remove diacritics class is added to transliteration.

A new hook_remove_diacritics_map_alter is introduced to allow changing rules.

Data model changes

Release notes snippet

Original report

when removing diacritics in function search_simplify(), it not considering remove Arabic diacritics. How to test: add this text to any article: "السُّلَّامُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُهُ" then search for "السلام". Original: https://ahmedspace.com/arabic-case-insensitive-in-database-systems-how-t...

🐛 Bug report
Status

Needs work

Version

11.0 🔥

Component
Base 

Last updated about 5 hours ago

Created by

🇴🇲Oman omlx

Live updates comments and jobs are added and updated live.
  • Needs tests

    The change is currently missing an automated test that fails when run with the original code, and succeeds when the bug has been fixed.

  • Needs framework manager review

    It is used to alert the framework manager core committer(s) that an issue significantly impacts (or has the potential to impact) multiple subsystems or represents a significant change or addition in architecture or public APIs, and their signoff is needed (see the governance policy draft for more information). If an issue significantly impacts only one subsystem, use Needs subsystem maintainer review instead, and make sure the issue component is set to the correct subsystem.

  • Needs subsystem maintainer review

    It is used to alert the maintainer(s) of a particular core subsystem that an issue significantly impacts their subsystem, and their signoff is needed (see the governance policy draft for more information). Also, if you use this tag, make sure the issue component is set to the correct subsystem. If an issue significantly impacts more than one subsystem, use needs framework manager review instead.

  • Needs change record

    A change record needs to be drafted before an issue is committed. Note: Change records used to be called change notifications.

Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Production build 0.71.5 2024