Duplicate entry in search_index.PRIMARY due to special characters not being handled by disparate database versions

Created on 10 December 2024, 11 days ago

Problem/Motivation

In some cases, using a URL as a primary key causes duplicate entiries in the database.

Steps to reproduce

Add a bunch of URLs with a variety of special characters (especially wide characters and characters from other languages) to systems which use the URL as a primary key in the database rather than an auto-incrementing ID.

Proposed resolution

Stop using URLs as primary keys and switch to auto-incrementing IDs.

I was able to fix the issue in the redirect module here:

https://www.drupal.org/project/redirect/issues/3201335 🐛 Duplicate entry error for paths containing trailing spaces in redirect_404 table Active

However, we're also encountering it in the drupal core search module. I do not believe I should be messing around with that, personally.

Example:

The command "/var/www/vendor/bin/drush sql:query --yes --strict=0 --file=/tmp/c8ac25646a44415f9a365fb0097b3fba_20241210_143857.sql.gz --file-delete --uri=nyesd.docksal.site" failed.

Exit Code: 1(General error)

Working directory:

Output:
================

Error Output:
================

In SqlCommands.php line 183:

Query failed. Rerun with --debug to see any error message. ERROR 1062 (23000) at line 26031: Duplicate entry 'פאר-34531-und-node_search' for key 'search_index.PRIMARY'

🐛 Bug report
Status

Active

Version

10.3

Component

database system

Created by

🇺🇸United States loopy1492

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @loopy1492
  • Thank you for this bug report.

    Could you please provide more detailed steps to reproduce the issue? Specifically, it would be helpful to clarify what you mean by "upstream environment" and to elaborate on the "systems which use the URL as a primary key in the database rather than an auto-incrementing ID."

    For instance, are you referring to systems like Drupal?

  • 🇺🇸United States loopy1492

    In docroot/core/modules/search/search.install, the schema for search tables indicates the primary key as 'primary key' => ['word', 'sid', 'langcode', 'type'],. In work on a particular contributed module (see OP), it has proven helpful to switch this primary key to an auto-incrementing ID number instead to avoid mismatches in special characters.

    For example:

    /example/path/
    /example/path
    /example%2Fpath
    /example%20path/
    /example%20path
    /example path
    /example/path)
    /example/path)
    /example)/path
    /example)/path
    /∼example/path
    /~example/path

    I suspect they got into the search index by bots inserting junk into a field. I'm not completely sure.

  • 🇺🇸United States loopy1492

    I ran drush sql-query "SELECT * FROM search_index WHERE word COLLATE utf8mb4_unicode_ci = 'פאר';"

    The response was:

    פֿאַר	34531	und	node_search	0.882818
    פאר	34531	und	node_search	3.74549
    

    I am not certain whether there's another way to using utf8mb4_unicode_ci to generate the key when selecting a record or syncing the databse. I would guess an auto-incrementing int for the key instead would do the trick, however.

  • 🇺🇸United States loopy1492

    ### Find ligatures:
    `drush sql-query "SELECT * FROM search_index WHERE word LIKE '%fi%' OR word LIKE '%fl%' OR word LIKE '%ffi%' OR word LIKE '%ffl%' OR word LIKE '%ſt%' OR word LIKE '%st%';"`

    ### Quick fix:
    `drush sql-query "DELETE FROM search_index WHERE word LIKE '%fi%' OR word LIKE '%fl%' OR word LIKE '%ffi%' OR word LIKE '%ffl%' OR word LIKE '%ſt%' OR word LIKE '%st%';"`

    ### Better fix:
    Edit the content, then run the command above, then reindex.

    ### Actual fix:
    Stop using concatenated fields as a primary key.

  • 🇳🇿New Zealand quietone

    Changes are made on on 11.x (our main development branch) first, and are then back ported as needed according to the Core change policies .

  • 🇺🇸United States loopy1492

    It seems that Urdu has a really tough one -- diacritical marks. They completely change the meaning of the word in some cases:

    drush sql-query "SELECT * FROM search_index WHERE word COLLATE utf8mb4_unicode_ci = 'پُر';"
    پر 34531 und node_search 0.824901
    پُر 34531 und node_search 0.252723

  • 🇺🇸United States loopy1492

    I definitely believe the best path forward here is to change the primary key of the search_index and search_total tables to an automatic int field instead of using the word/sid combo.

Production build 0.71.5 2024