- Issue created by @loopy1492
Thank you for this bug report.
Could you please provide more detailed steps to reproduce the issue? Specifically, it would be helpful to clarify what you mean by "upstream environment" and to elaborate on the "systems which use the URL as a primary key in the database rather than an auto-incrementing ID."
For instance, are you referring to systems like Drupal?
- 🇺🇸United States loopy1492
In docroot/core/modules/search/search.install, the schema for search tables indicates the primary key as
'primary key' => ['word', 'sid', 'langcode', 'type'],
. In work on a particular contributed module (see OP), it has proven helpful to switch this primary key to an auto-incrementing ID number instead to avoid mismatches in special characters.For example:
/example/path/
/example/path
/example%2Fpath
/example%20path/
/example%20path
/example path
/example/path)
/example/path)
/example)/path
/example)/path
/∼example/path
/~example/pathI suspect they got into the search index by bots inserting junk into a field. I'm not completely sure.
- 🇺🇸United States loopy1492
I ran
drush sql-query "SELECT * FROM search_index WHERE word COLLATE utf8mb4_unicode_ci = 'פאר';"
The response was:
פֿאַר 34531 und node_search 0.882818 פאר 34531 und node_search 3.74549
I am not certain whether there's another way to using utf8mb4_unicode_ci to generate the key when selecting a record or syncing the databse. I would guess an auto-incrementing int for the key instead would do the trick, however.
- 🇺🇸United States loopy1492
### Find ligatures:
`drush sql-query "SELECT * FROM search_index WHERE word LIKE '%fi%' OR word LIKE '%fl%' OR word LIKE '%ffi%' OR word LIKE '%ffl%' OR word LIKE '%ſt%' OR word LIKE '%st%';"`### Quick fix:
`drush sql-query "DELETE FROM search_index WHERE word LIKE '%fi%' OR word LIKE '%fl%' OR word LIKE '%ffi%' OR word LIKE '%ffl%' OR word LIKE '%ſt%' OR word LIKE '%st%';"`### Better fix:
Edit the content, then run the command above, then reindex.### Actual fix:
Stop using concatenated fields as a primary key. - 🇳🇿New Zealand quietone
Changes are made on on 11.x (our main development branch) first, and are then back ported as needed according to the Core change policies → .
- 🇺🇸United States loopy1492
It seems that Urdu has a really tough one -- diacritical marks. They completely change the meaning of the word in some cases:
drush sql-query "SELECT * FROM search_index WHERE word COLLATE utf8mb4_unicode_ci = 'پُر';"
پر 34531 und node_search 0.824901
پُر 34531 und node_search 0.252723 - 🇺🇸United States loopy1492
I definitely believe the best path forward here is to change the primary key of the search_index and search_total tables to an automatic int field instead of using the word/sid combo.