- Issue created by @gaddman
From what I can tell the indexed entries are almost unique, but the trailing space is causing problems. So there are two entries:
1. 'abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw': the 49-char word
2. 'abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw ': the 49-char word plus a space (it would have been a phrase consisting of the 49-char word plus a space plus the following word but is trimmed to 50 chars).
These are being treated as identical in the MariaDB database:MariaDB [pantheon]> show full columns from search_api_db_default_text; +------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+ | Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment | +------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+ | item_id | varchar(150) | utf8_general_ci | NO | PRI | NULL | | select,insert,update,references | | | field_name | varchar(191) | utf8mb4_bin | NO | PRI | NULL | | select,insert,update,references | The name of the field in which the token appears, or a base-64 encoded sha-256 hash of the field | | word | varchar(50) | utf8mb4_bin | NO | PRI | NULL | | select,insert,update,references | The text of the indexed token | | score | int(10) unsigned | NULL | NO | | 0 | | select,insert,update,references | The score associated with this token | +------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+ 4 rows in set (0.002 sec) MariaDB [pantheon]> insert into search_api_db_default_text VALUES('entity:node/213748:en','rendered_item','abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw',1000); Query OK, 1 row affected (0.001 sec) MariaDB [pantheon]> insert into search_api_db_default_text VALUES('entity:node/213748:en','rendered_item','abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw ',1000); ERROR 1062 (23000): Duplicate entry 'entity:node/213748:en-rendered_item-abcdefjklmnopqrstuvwxyzab...' for key 'PRIMARY'
MariaDB version: 10.4.17-MariaDB
From my read of the docs on varchar and collation this is expected behaviour with the configured
utf8mb4_bin
collation.Issue #3199355 → introduced a fix to strip trailing spaces to avoid these errors but it excludes type=='text'. See /modules/search_api_db/src/DatabaseCompatibility/MySql.php. Not sure why it excludes text?
- Status changed to Needs review
about 1 year ago 1:57pm 29 October 2023 - last update
about 1 year ago 545 pass - last update
about 1 year ago 545 pass - 🇦🇹Austria drunken monkey Vienna, Austria
Thanks a lot for reporting this problem!
I could easily replicate and fix it. And while the problem is only present in MySQL, I guess adding a token that is just a word plus a single trailing space doesn’t make sense for any of the other DBMSs, either. Still, applying thertrim()
also to text fields probably makes sense, too, if MySQL is incapable of treating those correctly.From my read of the docs on varchar and collation this is expected behaviour with the configured
utf8mb4_bin
collation.Which is all you need to know about MySQL. Seriously, wtf?
Awesomely quick patch! Tested and works fine, thanks a heap.
This is a bit of a nitpick+tangent, but your comment here got me thinking:
In case... ...the bigram wouldn't fit into the maximum token length, there is no bigram to add.
The code is checking the length of the prev_word, not the bigram itself. No big deal, but that got me looking into what happens if the bigram itself is too long, and I see it gets truncated further down, eg the text abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvwxyz is stored as the phrase
abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvw
. It's a different scenario and may not cause any problems, but is there any value in storing a truncated bigram or should they just be discarded? I did some brief testing and don't get any results when searching for abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvw but I do get results for abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvwxyz, with and without quotes, so I guess it still works somehow?- last update
about 1 year ago 537 pass, 2 fail - last update
about 1 year ago 545 pass - last update
about 1 year ago 544 pass - last update
about 1 year ago 545 pass -
drunken monkey →
committed d0ddc698 on 8.x-1.x
Issue #3397017 by drunken monkey, gaddman: Fixed MySQL error in edge...
-
drunken monkey →
committed d0ddc698 on 8.x-1.x
- Status changed to Fixed
about 1 year ago 2:23pm 11 November 2023 - 🇦🇹Austria drunken monkey Vienna, Austria
Thanks for reporting back, good to hear it works for you.
Merged. Thanks again!And yes, overlong bigrams are already handled “properly” (i.e., as well as possible) in
\Drupal\search_api_db\Plugin\search_api\backend\Database::splitKeys()
. Automatically closed - issue fixed for 2 weeks with no activity.