SQL duplicate entry with bigram

Issue created by @gaddman

From what I can tell the indexed entries are almost unique, but the trailing space is causing problems. So there are two entries:
1. 'abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw': the 49-char word
2. 'abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw ': the 49-char word plus a space (it would have been a phrase consisting of the 49-char word plus a space plus the following word but is trimmed to 50 chars).
These are being treated as identical in the MariaDB database:

MariaDB [pantheon]> show full columns from search_api_db_default_text;
+------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+
| Field      | Type             | Collation       | Null | Key | Default | Extra | Privileges                      | Comment                                                                                          |
+------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+
| item_id    | varchar(150)     | utf8_general_ci | NO   | PRI | NULL    |       | select,insert,update,references |                                                                                                  |
| field_name | varchar(191)     | utf8mb4_bin     | NO   | PRI | NULL    |       | select,insert,update,references | The name of the field in which the token appears, or a base-64 encoded sha-256 hash of the field |
| word       | varchar(50)      | utf8mb4_bin     | NO   | PRI | NULL    |       | select,insert,update,references | The text of the indexed token                                                                    |
| score      | int(10) unsigned | NULL            | NO   |     | 0       |       | select,insert,update,references | The score associated with this token                                                             |
+------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+
4 rows in set (0.002 sec)

MariaDB [pantheon]> insert into search_api_db_default_text VALUES('entity:node/213748:en','rendered_item','abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw',1000);                                          Query OK, 1 row affected (0.001 sec)

MariaDB [pantheon]> insert into search_api_db_default_text VALUES('entity:node/213748:en','rendered_item','abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw ',1000);
ERROR 1062 (23000): Duplicate entry 'entity:node/213748:en-rendered_item-abcdefjklmnopqrstuvwxyzab...' for key 'PRIMARY'

MariaDB version: 10.4.17-MariaDB

From my read of the docs on varchar and collation this is expected behaviour with the configured utf8mb4_bin collation.

Comment over 1 year ago →
gaddman
Comment over 1 year ago →
gaddman
Issue #3199355 → introduced a fix to strip trailing spaces to avoid these errors but it excludes type=='text'. See /modules/search_api_db/src/DatabaseCompatibility/MySql.php. Not sure why it excludes text?
Status changed to Needs review over 1 year ago1:57pm 29 October 2023
Open in Jenkins → Open on Drupal.org →
Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
last update over 1 year ago
545 pass
Open in Jenkins → Open on Drupal.org →
Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
last update over 1 year ago
545 pass
Comment over 1 year ago →
🇦🇹Austria drunken monkey Vienna, Austria
Thanks a lot for reporting this problem!
I could easily replicate and fix it. And while the problem is only present in MySQL, I guess adding a token that is just a word plus a single trailing space doesn’t make sense for any of the other DBMSs, either. Still, applying the rtrim() also to text fields probably makes sense, too, if MySQL is incapable of treating those correctly.

From my read of the docs on varchar and collation this is expected behaviour with the configured utf8mb4_bin collation.

Which is all you need to know about MySQL. Seriously, wtf?
Comment over 1 year ago →
gaddman
Awesomely quick patch! Tested and works fine, thanks a heap.

This is a bit of a nitpick+tangent, but your comment here got me thinking:

In case... ...the bigram wouldn't fit into the maximum token length, there is no bigram to add.

The code is checking the length of the prev_word, not the bigram itself. No big deal, but that got me looking into what happens if the bigram itself is too long, and I see it gets truncated further down, eg the text abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvwxyz is stored as the phrase abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvw. It's a different scenario and may not cause any problems, but is there any value in storing a truncated bigram or should they just be discarded? I did some brief testing and don't get any results when searching for abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvw but I do get results for abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvwxyz, with and without quotes, so I guess it still works somehow?
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.2 & MySQL 8
last update over 1 year ago
537 pass, 2 fail
Open in Jenkins → Open on Drupal.org →
Core: 9.5.5 + Environment: PHP 8.1 & pgsql-10.12
last update over 1 year ago
545 pass
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.2 & MySQL 8
last update over 1 year ago
544 pass
Open in Jenkins → Open on Drupal.org →
Core: 9.5.5 + Environment: PHP 8.1 & pgsql-10.12
last update over 1 year ago
545 pass
Comment over 1 year ago →
System Message

drunken monkey → committed d0ddc698 on 8.x-1.x
Issue #3397017 by drunken monkey, gaddman: Fixed MySQL error in edge...
Status changed to Fixed over 1 year ago2:23pm 11 November 2023
Comment over 1 year ago →
🇦🇹Austria drunken monkey Vienna, Austria
Thanks for reporting back, good to hear it works for you.
Merged. Thanks again!

And yes, overlong bigrams are already handled “properly” (i.e., as well as possible) in \Drupal\search_api_db\Plugin\search_api\backend\Database::splitKeys().
Comment over 1 year ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

SQL duplicate entry with bigram

Problem/Motivation

Steps to reproduce

Comments & Activities