SQL duplicate entry with bigram

Created on 26 October 2023, about 1 year ago
Updated 11 November 2023, about 1 year ago

Problem/Motivation

Since upgrading from 8.x-1.29 and enabling bigram phrase indexing, if a node has a 49 character word in it (TOKEN_LENGTH_MAX-1) then the following error is triggered (truncated):
SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'entity:node/213748:en-rendered_item-abcdefghijklmnopqrstuvwxy...' for key 'PRIMARY': INSERT INTO "search_api_db_default_text" ("item_id", "field_name", "word", "score") VALUES
With the duplicate entry being two identical entries with the 49 character word.

Steps to reproduce

  • Enable bigram phrase indexing.
  • Add rendered HTML output field.
  • Create a node with a 49 character word somewhere in it (can be the only word).

If TOKEN_LENGTH_MAX is lowered from 50 to 49 then a 48 character word will trigger the error (and a longer word will be fine).

🐛 Bug report
Status

Fixed

Version

1.30

Component

Database backend

Created by

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @gaddman
  • From what I can tell the indexed entries are almost unique, but the trailing space is causing problems. So there are two entries:
    1. 'abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw': the 49-char word
    2. 'abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw ': the 49-char word plus a space (it would have been a phrase consisting of the 49-char word plus a space plus the following word but is trimmed to 50 chars).
    These are being treated as identical in the MariaDB database:

    MariaDB [pantheon]> show full columns from search_api_db_default_text;
    +------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+
    | Field      | Type             | Collation       | Null | Key | Default | Extra | Privileges                      | Comment                                                                                          |
    +------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+
    | item_id    | varchar(150)     | utf8_general_ci | NO   | PRI | NULL    |       | select,insert,update,references |                                                                                                  |
    | field_name | varchar(191)     | utf8mb4_bin     | NO   | PRI | NULL    |       | select,insert,update,references | The name of the field in which the token appears, or a base-64 encoded sha-256 hash of the field |
    | word       | varchar(50)      | utf8mb4_bin     | NO   | PRI | NULL    |       | select,insert,update,references | The text of the indexed token                                                                    |
    | score      | int(10) unsigned | NULL            | NO   |     | 0       |       | select,insert,update,references | The score associated with this token                                                             |
    +------------+------------------+-----------------+------+-----+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+
    4 rows in set (0.002 sec)
    
    MariaDB [pantheon]> insert into search_api_db_default_text VALUES('entity:node/213748:en','rendered_item','abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw',1000);                                          Query OK, 1 row affected (0.001 sec)
    
    MariaDB [pantheon]> insert into search_api_db_default_text VALUES('entity:node/213748:en','rendered_item','abcdefjklmnopqrstuvwxyzabcdefjklmnopqrstuvw ',1000);
    ERROR 1062 (23000): Duplicate entry 'entity:node/213748:en-rendered_item-abcdefjklmnopqrstuvwxyzab...' for key 'PRIMARY'
    

    MariaDB version: 10.4.17-MariaDB

    From my read of the docs on varchar and collation this is expected behaviour with the configured utf8mb4_bin collation.

  • Issue #3199355 introduced a fix to strip trailing spaces to avoid these errors but it excludes type=='text'. See /modules/search_api_db/src/DatabaseCompatibility/MySql.php. Not sure why it excludes text?

  • Status changed to Needs review about 1 year ago
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update about 1 year ago
    545 pass
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update about 1 year ago
    545 pass
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Thanks a lot for reporting this problem!
    I could easily replicate and fix it. And while the problem is only present in MySQL, I guess adding a token that is just a word plus a single trailing space doesn’t make sense for any of the other DBMSs, either. Still, applying the rtrim() also to text fields probably makes sense, too, if MySQL is incapable of treating those correctly.

    From my read of the docs on varchar and collation this is expected behaviour with the configured utf8mb4_bin collation.

    Which is all you need to know about MySQL. Seriously, wtf?

  • Awesomely quick patch! Tested and works fine, thanks a heap.

    This is a bit of a nitpick+tangent, but your comment here got me thinking:

    In case... ...the bigram wouldn't fit into the maximum token length, there is no bigram to add.

    The code is checking the length of the prev_word, not the bigram itself. No big deal, but that got me looking into what happens if the bigram itself is too long, and I see it gets truncated further down, eg the text abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvwxyz is stored as the phrase abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvw. It's a different scenario and may not cause any problems, but is there any value in storing a truncated bigram or should they just be discarded? I did some brief testing and don't get any results when searching for abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvw but I do get results for abcdefjhijklmnopqrstuvwxyzabcdefjhijklmnopqrs tuvwxyz, with and without quotes, so I guess it still works somehow?

  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.2 & MySQL 8
    last update about 1 year ago
    537 pass, 2 fail
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 8.1 & pgsql-10.12
    last update about 1 year ago
    545 pass
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.2 & MySQL 8
    last update about 1 year ago
    544 pass
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 8.1 & pgsql-10.12
    last update about 1 year ago
    545 pass
  • Status changed to Fixed about 1 year ago
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Thanks for reporting back, good to hear it works for you.
    Merged. Thanks again!

    And yes, overlong bigrams are already handled “properly” (i.e., as well as possible) in \Drupal\search_api_db\Plugin\search_api\backend\Database::splitKeys().

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024