Item splitter processor to avoid record size limit

Created on 4 January 2022, over 2 years ago
Updated 16 January 2024, 5 months ago

Problem/Motivation

  • Algolia has a maximum size limit for a single record. search_api_algolia currently provides a truncate option which truncates strings to 10000 characters in an effort to avoid hitting the limit. This is not ideal because it results in data not being indexed.

Steps to reproduce

Proposed resolution

  • In an effort to solve this problem I created an Algolia Item Splitter processor (provided in patch). Please see the Remaining tasks/Known issues below before using. I'm hoping someone smarter than me will take this and expand on it, addressing these issues.
  • The processor allows setting a maximum character limit for all fields on which it is enabled (only allowed on strings). If a field value has more characters than the limit, the field value is split into smaller pieces using the code Algolia provides here. Then during the indexing process:
    • For each split that was created, a new SplitItem is created (a new class that extends search_api Item class).
    • An objectID is set on the SplitItem. Unlike an Item, the objectID is not the same as the search_api itemID. It is the itemID plus the field machine name plus the split number.
    • The field value is set to empty for the original Item.
    • For each indexing batch process, the Items are indexed first (sent off to Algolia using saveObjects()), and then the SplitItems.
  • In order to avoid duplicate records in Algolia results, we have to set an attributeForDistinct in the config for the index. I'm personally using url but you can use whatever you want. Algolia then combines all the records with the same url into one (as far as the end user can see).
  • We also need to make sure that search_api_algolia's truncate option is turned off in the server settings. The splitter is meant to be an alternative approach.

How to test patch:

  • Apply patch.
  • Go to your Algolia search_api server and make sure that truncation is disabled.
  • Go to an Algolia search_api index. Click the Processors tab.
  • Enable the Algolia Item Splitter processor.
    • Enable the processor only on fields that are likely to be long, such as the body field or rendered HTML.
    • If you enable on rendered HTML, you should also enable the HTML trimmer processor on the rendered HTML field.
  • Set the character limit. For testing purposes, set it low enough that it is guaranteed to cause splits to be created.
  • Identify a node that will be indexed that has a field value that will be split based on the character limit you set.
  • Clear and reindex.
  • Go to your Algolia dashboard.
  • In the Browse tab, search for the node you identified. You should see that there are multiple records for the same node. The value of the field you enabled the processor for should be split amongst the records.
  • In the Algolia dashboard, go to Configuration > Deduplication and Grouping. Set Distinct to true. Set the Attribute for Distinct to a field guaranteed to have a unique value, such as the url or nid.
  • Go back to the Browse tab and search for the node again. You should only see one record. The field that was split will not show the entire field value. This is apparently another quirk of Algolia. I do not know what happens when you try to render the split field on the frontend.
  • In the Browse tab, search for something that appears in the split field value that does not appear on the one record you can see. You should find that even though you can't see the entire field value, the record is still returned in search results, meaning that the entire value of the field is being searched.

Remaining tasks

Known issues

  • When the index is cleared by clicking "Clear all indexed data" at /admin/config/search/search-api/index/index_name , the splits are not removed from the Algolia index. This is because the clear process works by looping through each search_api Item and sending it to Algolia's deleteObjects() method which takes an array of itemIDs/objectIDs. Since search_api doesn't know about SplitItems, these are excluded.
    • I'm sure this is solvable. We would need a table that keeps track of all the objectIDs for the currently indexed SplitItems. Then when the index is cleared, send them to deleteObjects() and remove them from the table.
    • Until this solved, you may want to periodically clear the index from the Algolia side, and then reindex on the Drupal side.
  • SplitItems bypass some of the multilingual code in \Drupal\search_api_algolia\Plugin\search_api\backend\SearchApiAlgoliaBackend::indexItems that Items go through.
  • The number of indexed items that search_api reports at /admin/config/search/search-api/index/index_name will not reflect the number of indexed items in Algolia if a split has been created. Because search_api probably does not support the concept of more items being created during the index process, perhaps this works as expected?
  • The batch process will index more than expected/intended if a split has been created. For example, each batch may be 50 items, but with the addition of splits, many more may be sent to Algolia during that batch. Unclear if this could cause the batch to timeout. The documentation for saveObjects() says "To ensure good performance, saveObjects automatically splits your records into batches of 1,000 objects" so I think it's unlikely to cause an issue on the Algolia side.
  • All other fields/attributes are included on the split item/object in Algolia. In some ways this is nice because you can use any attribute as the attributeForDistinct, and it will work. However, it would be smarter to only include the split field/attribute and the attribute that is set as attributeForDistinct. In theory these are all that are needed on the split records.
  • The code to split the field value was taken directly from Algolia. That said, I don't think it's very smart. For example I don't think it knows not to trim/split in the middle of a word.

User interface changes

  • Creates an Algolia Item Splitter processor available under Processors tab of index.

API changes

Data model changes

✨ Feature request
Status

Active

Version

3.0

Component

Code

Created by

🇺🇸United States maskedjellybean Portland, OR

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇲🇾Malaysia jonloh

    Tried the patch, but unfortunately this does not work well in Multi-lingual setup.

  • 🇮🇳India Akhil Babu Chengannur

    Thanks for the patch. I have created a new patch with few changes.

    • When records are splitted, the current patch removes the splitted value from original record and adds it to splitted records. Instead, the new patch will add the first splitted value to original record and subsequent values to split records.
    • It adds a new field ‘parent_record’ in all records to filter out all splits associated with a record. The original record will have ‘self’ as the value in this field, and split records will have ‘node_id:language_code’ as the value. This field is used to delete all splits associated with a record when a node is modified/deleted. It will also help distinguish between the original record and splits if you are building the search UI using JS. The ‘parent_record’ field should be configured as a filter from the Algolia dashboard for this to work.
    • Works with multilingual content.
  • 🇮🇳India Akhil Babu Chengannur
  • 🇺🇸United States maskedjellybean Portland, OR

    Thank you for carrying this forward! Sadly I no longer have an Algolia project to work with so I can't test the new patch out.

Production build 0.69.0 2024