Item splitter processor to avoid record size limit

Created on 4 January 2022, almost 3 years ago
Updated 24 May 2023, over 1 year ago

Problem/Motivation

  • Algolia has a maximum size limit for a single record. search_api_algolia currently provides a truncate option which truncates strings to 10000 characters in an effort to avoid hitting the limit. This is not ideal because it results in data not being indexed.

Steps to reproduce

Proposed resolution

  • In an effort to solve this problem I created an Algolia Item Splitter processor (provided in patch). Please see the Remaining tasks/Known issues below before using. I'm hoping someone smarter than me will take this and expand on it, addressing these issues.
  • The processor allows setting a maximum character limit for all fields on which it is enabled (only allowed on strings). If a field value has more characters than the limit, the field value is split into smaller pieces using the code Algolia provides here. Then during the indexing process:
    • For each split that was created, a new SplitItem is created (a new class that extends search_api Item class).
    • An objectID is set on the SplitItem. Unlike an Item, the objectID is not the same as the search_api itemID. It is the itemID plus the field machine name plus the split number.
    • The field value is set to empty for the original Item.
    • For each indexing batch process, the Items are indexed first (sent off to Algolia using saveObjects()), and then the SplitItems.
  • In order to avoid duplicate records in Algolia results, we have to set an attributeForDistinct in the config for the index. I'm personally using url but you can use whatever you want. Algolia then combines all the records with the same url into one (as far as the end user can see).
  • We also need to make sure that search_api_algolia's truncate option is turned off in the server settings. The splitter is meant to be an alternative approach.

How to test patch:

  • Apply patch.
  • Go to your Algolia search_api server and make sure that truncation is disabled.
  • Go to an Algolia search_api index. Click the Processors tab.
  • Enable the Algolia Item Splitter processor.
    • Enable the processor only on fields that are likely to be long, such as the body field or rendered HTML.
    • If you enable on rendered HTML, you should also enable the HTML trimmer processor on the rendered HTML field.
  • Set the character limit. For testing purposes, set it low enough that it is guaranteed to cause splits to be created.
  • Identify a node that will be indexed that has a field value that will be split based on the character limit you set.
  • Clear and reindex.
  • Go to your Algolia dashboard.
  • In the Browse tab, search for the node you identified. You should see that there are multiple records for the same node. The value of the field you enabled the processor for should be split amongst the records.
  • In the Algolia dashboard, go to Configuration > Deduplication and Grouping. Set Distinct to true. Set the Attribute for Distinct to a field guaranteed to have a unique value, such as the url or nid.
  • Go back to the Browse tab and search for the node again. You should only see one record. The field that was split will not show the entire field value. This is apparently another quirk of Algolia. I do not know what happens when you try to render the split field on the frontend.
  • In the Browse tab, search for something that appears in the split field value that does not appear on the one record you can see. You should find that even though you can't see the entire field value, the record is still returned in search results, meaning that the entire value of the field is being searched.

Remaining tasks

Known issues

  • When the index is cleared by clicking "Clear all indexed data" at /admin/config/search/search-api/index/index_name , the splits are not removed from the Algolia index. This is because the clear process works by looping through each search_api Item and sending it to Algolia's deleteObjects() method which takes an array of itemIDs/objectIDs. Since search_api doesn't know about SplitItems, these are excluded.
    • I'm sure this is solvable. We would need a table that keeps track of all the objectIDs for the currently indexed SplitItems. Then when the index is cleared, send them to deleteObjects() and remove them from the table.
    • Until this solved, you may want to periodically clear the index from the Algolia side, and then reindex on the Drupal side.
  • SplitItems bypass some of the multilingual code in \Drupal\search_api_algolia\Plugin\search_api\backend\SearchApiAlgoliaBackend::indexItems that Items go through.
  • The number of indexed items that search_api reports at /admin/config/search/search-api/index/index_name will not reflect the number of indexed items in Algolia if a split has been created. Because search_api probably does not support the concept of more items being created during the index process, perhaps this works as expected?
  • The batch process will index more than expected/intended if a split has been created. For example, each batch may be 50 items, but with the addition of splits, many more may be sent to Algolia during that batch. Unclear if this could cause the batch to timeout. The documentation for saveObjects() says "To ensure good performance, saveObjects automatically splits your records into batches of 1,000 objects" so I think it's unlikely to cause an issue on the Algolia side.
  • All other fields/attributes are included on the split item/object in Algolia. In some ways this is nice because you can use any attribute as the attributeForDistinct, and it will work. However, it would be smarter to only include the split field/attribute and the attribute that is set as attributeForDistinct. In theory these are all that are needed on the split records.
  • The code to split the field value was taken directly from Algolia. That said, I don't think it's very smart. For example I don't think it knows not to trim/split in the middle of a word.

User interface changes

  • Creates an Algolia Item Splitter processor available under Processors tab of index.

API changes

Data model changes

Feature request
Status

Active

Version

3.0

Component

Code

Created by

🇺🇸United States maskedjellybean Portland, OR

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇲🇾Malaysia jonloh

    Tried the patch, but unfortunately this does not work well in Multi-lingual setup.

  • 🇮🇳India Akhil Babu Chengannur

    Thanks for the patch. I have created a new patch with few changes.

    • When records are splitted, the current patch removes the splitted value from original record and adds it to splitted records. Instead, the new patch will add the first splitted value to original record and subsequent values to split records.
    • It adds a new field ‘parent_record’ in all records to filter out all splits associated with a record. The original record will have ‘self’ as the value in this field, and split records will have ‘node_id:language_code’ as the value. This field is used to delete all splits associated with a record when a node is modified/deleted. It will also help distinguish between the original record and splits if you are building the search UI using JS. The ‘parent_record’ field should be configured as a filter from the Algolia dashboard for this to work.
    • Works with multilingual content.
  • 🇺🇸United States maskedjellybean Portland, OR

    Thank you for carrying this forward! Sadly I no longer have an Algolia project to work with so I can't test the new patch out.

  • Status changed to Postponed: needs info 5 months ago
  • 🇮🇳India nikunjkotecha India, Gujarat, Rajkot

    This is good. I am not convinced though that we should index huge objects in Algolia, can we have some real use case to help understand the need for this?

  • 🇺🇸United States maskedjellybean Portland, OR

    The use case is if you want to index more than 10000 characters in one record. :-)

    Algolia offers the ability to split records in order to get around their character count limitation, so it would be great if search_api_algolia leveraged this ability.

    Potentially site builders/developers may not realize their records are being truncated. When it is truncated search does not search the entire record because only part of it is indexed. This means worse search results without any indication why.

  • 🇺🇸United States kevinb623

    This patch is working wonderfully to properly index and discover lengthy pages on a content rich website we manage.

    Only suggestion is to update ItemSplitter.php line 68 to use isset() to reduce PHP warnings related to unknown and null array keys.

    Very nice work!

  • 🇬🇧United Kingdom reecemarsland

    Our use case is indexing PDF files attached to content and we need the PDF content to be searchable.

  • 🇧🇪Belgium Den Tweed

    Same as in #17 our use case is making attached documents searchable

    I've worked further on patch #11 and changed following:

    • Fixed warning for getSplitsForItem(), the whole method could be reduced to a simple ?? statement
    • Removed the getDataTypeHelper() and setDataTypeHelper() overrides as they aren't changed from the parent class
    • Moved the code from processFieldValue() to process() and removed the string type check. As far as I understand 'String' as dataType is for shorter field values (e.g. Title, url, etc...) and should already be shorter than the limit in most cases. It's 'Text' (aka Fulltext) we need the most here imo, but in general anything that is considered string characters. This is already covered by the shouldProcess() method (has an is_string() check) which is the condition to call process() (which in turn calls processFieldValue()). As process is an empty method there's no need to override the processFieldValue() code
  • Status changed to Needs review about 1 month ago
  • 🇧🇪Belgium dieterholvoet Brussels

    I am not convinced though that we should index huge objects in Algolia, can we have some real use case to help understand the need for this?

    We hit this limit regularly on projects, when e.g. indexing long text fields or paragraphs for search. This is a very valid use case.

  • 🇧🇪Belgium dieterholvoet Brussels

    I started a MR based on the latest patch. I'm sometimes still getting the following error, even with the patch applied:

    Record at the position 46 objectID=entity:node/118:bg-split-processed_2-1 is too big size=15808/10000 bytes. Please have a look at https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-you...

    I'll do some debugging.

  • Pipeline finished with Success
    about 1 month ago
    Total: 133s
    #344540
  • Pipeline finished with Success
    about 1 month ago
    Total: 134s
    #344689
  • 🇧🇪Belgium dieterholvoet Brussels

    I can't figure out the problem. That project might have been using an outdated patch, I updated it and I'll wait and see if the issue happens again.

  • Merge request !29Resolve #3256840 "Smarter splitter" → (Open) created by dieterholvoet
  • Pipeline finished with Success
    24 days ago
    Total: 136s
    #354386
  • 🇧🇪Belgium dieterholvoet Brussels

    The existing splitter doesn't work consistently for me. Splitting up all enabled fields on a fixed amount of characters works quite well if you only have one very big body field. If you have multiple fields with a lot of content, splitting on a fixed amount of characters still has the risk of creating records that are too big, unless you set the amount of characters to a low value.

    That's why I decided to rewrite everything and to come up with a smarter splitter. Instead of splitting all enabled fields on a fixed amount of characters, my splitter fills up records until the limit which is dictated by Algolia (usually 10K bytes), before it starts splitting text into multiple records. This makes it practically impossible to create records that are too big + it's a lot more efficient, it will only create as much records as necessary.

    Most of the logic was moved from the field processor to the search backend code, right before the record is sent to Algolia, in order to be able to calculate the record sizes as efficiently as possible with all base fields included.

  • 🇧🇪Belgium dieterholvoet Brussels

    I also improved documentation of the processor, warning users to set up things correctly on the Algolia side. I also changed it so the truncate option is automatically disabled for an index when splitting is enabled.

  • Pipeline finished with Success
    24 days ago
    Total: 165s
    #354429
  • Pipeline finished with Success
    19 days ago
    Total: 135s
    #358714
  • 🇧🇪Belgium dieterholvoet Brussels

    I cleaned up the issue description. About the known issues previously listed:

    When the index is cleared by clicking "Clear all indexed data" at /admin/config/search/search-api/index/index_name , the splits are not removed from the Algolia index. This is because the clear process works by looping through each search_api Item and sending it to Algolia's deleteObjects() method which takes an array of itemIDs/objectIDs. Since search_api doesn't know about SplitItems, these are excluded.

    This is not true. When clicking that button, deleteAllIndexItems() is triggered, which clears the whole index instead of specific objects. I'll remove this from the known issues.

    SplitItems bypass some of the multilingual code in \Drupal\search_api_algolia\Plugin\search_api\backend\SearchApiAlgoliaBackend::indexItems that Items go through.

    This is not the case in my implementation. Removing from known issues.

    The number of indexed items that search_api reports at /admin/config/search/search-api/index/index_name will not reflect the number of indexed items in Algolia if a split has been created. Because search_api probably does not support the concept of more items being created during the index process, perhaps this works as expected?

    I would also say this works as expected. The splitted items are an implementation detail of this specific backend and are not necessary to be listed in the UI. When you display the index on dashboard.algolia.com, splitted items are also not listed or counted separately. Removing from known issues.

    The batch process will index more than expected/intended if a split has been created. For example, each batch may be 50 items, but with the addition of splits, many more may be sent to Algolia during that batch. Unclear if this could cause the batch to timeout. The documentation for saveObjects() says "To ensure good performance, saveObjects automatically splits your records into batches of 1,000 objects" so I think it's unlikely to cause an issue on the Algolia side.

    This is not the case in my implementation since splits are indexed together with regular objects. Removing from known issues.

    All other fields/attributes are included on the split item/object in Algolia. In some ways this is nice because you can use any attribute as the attributeForDistinct, and it will work. However, it would be smarter to only include the split field/attribute and the attribute that is set as attributeForDistinct. In theory these are all that are needed on the split records.

    This is not true. Algolia doesn't merge the contents of splitted items. When searching an Algolia index and multiple splitted items match the query, the splitted item that matches the most will be returned to the user. This means that all non-splitted attributes need to be present on all splitted objects. Removing from known issues.

    The code to split the field value was taken directly from Algolia. That said, I don't think it's very smart. For example I don't think it knows not to trim/split in the middle of a word.

    This is not true. It splits on spaces, so it shouldn't break words. The current code looks plenty smart to me. Removing from known issues.

  • 🇧🇪Belgium dieterholvoet Brussels

    dieterholvoet changed the visibility of the branch 3256840-item-splitter-processor to hidden.

Production build 0.71.5 2024