Item splitter processor to avoid record size limit

Open on Drupal.org →

Created on 4 January 2022, over 3 years ago

Updated 24 May 2023, about 2 years ago

Problem/Motivation

Algolia has a maximum size limit for a single record. search_api_algolia currently provides a truncate option which truncates strings to 10000 characters in an effort to avoid hitting the limit. This is not ideal because it results in data not being indexed.

Steps to reproduce

Proposed resolution

In an effort to solve this problem I created an Algolia Item Splitter processor (provided in patch). Please see the Remaining tasks/Known issues below before using. I'm hoping someone smarter than me will take this and expand on it, addressing these issues.
The processor allows setting a maximum character limit for all fields on which it is enabled (only allowed on strings). If a field value has more characters than the limit, the field value is split into smaller pieces using the code Algolia provides here. Then during the indexing process:
- For each split that was created, a new SplitItem is created (a new class that extends search_api Item class).
- An objectID is set on the SplitItem. Unlike an Item, the objectID is not the same as the search_api itemID. It is the itemID plus the field machine name plus the split number.
- The field value is set to empty for the original Item.
- For each indexing batch process, the Items are indexed first (sent off to Algolia using saveObjects()), and then the SplitItems.
In order to avoid duplicate records in Algolia results, we have to set an attributeForDistinct in the config for the index. I'm personally using url but you can use whatever you want. Algolia then combines all the records with the same url into one (as far as the end user can see).
We also need to make sure that search_api_algolia's truncate option is turned off in the server settings. The splitter is meant to be an alternative approach.

How to test patch:

Apply patch.
Go to your Algolia search_api server and make sure that truncation is disabled.
Go to an Algolia search_api index. Click the Processors tab.
Enable the Algolia Item Splitter processor.
- Enable the processor only on fields that are likely to be long, such as the body field or rendered HTML.
- If you enable on rendered HTML, you should also enable the HTML trimmer processor on the rendered HTML field.
Set the character limit. For testing purposes, set it low enough that it is guaranteed to cause splits to be created.
Identify a node that will be indexed that has a field value that will be split based on the character limit you set.
Clear and reindex.
Go to your Algolia dashboard.
In the Browse tab, search for the node you identified. You should see that there are multiple records for the same node. The value of the field you enabled the processor for should be split amongst the records.
In the Algolia dashboard, go to Configuration > Deduplication and Grouping. Set Distinct to true. Set the Attribute for Distinct to a field guaranteed to have a unique value, such as the url or nid.
Go back to the Browse tab and search for the node again. You should only see one record. The field that was split will not show the entire field value. This is apparently another quirk of Algolia. I do not know what happens when you try to render the split field on the frontend.
In the Browse tab, search for something that appears in the split field value that does not appear on the one record you can see. You should find that even though you can't see the entire field value, the record is still returned in search results, meaning that the entire value of the field is being searched.

Remaining tasks

Known issues

When the index is cleared by clicking "Clear all indexed data" at /admin/config/search/search-api/index/index_name , the splits are not removed from the Algolia index. This is because the clear process works by looping through each search_api Item and sending it to Algolia's deleteObjects() method which takes an array of itemIDs/objectIDs. Since search_api doesn't know about SplitItems, these are excluded.
- I'm sure this is solvable. We would need a table that keeps track of all the objectIDs for the currently indexed SplitItems. Then when the index is cleared, send them to deleteObjects() and remove them from the table.
- Until this solved, you may want to periodically clear the index from the Algolia side, and then reindex on the Drupal side.
SplitItems bypass some of the multilingual code in \Drupal\search_api_algolia\Plugin\search_api\backend\SearchApiAlgoliaBackend::indexItems that Items go through.
The number of indexed items that search_api reports at /admin/config/search/search-api/index/index_name will not reflect the number of indexed items in Algolia if a split has been created. Because search_api probably does not support the concept of more items being created during the index process, perhaps this works as expected?
The batch process will index more than expected/intended if a split has been created. For example, each batch may be 50 items, but with the addition of splits, many more may be sent to Algolia during that batch. Unclear if this could cause the batch to timeout. The documentation for saveObjects() says "To ensure good performance, saveObjects automatically splits your records into batches of 1,000 objects" so I think it's unlikely to cause an issue on the Algolia side.
All other fields/attributes are included on the split item/object in Algolia. In some ways this is nice because you can use any attribute as the attributeForDistinct, and it will work. However, it would be smarter to only include the split field/attribute and the attribute that is set as attributeForDistinct. In theory these are all that are needed on the split records.
The code to split the field value was taken directly from Algolia. That said, I don't think it's very smart. For example I don't think it knows not to trim/split in the middle of a word.

User interface changes

Creates an Algolia Item Splitter processor available under Processors tab of index.

API changes

Data model changes

✨ Feature request

Status

Active

Version

3.0

Component

Code

Created by

🇺🇸United States maskedjellybean Portland, OR

Live updates comments and jobs are added and updated live.

Incomplete comments

Sign in to follow issues

Merge Requests

!29Item splitter processor to avoid record size limit
Merged
🇧🇪Belgium dieterholvoet
updated 5 months ago
!28Item splitter processor to avoid record size limit
Open
🇧🇪Belgium dieterholvoet
updated 8 months ago

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Comment about 2 years ago →
🇲🇾Malaysia jonloh
Tried the patch, but unfortunately this does not work well in Multi-lingual setup.
Comment over 1 year ago →
🇮🇳India Akhil Babu Chengannur
Thanks for the patch. I have created a new patch with few changes.

When records are splitted, the current patch removes the splitted value from original record and adds it to splitted records. Instead, the new patch will add the first splitted value to original record and subsequent values to split records.

It adds a new field ‘parent_record’ in all records to filter out all splits associated with a record. The original record will have ‘self’ as the value in this field, and split records will have ‘node_id:language_code’ as the value. This field is used to delete all splits associated with a record when a node is modified/deleted. It will also help distinguish between the original record and splits if you are building the search UI using JS. The ‘parent_record’ field should be configured as a filter from the Algolia dashboard for this to work.

Works with multilingual content.
Comment over 1 year ago →
🇮🇳India Akhil Babu Chengannur
Comment over 1 year ago →
🇺🇸United States maskedjellybean Portland, OR
Thank you for carrying this forward! Sadly I no longer have an Algolia project to work with so I can't test the new patch out.
Status changed to Postponed: needs info about 1 year ago8:42am 21 July 2024
Comment about 1 year ago →
🇮🇳India nikunjkotecha India, Gujarat, Rajkot
This is good. I am not convinced though that we should index huge objects in Algolia, can we have some real use case to help understand the need for this?
Comment about 1 year ago →
🇺🇸United States maskedjellybean Portland, OR
The use case is if you want to index more than 10000 characters in one record. :-)

Algolia offers the ability to split records in order to get around their character count limitation, so it would be great if search_api_algolia leveraged this ability.

Potentially site builders/developers may not realize their records are being truncated. When it is truncated search does not search the entire record because only part of it is indexed. This means worse search results without any indication why.
Comment 12 months ago →
🇺🇸United States kevinb623
This patch is working wonderfully to properly index and discover lengthy pages on a content rich website we manage.

Only suggestion is to update ItemSplitter.php line 68 to use isset() to reduce PHP warnings related to unknown and null array keys.

Very nice work!
Comment 11 months ago →
🇬🇧United Kingdom reecemarsland
Our use case is indexing PDF files attached to content and we need the PDF content to be searchable.
Comment 10 months ago →
🇧🇪Belgium Den Tweed
Same as in #17 our use case is making attached documents searchable

I've worked further on patch #11 and changed following:

Fixed warning for getSplitsForItem(), the whole method could be reduced to a simple ?? statement

Removed the getDataTypeHelper() and setDataTypeHelper() overrides as they aren't changed from the parent class

Moved the code from processFieldValue() to process() and removed the string type check. As far as I understand 'String' as dataType is for shorter field values (e.g. Title, url, etc...) and should already be shorter than the limit in most cases. It's 'Text' (aka Fulltext) we need the most here imo, but in general anything that is considered string characters. This is already covered by the shouldProcess() method (has an is_string() check) which is the condition to call process() (which in turn calls processFieldValue()). As process is an empty method there's no need to override the processFieldValue() code
Status changed to Needs review 8 months ago10:57am 20 November 2024
Comment 8 months ago →
🇧🇪Belgium dieterholvoet Brussels
I am not convinced though that we should index huge objects in Algolia, can we have some real use case to help understand the need for this?

We hit this limit regularly on projects, when e.g. indexing long text fields or paragraphs for search. This is a very valid use case.
Merge request !28Add search_api_algolia-3256840-18.patch → (Open) created by dieterholvoet
Comment 8 months ago →
🇧🇪Belgium dieterholvoet Brussels
I started a MR based on the latest patch. I'm sometimes still getting the following error, even with the patch applied:

Record at the position 46 objectID=entity:node/118:bg-split-processed_2-1 is too big size=15808/10000 bytes. Please have a look at https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-you...

I'll do some debugging.
Comment 8 months ago →
🇧🇪Belgium dieterholvoet Brussels
Pipeline finished with Success
8 months ago
Total: 133s
#344540
Pipeline finished with Success
8 months ago
Total: 134s
#344689
Comment 8 months ago →
🇧🇪Belgium dieterholvoet Brussels
I can't figure out the problem. That project might have been using an outdated patch, I updated it and I'll wait and see if the issue happens again.
Merge request !29Resolve #3256840 "Smarter splitter" → (Merged) created by dieterholvoet
Pipeline finished with Success
8 months ago
Total: 136s
#354386
Comment 8 months ago →
🇧🇪Belgium dieterholvoet Brussels
The existing splitter doesn't work consistently for me. Splitting up all enabled fields on a fixed amount of characters works quite well if you only have one very big body field. If you have multiple fields with a lot of content, splitting on a fixed amount of characters still has the risk of creating records that are too big, unless you set the amount of characters to a low value.

That's why I decided to rewrite everything and to come up with a smarter splitter. Instead of splitting all enabled fields on a fixed amount of characters, my splitter fills up records until the limit which is dictated by Algolia (usually 10K bytes), before it starts splitting text into multiple records. This makes it practically impossible to create records that are too big + it's a lot more efficient, it will only create as much records as necessary.

Most of the logic was moved from the field processor to the search backend code, right before the record is sent to Algolia, in order to be able to calculate the record sizes as efficiently as possible with all base fields included.
Comment 8 months ago →
🇧🇪Belgium dieterholvoet Brussels
I also improved documentation of the processor, warning users to set up things correctly on the Algolia side. I also changed it so the truncate option is automatically disabled for an index when splitting is enabled.
Pipeline finished with Success
8 months ago
Total: 165s
#354429
Pipeline finished with Success
8 months ago
Total: 135s
#358714
Comment 8 months ago →
🇧🇪Belgium dieterholvoet Brussels
I cleaned up the issue description. About the known issues previously listed:

When the index is cleared by clicking "Clear all indexed data" at /admin/config/search/search-api/index/index_name , the splits are not removed from the Algolia index. This is because the clear process works by looping through each search_api Item and sending it to Algolia's deleteObjects() method which takes an array of itemIDs/objectIDs. Since search_api doesn't know about SplitItems, these are excluded.

This is not true. When clicking that button, deleteAllIndexItems() is triggered, which clears the whole index instead of specific objects. I'll remove this from the known issues.

SplitItems bypass some of the multilingual code in \Drupal\search_api_algolia\Plugin\search_api\backend\SearchApiAlgoliaBackend::indexItems that Items go through.

This is not the case in my implementation. Removing from known issues.

The number of indexed items that search_api reports at /admin/config/search/search-api/index/index_name will not reflect the number of indexed items in Algolia if a split has been created. Because search_api probably does not support the concept of more items being created during the index process, perhaps this works as expected?

I would also say this works as expected. The splitted items are an implementation detail of this specific backend and are not necessary to be listed in the UI. When you display the index on dashboard.algolia.com, splitted items are also not listed or counted separately. Removing from known issues.

The batch process will index more than expected/intended if a split has been created. For example, each batch may be 50 items, but with the addition of splits, many more may be sent to Algolia during that batch. Unclear if this could cause the batch to timeout. The documentation for saveObjects() says "To ensure good performance, saveObjects automatically splits your records into batches of 1,000 objects" so I think it's unlikely to cause an issue on the Algolia side.

This is not the case in my implementation since splits are indexed together with regular objects. Removing from known issues.

All other fields/attributes are included on the split item/object in Algolia. In some ways this is nice because you can use any attribute as the attributeForDistinct, and it will work. However, it would be smarter to only include the split field/attribute and the attribute that is set as attributeForDistinct. In theory these are all that are needed on the split records.

This is not true. Algolia doesn't merge the contents of splitted items. When searching an Algolia index and multiple splitted items match the query, the splitted item that matches the most will be returned to the user. This means that all non-splitted attributes need to be present on all splitted objects. Removing from known issues.

The code to split the field value was taken directly from Algolia. That said, I don't think it's very smart. For example I don't think it knows not to trim/split in the middle of a word.

This is not true. It splits on spaces, so it shouldn't break words. The current code looks plenty smart to me. Removing from known issues.
Comment 8 months ago →
🇧🇪Belgium dieterholvoet Brussels
dieterholvoet → changed the visibility of the branch 3256840-item-splitter-processor to hidden.
First commit to issue fork.
Pipeline finished with Success
5 months ago
Total: 166s
#427735
Comment 5 months ago →
🇺🇸United States josh.stewart Lexington, KY
We ended up using the following patch after we ran into some issues. Might not be 100% perfect but we hit some snags with Korean characters and the splitting actually malforming them because of their bytes length. Hopefully this helps someone. Previous work on the MR worked great as far as I can tell other than that issue.
Comment 5 months ago →
🇧🇪Belgium dieterholvoet Brussels
@josh.stewart I added the changes from your patch to the MR. What do you mean by 'Might not be 100% perfect', anything specific to look out for?
Pipeline finished with Success
5 months ago
Total: 145s
#430440
Pipeline finished with Success
5 months ago
Total: 154s
#430442
Comment 5 months ago →
🇺🇸United States josh.stewart Lexington, KY
@dieterholvoet I was able to index 70k records without issue after the updates so the comment about not 100% perfect was maybe just a lack of confidence around testing edge cases. But it's working really well for us at the moment.
Comment 5 months ago →
🇧🇪Belgium dieterholvoet Brussels
Good to hear, thanks for the help!
Comment 5 months ago →
🇺🇸United States josh.stewart Lexington, KY
Ran into a warning message with the Annotation parser dealing with the link in the description.

[error] Doctrine\Common\Annotations\AnnotationException while computing Views data for index Main: [Syntax Error] Expected Doctrine\Common\Annotations\DocLexer::T_CLOSE_PARENTHESIS, got 'https' at position 234 in class Drupal\search_api_algolia\Plugin\search_api\processor\ItemSplitter. in Doctrine\Common\Annotations\AnnotationException::syntaxError() (line 28 of /app/vendor/doctrine/annotations/lib/Doctrine/Common/Annotations/AnnotationException.php).

So this file is to update it to remove the html in there.
Comment 5 months ago →
🇧🇪Belgium dieterholvoet Brussels
@josh.stewart could you please add those changes to the MR? Patches aren't being used anymore on Drupal.org. Thanks!
Comment 5 months ago →
🇺🇸United States jordan.caldwell
jordan.caldwell → made their first commit to this issue’s fork.
Pipeline finished with Success
5 months ago
Total: 160s
#436936
Comment 5 months ago →
🇺🇸United States jordan.caldwell
I ran into the Annotation parser issue as well. I've pushed an update to the MR to resolve it.
Pipeline finished with Skipped
5 months ago
#437489
Comment 5 months ago →
System Message

nikunjkotecha → committed 9faf6305 on 3.0.x authored by dieterholvoet →
Issue #3256840: Item splitter processor to avoid record size limit
Comment 5 months ago →
🇮🇳India nikunjkotecha India, Gujarat, Rajkot
Thanks everyone for hard work, based on last few comments I can say it is already tested in multiple projects so merging it.
Comment 5 months ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024