- 🇲🇾Malaysia jonloh
Tried the patch, but unfortunately this does not work well in Multi-lingual setup.
- 🇮🇳India Akhil Babu Chengannur
Thanks for the patch. I have created a new patch with few changes.
- When records are splitted, the current patch removes the splitted value from original record and adds it to splitted records. Instead, the new patch will add the first splitted value to original record and subsequent values to split records.
- It adds a new field ‘parent_record’ in all records to filter out all splits associated with a record. The original record will have ‘self’ as the value in this field, and split records will have ‘node_id:language_code’ as the value. This field is used to delete all splits associated with a record when a node is modified/deleted. It will also help distinguish between the original record and splits if you are building the search UI using JS. The ‘parent_record’ field should be configured as a filter from the Algolia dashboard for this to work.
- Works with multilingual content.
- 🇺🇸United States maskedjellybean Portland, OR
Thank you for carrying this forward! Sadly I no longer have an Algolia project to work with so I can't test the new patch out.
- Status changed to Postponed: needs info
5 months ago 8:42am 21 July 2024 - 🇮🇳India nikunjkotecha India, Gujarat, Rajkot
This is good. I am not convinced though that we should index huge objects in Algolia, can we have some real use case to help understand the need for this?
- 🇺🇸United States maskedjellybean Portland, OR
The use case is if you want to index more than 10000 characters in one record. :-)
Algolia offers the ability to split records in order to get around their character count limitation, so it would be great if search_api_algolia leveraged this ability.
Potentially site builders/developers may not realize their records are being truncated. When it is truncated search does not search the entire record because only part of it is indexed. This means worse search results without any indication why.
- 🇺🇸United States kevinb623
This patch is working wonderfully to properly index and discover lengthy pages on a content rich website we manage.
Only suggestion is to update ItemSplitter.php line 68 to use isset() to reduce PHP warnings related to unknown and null array keys.
Very nice work!
- 🇬🇧United Kingdom reecemarsland
Our use case is indexing PDF files attached to content and we need the PDF content to be searchable.
- 🇧🇪Belgium Den Tweed
Same as in #17 our use case is making attached documents searchable
I've worked further on patch #11 and changed following:
- Fixed warning for getSplitsForItem(), the whole method could be reduced to a simple ?? statement
- Removed the getDataTypeHelper() and setDataTypeHelper() overrides as they aren't changed from the parent class
- Moved the code from processFieldValue() to process() and removed the string type check. As far as I understand 'String' as dataType is for shorter field values (e.g. Title, url, etc...) and should already be shorter than the limit in most cases. It's 'Text' (aka Fulltext) we need the most here imo, but in general anything that is considered string characters. This is already covered by the shouldProcess() method (has an is_string() check) which is the condition to call process() (which in turn calls processFieldValue()). As process is an empty method there's no need to override the processFieldValue() code
- Status changed to Needs review
about 1 month ago 10:57am 20 November 2024 - 🇧🇪Belgium dieterholvoet Brussels
I am not convinced though that we should index huge objects in Algolia, can we have some real use case to help understand the need for this?
We hit this limit regularly on projects, when e.g. indexing long text fields or paragraphs for search. This is a very valid use case.
- 🇧🇪Belgium dieterholvoet Brussels
I started a MR based on the latest patch. I'm sometimes still getting the following error, even with the patch applied:
Record at the position 46 objectID=entity:node/118:bg-split-processed_2-1 is too big size=15808/10000 bytes. Please have a look at https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-you...
I'll do some debugging.
- 🇧🇪Belgium dieterholvoet Brussels
I can't figure out the problem. That project might have been using an outdated patch, I updated it and I'll wait and see if the issue happens again.
- 🇧🇪Belgium dieterholvoet Brussels
The existing splitter doesn't work consistently for me. Splitting up all enabled fields on a fixed amount of characters works quite well if you only have one very big body field. If you have multiple fields with a lot of content, splitting on a fixed amount of characters still has the risk of creating records that are too big, unless you set the amount of characters to a low value.
That's why I decided to rewrite everything and to come up with a smarter splitter. Instead of splitting all enabled fields on a fixed amount of characters, my splitter fills up records until the limit which is dictated by Algolia (usually 10K bytes), before it starts splitting text into multiple records. This makes it practically impossible to create records that are too big + it's a lot more efficient, it will only create as much records as necessary.
Most of the logic was moved from the field processor to the search backend code, right before the record is sent to Algolia, in order to be able to calculate the record sizes as efficiently as possible with all base fields included.
- 🇧🇪Belgium dieterholvoet Brussels
I also improved documentation of the processor, warning users to set up things correctly on the Algolia side. I also changed it so the truncate option is automatically disabled for an index when splitting is enabled.
- 🇧🇪Belgium dieterholvoet Brussels
I cleaned up the issue description. About the known issues previously listed:
When the index is cleared by clicking "Clear all indexed data" at /admin/config/search/search-api/index/index_name , the splits are not removed from the Algolia index. This is because the clear process works by looping through each search_api Item and sending it to Algolia's deleteObjects() method which takes an array of itemIDs/objectIDs. Since search_api doesn't know about SplitItems, these are excluded.
This is not true. When clicking that button,
deleteAllIndexItems()
is triggered, which clears the whole index instead of specific objects. I'll remove this from the known issues.SplitItems bypass some of the multilingual code in \Drupal\search_api_algolia\Plugin\search_api\backend\SearchApiAlgoliaBackend::indexItems that Items go through.
This is not the case in my implementation. Removing from known issues.
The number of indexed items that search_api reports at /admin/config/search/search-api/index/index_name will not reflect the number of indexed items in Algolia if a split has been created. Because search_api probably does not support the concept of more items being created during the index process, perhaps this works as expected?
I would also say this works as expected. The splitted items are an implementation detail of this specific backend and are not necessary to be listed in the UI. When you display the index on dashboard.algolia.com, splitted items are also not listed or counted separately. Removing from known issues.
The batch process will index more than expected/intended if a split has been created. For example, each batch may be 50 items, but with the addition of splits, many more may be sent to Algolia during that batch. Unclear if this could cause the batch to timeout. The documentation for
saveObjects()
says "To ensure good performance, saveObjects automatically splits your records into batches of 1,000 objects" so I think it's unlikely to cause an issue on the Algolia side.This is not the case in my implementation since splits are indexed together with regular objects. Removing from known issues.
All other fields/attributes are included on the split item/object in Algolia. In some ways this is nice because you can use any attribute as the attributeForDistinct, and it will work. However, it would be smarter to only include the split field/attribute and the attribute that is set as attributeForDistinct. In theory these are all that are needed on the split records.
This is not true. Algolia doesn't merge the contents of splitted items. When searching an Algolia index and multiple splitted items match the query, the splitted item that matches the most will be returned to the user. This means that all non-splitted attributes need to be present on all splitted objects. Removing from known issues.
The code to split the field value was taken directly from Algolia. That said, I don't think it's very smart. For example I don't think it knows not to trim/split in the middle of a word.
This is not true. It splits on spaces, so it shouldn't break words. The current code looks plenty smart to me. Removing from known issues.
- 🇧🇪Belgium dieterholvoet Brussels
dieterholvoet → changed the visibility of the branch 3256840-item-splitter-processor to hidden.