- Issue created by @seogow
- 🇬🇧United Kingdom scott_euser
This is great, nice to be able to handle giant single nodes, e.g. guides or books perhaps. Added some comments to the MR, but the only general thing is I think we probably need to make this an opt-in given there are some less configurable hosts well used in the drupal community like Pantheon that have very infrequent cron
- Assigned to seogow
- 🇬🇧United Kingdom scott_euser
I gave this a try, ooph it adds a lot of scenarios and complexity - but I get why its needed if you have really long contents (e.g. a full book, or indexing big attachments).
Some findings:
- Index status shows completely indexed even if all chunks are not completed
- The status message in the progress bar when clicking 'Index now' in the UI always shows finished processing all chunks for item X even at earlier percentages when its not finished
- If you resave an item before all queued items are process by cron, there become orphaned items in the queue. Easiest way to mimic this without cron is clicking 'Index now', then abandoning it partway through, then make changes to the content item being indexed so it gets requeued.
- 🇬🇧United Kingdom scott_euser
I think we need some sort of warning if there are unprocessed chunks right below the fully indexed, and some sort of button to allow the user to finish processing the chunks (and of course clear out old orphaned chunks if they are related to old state of a content item which has since been re-queued)
E.g. maybe here in the UI:
- 🇬🇧United Kingdom seogow
@scott_euser - you are 100% correct. The problem we have now is that it works for most scenarios (under 10-12 chunks), but fails for long nodes (and slow, local embedding LLMs) completely.
The solution I coded is actually imperfect either - it bypasses batch-chunking of long Entity on CRUD when immediate indexing is allowed. That potentially not only creates orphaned chunks, but actually fails to chunk and index long Entities.
I suggest we implement a robust solution:
- Now, an Entity is marked as indexed when child chunks are added to query. It should be the opposite - it should be marked as indexed, only when all the child chunks are processed.
- For AI Search index we replace Index status page with page reflecting chunking, not Entities. The user then will see non-processes Entities and - if any - an unfinished Entity, including number of unembedded chunks.
- We set robust handling of indexing for Entity CRUD, where we set message on failure where we offer to index manually on index page, or add to a Cron query. If user doesn't react, it will be added to the query (that way the re-indexing request is actually recorded and it will be processed eventually).
- 🇬🇧United Kingdom scott_euser
1. Now, an Entity is marked as indexed when child chunks are added to query. It should be the opposite - it should be marked as indexed, only when all the child chunks are processed.
That might be quite difficult to change as I expect that is heavily controlled by Search API module, but I have not look. It sounds ideal though.
2. For AI Search index we replace Index status page with page reflecting chunking, not Entities. The user then will see non-processes Entities and - if any - an unfinished Entity, including number of unembedded chunks.
Maybe as well as rather than instead of? Its still useful for a user to see how many of their content items in total are done (and maybe what they care about more). We have already `ai_search_preprocess_search_api_index()` function which is changing
$variables
so we could add after$variables['index_progress']
perhaps?3. We set robust handling of indexing for Entity CRUD, where we set message on failure where we offer to index manually on index page, or add to a Cron query. If user doesn't react, it will be added to the query (that way the re-indexing request is actually recorded and it will be processed eventually).
Note that this is actually facilitated by Search API. There is Drupal\search_api\Plugin\search_api\tracker\Basic plugin which is the default tracker. I haven't properly explored how complicated it would be to leverage that or whether it is the best route though - so take this with a grain of salt.
- 🇬🇧United Kingdom seogow
I have opened a new MR for improved version of the chunking. This one incorporates new table for chunks, allowing for control of what has been indexed and when.
- 🇩🇪Germany marcus_johansson
It took some time to understand everything and I've added some comments. I don't think a single comment is a must change, so feel free to change something if you like.
I will test it later today.
If @scott_euser has time to check, it would be great, but from my point of view it can be merged as is or with minor modification if wanted, as soon as its tested and working.
- 🇬🇧United Kingdom scott_euser
scott_euser → changed the visibility of the branch 3487487-improve-ai-search-datasource to hidden.
- Status changed to Needs review
2 months ago 5:31am 25 April 2025 - 🇬🇧United Kingdom scott_euser
scott_euser → changed the visibility of the branch 1.0.x to hidden.
- 🇬🇧United Kingdom scott_euser
Thanks! This is working well, I was able to complete the task without issue and able to index huge content items; used extensively long wikipedia articles on old HMS/RMS ships. So I think fundamentals are good and we are getting into the finer details now. Nice work!
- 🇪🇸Spain gxleano Cáceres
I am testing the patch in AI
1.1
and it is working as expected.Using the next embedding strategy:
- Max chunk size: 500 tokens
- Minimum chunk overlap: 100 tokens
- Contextual content percentage: 25%
It has reduce a lot the indexation time.
What else would we need to move it to RTBC?
- 🇪🇸Spain gxleano Cáceres
After testing the changes, I’ve identified two important points:
- The index has become approximately twice as long compared to the previous version.
- The number of items indexed depends on the value set in the "batch" option. For example, if it's set to 5, indexing stops after 5 items. In my opinion, this is not an optimal solution—indexing should works as usual by default, while batching should be handled in the background without needed to re-run after each batch is finished.
See evidences:
I find the behavior of the index during the processed items progress bar quite misleading. In the default version, it’s clear when the index begins processing content, what exactly it’s processing, and when it finishes. However, in the current version, the process is more complicated and it’s harder to understand what’s actually happening.
- 🇬🇧United Kingdom seogow
@gxleano the behaviour you describe is not the expected behaviour and it has to be a regression bug - are you happy to take over or do you want me to go and fix it? It must/did work as expected - a 'batch' is a number of Entities processed during one PHP call (i.e. subject to max_execution_time). The only difference is that now a new DB table records chunks (fast operation) and subsequently a new Drupal batch (sub-batch) is created for embedding API calls (slow operations). That way doesn't matter how many chunks an Entity is splitted into (i.e. how many API calls are required to process it), because each call is an individual PHP run.
As you can see, the behaviour of batch progress in GUI should not change (except that you see sub-batch when Entity is processed).
My time is severely limited very now, so happy to hand over.