- Issue created by @seogow
- 🇬🇧United Kingdom scott_euser
This is great, nice to be able to handle giant single nodes, e.g. guides or books perhaps. Added some comments to the MR, but the only general thing is I think we probably need to make this an opt-in given there are some less configurable hosts well used in the drupal community like Pantheon that have very infrequent cron
- Assigned to seogow
- 🇬🇧United Kingdom scott_euser
I gave this a try, ooph it adds a lot of scenarios and complexity - but I get why its needed if you have really long contents (e.g. a full book, or indexing big attachments).
Some findings:
- Index status shows completely indexed even if all chunks are not completed
- The status message in the progress bar when clicking 'Index now' in the UI always shows finished processing all chunks for item X even at earlier percentages when its not finished
- If you resave an item before all queued items are process by cron, there become orphaned items in the queue. Easiest way to mimic this without cron is clicking 'Index now', then abandoning it partway through, then make changes to the content item being indexed so it gets requeued.
- 🇬🇧United Kingdom scott_euser
I think we need some sort of warning if there are unprocessed chunks right below the fully indexed, and some sort of button to allow the user to finish processing the chunks (and of course clear out old orphaned chunks if they are related to old state of a content item which has since been re-queued)
E.g. maybe here in the UI:
- 🇬🇧United Kingdom seogow
@scott_euser - you are 100% correct. The problem we have now is that it works for most scenarios (under 10-12 chunks), but fails for long nodes (and slow, local embedding LLMs) completely.
The solution I coded is actually imperfect either - it bypasses batch-chunking of long Entity on CRUD when immediate indexing is allowed. That potentially not only creates orphaned chunks, but actually fails to chunk and index long Entities.
I suggest we implement a robust solution:
- Now, an Entity is marked as indexed when child chunks are added to query. It should be the opposite - it should be marked as indexed, only when all the child chunks are processed.
- For AI Search index we replace Index status page with page reflecting chunking, not Entities. The user then will see non-processes Entities and - if any - an unfinished Entity, including number of unembedded chunks.
- We set robust handling of indexing for Entity CRUD, where we set message on failure where we offer to index manually on index page, or add to a Cron query. If user doesn't react, it will be added to the query (that way the re-indexing request is actually recorded and it will be processed eventually).
- 🇬🇧United Kingdom scott_euser
1. Now, an Entity is marked as indexed when child chunks are added to query. It should be the opposite - it should be marked as indexed, only when all the child chunks are processed.
That might be quite difficult to change as I expect that is heavily controlled by Search API module, but I have not look. It sounds ideal though.
2. For AI Search index we replace Index status page with page reflecting chunking, not Entities. The user then will see non-processes Entities and - if any - an unfinished Entity, including number of unembedded chunks.
Maybe as well as rather than instead of? Its still useful for a user to see how many of their content items in total are done (and maybe what they care about more). We have already `ai_search_preprocess_search_api_index()` function which is changing
$variables
so we could add after$variables['index_progress']
perhaps?3. We set robust handling of indexing for Entity CRUD, where we set message on failure where we offer to index manually on index page, or add to a Cron query. If user doesn't react, it will be added to the query (that way the re-indexing request is actually recorded and it will be processed eventually).
Note that this is actually facilitated by Search API. There is Drupal\search_api\Plugin\search_api\tracker\Basic plugin which is the default tracker. I haven't properly explored how complicated it would be to leverage that or whether it is the best route though - so take this with a grain of salt.
- 🇬🇧United Kingdom seogow
I have opened a new MR for improved version of the chunking. This one incorporates new table for chunks, allowing for control of what has been indexed and when.
- 🇩🇪Germany marcus_johansson
It took some time to understand everything and I've added some comments. I don't think a single comment is a must change, so feel free to change something if you like.
I will test it later today.
If @scott_euser has time to check, it would be great, but from my point of view it can be merged as is or with minor modification if wanted, as soon as its tested and working.