Improve AI Search Module Indexing to Handle Long-Running Chunk Embedding Processes

Created on 14 November 2024, 5 months ago

Problem/Motivation

When using the AI Search module, which leverages the Search API for indexing content, we encounter a significant issue with large items that require extensive processing. The module needs to chunk content and generate embeddings for each chunk. However, this process can be time-consuming, especially for large items that result in hundreds of chunks. As a result, the indexing process can exceed the max_execution_time limit set in PHP, causing timeouts during Cron runs or user-initiated batch processes.

Proposed resolution

  1. Utilize the Drupal Batch API with Sandbox Mechanism:
    • Incremental Processing: Break down the indexing of large items into smaller, manageable chunks that can be processed incrementally.
    • Sandbox State Preservation: Use the $context['sandbox'] variable to maintain the state between successive batch operations, allowing the process to continue across multiple HTTP requests without exceeding execution time limits.
  2. Implement Idempotent Processing:
    • Chunk Tracking: Create a tracking system (e.g., a custom database table) to record which chunks have been processed. This ensures that if the process is interrupted, it can resume without duplicating work.
    • Duplicate Prevention: Before processing a chunk, check if it has already been processed to avoid redundancy.
  3. Provide User Feedback:
    • Progress Indicators: Update batch messages to inform administrators about the current progress of the indexing operation.
  4. Adjust Return Values Appropriately:
    • Accurate Reporting: Return the IDs of items being processed to indicate to the Search API that indexing is underway, while managing actual completion status internally.
📌 Task
Status

Needs work

Version

1.0

Component

AI Search

Created by

🇬🇧United Kingdom seogow

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @seogow
  • 🇬🇧United Kingdom seogow
  • Merge request !283Split Queue item to manageable batches. → (Closed) created by seogow
  • 🇬🇧United Kingdom scott_euser

    This is great, nice to be able to handle giant single nodes, e.g. guides or books perhaps. Added some comments to the MR, but the only general thing is I think we probably need to make this an opt-in given there are some less configurable hosts well used in the drupal community like Pantheon that have very infrequent cron

  • Pipeline finished with Failed
    2 months ago
    Total: 331s
    #402699
  • Pipeline finished with Success
    2 months ago
    Total: 383s
    #402697
  • Pipeline finished with Failed
    2 months ago
    Total: 384s
    #402698
  • Assigned to seogow
  • 🇬🇧United Kingdom scott_euser

    I gave this a try, ooph it adds a lot of scenarios and complexity - but I get why its needed if you have really long contents (e.g. a full book, or indexing big attachments).

    Some findings:

    1. Index status shows completely indexed even if all chunks are not completed
    2. The status message in the progress bar when clicking 'Index now' in the UI always shows finished processing all chunks for item X even at earlier percentages when its not finished
    3. If you resave an item before all queued items are process by cron, there become orphaned items in the queue. Easiest way to mimic this without cron is clicking 'Index now', then abandoning it partway through, then make changes to the content item being indexed so it gets requeued.
  • 🇬🇧United Kingdom scott_euser

    I think we need some sort of warning if there are unprocessed chunks right below the fully indexed, and some sort of button to allow the user to finish processing the chunks (and of course clear out old orphaned chunks if they are related to old state of a content item which has since been re-queued)

    E.g. maybe here in the UI:

  • 🇬🇧United Kingdom seogow

    @scott_euser - you are 100% correct. The problem we have now is that it works for most scenarios (under 10-12 chunks), but fails for long nodes (and slow, local embedding LLMs) completely.

    The solution I coded is actually imperfect either - it bypasses batch-chunking of long Entity on CRUD when immediate indexing is allowed. That potentially not only creates orphaned chunks, but actually fails to chunk and index long Entities.

    I suggest we implement a robust solution:

    1. Now, an Entity is marked as indexed when child chunks are added to query. It should be the opposite - it should be marked as indexed, only when all the child chunks are processed.
    2. For AI Search index we replace Index status page with page reflecting chunking, not Entities. The user then will see non-processes Entities and - if any - an unfinished Entity, including number of unembedded chunks.
    3. We set robust handling of indexing for Entity CRUD, where we set message on failure where we offer to index manually on index page, or add to a Cron query. If user doesn't react, it will be added to the query (that way the re-indexing request is actually recorded and it will be processed eventually).
  • 🇬🇧United Kingdom scott_euser

    1. Now, an Entity is marked as indexed when child chunks are added to query. It should be the opposite - it should be marked as indexed, only when all the child chunks are processed.

    That might be quite difficult to change as I expect that is heavily controlled by Search API module, but I have not look. It sounds ideal though.

    2. For AI Search index we replace Index status page with page reflecting chunking, not Entities. The user then will see non-processes Entities and - if any - an unfinished Entity, including number of unembedded chunks.

    Maybe as well as rather than instead of? Its still useful for a user to see how many of their content items in total are done (and maybe what they care about more). We have already `ai_search_preprocess_search_api_index()` function which is changing $variables so we could add after $variables['index_progress'] perhaps?

    3. We set robust handling of indexing for Entity CRUD, where we set message on failure where we offer to index manually on index page, or add to a Cron query. If user doesn't react, it will be added to the query (that way the re-indexing request is actually recorded and it will be processed eventually).

    Note that this is actually facilitated by Search API. There is Drupal\search_api\Plugin\search_api\tracker\Basic plugin which is the default tracker. I haven't properly explored how complicated it would be to leverage that or whether it is the best route though - so take this with a grain of salt.

  • Merge request !525Resolve #3487487 "Improve ai search table" → (Open) created by seogow
  • 🇬🇧United Kingdom seogow

    I have opened a new MR for improved version of the chunking. This one incorporates new table for chunks, allowing for control of what has been indexed and when.

  • 🇩🇪Germany marcus_johansson

    It took some time to understand everything and I've added some comments. I don't think a single comment is a must change, so feel free to change something if you like.

    I will test it later today.

    If @scott_euser has time to check, it would be great, but from my point of view it can be merged as is or with minor modification if wanted, as soon as its tested and working.

Production build 0.71.5 2024