Improve AI Search Module Indexing to Handle Long-Running Chunk Embedding Processes

Created on 14 November 2024, 3 months ago

Problem/Motivation

When using the AI Search module, which leverages the Search API for indexing content, we encounter a significant issue with large items that require extensive processing. The module needs to chunk content and generate embeddings for each chunk. However, this process can be time-consuming, especially for large items that result in hundreds of chunks. As a result, the indexing process can exceed the max_execution_time limit set in PHP, causing timeouts during Cron runs or user-initiated batch processes.

Proposed resolution

  1. Utilize the Drupal Batch API with Sandbox Mechanism:
    • Incremental Processing: Break down the indexing of large items into smaller, manageable chunks that can be processed incrementally.
    • Sandbox State Preservation: Use the $context['sandbox'] variable to maintain the state between successive batch operations, allowing the process to continue across multiple HTTP requests without exceeding execution time limits.
  2. Implement Idempotent Processing:
    • Chunk Tracking: Create a tracking system (e.g., a custom database table) to record which chunks have been processed. This ensures that if the process is interrupted, it can resume without duplicating work.
    • Duplicate Prevention: Before processing a chunk, check if it has already been processed to avoid redundancy.
  3. Provide User Feedback:
    • Progress Indicators: Update batch messages to inform administrators about the current progress of the indexing operation.
  4. Adjust Return Values Appropriately:
    • Accurate Reporting: Return the IDs of items being processed to indicate to the Search API that indexing is underway, while managing actual completion status internally.
📌 Task
Status

Needs work

Version

1.0

Component

AI Search

Created by

🇬🇧United Kingdom seogow

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @seogow
  • 🇬🇧United Kingdom seogow
  • Merge request !283Split Queue item to manageable batches. → (Open) created by seogow
  • 🇬🇧United Kingdom scott_euser

    This is great, nice to be able to handle giant single nodes, e.g. guides or books perhaps. Added some comments to the MR, but the only general thing is I think we probably need to make this an opt-in given there are some less configurable hosts well used in the drupal community like Pantheon that have very infrequent cron

  • Pipeline finished with Failed
    8 days ago
    Total: 331s
    #402699
  • Pipeline finished with Success
    8 days ago
    Total: 383s
    #402697
  • Pipeline finished with Failed
    8 days ago
    Total: 384s
    #402698
Production build 0.71.5 2024