Improve AI Search Module Indexing to Handle Long-Running Chunk Embedding Processes

Created on 14 November 2024, 7 days ago

Problem/Motivation

When using the AI Search module, which leverages the Search API for indexing content, we encounter a significant issue with large items that require extensive processing. The module needs to chunk content and generate embeddings for each chunk. However, this process can be time-consuming, especially for large items that result in hundreds of chunks. As a result, the indexing process can exceed the max_execution_time limit set in PHP, causing timeouts during Cron runs or user-initiated batch processes.

Proposed resolution

  1. Utilize the Drupal Batch API with Sandbox Mechanism:
    • Incremental Processing: Break down the indexing of large items into smaller, manageable chunks that can be processed incrementally.
    • Sandbox State Preservation: Use the $context['sandbox'] variable to maintain the state between successive batch operations, allowing the process to continue across multiple HTTP requests without exceeding execution time limits.
  2. Implement Idempotent Processing:
    • Chunk Tracking: Create a tracking system (e.g., a custom database table) to record which chunks have been processed. This ensures that if the process is interrupted, it can resume without duplicating work.
    • Duplicate Prevention: Before processing a chunk, check if it has already been processed to avoid redundancy.
  3. Provide User Feedback:
    • Progress Indicators: Update batch messages to inform administrators about the current progress of the indexing operation.
  4. Adjust Return Values Appropriately:
    • Accurate Reporting: Return the IDs of items being processed to indicate to the Search API that indexing is underway, while managing actual completion status internally.
📌 Task
Status

Needs work

Version

1.0

Component

AI Search

Created by

🇬🇧United Kingdom seogow

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.71.5 2024