Re-indexing on node save adds new chunks instead of replacing existing chunks

Created on 23 January 2025, 30 days ago

Problem/Motivation

When a node is re-saved in a Drupal 11 installation with the AI Search index set to “Index items immediately,” additional embedding chunks are appended in the Milvus vector database instead of replacing the existing chunks. As a result, the index becomes incorrect and grows over time, leading to multiple stale chunks for the same content.

Steps to reproduce

  1. Setup a fresh Drupal 11 environment with the following modules enabled and configured:
    • AI Module
    • Search API Module
    • AI Search Module
    • OpenAI Provider Module
    • Milvus VDB Provider Module
    • Key Module
  2. Create an Article node:
    • Title: “Long article”
    • Body: A long text (~5000 words or more), starting with the word “Test1”
  3. Create a Search API server with:
    • Name: Milvus VDB
    • Backend: AI Search
    • Embeddings Engine: OpenAi Small
    • Tokenizer chat counting model: OpenAI gpt-3.5-turbo
    • Vector Database: Milvus DB
    • Database name: default
    • Collection: test_long
    • Similarity Metric: Cosine
    • Advanced Embeddings Strategy Configuration:
      • Strategy for breaking content into smaller chunks for indexing: Enriched Embedding Strategy
      • Maximum chunk size allowed when breaking up larger content: 1000
      • Minimum chunk overlap for ‘Main Content’: 100
      • Contextual content maximum percentage: 30%
  4. Configure the Search API Index:
    • Index name: long_text_index
    • Datasources: Content → Article only
    • Languages: English only
    • Server: Milvus VDB
    • Default Tracker Settings:
      • Index order: Same
    • Index options:
      • Index items immediately: enabled
      • Track changes in referenced entities: enabled
      • Cron batch size: 5
    • Add fields:
      • Body | Fulltext | Main content
      • Title | String | Contextual content
  5. Index the content:
    • Go to /admin/config/search/search-api/index/long_text_index
    • Click “Index now”
  6. Check the Approx Count count in Milvus Attu, database 'default', collection 'test_long':
    • Verify you have ~12 chunks (the actual count may vary by content length).
  7. Edit the same node:
    • Change the first word in the Body field from “Test1” to “Test2”.
    • Save the node.
  8. Check again the Approx Count count in Milvus Attu, database 'default', collection 'test_long':
    • Observe that additional chunks (~3) are added instead of the original chunks being replaced.

Actual Result

After editing and re-saving the node, additional chunks are appended in the collection, causing the “Approx Count” in Milvus Attu to increase. For example, from 12 to 15 chunks.

Expected Result

The indexing process on save should replace the existing chunks, leading to the same number of chunks in Milvus, because the overall token count did not change. The total count of chunks should remain the same.

Notes

  1. This issue is only observed when “Index items immediately” is enabled. Cron or manual reindex at /admin/config/search/search-api/index/long_text_index work as expected.
  2. This issue is not caused by timeout (e.g. PHP's max_execution_time) - a node with ~12 chunks was indexed in one go when triggered by cron, direct PHP code or batch.

Proposed resolution

  1. Firstly, we need to fix the issue, as it might be affecting the way chunking works in edge cases.
  2. When fixed, IMHO the option “Index items immediately” should be disabled for Vector DBs, because although this time it is clear bug and not timeout, there will be situations when direct saving will fail because of max_execution_time and that will damage index.
🐛 Bug report
Status

Active

Version

1.0

Component

AI Search

Created by

🇬🇧United Kingdom seogow

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.71.5 2024