Re-indexing on node save adds new chunks instead of replacing existing chunks

Open on Drupal.org →

Created on 23 January 2025, 6 months ago

Problem/Motivation

When a node is re-saved in a Drupal 11 installation with the AI Search index set to “Index items immediately,” additional embedding chunks are appended in the Milvus vector database instead of replacing the existing chunks. As a result, the index becomes incorrect and grows over time, leading to multiple stale chunks for the same content.

Steps to reproduce

Setup a fresh Drupal 11 environment with the following modules enabled and configured:
- AI Module
- Search API Module
- AI Search Module
- OpenAI Provider Module
- Milvus VDB Provider Module
- Key Module
Create an Article node:
- Title: “Long article”
- Body: A long text (~5000 words or more), starting with the word “Test1”
Create a Search API server with:
- Name: Milvus VDB
- Backend: AI Search
- Embeddings Engine: OpenAi Small
- Tokenizer chat counting model: OpenAI gpt-3.5-turbo
- Vector Database: Milvus DB
- Database name: default
- Collection: test_long
- Similarity Metric: Cosine
- Advanced Embeddings Strategy Configuration:
  - Strategy for breaking content into smaller chunks for indexing: Enriched Embedding Strategy
  - Maximum chunk size allowed when breaking up larger content: 1000
  - Minimum chunk overlap for ‘Main Content’: 100
  - Contextual content maximum percentage: 30%
Configure the Search API Index:
- Index name: long_text_index
- Datasources: Content → Article only
- Languages: English only
- Server: Milvus VDB
- Default Tracker Settings:
  - Index order: Same
- Index options:
  - Index items immediately: enabled
  - Track changes in referenced entities: enabled
  - Cron batch size: 5
- Add fields:
  - Body | Fulltext | Main content
  - Title | String | Contextual content
Index the content:
- Go to /admin/config/search/search-api/index/long_text_index
- Click “Index now”
Check the Approx Count count in Milvus Attu, database 'default', collection 'test_long':
- Verify you have ~12 chunks (the actual count may vary by content length).
Edit the same node:
- Change the first word in the Body field from “Test1” to “Test2”.
- Save the node.
Check again the Approx Count count in Milvus Attu, database 'default', collection 'test_long':
- Observe that additional chunks (~3) are added instead of the original chunks being replaced.

Actual Result

After editing and re-saving the node, additional chunks are appended in the collection, causing the “Approx Count” in Milvus Attu to increase. For example, from 12 to 15 chunks.

Expected Result

The indexing process on save should replace the existing chunks, leading to the same number of chunks in Milvus, because the overall token count did not change. The total count of chunks should remain the same.

Notes

This issue is only observed when “Index items immediately” is enabled. Cron or manual reindex at /admin/config/search/search-api/index/long_text_index work as expected.
This issue is not caused by timeout (e.g. PHP's max_execution_time) - a node with ~12 chunks was indexed in one go when triggered by cron, direct PHP code or batch.

Proposed resolution

Firstly, we need to fix the issue, as it might be affecting the way chunking works in edge cases.
When fixed, IMHO the option “Index items immediately” should be disabled for Vector DBs, because although this time it is clear bug and not timeout, there will be situations when direct saving will fail because of max_execution_time and that will damage index.

🐛 Bug report

Status

Active

Version

1.0

Component

AI Search

Created by

🇬🇧United Kingdom seogow

Live updates comments and jobs are added and updated live.

Sign in to follow issues

Merge Requests

!5Re-indexing on node save adds new chunks instead of replacing existing chunks
Merged
🇬🇧United Kingdom seogow
updated 5 months ago

Comments & Activities

Issue created by @seogow
Comment 6 months ago →
🇬🇧United Kingdom scott_euser
In AiVdbProviderClientBase::indexItems() we have a deleteIndexItems call.

I just checked with Pinecone and its not an issue for that, so I believe Milvus implementation needs checking in deleteIndexItems() method and possible needs a fix for that.

I'll move this over to there.
Comment 6 months ago →
🇬🇧United Kingdom seogow
Cool, I will check Milvus for this.
Merge request !5Allowed maximum items for Milvus → (Merged) created by seogow
Comment 6 months ago →
🇬🇧United Kingdom seogow
Merge request is ready.

Facade search for Milvus IDs from Drupal IDs was limited to 10, hence when the number of chunks per node was larger, some of these remained in the VDB index, because only the first 10 were found and deleted before saving new chunks.
Comment 6 months ago →
🇬🇧United Kingdom seogow
Comment 5 months ago →
🇬🇧United Kingdom scott_euser
Looks good to me! Good spot
Comment 5 months ago →
🇩🇪Germany marcus_johansson
Merging it
Comment 5 months ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024