Chunked embeddings do not have unique ID

Issue created by @Vivek Panicker
Merge request !53Issue #3470652: Made ID unique for every chunk. → (Merged) created by Vivek Panicker
Status changed to Needs review 10 months ago10:52am 28 August 2024
Comment 10 months ago →
🇮🇳India Vivek Panicker Kolkata
Pipeline finished with Success
10 months ago
Total: 308s
#267027
Comment 10 months ago →
System Message
vivek panicker → closed merge request !53
Issue was unassigned.
Status changed to Closed: won't fix 10 months ago5:56pm 28 August 2024
Comment 10 months ago →
🇮🇳India Vivek Panicker Kolkata
As discussed with the maintainers Scott Euser and Seogow, the alteration should not be made in the core AI module, but should be made in the Pinecone Provider itself.
Hence closing this issue.
Comment 10 months ago →
🇬🇧United Kingdom yautja_cetanu
Vivek. Do you think you could present some exploration of the benefits of putting something that handles chunk IDs into the VDB abstraction layer itself or putting it in the provider?

If its in the Layer it means all modules can interact with the "Chunk ID" in a consistent manner. What use cases would there be for that? If its in the provider itself then it means that the VDB can't decide how to create IDs and has to rely on each Db's internal approach to it. Does that matter?
Comment 10 months ago →
🇬🇧United Kingdom yautja_cetanu
Ok lets keep this closed.

ok do either way we have to do something unique for each DB. Milvus does not allow for custom IDs, Pinecone requires them.
so we pick, milvus being default by accident of history and make the pinecone provider create its own thing
The next provider made like Weviate, if it needs the same thing can copy and paste code from Pinecone and if this happens enough we maybe should provide a function for creating chunk ids in the abstraction layer.

We can't even think of any reason why any module would need some consistent method for chunk IDs. Maybe appending entity ids into the chunk IDs? But that seems useless. Maybe if I had a log system that stored the chunk IDs instead of the contents of chunk I want to keep those IDs consistent but every time you reindex or do anything funky the IDs are going to be messed.

As Scott says "I think we can over time make traits or base classes for similar vdb providers when they share the same functionality to avoid them repeating each other"
Assigned to scott_euser
Status changed to Needs work 10 months ago6:49pm 28 August 2024
Comment 10 months ago →
🇬🇧United Kingdom scott_euser
Thanks for raising this and working on it @vivek-panicker!

Okay will explore this a bit more. It could be we re-open with a pivot to the goal and updated issue summary, instead making it possible for the VDB provider to set the ID that Search API then stores in its local.

My current thinking is that we can probably pass $data by reference to at https://git.drupalcode.org/project/ai/-/blob/1.0.x/modules/ai_search/src/Plugin/search_api/backend/SearchApiAiSearchBackend.php?ref_type=heads#L319 so that the drupal_long_id can be modified by the VDB Provider. Need to check if there are similar issues elsewhere, but at a glance I think not.

It would mean any VDB provider that exists needs to also update insertIntoCollection() with the & for the reference.

But then Pinecone can have a method like setUniqueId() which can start in Pinecone plugin but eventually move to a trait if we get Weviate in which appears to have the same issue as noted here with ID must be required and unique for each chunk: https://weaviate.io/developers/weaviate/api/rest#tag/objects/POST/objects
Comment 10 months ago →
🇬🇧United Kingdom andrewbelcher
I've had a look through both the old search api AI code (which was actually somewhat broken) and the new code...

For providers than need a provided ID, I would suggest that is added in their implementation of AiVdbProviderInterface::insertIntoCollection. I think this should be done on a provider by provider basis (as was previously decided).

Nothing in AI Search makes use of those IDs or needs to track those IDs. Deletions are done by querying all items with the corresponding Search API Item ID (stored in metadata).

In terms of generating IDs, AI Search already generates a couple IDs used in metadata:

drupal_entity_id: Search API Item ID (non unique)

drupal_long_id: Comes from the embedding, but is currently identical to above.

I think the second of those should actually be a unique ID for the embedding item, which would then make it suitable for Pinecone (and others) to use in AiVdbProviderInterface::insertIntoCollection to set a unique and reliable ID.

This ID is not actually used anywhere, so whilst technically a BC break, I think that would be fine to do.
Comment 10 months ago →
🇬🇧United Kingdom andrewbelcher
Sorry scott_euser, a bit of a cross post there, and I've also updated the issue summary.

My current thinking is that we can probably pass $data by reference to at https://git.drupalcode.org/project/ai/-/blob/1.0.x/modules/ai_search/src... so that the drupal_long_id can be modified by the VDB Provider. Need to check if there are similar issues elsewhere, but at a glance I think not.

I don't think we should do this, for a couple of reasons:

Passing by reference I think is a pattern that should be generally avoided, as it can lead to various unexpected behaviour and hard to track down errors.

AI Search doesn't actually do anything with drupal_long_id, so even if it was modified by a provider, that would be fine.

Providers don't have as much context to decide what is a suitable ID, e.g. it doesn't have an delta. The embedding (which I've suggested handle generating the ID) does have this context.
Comment 10 months ago →
🇬🇧United Kingdom scott_euser
Hmmm we don't have a record of the ID chunks stored if we don't return (or pass by reference) anything. But yes I can see that drupal_entity_id is used for that and not drupal_long_id . So actually search api item corresponds to the entity, not the chunk (ie, what gets recorded in search_api_item)

Its not really the insert I am worried about. I'll explore to see if that's enough info to be stored for e.g. Pinecone/Weviate to query to find all chunks related to that entity id to be able to delete. Hopefully they both have sufficient query methods to be able to find and delete all chunks that match the entity ID

Then the drupal_long_id also won't matter in Pinecone either (as long as its unique)
Comment 10 months ago →
🇬🇧United Kingdom scott_euser
Okay seems Pinecone free cannot do Delete with filter (only paid plans), but it can do List by ID prefix and then Delete by actual IDs, so it should be doable I think to not know the chunk IDs within Drupal. We just need to make sure that the prefix matches the locally stored index ID to be able to find and delete them. I think that's anyways what you are suggesting in the issue summary. It actually seems that is what SearchApiAiSearchBackend::deleteItems() is doing (so what Milvus is doing, so I guess Milvus has same limitation but haven't checked).

In any case, it should be fine
Comment 10 months ago →
🇬🇧United Kingdom scott_euser
Comment 10 months ago →
🇬🇧United Kingdom scott_euser
Updated issue summary with what specifically needs to be done. Will comment on Pinecone one to add that to the list of things to deal with.
Comment 10 months ago →
System Message
vivek panicker → reopened merge request !53
Pipeline finished with Canceled
10 months ago
Total: 117s
#268194
Pipeline finished with Failed
10 months ago
Total: 240s
#268196
Pipeline finished with Success
10 months ago
Total: 248s
#268203
Issue was unassigned.
Status changed to Needs review 10 months ago4:35am 3 September 2024
Comment 10 months ago →
🇬🇧United Kingdom scott_euser
Updated the code to enforce a consistent structure in each chunk. I am not 100% sure on the return phpdoc in case someone has a need for additional keys, I guess this would stop that, but we can deal with that if/when problem arises. I did add a validation method too though to actually check the format of the ID for consistency too as I expect we'll have problems if we rely on it to be ID + delta.

I need to test this more thoroughly still so not ready for merging, but marking as Needs Review in case anyone else wants to check it as well.
Comment 9 months ago →
🇬🇧United Kingdom scott_euser
scott_euser → changed the visibility of the branch 3470652-chunked-embeddings-unique-id to hidden.
Pipeline finished with Skipped
9 months ago
#297022
Comment 9 months ago →
🇬🇧United Kingdom scott_euser
Thank you!
Comment 9 months ago →
System Message

scott_euser → committed 46559d4a on 1.0.x authored by vivek panicker →
Issue #3470652 by scott_euser, vivek panicker, andrewbelcher: Chunked...
Comment 9 months ago →
System Message

scott_euser → committed 46559d4a on mock-stream-response authored by vivek panicker →
Issue #3470652 by scott_euser, vivek panicker, andrewbelcher: Chunked...
Comment 9 months ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.
Status changed to Fixed 7 days ago2:36am 1 July 2025
Comment 7 days ago →
🇺🇸United States Kristen Pol Santa Cruz, CA, USA
We are doing some issue management housekeeping and adding/removing components.

We are removing the "Code" component and want people to categorize issues with the best module/submodule component.

Moving this issue to "AI Search".

See 📌 Update AI module project components Active for more details.

Chunked embeddings do not have unique ID

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Merge Requests

!53Chunked embeddings do not have unique ID
Merged

Comments & Activities

Chunked embeddings do not have unique ID

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Merge Requests

!53Chunked embeddings do not have unique IDMerged

Comments & Activities

!53Chunked embeddings do not have unique ID
Merged