- Issue created by @Vivek Panicker
- Merge request !53Issue #3470652: Made ID unique for every chunk. → (Merged) created by Vivek Panicker
- Status changed to Needs review
5 months ago 10:52am 28 August 2024 - Issue was unassigned.
- Status changed to Closed: won't fix
5 months ago 5:56pm 28 August 2024 - 🇮🇳India Vivek Panicker Kolkata
As discussed with the maintainers Scott Euser and Seogow, the alteration should not be made in the core AI module, but should be made in the Pinecone Provider itself.
Hence closing this issue. - 🇬🇧United Kingdom yautja_cetanu
Vivek. Do you think you could present some exploration of the benefits of putting something that handles chunk IDs into the VDB abstraction layer itself or putting it in the provider?
If its in the Layer it means all modules can interact with the "Chunk ID" in a consistent manner. What use cases would there be for that? If its in the provider itself then it means that the VDB can't decide how to create IDs and has to rely on each Db's internal approach to it. Does that matter?
- 🇬🇧United Kingdom yautja_cetanu
Ok lets keep this closed.
ok do either way we have to do something unique for each DB. Milvus does not allow for custom IDs, Pinecone requires them.
so we pick, milvus being default by accident of history and make the pinecone provider create its own thing
The next provider made like Weviate, if it needs the same thing can copy and paste code from Pinecone and if this happens enough we maybe should provide a function for creating chunk ids in the abstraction layer.We can't even think of any reason why any module would need some consistent method for chunk IDs. Maybe appending entity ids into the chunk IDs? But that seems useless. Maybe if I had a log system that stored the chunk IDs instead of the contents of chunk I want to keep those IDs consistent but every time you reindex or do anything funky the IDs are going to be messed.
As Scott says "I think we can over time make traits or base classes for similar vdb providers when they share the same functionality to avoid them repeating each other"
- Assigned to scott_euser
- Status changed to Needs work
5 months ago 6:49pm 28 August 2024 - 🇬🇧United Kingdom scott_euser
Thanks for raising this and working on it @vivek-panicker!
Okay will explore this a bit more. It could be we re-open with a pivot to the goal and updated issue summary, instead making it possible for the VDB provider to set the ID that Search API then stores in its local.
My current thinking is that we can probably pass $data by reference to
at https://git.drupalcode.org/project/ai/-/blob/1.0.x/modules/ai_search/src/Plugin/search_api/backend/SearchApiAiSearchBackend.php?ref_type=heads#L319 so that the drupal_long_id can be modified by the VDB Provider. Need to check if there are similar issues elsewhere, but at a glance I think not.
It would mean any VDB provider that exists needs to also update insertIntoCollection() with the & for the reference.
But then Pinecone can have a method like setUniqueId() which can start in Pinecone plugin but eventually move to a trait if we get Weviate in which appears to have the same issue as noted here with ID must be required and unique for each chunk: https://weaviate.io/developers/weaviate/api/rest#tag/objects/POST/objects
- 🇬🇧United Kingdom andrewbelcher
I've had a look through both the old search api AI code (which was actually somewhat broken) and the new code...
For providers than need a provided ID, I would suggest that is added in their implementation of
AiVdbProviderInterface::insertIntoCollection
. I think this should be done on a provider by provider basis (as was previously decided).Nothing in AI Search makes use of those IDs or needs to track those IDs. Deletions are done by querying all items with the corresponding Search API Item ID (stored in metadata).
In terms of generating IDs, AI Search already generates a couple IDs used in metadata:
drupal_entity_id
: Search API Item ID (non unique)drupal_long_id
: Comes from the embedding, but is currently identical to above.
I think the second of those should actually be a unique ID for the embedding item, which would then make it suitable for Pinecone (and others) to use in
AiVdbProviderInterface::insertIntoCollection
to set a unique and reliable ID.This ID is not actually used anywhere, so whilst technically a BC break, I think that would be fine to do.
- 🇬🇧United Kingdom andrewbelcher
Sorry scott_euser, a bit of a cross post there, and I've also updated the issue summary.
My current thinking is that we can probably pass $data by reference to at https://git.drupalcode.org/project/ai/-/blob/1.0.x/modules/ai_search/src... so that the drupal_long_id can be modified by the VDB Provider. Need to check if there are similar issues elsewhere, but at a glance I think not.
I don't think we should do this, for a couple of reasons:
- Passing by reference I think is a pattern that should be generally avoided, as it can lead to various unexpected behaviour and hard to track down errors.
- AI Search doesn't actually do anything with
drupal_long_id
, so even if it was modified by a provider, that would be fine. - Providers don't have as much context to decide what is a suitable ID, e.g. it doesn't have an delta. The embedding (which I've suggested handle generating the ID) does have this context.
- 🇬🇧United Kingdom scott_euser
Hmmm we don't have a record of the ID chunks stored if we don't return (or pass by reference) anything. But yes I can see that drupal_entity_id is used for that and not drupal_long_id . So actually search api item corresponds to the entity, not the chunk (ie, what gets recorded in search_api_item)
Its not really the insert I am worried about. I'll explore to see if that's enough info to be stored for e.g. Pinecone/Weviate to query to find all chunks related to that entity id to be able to delete. Hopefully they both have sufficient query methods to be able to find and delete all chunks that match the entity ID
Then the drupal_long_id also won't matter in Pinecone either (as long as its unique)
- 🇬🇧United Kingdom scott_euser
Okay seems Pinecone free cannot do Delete with filter (only paid plans), but it can do List by ID prefix and then Delete by actual IDs, so it should be doable I think to not know the chunk IDs within Drupal. We just need to make sure that the prefix matches the locally stored index ID to be able to find and delete them. I think that's anyways what you are suggesting in the issue summary. It actually seems that is what SearchApiAiSearchBackend::deleteItems() is doing (so what Milvus is doing, so I guess Milvus has same limitation but haven't checked).
In any case, it should be fine
- 🇬🇧United Kingdom scott_euser
Updated issue summary with what specifically needs to be done. Will comment on Pinecone one to add that to the list of things to deal with.
- Issue was unassigned.
- Status changed to Needs review
5 months ago 4:35am 3 September 2024 - 🇬🇧United Kingdom scott_euser
Updated the code to enforce a consistent structure in each chunk. I am not 100% sure on the return phpdoc in case someone has a need for additional keys, I guess this would stop that, but we can deal with that if/when problem arises. I did add a validation method too though to actually check the format of the ID for consistency too as I expect we'll have problems if we rely on it to be ID + delta.
I need to test this more thoroughly still so not ready for merging, but marking as Needs Review in case anyone else wants to check it as well.
- 🇬🇧United Kingdom scott_euser
scott_euser → changed the visibility of the branch 3470652-chunked-embeddings-unique-id to hidden.
-
scott_euser →
committed 46559d4a on 1.0.x authored by
vivek panicker →
Issue #3470652 by scott_euser, vivek panicker, andrewbelcher: Chunked...
-
scott_euser →
committed 46559d4a on 1.0.x authored by
vivek panicker →
-
scott_euser →
committed 46559d4a on mock-stream-response authored by
vivek panicker →
Issue #3470652 by scott_euser, vivek panicker, andrewbelcher: Chunked...
-
scott_euser →
committed 46559d4a on mock-stream-response authored by
vivek panicker →
Automatically closed - issue fixed for 2 weeks with no activity.