Chunked embeddings should have a unique ID

Created on 28 August 2024, 24 days ago
Updated 3 September 2024, 18 days ago

Problem/Motivation

There is no unique ID for the individual chunks that are indexed. This means that back-ends that require an ID on creation have to generate the ID themselves. Given the embeddings are generated by AI Search, it would seem reasonable for the ID to be generated by AI Search too for consistency and to reduce the effort for provider implementations.

EmbeddingStrategyInterface::getEmbedding already generates an ID for the embedding, which is then stored in drupal_long_id for indexing. However, both implementations in AI module use the Search API Item ID directly, which is both non-unique, but also duplicated with the drupal_entity_id metadata field.

Steps to reproduce

Index data and check the index if unique IDs are being created for each chunk.

Proposed resolution

  • EmbeddingStrategyInterface::getEmbedding implementations should be updated to provide a unique identifier for the chunk. This should be a combination of the Search API Item ID and a unique suffix (e.g. delta).

Remaining tasks

  • Implement the changes
  • Issue a change notice
  • Decide whether we should provide some more structured responses to help enforce the correct behaviour. E.g. a DTO or phpdoc array shapes. The function simply returns an array with no documentation or enforcement of what should be in it, which could lead to errors from embedding implementations.

User interface changes

None.

API changes

drupal_long_id, returned from EmbeddingStrategyInterface::getEmbedding, will now need to be a unique ID.

Data model changes

drupal_long_id metadata will have a unique value.

Original report by vivek panicker

When creating the item to index data, a unique ID is no longer created for it, like it was done earlier in the Search API AI module
AI module code:

Search API AI module code:

Vector Database services like Pinecone require that the indexed item should have a unique ID.

I believe if we maintain the same ID, this will cause the data in the index to override.

📌 Task
Status

Needs review

Version

1.0

Component

Code

Created by

🇮🇳India Vivek Panicker Kolkata

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @Vivek Panicker
  • Status changed to Needs review 24 days ago
  • Pipeline finished with Success
    24 days ago
    Total: 308s
    #267027
  • Issue was unassigned.
  • Status changed to Closed: won't fix 23 days ago
  • 🇮🇳India Vivek Panicker Kolkata

    As discussed with the maintainers Scott Euser and Seogow, the alteration should not be made in the core AI module, but should be made in the Pinecone Provider itself.
    Hence closing this issue.

  • 🇬🇧United Kingdom yautja_cetanu

    Vivek. Do you think you could present some exploration of the benefits of putting something that handles chunk IDs into the VDB abstraction layer itself or putting it in the provider?

    If its in the Layer it means all modules can interact with the "Chunk ID" in a consistent manner. What use cases would there be for that? If its in the provider itself then it means that the VDB can't decide how to create IDs and has to rely on each Db's internal approach to it. Does that matter?

  • 🇬🇧United Kingdom yautja_cetanu

    Ok lets keep this closed.

    ok do either way we have to do something unique for each DB. Milvus does not allow for custom IDs, Pinecone requires them.
    so we pick, milvus being default by accident of history and make the pinecone provider create its own thing
    The next provider made like Weviate, if it needs the same thing can copy and paste code from Pinecone and if this happens enough we maybe should provide a function for creating chunk ids in the abstraction layer.

    We can't even think of any reason why any module would need some consistent method for chunk IDs. Maybe appending entity ids into the chunk IDs? But that seems useless. Maybe if I had a log system that stored the chunk IDs instead of the contents of chunk I want to keep those IDs consistent but every time you reindex or do anything funky the IDs are going to be messed.

    As Scott says "I think we can over time make traits or base classes for similar vdb providers when they share the same functionality to avoid them repeating each other"

  • Assigned to scott_euser
  • Status changed to Needs work 23 days ago
  • 🇬🇧United Kingdom scott_euser

    Thanks for raising this and working on it @vivek-panicker!

    Okay will explore this a bit more. It could be we re-open with a pivot to the goal and updated issue summary, instead making it possible for the VDB provider to set the ID that Search API then stores in its local.

    My current thinking is that we can probably pass $data by reference to at https://git.drupalcode.org/project/ai/-/blob/1.0.x/modules/ai_search/src/Plugin/search_api/backend/SearchApiAiSearchBackend.php?ref_type=heads#L319 so that the drupal_long_id can be modified by the VDB Provider. Need to check if there are similar issues elsewhere, but at a glance I think not.

    It would mean any VDB provider that exists needs to also update insertIntoCollection() with the & for the reference.

    But then Pinecone can have a method like setUniqueId() which can start in Pinecone plugin but eventually move to a trait if we get Weviate in which appears to have the same issue as noted here with ID must be required and unique for each chunk: https://weaviate.io/developers/weaviate/api/rest#tag/objects/POST/objects

  • 🇬🇧United Kingdom andrewbelcher

    I've had a look through both the old search api AI code (which was actually somewhat broken) and the new code...

    For providers than need a provided ID, I would suggest that is added in their implementation of AiVdbProviderInterface::insertIntoCollection. I think this should be done on a provider by provider basis (as was previously decided).

    Nothing in AI Search makes use of those IDs or needs to track those IDs. Deletions are done by querying all items with the corresponding Search API Item ID (stored in metadata).

    In terms of generating IDs, AI Search already generates a couple IDs used in metadata:

    • drupal_entity_id: Search API Item ID (non unique)
    • drupal_long_id: Comes from the embedding, but is currently identical to above.

    I think the second of those should actually be a unique ID for the embedding item, which would then make it suitable for Pinecone (and others) to use in AiVdbProviderInterface::insertIntoCollection to set a unique and reliable ID.

    This ID is not actually used anywhere, so whilst technically a BC break, I think that would be fine to do.

  • 🇬🇧United Kingdom andrewbelcher

    Sorry scott_euser, a bit of a cross post there, and I've also updated the issue summary.

    My current thinking is that we can probably pass $data by reference to at https://git.drupalcode.org/project/ai/-/blob/1.0.x/modules/ai_search/src... so that the drupal_long_id can be modified by the VDB Provider. Need to check if there are similar issues elsewhere, but at a glance I think not.

    I don't think we should do this, for a couple of reasons:

    1. Passing by reference I think is a pattern that should be generally avoided, as it can lead to various unexpected behaviour and hard to track down errors.
    2. AI Search doesn't actually do anything with drupal_long_id, so even if it was modified by a provider, that would be fine.
    3. Providers don't have as much context to decide what is a suitable ID, e.g. it doesn't have an delta. The embedding (which I've suggested handle generating the ID) does have this context.
  • 🇬🇧United Kingdom scott_euser

    Hmmm we don't have a record of the ID chunks stored if we don't return (or pass by reference) anything. But yes I can see that drupal_entity_id is used for that and not drupal_long_id . So actually search api item corresponds to the entity, not the chunk (ie, what gets recorded in search_api_item)

    Its not really the insert I am worried about. I'll explore to see if that's enough info to be stored for e.g. Pinecone/Weviate to query to find all chunks related to that entity id to be able to delete. Hopefully they both have sufficient query methods to be able to find and delete all chunks that match the entity ID

    Then the drupal_long_id also won't matter in Pinecone either (as long as its unique)

  • 🇬🇧United Kingdom scott_euser

    Okay seems Pinecone free cannot do Delete with filter (only paid plans), but it can do List by ID prefix and then Delete by actual IDs, so it should be doable I think to not know the chunk IDs within Drupal. We just need to make sure that the prefix matches the locally stored index ID to be able to find and delete them. I think that's anyways what you are suggesting in the issue summary. It actually seems that is what SearchApiAiSearchBackend::deleteItems() is doing (so what Milvus is doing, so I guess Milvus has same limitation but haven't checked).

    In any case, it should be fine

  • 🇬🇧United Kingdom scott_euser

    Updated issue summary with what specifically needs to be done. Will comment on Pinecone one to add that to the list of things to deal with.

  • Pipeline finished with Canceled
    23 days ago
    Total: 117s
    #268194
  • Pipeline finished with Failed
    23 days ago
    Total: 240s
    #268196
  • Pipeline finished with Success
    23 days ago
    Total: 248s
    #268203
  • Issue was unassigned.
  • Status changed to Needs review 18 days ago
  • 🇬🇧United Kingdom scott_euser

    Updated the code to enforce a consistent structure in each chunk. I am not 100% sure on the return phpdoc in case someone has a need for additional keys, I guess this would stop that, but we can deal with that if/when problem arises. I did add a validation method too though to actually check the format of the ID for consistency too as I expect we'll have problems if we rely on it to be ID + delta.

    I need to test this more thoroughly still so not ready for merging, but marking as Needs Review in case anyone else wants to check it as well.

Production build 0.71.5 2024