Implement Milvus as a vector storage backend

Issue created by @scott_euser
Status changed to Needs work almost 2 years ago6:52am 23 November 2023
Comment almost 2 years ago →
🇬🇧United Kingdom scott_euser
Corresponding issue in OpenAI: https://www.drupal.org/project/openai/issues/3402579 📌 Create a Milvus.io plugin for openai_embeddings module Active
Merge request https://git.drupalcode.org/issue/search_api_ai-3403577/-/tree/3403577-im... WIP but mostly there
Merge request !12Resolve #3403577 "Implement milvus as" → (Open) created by scott_euser
Comment over 1 year ago →
🇬🇧United Kingdom scott_euser
Comment over 1 year ago →
🇬🇧United Kingdom scott_euser
Updated to match https://www.drupal.org/project/openai/issues/3400915 📌 Update VectorClientInterface to allow for varying parameters Fixed , retested wiping index, re-indexing content, and verifying that the index has the expected content chunks.
Issue was unassigned.
Status changed to Needs review over 1 year ago7:59am 26 November 2023
Comment over 1 year ago →
🇬🇧United Kingdom scott_euser
Comment over 1 year ago →
🇬🇧United Kingdom andrewbelcher
@scot_euser thanks for this!

I think rather than swapping out Pinecone for Milvus as the only supported backend, it would be better to use an interface that can be implemented and then make the places that use the index capable of working with any backend that implements the interface. This will also pave the way for SOLR to be added into the mix once it has support for SOLR when it has the appropriate vector depth.

I think there is a wider issue of lots of modules that overlap. OpenAI now has a VectorClientInterface that overlaps a lot with what Search API AI is aiming to do. I think leveraging Search API to achieve the tracking/indexing makes more sense than OpenAI doing that itself. It's also agnostic to OpenAI itself, as you may use an alternative method for embedding.

My suggestion would be that we:

Have a separate, lightweight library → for the interfaces. This will make it easier for modules such as Search API SOLR to support AI without a full module dependency. This could be done at a later date when SOLR is ready if we don't want the additional overhead right now.

Abstract the relevant methods required for AI specific tasks into an interface. The Pinecone and Milvus backends would then implement that.

Where checking which indexes are possible to use, we check for the interface rather than the specific backend.

OpenAI could then make use of Search API AI to handle the vector repositories etc, and provide an embedding plugin for Search API AI to be configured with (i.e. #3361507 could again provide an interface, OpenAI can then have it's implementation).

I see this PR also fixes some compatibility issues with free Pinecode / OpenAI's updates. I believe those are the same that are in #3403561? I'll try and get that reviewed and merged in the next few days, that can keep this issue a bit cleaner!
Comment over 1 year ago →
🇬🇧United Kingdom scott_euser

Have a separate, lightweight library for the interfaces. This will make it easier for modules such as Search API SOLR to support AI without a full module dependency. This could be done at a later date when SOLR is ready if we don't want the additional overhead right now.

Abstract the relevant methods required for AI specific tasks into an interface. The Pinecone and Milvus backends would then implement that.

Howabout something like this to keep it simple at first:

- search_api_ai
  - src
    - interface class SearchApiAiBackendInterface.php
    - abstract class SearchApiAiBackendBase.php implements interface

Where the interface defines the required methods, used for the asserting for example and the abstract class provides the shared code. Perhaps that is what you meant, though I could not see the benefit at least for now of putting the interface into a separate module.

The interface and abstract class could perhaps be in a sub-module as well that milvus, pinecone, and others depend on. That would allow it to move to the external module if you do eventually want that, without slowing down initial development attempting to keep three unrelated modules in sync (since these sub-modules already depends on OpenAI module).

Where checking which indexes are possible to use, we check for the interface rather than the specific backend.

I believe for this you are referring to SearchForm.php. I reverted the assert and changed it to check for either (but of course needs further update once we have an interface. I believe this relies on this issue getting solved first as Pinecone, Milvus, and others do not return a consistent result format:
https://www.drupal.org/project/openai/issues/3404210 📌 Update return values of vector client plugins to return consistent results Active

I see this PR also fixes some compatibility issues with free Pinecode / OpenAI's updates. I believe those are the same that are in #3403561? I'll try and get that reviewed and merged in the next few days, that can keep this issue a bit cleaner!

Apologies for that! I separated this out fully so it can be reviewed/tackled independently. Whichever you approach first I can then merge any eventual changes to HEAD into the other merge request.

I think there is a wider issue of lots of modules that overlap. OpenAI now has a VectorClientInterface that overlaps a lot with what Search API AI is aiming to do. I think leveraging Search API to achieve the tracking/indexing makes more sense than OpenAI doing that itself. It's also agnostic to OpenAI itself, as you may use an alternative method for embedding.

VectorClientInterface did actually already exist but Kevin said he hit a blocker so added the HttpClient for Pinecone straight away with the intent to move it to the interface and the interface was left as a skeleton with the Pinecone plugin implementing it not actually working. That is now fixed, but point taken that it sitting within OpenAI module means that Pinecone client & Milvus client requires you to enable OpenAI module. Similarly moving it to here means using Pinecone client & Milvus client would require using Search API (and actually at the moment also OpenAI module given its also a dependency - ie, not really better off, just shifting the problem over. So probably eventual right scenario would be something like this:

Vector Databases module: Provides interfaces/abstract classes to handle shared functionality unrelated to Search API and OpenAI

OpenAI Embeddings module: Implements interface/abstract classes to use OpenAI as the generator of the embedding

Milvus module depends on Vector Database module only: Provides Milvus http client extending php wider community class in a Drupal way

Pinecone module depends on Vector Database module only: Provides Pinecone http client extending php wider community class in a Drupal way

Search API AI: Provides interface abstract classes for connecting Search API to Vector Database

Search API Milvus: Provides connection between Milvus and Search API AI

Search API Pinecone: Provides connection between Pinecone and Search API AI

Very rough, but essentially not locking in a site to OpenAI, Search API, or any specific Vector Database.

But given where we are now with Search API AI depending on OpenAI, and Search API AI Pinecone sub-module using OpenAI module's Pinecone http client, 2 modules working together at least means more rapid progress. We could certainly speak to Kevin about whether he's open to not having OpenAI Embeddings module queue items and removing the queue worker, forcing the OpenAI Embedding module to rely on Search API as the mechanism for processing entities.
Comment over 1 year ago →
🇨🇳China fishfree
@scott_euser Would you pls have a look at this issue 🐛 Exception: Source IDs to delete are required by Milvus in Drupal\openai_embeddings\Plugin\openai_embeddings\vector_client\Milvus->delete() (line 229 of Active ? I cannot get it working with a Milvus instance.
Comment about 1 year ago →
🇬🇧United Kingdom andrewbelcher
@scott_euser apologies for the silence. Could you take a look at 📌 Add decouple Milvus support Needs review ? We've put effort into pluggable backends and that issue adds Milvus support.
Comment about 1 year ago →
🇭🇺Hungary asrob Hungary 🇭🇺 🇪🇺
@scott_euser @andrewbelcher,

I've installed search_api_ai 1.0.x-dev successfully, added a Milvus backend (using Zilliz's free instance) and indexed items. So, it seems it works well, however I've encountered a bug. ( https://www.drupal.org/project/search_api_ai/issues/3451719 🐛 Could not load indexed items using Milvus Active )

Implement Milvus as a vector storage backend

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Merge Requests

!12Implement Milvus as a vector storage backend
Open

Comments & Activities

Implement Milvus as a vector storage backend

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Merge Requests

!12Implement Milvus as a vector storage backendOpen

Comments & Activities

!12Implement Milvus as a vector storage backend
Open