Implement Milvus as a vector storage backend

Created on 23 November 2023, about 1 year ago
Updated 1 June 2024, 8 months ago

Problem/Motivation

We currently only use Pinecone as a backend vector store. But his is proprietary and for data security issues may be an issue especially in the EU.

Steps to reproduce

Proposed resolution

Implement Milvus as an open source alternative https://milvus.io/api-reference/restful/v2.2.x/About.md
(Could also look at newer version of Apache Solr when it supports vectors of size 1536 instead of the current 1024

Remaining tasks

User interface changes

API changes

Data model changes

📌 Task
Status

Needs review

Version

1.0

Component

Code

Created by

🇬🇧United Kingdom scott_euser

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @scott_euser
  • Status changed to Needs work about 1 year ago
  • 🇬🇧United Kingdom scott_euser

    Corresponding issue in OpenAI: https://www.drupal.org/project/openai/issues/3402579 📌 Create a Milvus.io plugin for openai_embeddings module Active
    Merge request https://git.drupalcode.org/issue/search_api_ai-3403577/-/tree/3403577-im... WIP but mostly there

  • Merge request !12Resolve #3403577 "Implement milvus as" → (Open) created by scott_euser
  • 🇬🇧United Kingdom scott_euser

    Updated to match https://www.drupal.org/project/openai/issues/3400915 📌 Update VectorClientInterface to allow for varying parameters Fixed , retested wiping index, re-indexing content, and verifying that the index has the expected content chunks.

  • Issue was unassigned.
  • Status changed to Needs review about 1 year ago
  • 🇬🇧United Kingdom andrewbelcher

    @scot_euser thanks for this!

    I think rather than swapping out Pinecone for Milvus as the only supported backend, it would be better to use an interface that can be implemented and then make the places that use the index capable of working with any backend that implements the interface. This will also pave the way for SOLR to be added into the mix once it has support for SOLR when it has the appropriate vector depth.

    I think there is a wider issue of lots of modules that overlap. OpenAI now has a VectorClientInterface that overlaps a lot with what Search API AI is aiming to do. I think leveraging Search API to achieve the tracking/indexing makes more sense than OpenAI doing that itself. It's also agnostic to OpenAI itself, as you may use an alternative method for embedding.

    My suggestion would be that we:

    • Have a separate, lightweight library for the interfaces. This will make it easier for modules such as Search API SOLR to support AI without a full module dependency. This could be done at a later date when SOLR is ready if we don't want the additional overhead right now.
    • Abstract the relevant methods required for AI specific tasks into an interface. The Pinecone and Milvus backends would then implement that.
    • Where checking which indexes are possible to use, we check for the interface rather than the specific backend.
    • OpenAI could then make use of Search API AI to handle the vector repositories etc, and provide an embedding plugin for Search API AI to be configured with (i.e. #3361507 could again provide an interface, OpenAI can then have it's implementation).

    I see this PR also fixes some compatibility issues with free Pinecode / OpenAI's updates. I believe those are the same that are in #3403561? I'll try and get that reviewed and merged in the next few days, that can keep this issue a bit cleaner!

  • 🇬🇧United Kingdom scott_euser

    Have a separate, lightweight library for the interfaces. This will make it easier for modules such as Search API SOLR to support AI without a full module dependency. This could be done at a later date when SOLR is ready if we don't want the additional overhead right now.

    Abstract the relevant methods required for AI specific tasks into an interface. The Pinecone and Milvus backends would then implement that.

    Howabout something like this to keep it simple at first:

    - search_api_ai
      - src
        - interface class SearchApiAiBackendInterface.php
        - abstract class SearchApiAiBackendBase.php implements interface

    Where the interface defines the required methods, used for the asserting for example and the abstract class provides the shared code. Perhaps that is what you meant, though I could not see the benefit at least for now of putting the interface into a separate module.

    The interface and abstract class could perhaps be in a sub-module as well that milvus, pinecone, and others depend on. That would allow it to move to the external module if you do eventually want that, without slowing down initial development attempting to keep three unrelated modules in sync (since these sub-modules already depends on OpenAI module).

    Where checking which indexes are possible to use, we check for the interface rather than the specific backend.

    I believe for this you are referring to SearchForm.php. I reverted the assert and changed it to check for either (but of course needs further update once we have an interface. I believe this relies on this issue getting solved first as Pinecone, Milvus, and others do not return a consistent result format:
    https://www.drupal.org/project/openai/issues/3404210 📌 Update return values of vector client plugins to return consistent results Active

    I see this PR also fixes some compatibility issues with free Pinecode / OpenAI's updates. I believe those are the same that are in #3403561? I'll try and get that reviewed and merged in the next few days, that can keep this issue a bit cleaner!

    Apologies for that! I separated this out fully so it can be reviewed/tackled independently. Whichever you approach first I can then merge any eventual changes to HEAD into the other merge request.

    I think there is a wider issue of lots of modules that overlap. OpenAI now has a VectorClientInterface that overlaps a lot with what Search API AI is aiming to do. I think leveraging Search API to achieve the tracking/indexing makes more sense than OpenAI doing that itself. It's also agnostic to OpenAI itself, as you may use an alternative method for embedding.

    VectorClientInterface did actually already exist but Kevin said he hit a blocker so added the HttpClient for Pinecone straight away with the intent to move it to the interface and the interface was left as a skeleton with the Pinecone plugin implementing it not actually working. That is now fixed, but point taken that it sitting within OpenAI module means that Pinecone client & Milvus client requires you to enable OpenAI module. Similarly moving it to here means using Pinecone client & Milvus client would require using Search API (and actually at the moment also OpenAI module given its also a dependency - ie, not really better off, just shifting the problem over. So probably eventual right scenario would be something like this:

    • Vector Databases module: Provides interfaces/abstract classes to handle shared functionality unrelated to Search API and OpenAI
    • OpenAI Embeddings module: Implements interface/abstract classes to use OpenAI as the generator of the embedding
    • Milvus module depends on Vector Database module only: Provides Milvus http client extending php wider community class in a Drupal way
    • Pinecone module depends on Vector Database module only: Provides Pinecone http client extending php wider community class in a Drupal way
    • Search API AI: Provides interface abstract classes for connecting Search API to Vector Database
    • Search API Milvus: Provides connection between Milvus and Search API AI
    • Search API Pinecone: Provides connection between Pinecone and Search API AI

    Very rough, but essentially not locking in a site to OpenAI, Search API, or any specific Vector Database.

    But given where we are now with Search API AI depending on OpenAI, and Search API AI Pinecone sub-module using OpenAI module's Pinecone http client, 2 modules working together at least means more rapid progress. We could certainly speak to Kevin about whether he's open to not having OpenAI Embeddings module queue items and removing the queue worker, forcing the OpenAI Embedding module to rely on Search API as the mechanism for processing entities.

  • 🇬🇧United Kingdom andrewbelcher

    @scott_euser apologies for the silence. Could you take a look at 📌 Add decouple Milvus support Needs review ? We've put effort into pluggable backends and that issue adds Milvus support.

  • 🇭🇺Hungary asrob Hungary 🇭🇺 🇪🇺

    @scott_euser @andrewbelcher,

    I've installed search_api_ai 1.0.x-dev successfully, added a Milvus backend (using Zilliz's free instance) and indexed items. So, it seems it works well, however I've encountered a bug. ( https://www.drupal.org/project/search_api_ai/issues/3451719 🐛 Could not load indexed items using Milvus Active )

Production build 0.71.5 2024