Dense Vector Search

Created on 31 October 2023, 8 months ago
Updated 16 May 2024, about 1 month ago

Hi,
there is a growing interest in having Vector search (embeddings) in an AI application.
Solr 9 supports "Dense Vector search".
What would it need to support Vector search on this module?
Thanks

Feature request
Status

Active

Version

4.0

Component

Code

Created by

🇬🇷Greece pinkonomy

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @pinkonomy
  • 🇩🇪Germany mkalkbrenner 🇩🇪

    It requires someone who contributes the required patches and tests.
    Or a sponsor, if I should work on it.

    https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-searc...
    https://sease.io/2022/01/apache-solr-neural-search.html

  • 🇨🇳China fishfree

    +1 please~

  • 🇺🇸United States eojthebrave Minneapolis, MN

    I’ve done some exploration of Solr’s dense vector fields as a possible way to do neural searching, and things like RAG. And while I was able to get something working with Search API Solr, I think that it’s not a great fit for Drupal until Solr (Lucence) implements some kind of multi-value dense vector fields.

    Right now a dense vector field can only store a single vector of up to 1024 dimensions. But, in most neural search and RAG implementations that involve large bodies of text (like a Drupal node) it’s best practice to chunk your text into sentences or paragraphs and generate vectors for those chunks rather than one vector for the entire article. As of Solr 9.1 if you chunk your node into sentences you would either need a field per sentence on the solr document, or a solr document per sentence. Both of these approaches seem … not awesome. Search API very much wants to create a Solr document per node.

    There’s ongoing work to add the multi-value vector fields to Lucene but it’s not ready yet - https://github.com/apache/lucene/issues/12313 - Once that’s done however I think this could work well.

    It’s potentially still useful today if your use case is vectorizing text that is a paragraph or less. Or things like creating an embedding for an image to make images searchable.

    I created this sandbox module as a proof of concept: https://www.drupal.org/sandbox/eojthebrave/3444194

    Here’s my notes from trying to put together a POC for future me or whoever they might be useful for.

    • requires Solr 9+ which added dense vector field support
    • add the dense vector field type
      • must be configurable since you need to be able to set the dimensions depending on the vector embedding implementation (e.g. 384 for sentence-transformers/all-MiniLM-L6-v2), and you probably also want to be able to configure similarity type cosine vs. dot_product.
    • at index time, convert a field’s text to vectors, and store in the field using a processor plugin similar to the current date range field implementation
      • clean, and chunk, text to send to the thing that transforms it to a vector (maybe chunking and cleaning should be part of the job of the thing that creates vectors?)
      • must be pluggable (openai v. hugging face v. word2vec, etc.)
      • must be configurable - e.g. what hugging face model to use, what chunk size, chunk overlap, etc.
      • This could be a new plugin type that creates embeddings
    • at query time use a !knn query
      • Right now the module supports lexical search via edismax, etc. But we need to allow for a K-Nearest Neighbors search q = {!knn ...}.
      • how does this interact with the normal dismax query?
      • do we implement a new parse_mode plugin? Basically, at some point we need to convert the search query string provided by the user (probably pretty much verbatim) into a vector.
        • This can’t be done in a contrib module because \Drupal\search_api_solr\Utility\Utility::flattenKeys will thrown an error when parse_mode_id isn’t in the hard coded lists of parse modes the module supports.
        • If there’s a new parse mode then SearchApiSolrBackend can use that to detect what kind of query to use.
        • It’s possible to combine the results of both lexical and vector searches into a sort of hybrid search - https://sease.io/2023/12/hybrid-search-with-apache-solr.html
        • Maybe modify \Drupal\search_api_solr\Plugin\search_api\backend\SearchApiSolrBackend::search and add in an if/else that checks the parse_mode plugin used and creates a knn search for a new Vector parse_mode. Maybe something like is currently done with more like this queries would work? Tag a query as ‘knn’ and then add a getKnnQuery() method to SearchApiSolrBackend
      • A contrib module could subscribe to PostConvertedQueryEvent events and do something like $solarium_query->setQuery($knn_query); But that’s inefficient as it would just negate all the work the search_api_solr module did to build the edismax query already.
    • maybe add the choice of vector embedding creation plugin and configuration for it at the index level because it’s used both when indexing, and when searching, and you have to use the same one for both. You need to use the same logic to create vectors for the index that you use to vectorize the search query if you want any decent results.
  • 🇺🇸United States DiegoPino

    hi @mkalkbrenner, our project has a need for this and I'm willing to give this a try but, to align with your roadmap I need some pointers.
    First a bit of background. We already have tons of external (Drupal and non Drupal) supporting code and some good experience altering/acting on events on this wonderful module to use custom Solr types, custom data sources, JOINS, etc. e.g The way we alter highlighting allowing use to use Fields that are driven by external Solr plugins that require different different query arguments, etc.

    1.- So, from the perspective of actual implementation, First we need to put the data in :)
    Because the Dense Vector Types are pre-set with a fixed comparison algorithm and a fixed vector Size per type we are right now defining 4 types with vector sizes of 384(Bert/Text embeddings), 512 (Apple Vision Image Fingerprint), 576 (Yolo Embeddings) and 1024 (mobileNet Embeddings). I believe as part of a release a 384 one should be sufficient and anyone else could then extend providing their own.

    The first issue is the mismatch of cardinality and the field generation. A Vector, when passed from PHP to Solr is an array (so multivalued, fixed size based on the Field Type config), but goes always into a single value Field into Solr (multivalued=FALSE), the dynamic field generation \Drupal\search_api_solr\Entity\SolrFieldType::getDynamicFields is blind to this need.
    Question is (or what would you suggest)
    - Add a new Field type Config setting e.g like $this->custom_code $this->cardinality, allowing a Field Type to "ask" for no Dynamic Fields outside of what its type allows (in the case of a Vector of course Single Valued only). This could be useful for future types/other fields driven by custom solr plugins that have that need. Could be also directly a full Solr field settings override. Where a Field Type could "ask" for handling how the field is generated completely via a config
    - OR a fixed method like getSpellcheckField() (e.g getDenseVectorField() that targets specifically Dense Vectors)
    - OR an event that allows any external module to alter the dynamics fields (delegating the actual support and extra configs to anyone willing to write an event subscriber)

    Second issue: Let's say we have now a dynamic, single valued field for one of these custom field type. And I want to setValue for the field.
    The datatype at the PHP level will be an array (multivalued), mismatching the data type at the backend. So question is
    - Do we need a new @SearchApiDataType ? that allows a Vector. Any other work arounds?

    I think the how/one/generates/populates the Vectors both on index time/query time are beyond a first implementation in this module. We, for example, have a Docker container that processes Images and generates a custom datasource populated with this data (and NLP, HOCR). But that will vary a lot between users. Some might want to add this type of fields as a Processor.

    At query time:
    Our hack for custom queries has been to "set EDISMAX" dynamically via a custom Views Filter and add a custom option to the query. EDISMAX because it is the current Parser that alters less/is less opinionated of all of them. Then we intercept all at PostConvertedQueryEvent subscriber, check if a given option was passed, if so we remove the edismax component from the Solarium query and add all our custom logic. This allowed us in the past to do subqueries, JOINS, etc. But for an official implementation, I wonder if having a custom Parse Plugin would be ideal. The only issue I see with that (And Views integration) is that it will have to interact with a Normal Filter/Facets but use them as Pre Filter in a !knn query. And Solr also recommends 3 different options, pre filter, re-ranking too, and a "must" compound query. And this custom parser makes no sense used in an exposed Filter in a Views. Ideas?

    That is what I have so far. I think the issue is not really coding this (testing might be a challenge but then your current tests are excellent, most of what I have learned from this module is reading your tests) but knowing what is worth tapping in, to what degree this module needs to cover all, or just allow the flexibility to override some things and provide the basics.

    Thanks

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    Thanks for all these investigations. In order to be able to discuss the best approach, I need to dive into dense vector searches by myself first.
    I already had a lot of comments in mind when reading your posts, but I want to avoid to reply too quickly.

    I suggest to focus on producing the vectors first. How should we do that in Drupal? How to we we leverage a external service?
    Maybe we can take https://www.drupal.org/project/search_api_clir as an example. It is able to index machine translations created by external services.

  • 🇺🇸United States DiegoPino

    @mkalkbrenner thanks for your quick reply.

    Our way of producing vectors (embedding extraction) is for sure not the standard way. We have have a chain-able and configurable post processor plugin system for our custom type of fields/data that runs as a set of "extractors", from OCR, to file transforms, to vectors, in this case that are pushed into a background processing queue, then injected into custom datasources. The number of moving parts is kinda huge and does not feel the type of project you would like to mimic for this.

    But, going back to the idea of plugins. I believe, that people (users and devs) using your module would be more comfortable using the existing search api processor idea. Since indexing already happens (most of the time at least) via cron or via drush, the overhead of calling an external service (well in our case it is external to Drupal not but no external in the sense of a commercial API) would be not huge. I mean we enqueue and have workers for everything but that is a choice. Why an extra plugin additionally/on top of to just a new processor?

    Because you want to reuse the "processing/remote API call -> return as vector" logic also on query time. So a Views filter would need to be able to call the same logic used to index a certain vector using the same API. Vectors are opinionated, a one vector generated by X won't make any sense in relation to one generated by Y. Also here vector dimension is key. Fixed, never variable and lastly, depending on the comparison algorithm you might want to provide a normalized Unit Vector so you can use the faster dot_product instead of cosine (which again is a fixed setting for the Field type)

    So, resuming (my 25cents). A post processor (e.g like the aggregated field one, or the Entity renderer one) that takes as argument/another type of very opinionated Plugin as config. These plugins would have standard methods (but opinionated internal logic) to call APIs using an input (in this case the same as a normal processor would have) and return a vector (array) and fixed annotations with vector size, etc. That way devs can write their own plugins that talk/understand/provide the needed logic that will vary a LOT for each remote service and also plug the same logic (which needs to be available outside of the processor itself) when querying to transform the input into a vector.

    I see what you are doing on search_api_clir and it is very interesting.

  • 🇬🇧United Kingdom scott_euser

    I like that idea of pluggable - I maintain the OpenAI Embeddings sub-module of OpenAI and would be happy to make a Plugin there to implement one route to generating the vectors.

    The base module has a method here to do that https://git.drupalcode.org/project/openai/-/blob/1.0.x/src/OpenAIApi.php... (which I am currently expanding to cover multiple numbers of dimensions and multiple text embedding models Support new embeddings models and dimensions (3-small, 3-large) Needs review ).

    Worth noting that any change to generating the vectors for search then requires regeneration of all vectors in the index. E.g. Change which text embedding model is used. So such a Plugin could have an interface method like getConfigurationHash() to let the Plugin decide which attributes of its config should trigger a needs reindexing warning.

  • 🇬🇧United Kingdom scott_euser

    To note there is also https://www.drupal.org/project/search_api_ai which handles the Embedding solr field type and vector generation via openai. It stores via Pinecone Plugin backend. It could also be an option to leverage that so you can focus on the storage only here. I would be happy to make a SOLR Plugin to connect to whatever gets done here in this issue, but I also appreciate that it then gets complicated for the site builder having 2 non-required dependancies to set things up...

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    I think we should adopt the approach of https://www.drupal.org/project/tmgmt and https://www.drupal.org/project/search_api_clir .

    TMGMT offers a centralized component for translations and provides the plugin infrastructure. There're plugins for DeepL, Google Translate, etc ...
    Search API CLIR leverages this service to get machine translations at index time.

    So there should be a centralized service (module) that can provide vectors for text fields. It should provide a plugin infrastructure to connect to different services (remote or locally).

    Search API Solr should then leverage that service to get the vectors.

    Having a vectors field on entities doesn't help as we need to build vectors from the search phrase, right?

    So it has to be "real time" service with it's own caching, somehow simliar to search_api_clir.

  • 🇬🇧United Kingdom scott_euser

    Yep using the full rendered entity in a given view mode to get the vectors for an entity would be great. Part of the problem is chunking it up right to not loose too much. I've seen concepts where first the item is summarised into a shorter length then create the vectors but for a very long Eg node, there starts to get risk of info loss. Would what you are thinking allow for creating multiple sets of vectors associated with a single eneity?

    Definitely vectors need to be created on demand from search terms. I am only saying they need to be created using the exact same method as the index is created with or the search fails (or at least it does with pinecone, milvus, etc backends).

    Whether you want to structure the vector creation as a seperate module the only thing I would say is that it would be a shame to ignore all the effort put into the existing range of modules available (see https://www.drupal.org/project/artificial_intelligence_initiative/issues... ). A Plugin system would allow existing modules to be leveraged rather than duplicating efforts. FWIW I don't think you are necessarily disagreeing with that but wanted to put it out there in case for your consideration.

Production build 0.69.0 2024