Created on 31 October 2023, about 1 year ago

Hi,
there is a growing interest in having Vector search (embeddings) in an AI application.
Solr 9 supports "Dense Vector search".
What would it need to support Vector search on this module?
Thanks

Feature request
Status

Active

Version

4.3

Component

Code

Created by

🇬🇷Greece pinkonomy

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @pinkonomy
  • 🇩🇪Germany mkalkbrenner 🇩🇪

    It requires someone who contributes the required patches and tests.
    Or a sponsor, if I should work on it.

    https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-searc...
    https://sease.io/2022/01/apache-solr-neural-search.html

  • 🇺🇸United States eojthebrave Minneapolis, MN

    I’ve done some exploration of Solr’s dense vector fields as a possible way to do neural searching, and things like RAG. And while I was able to get something working with Search API Solr, I think that it’s not a great fit for Drupal until Solr (Lucence) implements some kind of multi-value dense vector fields.

    Right now a dense vector field can only store a single vector of up to 1024 dimensions. But, in most neural search and RAG implementations that involve large bodies of text (like a Drupal node) it’s best practice to chunk your text into sentences or paragraphs and generate vectors for those chunks rather than one vector for the entire article. As of Solr 9.1 if you chunk your node into sentences you would either need a field per sentence on the solr document, or a solr document per sentence. Both of these approaches seem … not awesome. Search API very much wants to create a Solr document per node.

    There’s ongoing work to add the multi-value vector fields to Lucene but it’s not ready yet - https://github.com/apache/lucene/issues/12313 - Once that’s done however I think this could work well.

    It’s potentially still useful today if your use case is vectorizing text that is a paragraph or less. Or things like creating an embedding for an image to make images searchable.

    I created this sandbox module as a proof of concept: https://www.drupal.org/sandbox/eojthebrave/3444194

    Here’s my notes from trying to put together a POC for future me or whoever they might be useful for.

    • requires Solr 9+ which added dense vector field support
    • add the dense vector field type
      • must be configurable since you need to be able to set the dimensions depending on the vector embedding implementation (e.g. 384 for sentence-transformers/all-MiniLM-L6-v2), and you probably also want to be able to configure similarity type cosine vs. dot_product.
    • at index time, convert a field’s text to vectors, and store in the field using a processor plugin similar to the current date range field implementation
      • clean, and chunk, text to send to the thing that transforms it to a vector (maybe chunking and cleaning should be part of the job of the thing that creates vectors?)
      • must be pluggable (openai v. hugging face v. word2vec, etc.)
      • must be configurable - e.g. what hugging face model to use, what chunk size, chunk overlap, etc.
      • This could be a new plugin type that creates embeddings
    • at query time use a !knn query
      • Right now the module supports lexical search via edismax, etc. But we need to allow for a K-Nearest Neighbors search q = {!knn ...}.
      • how does this interact with the normal dismax query?
      • do we implement a new parse_mode plugin? Basically, at some point we need to convert the search query string provided by the user (probably pretty much verbatim) into a vector.
        • This can’t be done in a contrib module because \Drupal\search_api_solr\Utility\Utility::flattenKeys will thrown an error when parse_mode_id isn’t in the hard coded lists of parse modes the module supports.
        • If there’s a new parse mode then SearchApiSolrBackend can use that to detect what kind of query to use.
        • It’s possible to combine the results of both lexical and vector searches into a sort of hybrid search - https://sease.io/2023/12/hybrid-search-with-apache-solr.html
        • Maybe modify \Drupal\search_api_solr\Plugin\search_api\backend\SearchApiSolrBackend::search and add in an if/else that checks the parse_mode plugin used and creates a knn search for a new Vector parse_mode. Maybe something like is currently done with more like this queries would work? Tag a query as ‘knn’ and then add a getKnnQuery() method to SearchApiSolrBackend
      • A contrib module could subscribe to PostConvertedQueryEvent events and do something like $solarium_query->setQuery($knn_query); But that’s inefficient as it would just negate all the work the search_api_solr module did to build the edismax query already.
    • maybe add the choice of vector embedding creation plugin and configuration for it at the index level because it’s used both when indexing, and when searching, and you have to use the same one for both. You need to use the same logic to create vectors for the index that you use to vectorize the search query if you want any decent results.
  • 🇺🇸United States diegopino

    hi @mkalkbrenner, our project has a need for this and I'm willing to give this a try but, to align with your roadmap I need some pointers.
    First a bit of background. We already have tons of external (Drupal and non Drupal) supporting code and some good experience altering/acting on events on this wonderful module to use custom Solr types, custom data sources, JOINS, etc. e.g The way we alter highlighting allowing use to use Fields that are driven by external Solr plugins that require different different query arguments, etc.

    1.- So, from the perspective of actual implementation, First we need to put the data in :)
    Because the Dense Vector Types are pre-set with a fixed comparison algorithm and a fixed vector Size per type we are right now defining 4 types with vector sizes of 384(Bert/Text embeddings), 512 (Apple Vision Image Fingerprint), 576 (Yolo Embeddings) and 1024 (mobileNet Embeddings). I believe as part of a release a 384 one should be sufficient and anyone else could then extend providing their own.

    The first issue is the mismatch of cardinality and the field generation. A Vector, when passed from PHP to Solr is an array (so multivalued, fixed size based on the Field Type config), but goes always into a single value Field into Solr (multivalued=FALSE), the dynamic field generation \Drupal\search_api_solr\Entity\SolrFieldType::getDynamicFields is blind to this need.
    Question is (or what would you suggest)
    - Add a new Field type Config setting e.g like $this->custom_code $this->cardinality, allowing a Field Type to "ask" for no Dynamic Fields outside of what its type allows (in the case of a Vector of course Single Valued only). This could be useful for future types/other fields driven by custom solr plugins that have that need. Could be also directly a full Solr field settings override. Where a Field Type could "ask" for handling how the field is generated completely via a config
    - OR a fixed method like getSpellcheckField() (e.g getDenseVectorField() that targets specifically Dense Vectors)
    - OR an event that allows any external module to alter the dynamics fields (delegating the actual support and extra configs to anyone willing to write an event subscriber)

    Second issue: Let's say we have now a dynamic, single valued field for one of these custom field type. And I want to setValue for the field.
    The datatype at the PHP level will be an array (multivalued), mismatching the data type at the backend. So question is
    - Do we need a new @SearchApiDataType ? that allows a Vector. Any other work arounds?

    I think the how/one/generates/populates the Vectors both on index time/query time are beyond a first implementation in this module. We, for example, have a Docker container that processes Images and generates a custom datasource populated with this data (and NLP, HOCR). But that will vary a lot between users. Some might want to add this type of fields as a Processor.

    At query time:
    Our hack for custom queries has been to "set EDISMAX" dynamically via a custom Views Filter and add a custom option to the query. EDISMAX because it is the current Parser that alters less/is less opinionated of all of them. Then we intercept all at PostConvertedQueryEvent subscriber, check if a given option was passed, if so we remove the edismax component from the Solarium query and add all our custom logic. This allowed us in the past to do subqueries, JOINS, etc. But for an official implementation, I wonder if having a custom Parse Plugin would be ideal. The only issue I see with that (And Views integration) is that it will have to interact with a Normal Filter/Facets but use them as Pre Filter in a !knn query. And Solr also recommends 3 different options, pre filter, re-ranking too, and a "must" compound query. And this custom parser makes no sense used in an exposed Filter in a Views. Ideas?

    That is what I have so far. I think the issue is not really coding this (testing might be a challenge but then your current tests are excellent, most of what I have learned from this module is reading your tests) but knowing what is worth tapping in, to what degree this module needs to cover all, or just allow the flexibility to override some things and provide the basics.

    Thanks

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    Thanks for all these investigations. In order to be able to discuss the best approach, I need to dive into dense vector searches by myself first.
    I already had a lot of comments in mind when reading your posts, but I want to avoid to reply too quickly.

    I suggest to focus on producing the vectors first. How should we do that in Drupal? How to we we leverage a external service?
    Maybe we can take https://www.drupal.org/project/search_api_clir as an example. It is able to index machine translations created by external services.

  • 🇺🇸United States diegopino

    @mkalkbrenner thanks for your quick reply.

    Our way of producing vectors (embedding extraction) is for sure not the standard way. We have have a chain-able and configurable post processor plugin system for our custom type of fields/data that runs as a set of "extractors", from OCR, to file transforms, to vectors, in this case that are pushed into a background processing queue, then injected into custom datasources. The number of moving parts is kinda huge and does not feel the type of project you would like to mimic for this.

    But, going back to the idea of plugins. I believe, that people (users and devs) using your module would be more comfortable using the existing search api processor idea. Since indexing already happens (most of the time at least) via cron or via drush, the overhead of calling an external service (well in our case it is external to Drupal not but no external in the sense of a commercial API) would be not huge. I mean we enqueue and have workers for everything but that is a choice. Why an extra plugin additionally/on top of to just a new processor?

    Because you want to reuse the "processing/remote API call -> return as vector" logic also on query time. So a Views filter would need to be able to call the same logic used to index a certain vector using the same API. Vectors are opinionated, a one vector generated by X won't make any sense in relation to one generated by Y. Also here vector dimension is key. Fixed, never variable and lastly, depending on the comparison algorithm you might want to provide a normalized Unit Vector so you can use the faster dot_product instead of cosine (which again is a fixed setting for the Field type)

    So, resuming (my 25cents). A post processor (e.g like the aggregated field one, or the Entity renderer one) that takes as argument/another type of very opinionated Plugin as config. These plugins would have standard methods (but opinionated internal logic) to call APIs using an input (in this case the same as a normal processor would have) and return a vector (array) and fixed annotations with vector size, etc. That way devs can write their own plugins that talk/understand/provide the needed logic that will vary a LOT for each remote service and also plug the same logic (which needs to be available outside of the processor itself) when querying to transform the input into a vector.

    I see what you are doing on search_api_clir and it is very interesting.

  • 🇬🇧United Kingdom scott_euser

    I like that idea of pluggable - I maintain the OpenAI Embeddings sub-module of OpenAI and would be happy to make a Plugin there to implement one route to generating the vectors.

    The base module has a method here to do that https://git.drupalcode.org/project/openai/-/blob/1.0.x/src/OpenAIApi.php... (which I am currently expanding to cover multiple numbers of dimensions and multiple text embedding models Support new embeddings models and dimensions (3-small, 3-large) Needs review ).

    Worth noting that any change to generating the vectors for search then requires regeneration of all vectors in the index. E.g. Change which text embedding model is used. So such a Plugin could have an interface method like getConfigurationHash() to let the Plugin decide which attributes of its config should trigger a needs reindexing warning.

  • 🇬🇧United Kingdom scott_euser

    To note there is also https://www.drupal.org/project/search_api_ai which handles the Embedding solr field type and vector generation via openai. It stores via Pinecone Plugin backend. It could also be an option to leverage that so you can focus on the storage only here. I would be happy to make a SOLR Plugin to connect to whatever gets done here in this issue, but I also appreciate that it then gets complicated for the site builder having 2 non-required dependancies to set things up...

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    I think we should adopt the approach of https://www.drupal.org/project/tmgmt and https://www.drupal.org/project/search_api_clir .

    TMGMT offers a centralized component for translations and provides the plugin infrastructure. There're plugins for DeepL, Google Translate, etc ...
    Search API CLIR leverages this service to get machine translations at index time.

    So there should be a centralized service (module) that can provide vectors for text fields. It should provide a plugin infrastructure to connect to different services (remote or locally).

    Search API Solr should then leverage that service to get the vectors.

    Having a vectors field on entities doesn't help as we need to build vectors from the search phrase, right?

    So it has to be "real time" service with it's own caching, somehow simliar to search_api_clir.

  • 🇬🇧United Kingdom scott_euser

    Yep using the full rendered entity in a given view mode to get the vectors for an entity would be great. Part of the problem is chunking it up right to not loose too much. I've seen concepts where first the item is summarised into a shorter length then create the vectors but for a very long Eg node, there starts to get risk of info loss. Would what you are thinking allow for creating multiple sets of vectors associated with a single eneity?

    Definitely vectors need to be created on demand from search terms. I am only saying they need to be created using the exact same method as the index is created with or the search fails (or at least it does with pinecone, milvus, etc backends).

    Whether you want to structure the vector creation as a seperate module the only thing I would say is that it would be a shame to ignore all the effort put into the existing range of modules available (see https://www.drupal.org/project/artificial_intelligence_initiative/issues... ). A Plugin system would allow existing modules to be leveraged rather than duplicating efforts. FWIW I don't think you are necessarily disagreeing with that but wanted to put it out there in case for your consideration.

  • 🇩🇪Germany opensolr

    We also did research on this, but I agree that Solr should have to implement a way to generate vectors both at indexing and query time.
    I think Solr will have a filter available for these tasks soon, perhaps with Solr version 10.

    At this point, we'd have to create the vectors (embeddings) at indexing and query time, in order to have a similarity-type AI search.

    At Opensolr, we have the built-in Web Crawler, in which we implemented a pythin script that uses SBERT to generate vectors based on document's (page's) title, description and a chink of the document's text, and then use those vectors to insert them into Solr.

    Here's the python script we used:

    import sys
    import argparse
    import torch
    from transformers import BertModel, BertTokenizer
    
    def generate_embedding(text, model, tokenizer):
        # Tokenize input text and convert to tensor
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
        
        # Generate embeddings using BERT
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Use the embeddings from the [CLS] token
        embeddings = outputs.last_hidden_state[:, 0, :].squeeze().tolist()
        
        # Reduce dimensionality to vectorDimension=4
        # embeddings = embeddings[:4]  # Take first 4 dimensions
        return embeddings
    
    def main(input_file, output_file):
        # Load pre-trained BERT model and tokenizer from Hugging Face
        model_name = "bert-base-uncased"
        tokenizer = BertTokenizer.from_pretrained(model_name)
        model = BertModel.from_pretrained(model_name)
    
        # Read the input file
        with open(input_file, 'r') as file:
            lines = file.readlines()
    
        # Prepare list to store embeddings
        embeddings_list = []
    
        # Generate embeddings for each line in the file
        for line in lines:
            line = line.strip()
            if line:
                embedding = generate_embedding(line, model, tokenizer)
                embeddings_list.append(embedding)
    
        # Write embeddings to the output file
        with open(output_file, 'w') as out_file:
            for embedding in embeddings_list:
                out_file.write(f"{embedding}\n")
    
    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description="Generate BERT embeddings with vectorDimension=4 for a text corpus and save to a file.")
        parser.add_argument("input_file", type=str, help="Path to the input text file.")
        parser.add_argument("output_file", type=str, help="Path to the output file to save embeddings.")
    
        args = parser.parse_args()
        main(args.input_file, args.output_file)
    

    This will take a corpus.txt file as the input, and output an out.txt file (both file names passed as parameters to this script), and you can use the content of out.txt to index that into Solr's vector field.

    Here's what needs to be added to Solr's schema.xml file:

        <!--vector search - AI-->
        <field     name="embeddings" type="vector"   indexed="true" stored="true" required="false" multiValued="false" />
        <fieldType name="vector"     class="solr.DenseVectorField" vectorDimension="768" similarityFunction="cosine" />
    

    Once you have that in your schema.xml you can use the above python script to generate the vectors, and insert them into this embeddings field.

    Once you have your vectors inserted for each of the documents, you must now generate vectors for the user query (at query time).
    So, if you search for something like: "keyword A keyword B", you would need to run the above python script, in order to generate vectors on those keywords, and the query sent to Solr should look something like this (this creates a UNION between results generated by the vector search (AI), and the keyword search):

    q = {!bool should=$lexicalQuery should=$vectorQuery}&
    lexicalQuery = {!type=edismax qf=<em>field1^10 field2^6....</em>}<em>SEARCH_QUERY</em>&
    vectorQuery = {!knn f=embeddings topK=10}[<em>VECTORS GENERATED BY THE PYTHON SCRIPT AT QUERY TIME</em>]
    

    One will see however that doing this using that python script, or perhaps an API call to OpenAI, is not really feasible, since generating those vectos is a very expensive operation (both in terms of CPU consumption and in terms of actual money spent on OpenAI :) )

    But... again, Solr may just bring up a filter that we can apply both at query time and at indexing time, that will do all of this for us.
    With all this pressure, I believe it's only a matter of time before this should happen.

    And then...
    Thinking about all this, as far as search is concerned, I think that the Drupal Search API Solr module could use some improvements, and fine tunning.
    But other than that, going through all this trouble of generating the vectors like this, just to get a similarity match, in my opinion it will be a juice that will not be worth the squeeze. :)
    At least not yet.

  • 🇳🇱Netherlands valgibson

    I have made a lot of progress on indexing vector embeddings in my Solr index and also am able to query them. I would like to preFilter my docs, but haven't succeeded on doing that yet.

  • 🇬🇧United Kingdom scott_euser

    Slight tangent, but to note a couple things in the AI module -> AI Search sub-module:

    1. You can use multiple providers to generate embeddings, some are paid like OpenAI, but you can also use a local model like Ollama based for example
    2. It handles the tokenization/chunking there, but I suppose you'd want a single Embedding here, so would have to using the Average Pooling Strategy which generates a single vector embedding from many chunks (the Embedding Strategies are a plugin system)
    3. For those looking for something now, we do have 'Boost' plugins as processors for Search API that essentially let you boost contents from a non-SOLR index combining those results with SOLR or database: https://git.drupalcode.org/project/ai/-/tree/1.0.x/modules/ai_search/src.... That gets around the problem described in #11 because a single entity results in many chunks (defined as you'd like with chunks, contentual content repeated, and filterable metadata, but the combined search means you get the full power of filtering, etc with the relevance benefits of vector search
    4. Would be happy to discuss refactors if its helpful to be able to leverage some of the code in there, particularly while we are still in alpha on it

    I'm running a talk/discussion on the AI Search sub-module of the AI module on October 17th https://www.drupal.org/community/events/drupal-ai-meetup-2024-10-17 where I'll go into more detail + would welcome discussion and ideas.

  • 🇳🇱Netherlands valgibson

    Nice to know! Oh and I succeeded in PreFiltering (was using the wrong solr version). Prefilter was only from 9.6 available.

  • 🇬🇷Greece pinkonomy

    Does this work with MySQL or does it need a vector database (e.g. Pgvector from Postgresql)?

Production build 0.71.5 2024