undefined - Contrib.social

📌 | Drupal.org content | Review Opensolr SRL-D for Drupal Services listing

Comment 10 months ago →

🇩🇪Germany opensolr

Can anyone re-check into this? The Search API Opensolr → module does seem to be a real contribution though...

📌 | Search API opensolr | Enhance the configuration form by adding option to autoconfigure server and opensolr index

Comment 11 months ago →

🇩🇪Germany opensolr

bbu23 → credited opensolr → .

✨ | Search API Solr | Vector search

Comment 11 months ago →

🇩🇪Germany opensolr

We also did research on this, but I agree that Solr should have to implement a way to generate vectors both at indexing and query time.
I think Solr will have a filter available for these tasks soon, perhaps with Solr version 10.

At this point, we'd have to create the vectors (embeddings) at indexing and query time, in order to have a similarity-type AI search.

At Opensolr, we have the built-in Web Crawler, in which we implemented a pythin script that uses SBERT to generate vectors based on document's (page's) title, description and a chink of the document's text, and then use those vectors to insert them into Solr.

Here's the python script we used:

import sys
import argparse
import torch
from transformers import BertModel, BertTokenizer

def generate_embedding(text, model, tokenizer):
    # Tokenize input text and convert to tensor
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    
    # Generate embeddings using BERT
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Use the embeddings from the [CLS] token
    embeddings = outputs.last_hidden_state[:, 0, :].squeeze().tolist()
    
    # Reduce dimensionality to vectorDimension=4
    # embeddings = embeddings[:4]  # Take first 4 dimensions
    return embeddings

def main(input_file, output_file):
    # Load pre-trained BERT model and tokenizer from Hugging Face
    model_name = "bert-base-uncased"
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)

    # Read the input file
    with open(input_file, 'r') as file:
        lines = file.readlines()

    # Prepare list to store embeddings
    embeddings_list = []

    # Generate embeddings for each line in the file
    for line in lines:
        line = line.strip()
        if line:
            embedding = generate_embedding(line, model, tokenizer)
            embeddings_list.append(embedding)

    # Write embeddings to the output file
    with open(output_file, 'w') as out_file:
        for embedding in embeddings_list:
            out_file.write(f"{embedding}\n")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate BERT embeddings with vectorDimension=4 for a text corpus and save to a file.")
    parser.add_argument("input_file", type=str, help="Path to the input text file.")
    parser.add_argument("output_file", type=str, help="Path to the output file to save embeddings.")

    args = parser.parse_args()
    main(args.input_file, args.output_file)

This will take a corpus.txt file as the input, and output an out.txt file (both file names passed as parameters to this script), and you can use the content of out.txt to index that into Solr's vector field.

Here's what needs to be added to Solr's schema.xml file:

    <!--vector search - AI-->
    <field     name="embeddings" type="vector"   indexed="true" stored="true" required="false" multiValued="false" />
    <fieldType name="vector"     class="solr.DenseVectorField" vectorDimension="768" similarityFunction="cosine" />

Once you have that in your schema.xml you can use the above python script to generate the vectors, and insert them into this embeddings field.

Once you have your vectors inserted for each of the documents, you must now generate vectors for the user query (at query time).
So, if you search for something like: "keyword A keyword B", you would need to run the above python script, in order to generate vectors on those keywords, and the query sent to Solr should look something like this (this creates a UNION between results generated by the vector search (AI), and the keyword search):

q = {!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery = {!type=edismax qf=<em>field1^10 field2^6....</em>}<em>SEARCH_QUERY</em>&
vectorQuery = {!knn f=embeddings topK=10}[<em>VECTORS GENERATED BY THE PYTHON SCRIPT AT QUERY TIME</em>]

One will see however that doing this using that python script, or perhaps an API call to OpenAI, is not really feasible, since generating those vectos is a very expensive operation (both in terms of CPU consumption and in terms of actual money spent on OpenAI :) )

But... again, Solr may just bring up a filter that we can apply both at query time and at indexing time, that will do all of this for us.
With all this pressure, I believe it's only a matter of time before this should happen.

And then...
Thinking about all this, as far as search is concerned, I think that the Drupal Search API Solr module could use some improvements, and fine tunning.
But other than that, going through all this trouble of generating the vectors like this, just to get a similarity match, in my opinion it will be a juice that will not be worth the squeeze. :)
At least not yet.

🐛 | Search API Solr | solrconfig_extra.xml Causes Solr write.lock issues in suggesters

Comment over 1 year ago →

🇩🇪Germany opensolr

🐛 | Search API Solr | solrconfig_extra.xml Causes Solr write.lock issues in suggesters

Comment over 1 year ago →

🇩🇪Germany opensolr

🐛 | Search API Solr | solrconfig_extra.xml Causes Solr write.lock issues in suggesters

Comment over 1 year ago →

🇩🇪Germany opensolr

🐛 | Search API Solr | solrconfig_extra.xml Causes Solr write.lock issues in suggesters

Comment over 1 year ago →

🇩🇪Germany opensolr

🐛 | Search API Solr | solrconfig_extra.xml Causes Solr write.lock issues in suggesters

Comment over 1 year ago →

🇩🇪Germany opensolr

opensolr → created an issue.

🇩🇪Germany @opensolr

Recent comments