Can anyone re-check into this? The Search API Opensolr → module does seem to be a real contribution though...
bbu23 → credited opensolr → .
We also did research on this, but I agree that Solr should have to implement a way to generate vectors both at indexing and query time.
I think Solr will have a filter available for these tasks soon, perhaps with Solr version 10.
At this point, we'd have to create the vectors (embeddings) at indexing and query time, in order to have a similarity-type AI search.
At Opensolr, we have the built-in Web Crawler, in which we implemented a pythin script that uses SBERT to generate vectors based on document's (page's) title, description and a chink of the document's text, and then use those vectors to insert them into Solr.
Here's the python script we used:
import sys
import argparse
import torch
from transformers import BertModel, BertTokenizer
def generate_embedding(text, model, tokenizer):
# Tokenize input text and convert to tensor
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
# Generate embeddings using BERT
with torch.no_grad():
outputs = model(**inputs)
# Use the embeddings from the [CLS] token
embeddings = outputs.last_hidden_state[:, 0, :].squeeze().tolist()
# Reduce dimensionality to vectorDimension=4
# embeddings = embeddings[:4] # Take first 4 dimensions
return embeddings
def main(input_file, output_file):
# Load pre-trained BERT model and tokenizer from Hugging Face
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
# Read the input file
with open(input_file, 'r') as file:
lines = file.readlines()
# Prepare list to store embeddings
embeddings_list = []
# Generate embeddings for each line in the file
for line in lines:
line = line.strip()
if line:
embedding = generate_embedding(line, model, tokenizer)
embeddings_list.append(embedding)
# Write embeddings to the output file
with open(output_file, 'w') as out_file:
for embedding in embeddings_list:
out_file.write(f"{embedding}\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate BERT embeddings with vectorDimension=4 for a text corpus and save to a file.")
parser.add_argument("input_file", type=str, help="Path to the input text file.")
parser.add_argument("output_file", type=str, help="Path to the output file to save embeddings.")
args = parser.parse_args()
main(args.input_file, args.output_file)
This will take a corpus.txt file as the input, and output an out.txt file (both file names passed as parameters to this script), and you can use the content of out.txt to index that into Solr's vector field.
Here's what needs to be added to Solr's schema.xml file:
<!--vector search - AI-->
<field name="embeddings" type="vector" indexed="true" stored="true" required="false" multiValued="false" />
<fieldType name="vector" class="solr.DenseVectorField" vectorDimension="768" similarityFunction="cosine" />
Once you have that in your schema.xml you can use the above python script to generate the vectors, and insert them into this embeddings field.
Once you have your vectors inserted for each of the documents, you must now generate vectors for the user query (at query time).
So, if you search for something like: "keyword A keyword B", you would need to run the above python script, in order to generate vectors on those keywords, and the query sent to Solr should look something like this (this creates a UNION between results generated by the vector search (AI), and the keyword search):
q = {!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery = {!type=edismax qf=<em>field1^10 field2^6....</em>}<em>SEARCH_QUERY</em>&
vectorQuery = {!knn f=embeddings topK=10}[<em>VECTORS GENERATED BY THE PYTHON SCRIPT AT QUERY TIME</em>]
One will see however that doing this using that python script, or perhaps an API call to OpenAI, is not really feasible, since generating those vectos is a very expensive operation (both in terms of CPU consumption and in terms of actual money spent on OpenAI :) )
But... again, Solr may just bring up a filter that we can apply both at query time and at indexing time, that will do all of this for us.
With all this pressure, I believe it's only a matter of time before this should happen.
And then...
Thinking about all this, as far as search is concerned, I think that the Drupal Search API Solr module could use some improvements, and fine tunning.
But other than that, going through all this trouble of generating the vectors like this, just to get a similarity match, in my opinion it will be a juice that will not be worth the squeeze. :)
At least not yet.
opensolr → created an issue.