Generate string vectors using OpenAI embeddings

Issue created by @somebodysysop
Comment over 2 years ago →
🇺🇸United States kevinquillen
This is actually a good idea worth looking into. It is this sort of application of OpenAI I am interested in - augmenting Drupal in assistive ways.
Comment over 2 years ago →
🇺🇸United States kevinquillen
Looking at the API, it does a POST request:

https://platform.openai.com/docs/api-reference/embeddings/create

My question, how does that work exactly? What does that mean for example for a sites content? I suppose all content published is public generally speaking, but is there a risk for content that is access protected, unpublished etc? Can people create their own private models? If you submit to this API is your content then in a public model?
Comment over 2 years ago →
🇺🇸United States somebodysysop
Looking at the API, it does a POST request:

https://platform.openai.com/docs/api-reference/embeddings/create

My question, how does that work exactly?

What we are talking about is text vectorization, a numeric representation of the text. https://www.deepset.ai/blog/what-is-text-vectorization-in-nlp. Essentially, way to make it easier for machines, specifically language models, to understand the text.

So, on a site we have a page with content. If it's a node, that content is stored in the
Drupal database with a node id. If it's a paragraph, that paragraph is stored in the Drupal database as a paragraph id. When we run an embedding, essentially we create a vector representation of that page and store it in the database, referring back to it's source id (i.e., the node/paragraph text that was vectored).

So, for example, when a page is created or updated, we run the API to get the vectors for the page content, then store those vectors with the entity_id of the added/updated page. That table row is what we use to do the vector searches on page content (as opposed to the sql keyword searches). We actually run a mathematical function to compare the search vector with the stored vectors as opposed to a database search.

What does that mean for example for a sites content?

I don't see any change. End users have access to site pages based upon their access level. The vector content is never seen nor accessible by the user. It is only used for searching.

I suppose all content published is public generally speaking, but is there a risk for content that is access protected, unpublished etc?

No. The page content and it's access remains the same. The embedding process makes API calls to the OpenAI embedding model to retrieve the vectors. That information is not stored on the model. The model simply "vectorizes" the content and sends back the vectors. The page data is never in a position, as far as I can tell, to be seen or accessed outside of this process. Same for the vector information. Unless OpenAI is storing this information to do model training. That, I don't know. So far I've not been able to find anything on embedding security.

Update: I actually asked the ChatGPT3 module this question, and this is these are the responses I received: https://sharegpt.com/c/TF7wLOy

Can people create their own private models? If you submit to this API is your content then in a public model?

At this point, there aren't any custom embedding models. At least as far as I know. The available models are listed here: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

I know that people can create their own private models in fine-tuning, but that's not what I am suggesting here. I am suggesting that we use OpenAI embeddings to vectorize our content in order to perform vector searches that no longer rely on keyword matches to locate relevant data. This is one of the best explanations I have seen on the subject: https://www.youtube.com/watch?v=xzHhZh7F25I&ab_channel=PartTimeLarry

For larger sites it might run up costs, so perhaps we start by limiting it to a small set of nodes, especially during development.

To be clear, I am not suggesting this process as a core feature. I would restrict it to an optional submodule of this OpenAI / GPT module. Therefore users who want this enhanced search feature on their data would certainly be willing to pay for the usage.

Which means that the submodule would require the user's API key. It would also give them the option of selecting the public embedding model they wish to use:

On this page, I only see one model: https://openai.com/api/pricing/

Ada $0.0004 / 1K tokens
Comment over 2 years ago →
🇺🇸United States kevinquillen
Thanks for the explanation, just wanted to throw some questions out there. I have an idea now.

Would you be able to contribute to the forked branch? https://git.drupalcode.org/issue/openai-3337774/-/tree/views-embedding
Comment over 2 years ago →
🇺🇸United States kevinquillen
To be clear, I am not suggesting this process as a core feature. I would restrict it to an optional submodule of this OpenAI / GPT module. Therefore users who want this enhanced search feature on their data would certainly be willing to pay for the usage.

Yep, thats the goal for most of these features. I have created an opeai_search module in the forked branch. Open to 'better' naming scheme(s).

Looking through the docs, they combined a dataset into a single row per record in their example:

https://github.com/openai/openai-cookbook/blob/main/examples/Obtain_data...

In this case, should it only be entities like nodes, and not ones like Paragraphs? One reason I ask is because Paragraphs last I checked are not deleted, for example. Typically, there are no such things as orphaned paragraphs either, they (the active ones) are attached to a host. Should the node be loaded into a specific view mode that is just a textual representation of the node content? Then feed that to the API?
Comment over 2 years ago →
🇺🇸United States somebodysysop
I actually didn't know that about paragraphs. I know you can migrate them without being attached to nodes, but not that they aren't deleted when their parent nodes are deleted.

I mentioned nodes and paragraphs because those are the primary entities that I am familiar with when working with site text content. Depending upon the application, a site content type could contain one or more or no body text fields. In my sites, I usually place the content in paragraph body text fields. Other folks use nodes. So, our module would have to accommodate this.

For example, this field:

machine name: field_content
field type: text (formatted, long)
entity type: paragraph
entity id: paragraph id
parent entity: parent entity type (typically a node)
parent entity id: parent entity's id

Will be stored in the vector table like this:

vector entity type: node/paragraph
vector id: node/paragraph id
vectors: the actual vector array (stored as serialized blob, as per my understanding)

I'm not sure if we need the parent entity info. It depends upon how the search is structured -- how we retrieve the target entities for search. In SQL, we would join the vector table on the vector table id = paragraph/node id.

But, we don't care about the ids until AFTER the search (or in our case, mathematical function) is completed.

We run a dot product or cosine calculation between the search vector and all rows in the vector table. Then we take the highest matches (1, 2, 3, whatever the user decides), and THEN we retrieve the node/paragraph entities from those results based upn the node/paragraph ids.

Should the node be loaded into a specific view mode that is just a textual representation of the node content? Then feed that to the API?

Good question. Right now, in the Drupal field content tables, the text fields contain the raw HTML text content. The question is: Do we strip the HTML tags from the content before we embed it? Logically, it would seem like we should in order to get the best semantic search results, but it would take someone with more AI embedding knowledge than me to determine that.
Comment over 2 years ago →
🇺🇸United States somebodysysop
So, right now with the model we are discussing, our results will bring back full pages. I was thinking through how we might bring back an excerpt instead. Then, thinking beyond that to using something similar to a ChatBot interface.

Check out this video:

https://youtu.be/fODk-alDqWw

What the guy does here is:

Creates embeddings for his content. The is the vectorization we've been discussing

Gets search vector and runs cosine completion against his stored vectors.

Gets the 3 highest results. We talked about this as well.

Takes the highest result and then sends that as context for the same question to a OpenAI completion model.

The completion model looks at the question within the context and returns an informed answer.

So now the Drupal user can get ChatGPT3 responses to queries on his/her content without having to fine-tune an OpenAI model.

All done through normal Drupal database transactions and APIs to OpenAI.

Add Audio to text and text to Audio, and now we have a full Drupal bot.
Comment over 2 years ago →
🇺🇸United States kevinquillen
stored as serialized blob

Would this have a performance impact during search at all? I was looking to see if Drupal 10 had support for JSON storage types in the database but could not confirm that yet.

In either case, yeah, I think the storage will look something like:

- id (entry id)
- entity_id
- entity_type
- bundle (unsure if needed)
- field_name (what the vector was generated from)
- embedding (the vectors)
- data (raw response from OpenAI)

with db indexes on type, bundle, field_name. Ex. a node with 4 text (plain or long) fields would have 4 entries.

The administrative interface also needs a way for a user to opt in which entity types and their bundles to run this on. Would we also need to allow specifying which text fields to run on? Maybe we do them all initially and make opting fields in an enhancement later.
Comment over 2 years ago →
🇺🇸United States somebodysysop
Would this have a performance impact during search at all?

Compared to current keyword searches? Probably so. We've got to do the API call to get the search vector then the mathematical computation on the vector table -- and possibly a 2nd API call to get a response with context. We won't know just how much of a noticeable impact until we start testing with real data.

with db indexes on type, bundle, field_name. Ex. a node with 4 text (plain or long) fields would have 4 entries. What about field deltas? Do we need to track ordering?

Yes. I don't know that field deltas matter in our scenario.

As for ordering, remember that the vector computation (dot product, cosine similarity, etc...) will return vector records that we will order according their value relative to the search vector.

I think we need to think of it from the standpoint of what we need to get back to the entity we are actually going to display to the end-user. Let's take my case, where all my text content resides in paragraphs.

When we embed the content, we use the field_content_value (text) from the paragraph__field_content_table. Now paragraph entity 11294 has a vector record and id.

When we search, and paragraph vector 11294 ends up the highest result, we use the paragraphs_item_field_data table to point us to the parent entity node that we will link users to in the search results.

The administrative interface also needs a way for a user to opt in which entity types and their bundles to run this on. Would we also need to allow specifying which text fields to run on? Maybe we do them all initially and make opting fields in an enhancement later.

In any event, I say let the end-user decide which fields to use with the warning that the more fields that are used, the greater the potential latency.
Comment over 2 years ago →
🇺🇸United States kevinquillen
So I started with a basic node type that has a title and a body. For the title, I named it 'Test Article'. The embedding returned for that was 1536 numbers - should it be that many?
Comment over 2 years ago →
🇺🇸United States somebodysysop
Yes. The vector arrays will be very long.
@kevinquillen opened merge request.
Comment over 2 years ago →
🇺🇸United States kevinquillen
Gotcha. I've put some rough work in here so far:

https://git.drupalcode.org/issue/openai-3337774/-/tree/views-embedding/m...

I also added the field delta, not for any other reason but to be able to use a db merge query. Wasn't sure of more efficient ways to ensure the data in the table is not duplicated and or fresh. Thinking ahead to something like Paragraphs.

Would this also be possible to make the module able to suggest taxonomy terms (either existing or not) to classify the text data?
Comment over 2 years ago →
🇺🇸United States somebodysysop
Would this also be possible to make the module able to suggest taxonomy terms (either existing or not) to classify the text data?

Yes. Embed and store the taxonomy terms in the same way, then run a dot product / cosine similarity function with the search vector. Same principle. Pick the highest results.

By the way, I'm not sure which is better: dot product or cosine similarity or some other function. Most of the examples I've seen use cosine, but it was recommended to me by someone knowledgeable to use dot product. Guess we'll see.
Comment over 2 years ago →
🇺🇸United States kevinquillen
Since we are building for Drupal 10, it looks like we are now permitted to define schema types as JSON (updated database versions):

'embedding' => [ 'type' => 'blob', 'mysql_type' => 'json', 'pgsql_type' => 'jsonb', 'size' => 'big', 'description' => 'The embedding values from OpenAI.', ], 'data' => [ 'type' => 'blob', 'mysql_type' => 'json', 'pgsql_type' => 'jsonb', 'not null' => FALSE, 'size' => 'big', 'description' => 'Full response returned from OpenAI.', ],
This means values are now stored as JSON documents in the database:

This should prove fruitful when performing lookups or searches vs storing as serialized text. This may also mean having to maintain a query service for MySQL and one for PostgresSQL (they vary slightly in this area), but I think its still worth doing.
Comment over 2 years ago →
🇺🇸United States kevinquillen
Comment over 2 years ago →
🇺🇸United States kevinquillen
It may be useful to add an "Embeddings" tab on a node/entity too and list out records in a table display so an admin can see what they have at a glance. This may be useful for monitoring token usage among other things.
Comment over 2 years ago →
🇺🇸United States kevinquillen
Here is another example, looking for a specific value (or could be a set of values):

SELECT JSON_EXISTS('{"data":{0.004663495}}' , '$.data') FROM openai_embeddings oe
Comment over 2 years ago →
🇺🇸United States kevinquillen
A couple of things here.

Calculating cosine similarity in MySQL seems to require (pretty much every example I saw) creating MySQL stored procedures. Many (Drupal) hosts disallow this. The next thing I looked for was a PHP implementation. I adapted this into Drupal:

https://github.com/mlwmlw/php-cosine-similarity

To test this, I wrote some example code:

$result = \Drupal::database()->query('SELECT embedding FROM {openai_embeddings} WHERE id = 1 LIMIT 1'); foreach ($result as $item) { $value = json_decode($item->embedding); $compare = $value->data; } $response = $client->embeddings()->create([ 'model' => 'text-embedding-ada-002', 'input' => "Creating another test article for embedding comparison", ]); $embeddings = $response->toArray(); $dot = $sim::dot($embeddings["data"][0]["embedding"]); $score = $sim::cosine($embeddings["data"][0]["embedding"], $compare, $dot);
I am not entirely sure I am doing that correctly. The value of $compare is a vector generated from "Test Article" string, previously. The result of ::cosine was 0.49932475388291.

To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents.

I guess the issue I see here is how quickly can every vector be iterated over in the database in order to do this. I am imagining cases where there are thousands of entries, each one has to be pulled and ran through this function and then sorted highest to lowest, for example.

I can see they recommend a few engines for vector searching, the only one I recognize is Redis.
Comment over 2 years ago →
🇺🇸United States kevinquillen
Comment over 2 years ago →
🇺🇸United States somebodysysop
This means values are now stored as JSON documents in the database:
This should prove fruitful when performing lookups or searches vs storing as serialized text.

The first examples I got from ChatGPT were SQL computations:

WITH search_vector AS ( SELECT <search_term_vector_column> AS search_vector ), vectors AS ( SELECT <vector_column> AS vector ) SELECT dot_product(search_vector.search_vector, vectors.vector) FROM search_vector, vectors
You're right. This streamlines the process of computing similarities.

Calculating cosine similarity in MySQL seems to require (pretty much every example I saw) creating MySQL stored procedures. Many (Drupal) hosts disallow this.

This is definitely outside by knowledgebase, but I did get this from an expert on the subject a few weeks ago:

You don’t need to only use the cosine similarity function. There are many different ways to process the vectors including:

Cosine Similarity (same as the dot product for unit vectors like these embeddings)

Dot Product

Euclidean Distance

Manhattan Distance.

This is a question for ChatGPT.
Comment over 2 years ago →
🇺🇸United States somebodysysop
I just checked with ChatGPT:

Unfortunately, the current version of MySQL does not have native support for vector comparison. However, you can compare two vectors by converting the BLOB data type to a string and using string comparison functions such as SUBSTRING or LOCATE to compare the vectors.
Comment over 2 years ago →
🇺🇸United States kevinquillen
OpenAI themselves seem to say that the cosine similarity is good enough:

We recommend cosine similarity. The choice of distance function typically doesn’t matter much.

Cosine similarity can be computed slightly faster using just a dot product
Cosine similarity and Euclidean distance will result in the identical rankings

Little outside my wheelhouse there, though.
Comment over 2 years ago →
🇺🇸United States kevinquillen
Hmmm... I better slow down.

OpenAI\Exceptions\ErrorException: Rate limit reached for default-global-with-image-limits in organization org-7arCWdLwAtl6atQZmkmJcOPf on requests per min. Limit: 60.000000 / min. Current: 70.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method. in OpenAI\Transporters\HttpTransporter->requestObject()
Comment over 2 years ago →
🇺🇸United States somebodysysop
OpenAI\Exceptions\ErrorException: Rate limit reached for default-global-with-image-limits in organization org-xxxx on requests per min. Limit: 60.000000 / min. Current: 70.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method. in OpenAI\Transporters\HttpTransporter->requestObject()

Yeah, I just watched a video on this recently. I can't remember how the guy worked around it.

But, this is something end-users will need to be aware of as well. This search capability is not something they will want to randomly make public to anybody -- unless they are a fairly large organization can can afford the tokens.

Cosine similarity can be computed slightly faster using just a dot product
Cosine similarity and Euclidean distance will result in the identical rankings

I can verify this. Early on I tested cosine similarity and dot product and the results there pretty similar. Too bad we can't do it in SQL!
Assigned to kevinquillen
Comment over 2 years ago →
🇺🇸United States kevinquillen
Comment over 2 years ago →
🇺🇸United States kevinquillen
Given those recent items, I think we should split this into newer tickets.

I think, right now, we have set the foundation. We can store embeddings from text based fields on any entity (user is blocked, until we can write code / ensure things like passwords or PII data is not included). From here, figuring out how to compare them effectively with broad technology (PHP or MySQL) may take some time and effort.
Comment over 2 years ago →
🇺🇸United States kevinquillen
OpenAI states that the model we are using will have 1536 output dimensions.

See:

https://towardsdatascience.com/different-techniques-to-represent-words-a...

https://www.tensorflow.org/text/guide/word_embeddings

for examples... we need to identify a valid, working PHP library to help compare these vectors - or if a MySQL stored procedure can do the same. Most of the library examples I see are for Python. If we cannot locate one, we may have to make our own.

One of the libs I linked above for example: https://github.com/yooper/php-text-analysis/blob/master/src/Comparisons/...

This says it is a cosine similarity comparison, but it does not use two vectors of floats. I can't find too much else out there. If we do make our own implementation, it will need a good battery of unit tests to ensure accuracy.
Comment over 2 years ago →
System Message

kevinquillen → committed 3ec14d3a on 1.0.x
Issue #3337774 by kevinquillen: Generate string vectors using OpenAI...
Comment over 2 years ago →
🇺🇸United States kevinquillen
Merging current effort.
Status changed to Fixed over 2 years ago5:07pm 6 February 2023
Comment over 2 years ago →
🇺🇸United States kevinquillen
Based on our previous comments and #29, I think we should create individual tickets that are the required pieces to get to the end goal.
Comment over 2 years ago →
🇺🇸United States somebodysysop
From here, figuring out how to compare them effectively with broad technology (PHP or MySQL) may take some time and effort.

Did you see these?

https://github.com/mlwmlw/php-cosine-similarity

https://github.com/angeloskath/php-nlp-tools/blob/master/src/NlpTools/Si...

https://github.com/topics/cosine-similarity?l=php

Anyway, great work! Way more than I anticipated.

Based on our previous comments and #29, I think we should create individual tickets that are the required pieces to get to the end goal.

How do we do this? Since you now have the data collection methodology working, I'd like to add file attachments to that. I've used the SolrClient library in the past to do calls directly to the solr engine. My idea would be to get the solr file dataset vectors and add them to the same vector table for nodes and paragraphs. Then searches would retrieve not only site content but also attachments (pdf, docx, txt, etc...) content. I can work on that.
Comment over 2 years ago →
🇺🇸United States kevinquillen
Yeah, I noted the first one (https://github.com/mlwmlw/php-cosine-similarity) and questions here:

https://www.drupal.org/project/openai/issues/3339686 📌 Integrate a service or library to compare vectors of floats Active

The second one looks like it has not been touched in years, which may not be an issue if it 'works'. Will check through the shared GH repos.
Comment over 2 years ago →
🇺🇸United States kevinquillen
In my head, I am figuring once we figure out the cosine sim / score generation, this opens up Views, taxonomy, etc features.
Comment over 2 years ago →
🇺🇸United States somebodysysop
Did you see this one?

https://github.com/openai-php/client
Comment over 2 years ago →
🇺🇸United States kevinquillen
Yeah - thats what the parent module uses to communicate to OpenAI and provides all the functionality we are leveraging to date.

I think, ultimately, we are looking at something like this now that I know a little bit more:

https://www.drupal.org/project/openai/issues/3339686#comment-14911421 📌 Integrate a service or library to compare vectors of floats Active
Comment over 2 years ago →
🇺🇸United States kevinquillen
Hop on over to this issue, I think we are on the right track: https://www.drupal.org/project/openai/issues/3339686#comment-14911959 📌 Integrate a service or library to compare vectors of floats Active
Comment over 2 years ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

Generate string vectors using OpenAI embeddings

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments & Activities