Generate string vectors using OpenAI embeddings

Created on 31 January 2023, over 1 year ago
Updated 8 February 2023, over 1 year ago

Problem/Motivation

I'm just spit balling here, but this is something that might prove very useful. I have been researching how to do OpenAi embeddings. Since all the content in a Drupal site is in the database, it seems like we could create a vector table that has the vectors for each page. Then, using views, we could get the search vector of the input and now do a vector search using cosine or dot product to determine the results.

Steps to reproduce

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

1. Create a vector table. Two fields: entity_id and vectors
2. Run an embedding against the node/paragraph content fields. Store the vectors in the vector table with associated entity_id.
3. Create a custom view filter that will
a. retrieve the vectors for the search term
b. run mathematical computation (cosine, dot product, etc..) of search vector against vector table
c. return text extract for highest result(s) with link to entity

Proposed resolution

A submodule of this OpenAI / GPT module which will create and maintain the vector table as well as programmatically create the custom views filter for doing embedded searches.

Remaining tasks

Creating and maintaining the vector table should be relatively simple. It could be done during cron and of course during node add/update/delete sessions.

The big problem will be determining the mathematical function for the vector search. Because of all the possible conditions - contains, does not contain, equals, less than, etc...

User interface changes

I don't see any user interface changes. All of the OpenAI calls would be invisible to the end user.

API changes

Clearly we would be adding OpenAI API embedding calls: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

Data model changes

The only real data model change is that we will do searches on vectors instead of keywords.

I'd like to figure out a way to do this with Search API / Search API Attachments as well.

Just an idea.

✨ Feature request
Status

Fixed

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @SomebodySysop
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    This is actually a good idea worth looking into. It is this sort of application of OpenAI I am interested in - augmenting Drupal in assistive ways.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Looking at the API, it does a POST request:

    https://platform.openai.com/docs/api-reference/embeddings/create

    My question, how does that work exactly? What does that mean for example for a sites content? I suppose all content published is public generally speaking, but is there a risk for content that is access protected, unpublished etc? Can people create their own private models? If you submit to this API is your content then in a public model?

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    Looking at the API, it does a POST request:

    https://platform.openai.com/docs/api-reference/embeddings/create

    My question, how does that work exactly?

    What we are talking about is text vectorization, a numeric representation of the text. https://www.deepset.ai/blog/what-is-text-vectorization-in-nlp. Essentially, way to make it easier for machines, specifically language models, to understand the text.

    So, on a site we have a page with content. If it's a node, that content is stored in the
    Drupal database with a node id. If it's a paragraph, that paragraph is stored in the Drupal database as a paragraph id. When we run an embedding, essentially we create a vector representation of that page and store it in the database, referring back to it's source id (i.e., the node/paragraph text that was vectored).

    So, for example, when a page is created or updated, we run the API to get the vectors for the page content, then store those vectors with the entity_id of the added/updated page. That table row is what we use to do the vector searches on page content (as opposed to the sql keyword searches). We actually run a mathematical function to compare the search vector with the stored vectors as opposed to a database search.

    What does that mean for example for a sites content?

    I don't see any change. End users have access to site pages based upon their access level. The vector content is never seen nor accessible by the user. It is only used for searching.

    I suppose all content published is public generally speaking, but is there a risk for content that is access protected, unpublished etc?

    No. The page content and it's access remains the same. The embedding process makes API calls to the OpenAI embedding model to retrieve the vectors. That information is not stored on the model. The model simply "vectorizes" the content and sends back the vectors. The page data is never in a position, as far as I can tell, to be seen or accessed outside of this process. Same for the vector information. Unless OpenAI is storing this information to do model training. That, I don't know. So far I've not been able to find anything on embedding security.

    Update: I actually asked the ChatGPT3 module this question, and this is these are the responses I received: https://sharegpt.com/c/TF7wLOy

    Can people create their own private models? If you submit to this API is your content then in a public model?

    At this point, there aren't any custom embedding models. At least as far as I know. The available models are listed here: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

    I know that people can create their own private models in fine-tuning, but that's not what I am suggesting here. I am suggesting that we use OpenAI embeddings to vectorize our content in order to perform vector searches that no longer rely on keyword matches to locate relevant data. This is one of the best explanations I have seen on the subject: https://www.youtube.com/watch?v=xzHhZh7F25I&ab_channel=PartTimeLarry

    For larger sites it might run up costs, so perhaps we start by limiting it to a small set of nodes, especially during development.

    To be clear, I am not suggesting this process as a core feature. I would restrict it to an optional submodule of this OpenAI / GPT module. Therefore users who want this enhanced search feature on their data would certainly be willing to pay for the usage.

    Which means that the submodule would require the user's API key. It would also give them the option of selecting the public embedding model they wish to use:

    On this page, I only see one model: https://openai.com/api/pricing/

    Ada $0.0004 / 1K tokens

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Thanks for the explanation, just wanted to throw some questions out there. I have an idea now.

    Would you be able to contribute to the forked branch? https://git.drupalcode.org/issue/openai-3337774/-/tree/views-embedding

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    To be clear, I am not suggesting this process as a core feature. I would restrict it to an optional submodule of this OpenAI / GPT module. Therefore users who want this enhanced search feature on their data would certainly be willing to pay for the usage.

    Yep, thats the goal for most of these features. I have created an opeai_search module in the forked branch. Open to 'better' naming scheme(s).

    Looking through the docs, they combined a dataset into a single row per record in their example:

    https://github.com/openai/openai-cookbook/blob/main/examples/Obtain_data...

    In this case, should it only be entities like nodes, and not ones like Paragraphs? One reason I ask is because Paragraphs last I checked are not deleted, for example. Typically, there are no such things as orphaned paragraphs either, they (the active ones) are attached to a host. Should the node be loaded into a specific view mode that is just a textual representation of the node content? Then feed that to the API?

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    I actually didn't know that about paragraphs. I know you can migrate them without being attached to nodes, but not that they aren't deleted when their parent nodes are deleted.

    I mentioned nodes and paragraphs because those are the primary entities that I am familiar with when working with site text content. Depending upon the application, a site content type could contain one or more or no body text fields. In my sites, I usually place the content in paragraph body text fields. Other folks use nodes. So, our module would have to accommodate this.

    For example, this field:

    machine name: field_content
    field type: text (formatted, long)
    entity type: paragraph
    entity id: paragraph id
    parent entity: parent entity type (typically a node)
    parent entity id: parent entity's id

    Will be stored in the vector table like this:

    vector entity type: node/paragraph
    vector id: node/paragraph id
    vectors: the actual vector array (stored as serialized blob, as per my understanding)

    I'm not sure if we need the parent entity info. It depends upon how the search is structured -- how we retrieve the target entities for search. In SQL, we would join the vector table on the vector table id = paragraph/node id.

    But, we don't care about the ids until AFTER the search (or in our case, mathematical function) is completed.

    We run a dot product or cosine calculation between the search vector and all rows in the vector table. Then we take the highest matches (1, 2, 3, whatever the user decides), and THEN we retrieve the node/paragraph entities from those results based upn the node/paragraph ids.

    Should the node be loaded into a specific view mode that is just a textual representation of the node content? Then feed that to the API?

    Good question. Right now, in the Drupal field content tables, the text fields contain the raw HTML text content. The question is: Do we strip the HTML tags from the content before we embed it? Logically, it would seem like we should in order to get the best semantic search results, but it would take someone with more AI embedding knowledge than me to determine that.

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    So, right now with the model we are discussing, our results will bring back full pages. I was thinking through how we might bring back an excerpt instead. Then, thinking beyond that to using something similar to a ChatBot interface.

    Check out this video:

    https://youtu.be/fODk-alDqWw

    What the guy does here is:

    • Creates embeddings for his content. The is the vectorization we've been discussing
    • Gets search vector and runs cosine completion against his stored vectors.
    • Gets the 3 highest results. We talked about this as well.
    • Takes the highest result and then sends that as context for the same question to a OpenAI completion model.
    • The completion model looks at the question within the context and returns an informed answer.

    So now the Drupal user can get ChatGPT3 responses to queries on his/her content without having to fine-tune an OpenAI model.

    All done through normal Drupal database transactions and APIs to OpenAI.

    Add Audio to text and text to Audio, and now we have a full Drupal bot.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    stored as serialized blob

    Would this have a performance impact during search at all? I was looking to see if Drupal 10 had support for JSON storage types in the database but could not confirm that yet.

    In either case, yeah, I think the storage will look something like:

    - id (entry id)
    - entity_id
    - entity_type
    - bundle (unsure if needed)
    - field_name (what the vector was generated from)
    - embedding (the vectors)
    - data (raw response from OpenAI)

    with db indexes on type, bundle, field_name. Ex. a node with 4 text (plain or long) fields would have 4 entries.

    The administrative interface also needs a way for a user to opt in which entity types and their bundles to run this on. Would we also need to allow specifying which text fields to run on? Maybe we do them all initially and make opting fields in an enhancement later.

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    Would this have a performance impact during search at all?

    Compared to current keyword searches? Probably so. We've got to do the API call to get the search vector then the mathematical computation on the vector table -- and possibly a 2nd API call to get a response with context. We won't know just how much of a noticeable impact until we start testing with real data.

    with db indexes on type, bundle, field_name. Ex. a node with 4 text (plain or long) fields would have 4 entries. What about field deltas? Do we need to track ordering?

    Yes. I don't know that field deltas matter in our scenario.

    As for ordering, remember that the vector computation (dot product, cosine similarity, etc...) will return vector records that we will order according their value relative to the search vector.

    I think we need to think of it from the standpoint of what we need to get back to the entity we are actually going to display to the end-user. Let's take my case, where all my text content resides in paragraphs.



    When we embed the content, we use the field_content_value (text) from the paragraph__field_content_table. Now paragraph entity 11294 has a vector record and id.

    When we search, and paragraph vector 11294 ends up the highest result, we use the paragraphs_item_field_data table to point us to the parent entity node that we will link users to in the search results.

    The administrative interface also needs a way for a user to opt in which entity types and their bundles to run this on. Would we also need to allow specifying which text fields to run on? Maybe we do them all initially and make opting fields in an enhancement later.

    In any event, I say let the end-user decide which fields to use with the warning that the more fields that are used, the greater the potential latency.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    So I started with a basic node type that has a title and a body. For the title, I named it 'Test Article'. The embedding returned for that was 1536 numbers - should it be that many?

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    Yes. The vector arrays will be very long.

  • @kevinquillen opened merge request.
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Gotcha. I've put some rough work in here so far:

    https://git.drupalcode.org/issue/openai-3337774/-/tree/views-embedding/m...

    I also added the field delta, not for any other reason but to be able to use a db merge query. Wasn't sure of more efficient ways to ensure the data in the table is not duplicated and or fresh. Thinking ahead to something like Paragraphs.

    Would this also be possible to make the module able to suggest taxonomy terms (either existing or not) to classify the text data?

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    Would this also be possible to make the module able to suggest taxonomy terms (either existing or not) to classify the text data?

    Yes. Embed and store the taxonomy terms in the same way, then run a dot product / cosine similarity function with the search vector. Same principle. Pick the highest results.

    By the way, I'm not sure which is better: dot product or cosine similarity or some other function. Most of the examples I've seen use cosine, but it was recommended to me by someone knowledgeable to use dot product. Guess we'll see.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Since we are building for Drupal 10, it looks like we are now permitted to define schema types as JSON (updated database versions):

          'embedding' => [
            'type' => 'blob',
            'mysql_type' => 'json',
            'pgsql_type' => 'jsonb',
            'size' => 'big',
            'description' => 'The embedding values from OpenAI.',
          ],
          'data' => [
            'type' => 'blob',
            'mysql_type' => 'json',
            'pgsql_type' => 'jsonb',
            'not null' => FALSE,
            'size' => 'big',
            'description' => 'Full response returned from OpenAI.',
          ],
    

    This means values are now stored as JSON documents in the database:

    This should prove fruitful when performing lookups or searches vs storing as serialized text. This may also mean having to maintain a query service for MySQL and one for PostgresSQL (they vary slightly in this area), but I think its still worth doing.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    It may be useful to add an "Embeddings" tab on a node/entity too and list out records in a table display so an admin can see what they have at a glance. This may be useful for monitoring token usage among other things.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Here is another example, looking for a specific value (or could be a set of values):

    SELECT JSON_EXISTS('{"data":{0.004663495}}' , '$.data') FROM openai_embeddings oe

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    A couple of things here.

    Calculating cosine similarity in MySQL seems to require (pretty much every example I saw) creating MySQL stored procedures. Many (Drupal) hosts disallow this. The next thing I looked for was a PHP implementation. I adapted this into Drupal:

    https://github.com/mlwmlw/php-cosine-similarity

    To test this, I wrote some example code:

      $result = \Drupal::database()->query('SELECT embedding FROM {openai_embeddings} WHERE id = 1 LIMIT 1');
    
      foreach ($result as $item) {
        $value = json_decode($item->embedding);
        $compare = $value->data;
      }
    
      $response = $client->embeddings()->create([
        'model' => 'text-embedding-ada-002',
        'input' => "Creating another test article for embedding comparison",
      ]);
    
      $embeddings = $response->toArray();
    
      $dot = $sim::dot($embeddings["data"][0]["embedding"]);
      $score = $sim::cosine($embeddings["data"][0]["embedding"], $compare, $dot);
    

    I am not entirely sure I am doing that correctly. The value of $compare is a vector generated from "Test Article" string, previously. The result of ::cosine was 0.49932475388291.

    To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents.

    I guess the issue I see here is how quickly can every vector be iterated over in the database in order to do this. I am imagining cases where there are thousands of entries, each one has to be pulled and ran through this function and then sorted highest to lowest, for example.

    I can see they recommend a few engines for vector searching, the only one I recognize is Redis.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    This means values are now stored as JSON documents in the database:
    This should prove fruitful when performing lookups or searches vs storing as serialized text.

    The first examples I got from ChatGPT were SQL computations:

    WITH search_vector AS (
      SELECT <search_term_vector_column> AS search_vector
    ),
    vectors AS (
      SELECT <vector_column> AS vector
    )
    SELECT dot_product(search_vector.search_vector, vectors.vector)
    FROM search_vector, vectors

    You're right. This streamlines the process of computing similarities.

    Calculating cosine similarity in MySQL seems to require (pretty much every example I saw) creating MySQL stored procedures. Many (Drupal) hosts disallow this.

    This is definitely outside by knowledgebase, but I did get this from an expert on the subject a few weeks ago:

    You don’t need to only use the cosine similarity function. There are many different ways to process the vectors including:

    • Cosine Similarity (same as the dot product for unit vectors like these embeddings)
    • Dot Product
    • Euclidean Distance
    • Manhattan Distance.

    This is a question for ChatGPT.

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    I just checked with ChatGPT:

    Unfortunately, the current version of MySQL does not have native support for vector comparison. However, you can compare two vectors by converting the BLOB data type to a string and using string comparison functions such as SUBSTRING or LOCATE to compare the vectors.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    OpenAI themselves seem to say that the cosine similarity is good enough:

    We recommend cosine similarity. The choice of distance function typically doesn’t matter much.

    Cosine similarity can be computed slightly faster using just a dot product
    Cosine similarity and Euclidean distance will result in the identical rankings

    Little outside my wheelhouse there, though.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Hmmm... I better slow down.

    OpenAI\Exceptions\ErrorException: Rate limit reached for default-global-with-image-limits in organization org-7arCWdLwAtl6atQZmkmJcOPf on requests per min. Limit: 60.000000 / min. Current: 70.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method. in OpenAI\Transporters\HttpTransporter->requestObject()

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    OpenAI\Exceptions\ErrorException: Rate limit reached for default-global-with-image-limits in organization org-xxxx on requests per min. Limit: 60.000000 / min. Current: 70.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method. in OpenAI\Transporters\HttpTransporter->requestObject()

    Yeah, I just watched a video on this recently. I can't remember how the guy worked around it.

    But, this is something end-users will need to be aware of as well. This search capability is not something they will want to randomly make public to anybody -- unless they are a fairly large organization can can afford the tokens.

    Cosine similarity can be computed slightly faster using just a dot product
    Cosine similarity and Euclidean distance will result in the identical rankings

    I can verify this. Early on I tested cosine similarity and dot product and the results there pretty similar. Too bad we can't do it in SQL!

  • Assigned to kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Given those recent items, I think we should split this into newer tickets.

    I think, right now, we have set the foundation. We can store embeddings from text based fields on any entity (user is blocked, until we can write code / ensure things like passwords or PII data is not included). From here, figuring out how to compare them effectively with broad technology (PHP or MySQL) may take some time and effort.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    OpenAI states that the model we are using will have 1536 output dimensions.

    See:

    for examples... we need to identify a valid, working PHP library to help compare these vectors - or if a MySQL stored procedure can do the same. Most of the library examples I see are for Python. If we cannot locate one, we may have to make our own.

    One of the libs I linked above for example: https://github.com/yooper/php-text-analysis/blob/master/src/Comparisons/...

    This says it is a cosine similarity comparison, but it does not use two vectors of floats. I can't find too much else out there. If we do make our own implementation, it will need a good battery of unit tests to ensure accuracy.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Merging current effort.

  • Status changed to Fixed over 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Based on our previous comments and #29, I think we should create individual tickets that are the required pieces to get to the end goal.

  • πŸ‡ΊπŸ‡ΈUnited States SomebodySysop

    From here, figuring out how to compare them effectively with broad technology (PHP or MySQL) may take some time and effort.

    Did you see these?

    Anyway, great work! Way more than I anticipated.

    Based on our previous comments and #29, I think we should create individual tickets that are the required pieces to get to the end goal.

    How do we do this? Since you now have the data collection methodology working, I'd like to add file attachments to that. I've used the SolrClient library in the past to do calls directly to the solr engine. My idea would be to get the solr file dataset vectors and add them to the same vector table for nodes and paragraphs. Then searches would retrieve not only site content but also attachments (pdf, docx, txt, etc...) content. I can work on that.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Yeah, I noted the first one (https://github.com/mlwmlw/php-cosine-similarity) and questions here:

    https://www.drupal.org/project/openai/issues/3339686 πŸ“Œ Integrate a service or library to compare vectors of floats Active

    The second one looks like it has not been touched in years, which may not be an issue if it 'works'. Will check through the shared GH repos.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    In my head, I am figuring once we figure out the cosine sim / score generation, this opens up Views, taxonomy, etc features.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Yeah - thats what the parent module uses to communicate to OpenAI and provides all the functionality we are leveraging to date.

    I think, ultimately, we are looking at something like this now that I know a little bit more:

    https://www.drupal.org/project/openai/issues/3339686#comment-14911421 πŸ“Œ Integrate a service or library to compare vectors of floats Active

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Hop on over to this issue, I think we are on the right track: https://www.drupal.org/project/openai/issues/3339686#comment-14911959 πŸ“Œ Integrate a service or library to compare vectors of floats Active

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.69.0 2024