The entity should be inserted as one string and not several entries

Created on 10 November 2023, about 1 year ago
Updated 17 April 2024, 8 months ago

Problem/Motivation

There is a TODO in the code "The entity should be inserted as one string and not several entries" in EmbeddingQueueWorker. What's the idea there? Instead of an embedding per file, would render the entity as a whole and insert it as a single embedding?

Suggested tasks:

  1. Review EmbeddingQueueWorker::processItem() method
  2. Instead of looping through fields consider renderPlain() of the entity itself (eg see here)
  3. Reduce metadata to remove field_name and field_delta
  4. Update ::generateUniqueId() to also be entity level rather than field level
  5. Remove ::getFieldTypes()
  6. Update database schema (see openai_embeddings_schema())
  7. Re-index to Pinecone
  8. Check that SearchForm works correctly still
  9. Add an update hook that just returns a warning message like "Indexed content has been switched from indexing per field to indexing the node as a whole. This will result in less embeddings usage in for example Pinecone, however you must wipe your index and re-index to Pinecone to take advantage of this.

Important: Branch off of the branch at https://www.drupal.org/project/openai/issues/3400627 ✨ Finish VectorClient plugin code. Active as that would result in merge conflicts otherwise (ie, get that branch locally, open merge request here, pull the changes from that branch into merge request here, then start working on this).

Steps to reproduce

N/A

Proposed resolution

Decide how to move forward with this

Remaining tasks

TBD

User interface changes

TBD

API changes

TBD

Data model changes

TBD

πŸ“Œ Task
Status

Active

Version

1.0

Component

OpenAI Embeddings

Created by

πŸ‡¬πŸ‡§United Kingdom scott_euser

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @scott_euser
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    I think because I couldn't determine the best path for supporting several services. Pinecone uses namespaces to split data into smaller buckets, for example. I also wound up thinking maybe you don't need that level of splitting, and just stored one vector.

  • πŸ‡¬πŸ‡§United Kingdom scott_euser

    I see, yeah I guess also costs will be lower if 1 rendered node = 1 embedding. Anyways things should become more clear as more vector client plugins are built I expect.

  • πŸ‡¬πŸ‡§United Kingdom scott_euser

    Updated task description - a colleague of mine may start working on this (I can do initial review), so I have tried to provide clearer steps how we can approach this.

    I am not sure if we should delete search indexed content and re-index as part of a batch update hook, what do you think @kevinquillen? I believe the functionality will continue to work as the SearchForm does not actually care about the field level responses and uses the entity type and entity ID only anyways, so I am suggesting in the task description we have an empty update hook outputting a warning message with advice on re-indexing.

  • πŸ‡¬πŸ‡§United Kingdom ben.bastow

    Going to have a look and start working on the issue

  • πŸ‡¬πŸ‡§United Kingdom ben.bastow

    I'm going to start working on this issue

  • πŸ‡§πŸ‡ͺBelgium mpp

    "I also wound up thinking maybe you don't need that level of splitting, and just stored one vector."

    While having a single embedding for one document has some benefits (e.g. cheaper, faster), it comes with some challenges:
    - document size may be too large to create a single embedding, hence chunking is needed
    - you could loose meaning/semantics

    Depending on the context & the type of content, you may want to have a different embedding strategy. For instance paragraphs or sentence embeddings may be useful.

  • πŸ‡¬πŸ‡§United Kingdom scott_euser

    Also the latest eg openai releases contain support for higher (and lower) numbers of dimensions per embedding which can help handle different use cases ✨ Support new embeddings models and dimensions (3-small, 3-large) Needs review - work in progress.

    It's probably never going to be possible to have a one-size-fits-all approach here. A sensible default with options for developers to extend/do something custom/use external module for embedding (like search_api_ai) all help.

    Perhaps we should either mark as outdated, or change the issue summary to reflect a different default if a consensus forms around something other than the current state.

Production build 0.71.5 2024