- Issue created by @scott_euser
- πΊπΈUnited States kevinquillen
I think because I couldn't determine the best path for supporting several services. Pinecone uses namespaces to split data into smaller buckets, for example. I also wound up thinking maybe you don't need that level of splitting, and just stored one vector.
- π¬π§United Kingdom scott_euser
I see, yeah I guess also costs will be lower if 1 rendered node = 1 embedding. Anyways things should become more clear as more vector client plugins are built I expect.
- π¬π§United Kingdom scott_euser
Updated task description - a colleague of mine may start working on this (I can do initial review), so I have tried to provide clearer steps how we can approach this.
I am not sure if we should delete search indexed content and re-index as part of a batch update hook, what do you think @kevinquillen? I believe the functionality will continue to work as the SearchForm does not actually care about the field level responses and uses the entity type and entity ID only anyways, so I am suggesting in the task description we have an empty update hook outputting a warning message with advice on re-indexing.
- π¬π§United Kingdom ben.bastow
Going to have a look and start working on the issue
- π¬π§United Kingdom ben.bastow
I'm going to start working on this issue
- π§πͺBelgium mpp
"I also wound up thinking maybe you don't need that level of splitting, and just stored one vector."
While having a single embedding for one document has some benefits (e.g. cheaper, faster), it comes with some challenges:
- document size may be too large to create a single embedding, hence chunking is needed
- you could loose meaning/semanticsDepending on the context & the type of content, you may want to have a different embedding strategy. For instance paragraphs or sentence embeddings may be useful.
- π¬π§United Kingdom scott_euser
Also the latest eg openai releases contain support for higher (and lower) numbers of dimensions per embedding which can help handle different use cases β¨ Support new embeddings models and dimensions (3-small, 3-large) Needs review - work in progress.
It's probably never going to be possible to have a one-size-fits-all approach here. A sensible default with options for developers to extend/do something custom/use external module for embedding (like search_api_ai) all help.
Perhaps we should either mark as outdated, or change the issue summary to reflect a different default if a consensus forms around something other than the current state.