- Issue created by @marcus_johansson
- Assigned to seogow
- 🇬🇧United Kingdom yautja_cetanu
Assigning to seogow to take a look at when he's explored a working search api ai module.
- 🇬🇧United Kingdom scott_euser
Howabout when you set up the Search API Index you select:
- Vector Storage plugin - ie, what is currently in open_ai_embeddings module
- Pinecone
- Etc
- Content Chunking plugin - new, selectable when configuring search api
- Minimum chunks - index the entire content with as few chunks as possible https://git.drupalcode.org/project/search_api_ai/-/blob/1.0.x/src/TextCh...
- Chunk per field - index each field separately, chunking within that if needed https://git.drupalcode.org/project/openai/-/blob/1.0.x/modules/openai_em...
- View mode chunks - select one or more view modes, each creating a set of chunks
Each chunking plugin has a getMetaData() method, returning keyed array
So meta data for each might look like
Added to each chunk
- minimum_chunks <--- or whichever chunker
- search_api_index_id:my_index_name
- search_api_datasource:entity:node
- search_api_item_id:entity:node/1000:en
Minimum chunks, 3 example chunks- minimum_chunks:plain_text:0
- minimum_chunks:plain_text:1
- minimum_chunks:plain_text:2
Chunk per field, 5 example chunks- chunk_per_field:title:0
- chunk_per_field:field_body:0
- chunk_per_field:field_body:1
- chunk_per_field:field_body:2
- chunk_per_field:field_additional_info:0
View mode chunks, 5 example chunks- view_mode:teaser:0
- view_mode:summary_info:0
- view_mode:summary_info:1
- view_mode:bio_info:0
- view_mode:bio_info:1
Regarding this:2. Add the embeddings rule as a variable for the chunking process, since this holds important data like max input size and is also where the embeddings call is triggered.
This would just be the vector storage plugin + the configured number of dimensions for the chosen LLM (like I was adding in ✨ Support new embeddings models and dimensions (3-small, 3-large) Needs review )
So, when configuring search api, the site builder must:
- Choose their dimensions from the options available in the LLM
- Choose their vector storage
- Choose their chunker
If any of these changes, flag for re-indexing.
- Vector Storage plugin - ie, what is currently in open_ai_embeddings module
- 🇬🇧United Kingdom seogow
I am starting this right now. The idea is to be in line with the rest of the LLM APIs supported by the AI module. Each Embedding API must either support a default, set by the AI module abstraction, or provide its own defaults, which the module gracefully uses if that API is selected.
In the development version of the Search API AI for the AI module, I do not plan to have any GUI settings for chunking. I plan for it to work out of the box as a plug-and-play replacement for the Solr and DB search. For an index, this means that once it is created, it cannot be changed, though re-indexing may be triggered as usual.
I am happy to take more feature requests as the Alpha is out, but I do not want to chase too many birds at the same time.
I hope the above makes sense.
- Status changed to Fixed
5 months ago 6:08pm 8 October 2024 - 🇬🇧United Kingdom scott_euser
We now have embedding strategies which can be extended for anyone to do something more advanced/custom. Going to mark as fixed to give credit.
Automatically closed - issue fixed for 2 weeks with no activity.