Add pre-chunk modification method in Search API AI

Issue created by @marcus_johansson
Assigned to seogow
Comment about 1 year ago →
🇬🇧United Kingdom yautja_cetanu
Assigning to seogow to take a look at when he's explored a working search api ai module.
Comment about 1 year ago →
🇬🇧United Kingdom scott_euser
Howabout when you set up the Search API Index you select:

Vector Storage plugin - ie, what is currently in open_ai_embeddings module

Pinecone

Etc

Content Chunking plugin - new, selectable when configuring search api

Minimum chunks - index the entire content with as few chunks as possible https://git.drupalcode.org/project/search_api_ai/-/blob/1.0.x/src/TextCh...

Chunk per field - index each field separately, chunking within that if needed https://git.drupalcode.org/project/openai/-/blob/1.0.x/modules/openai_em...

View mode chunks - select one or more view modes, each creating a set of chunks

Each chunking plugin has a getMetaData() method, returning keyed array

So meta data for each might look like

Added to each chunk

minimum_chunks <--- or whichever chunker

search_api_index_id:my_index_name

search_api_datasource:entity:node

search_api_item_id:entity:node/1000:en

Minimum chunks, 3 example chunks

minimum_chunks:plain_text:0

minimum_chunks:plain_text:1

minimum_chunks:plain_text:2

Chunk per field, 5 example chunks

chunk_per_field:title:0

chunk_per_field:field_body:0

chunk_per_field:field_body:1

chunk_per_field:field_body:2

chunk_per_field:field_additional_info:0

View mode chunks, 5 example chunks

view_mode:teaser:0

view_mode:summary_info:0

view_mode:summary_info:1

view_mode:bio_info:0

view_mode:bio_info:1

Regarding this:

2. Add the embeddings rule as a variable for the chunking process, since this holds important data like max input size and is also where the embeddings call is triggered.

This would just be the vector storage plugin + the configured number of dimensions for the chosen LLM (like I was adding in ✨ Support new embeddings models and dimensions (3-small, 3-large) Needs review )

So, when configuring search api, the site builder must:

Choose their dimensions from the options available in the LLM

Choose their vector storage

Choose their chunker

If any of these changes, flag for re-indexing.
Comment about 1 year ago →
🇬🇧United Kingdom scott_euser
Comment about 1 year ago →
🇬🇧United Kingdom seogow
I am starting this right now. The idea is to be in line with the rest of the LLM APIs supported by the AI module. Each Embedding API must either support a default, set by the AI module abstraction, or provide its own defaults, which the module gracefully uses if that API is selected.

In the development version of the Search API AI for the AI module, I do not plan to have any GUI settings for chunking. I plan for it to work out of the box as a plug-and-play replacement for the Solr and DB search. For an index, this means that once it is created, it cannot be changed, though re-indexing may be triggered as usual.

I am happy to take more feature requests as the Alpha is out, but I do not want to chase too many birds at the same time.

I hope the above makes sense.
Status changed to Fixed 9 months ago6:08pm 8 October 2024
Comment 9 months ago →
🇬🇧United Kingdom scott_euser
We now have embedding strategies which can be extended for anyone to do something more advanced/custom. Going to mark as fixed to give credit.
Comment 9 months ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.
Issue was unassigned.
Status changed to Fixed 12 days ago5:07pm 6 July 2025
Comment 12 days ago →
🇺🇸United States Kristen Pol Santa Cruz, CA, USA
Issue cleanup per 📌 Unassign closed AI issues and update issue metadata Active

Add pre-chunk modification method in Search API AI

Problem/Motivation

Proposed resolution

API changes

Comments & Activities

So meta data for each might look like