Add pre-chunk modification method in Search API AI

Created on 10 June 2024, 18 days ago
Updated 24 June 2024, 4 days ago

Problem/Motivation

By default Search API AI does very naive RAG. This means that it just chunks pieces into a certain size and currently the only techinque being used is overlapping sliding window and even that is hard coded.

The possibilities when chunking are endless, so trying to build a one-size-fits-all GUI module will be impossible, but at least we should open up the possibility for the most common use cases. Furthermore it should be possible to use code to inject and manage complex chunking situations.

Proposed resolution

1. Add the whole entity being indexed as a variable for the chunking process . Meaning that the user should be able to use any field/field value while indexing.
2. Add the embeddings rule as a variable for the chunking process, since this holds important data like max input size and is also where the embeddings call is triggered.
3. Add the possibility for a module to completely take over the whole chunking procedure. Should it be events or Search API preprocessor or both?

API changes

This needs to be set in the Embeddings Search API Data Type.

✨ Feature request
Status

Active

Version

1.0

Component

AI Search

Created by

πŸ‡©πŸ‡ͺGermany Marcus_Johansson

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @Marcus_Johansson
  • Assigned to seogow
  • πŸ‡¬πŸ‡§United Kingdom yautja_cetanu

    Assigning to seogow to take a look at when he's explored a working search api ai module.

  • πŸ‡¬πŸ‡§United Kingdom scott_euser

    Howabout when you set up the Search API Index you select:

    1. Vector Storage plugin - ie, what is currently in open_ai_embeddings module
      1. Pinecone
      2. Etc
    2. Content Chunking plugin - new, selectable when configuring search api
      1. Minimum chunks - index the entire content with as few chunks as possible https://git.drupalcode.org/project/search_api_ai/-/blob/1.0.x/src/TextCh...
      2. Chunk per field - index each field separately, chunking within that if needed https://git.drupalcode.org/project/openai/-/blob/1.0.x/modules/openai_em...
      3. View mode chunks - select one or more view modes, each creating a set of chunks

    Each chunking plugin has a getMetaData() method, returning keyed array

    So meta data for each might look like

    Added to each chunk

    • minimum_chunks <--- or whichever chunker
    • search_api_index_id:my_index_name
    • search_api_datasource:entity:node
    • search_api_item_id:entity:node/1000:en



    Minimum chunks, 3 example chunks

    • minimum_chunks:plain_text:0
    • minimum_chunks:plain_text:1
    • minimum_chunks:plain_text:2



    Chunk per field, 5 example chunks

    • chunk_per_field:title:0
    • chunk_per_field:field_body:0
    • chunk_per_field:field_body:1
    • chunk_per_field:field_body:2
    • chunk_per_field:field_additional_info:0



    View mode chunks, 5 example chunks

    • view_mode:teaser:0
    • view_mode:summary_info:0
    • view_mode:summary_info:1
    • view_mode:bio_info:0
    • view_mode:bio_info:1



    Regarding this:

    2. Add the embeddings rule as a variable for the chunking process, since this holds important data like max input size and is also where the embeddings call is triggered.

    This would just be the vector storage plugin + the configured number of dimensions for the chosen LLM (like I was adding in ✨ Support new embeddings models and dimensions (3-small, 3-large) Needs review )

    So, when configuring search api, the site builder must:

    1. Choose their dimensions from the options available in the LLM
    2. Choose their vector storage
    3. Choose their chunker

    If any of these changes, flag for re-indexing.

  • πŸ‡¬πŸ‡§United Kingdom scott_euser
  • πŸ‡¬πŸ‡§United Kingdom seogow

    I am starting this right now. The idea is to be in line with the rest of the LLM APIs supported by the AI module. Each Embedding API must either support a default, set by the AI module abstraction, or provide its own defaults, which the module gracefully uses if that API is selected.

    In the development version of the Search API AI for the AI module, I do not plan to have any GUI settings for chunking. I plan for it to work out of the box as a plug-and-play replacement for the Solr and DB search. For an index, this means that once it is created, it cannot be changed, though re-indexing may be triggered as usual.

    I am happy to take more feature requests as the Alpha is out, but I do not want to chase too many birds at the same time.

    I hope the above makes sense.

Production build 0.69.0 2024