Add Advanced RAG pre chunk modifier with Token support

Created on 30 May 2024, 6 months ago

Problem/Motivation

To be able to do advanced RAG as compared to naive RAG one of the components is to actually add the embeddings in a semantic accurate way.

Take this example, you have a content type that are poems with an author, title of the poem and the poem.

The poem will be chunked up to pieces and you can find context on these, but not on the title and the author.

This means that if you index Sonnet 18 by William Shakespeare and A Boat Beneath a Sunny Sky by Lewis Carroll and you ask "Could you give me a sentence from a poem by Shakespeare that describes warmth?" you would most likely get back a sentence from Lewis Carroll as the most likely embedding, with a low score.

If you instead stored every embedding like this it would find the correct embedding:

Author: William Shakespeare
Poem Title: Sonnet 18
Poem Part: 
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date.
Sometime too hot the eye of heaven shines,

Same thing for instance if you want to reference a timestamp in video or a page in PDF.

You should also have natural breakpoints on when you want to break a chunk - if you want to index PDF's per pages then chunking them up per pages makes sense.

Having the possibility to set the max size of the chunk (based on some interface the embeddings engine sets as min and max) is also important - different types of content needs different types of concentration of text.

Having the option to choose how many characters should be added pre and post from the chunk is also important - in some cases when we have post-query modification as well, 0 is good enough.

Proposed resolution

Add a Search API chunk processor (if possible, otherwise custom form) where you can setup how the embedding should be stored using tokens.

Add a config field when the chunk should split, that also takes tokens and can be chosen actions like "changes" or even a regex mode.

Add a config how big chunks to set (withing the embedding engines limitations)
Add a config how many characters to prepend from the last chunk.
Add a config how many character to append from the last chunk.

πŸ“Œ Task
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡©πŸ‡ͺGermany marcus_johansson

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @marcus_johansson
  • Issue was unassigned.
  • πŸ‡¬πŸ‡§United Kingdom andrewbelcher

    So I think there are a couple things in place already that help with this?

    1. You can index multiple fields in the embeddings, so the author etc can already be in there. However, at the moment it'll just blindly do the value, without any context of the label. That might be a worthwhile option (i.e. prefix with label). I think ensuring chunking doesn't split up small fields would also be wanted.
    2. You can index the rendered entity in the embeddings, so you can set up an "embedding" display mode, then use entity view displays to set up the exact rendering you want, including/excluding labels as appropriate.
  • πŸ‡©πŸ‡ͺGermany marcus_johansson

    Ah, #1 will work in all cases I can think off if we can get the field label in there and newlines in there. Newlines shouldn't affect embeddings performance in modern models, but having that ouput returned with newlines, helps the LLM to understand what is metadata better.

    We should still figure out some way where its possible to split the chunks per some wanted custom attribute though (pages, anchor link, timecode depending on content) to have context-aware chunking. Also managing chunk size and overlap size would still be great to do.

Production build 0.71.5 2024