Add Advanced RAG pre chunk modifier with Token support

Author: William Shakespeare Poem Title: Sonnet 18 Poem Part: Shall I compare thee to a summer’s day? Thou art more lovely and more temperate. Rough winds do shake the darling buds of May, And summer’s lease hath all too short a date. Sometime too hot the eye of heaven shines,

Comments & Activities

Issue created by @marcus_johansson
Comment 10 months ago →
🇩🇪Germany marcus_johansson
Comment 10 months ago →
🇩🇪Germany marcus_johansson
Issue was unassigned.
Comment 10 months ago →
🇩🇪Germany marcus_johansson
Comment 10 months ago →
🇬🇧United Kingdom andrewbelcher
So I think there are a couple things in place already that help with this?

You can index multiple fields in the embeddings, so the author etc can already be in there. However, at the moment it'll just blindly do the value, without any context of the label. That might be a worthwhile option (i.e. prefix with label). I think ensuring chunking doesn't split up small fields would also be wanted.

You can index the rendered entity in the embeddings, so you can set up an "embedding" display mode, then use entity view displays to set up the exact rendering you want, including/excluding labels as appropriate.
Comment 10 months ago →
🇩🇪Germany marcus_johansson
Ah, #1 will work in all cases I can think off if we can get the field label in there and newlines in there. Newlines shouldn't affect embeddings performance in modern models, but having that ouput returned with newlines, helps the LLM to understand what is metadata better.

We should still figure out some way where its possible to split the chunks per some wanted custom attribute though (pages, anchor link, timecode depending on content) to have context-aware chunking. Also managing chunk size and overlap size would still be great to do.

Add Advanced RAG pre chunk modifier with Token support

Problem/Motivation

Proposed resolution

Comments & Activities