What is the recommended approach for denormalizing items before indexing?

Created on 11 February 2025, 2 months ago

For the AI search integration (vector databases), we have a concept of "Chunking", which is where we break content into smaller chunks to facilitate use within AI context limits etc.

We were handling this at the point of indexing, where each item was actually represented by multiple documents in the vector database. However, this causes problems for large items. Each chunk needs to be sent to an LLM for embedding, and trying to denormalize at this point means everything has to be handle at once, resulting in hitting time limits etc.

A few thoughts I'd had so far:

  1. Using events to modify the tracking: Not sure how viable this is - I couldn't see an obvious way we'd be able to cover all points that would require a list of item IDs for a content ID. It also requires us to handle chunking at the point of tracking, which may not be ideal.
  2. Using custom tracking: This seems viable, but again requires handling the chunking at the point of tracking.
  3. Using a custom data source: Again, this seems pretty viable by having a separate entity (or even just table) in between. We can then decouple chunking from content update. Those can then be the actual data source and IDs become nice and straight forward. However, it makes the configuration UI a bit more confusing.
💬 Support request
Status

Active

Version

1.0

Component

General code

Created by

🇬🇧United Kingdom andrewbelcher

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @andrewbelcher
  • 🇦🇹Austria drunken monkey Vienna, Austria

    I’m pretty sure creating a custom datasource is the correct and most promising way to implement this. The datasource plugin is what defines what constitutes an “item” in the index, and if your items aren’t entities but just parts of them then it seems you’ll need a custom datasource plugin to do that. Otherwise you’re almost guaranteed to run into conflicts with the item IDs which some parts assume to refer to whole entities (or, rather, translations) while other parts are changed to actually handle just entity chunks.

    I think adapting the configuration UI in a way that makes it as clear as possible is a relatively simple task in this context. You could, e.g., instead of providing a new datasource just override the plugin class used by the entity:* datasources and then maybe (if you need both the normal and this special “chunked” functionality) provide an option switching between the handling of whole entities and chunking them.

    One potential pitfall is that it sounds like editing a node to make the body text longer/shorter might change the number of chunks this node will be split into? That would mean issuing trackItemsInserted()/trackItemsDeleted() calls for those chunks in the update hook, which could be a bit confusing. But nothing that can’t be handled, I’d say, as long as you’re aware of it.

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024