Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languages

Issue created by @effulgentsia
Comment about 1 year ago →
🇺🇸United States effulgentsia
When I first posted this issue, I didn't know how feasible option 2 (implementing it at a central but opt-in level) would be, but having thought a bit more on it, I think it's quite feasible.

The key is deciding at what granularity to intern:

At the FieldItemList level (meaning store a single hash of the full list)

At the FieldItem level (a hash per field item)

At the property level (e.g., for a text_long field, intern the value property but not the format property)

Of these, I think that:

At the FieldItem level is the most valuable.

At the FieldItemList level would provide significant added value so long as it's in addition to, not instead of, the FieldItem level.

I don't think there's much value to intern at the property level, and also this would be the most difficult to implement centrally.

Given this, here's a rough outline of a possible implementation:

Let's assume that $node->body is opted into interning via some configuration that's similar to a field storage setting. Then we want to modify the entity storage/schema handler(s) in the following way:

Instead of adding the body_value, body_summary, and body_format columns to the node_revision__body table, we only add a body_hash column.

We add two other tables: node_revision_interned_item__body and node_revision_interned_item_list__body.

The columns for node_revision_interned_item__body would be: entity_id, hash, body_value, body_summary, and body_format.

The columns for node_revision_interned_item_list__body would be: entity_id, list_hash, delta, item_hash.

Now let's say $node->body has two items in it and it's time to save that. What we would write to the database is:

For each item, a record in node_revision_interned_item__body mapping the hash of [value, summary, format] to those values.

Let's say the hash of item 0 is xxxxxx and the hash of item 1 is yyyyyy, then we calculate a single hash for ['xxxxxx', 'yyyyyy']. Let's say that's zzzzzz. Assuming this is for node 100, what we'd store in node_revision_interned_item_list__body is:

100, 'zzzzzz', 0, 'xxxxxx' 100, 'zzzzzz', 1, 'yyyyyy'

Finally, in the node_revision__body table, we only store a single delta with 'zzzzzz' for the body_hash.

The entity_id column in the two interned tables isn't strictly necessary, since the hash is enough to uniquely map to the underlying values, but I'm suggesting it to make garbage collection easier. For example, it means when an entity is deleted, we can safely delete all of its interned records. Or, if we want to do more fine-grained garbage collection, we can do bulk/batch operations on a subset of entities at a time.
Comment about 1 year ago →
🇬🇧United Kingdom longwave UK
I think this could be simpler if we do this at FieldItem level?

Instead of adding the body_value, body_summary, and body_format columns to the node_revision__body table, we only add a body_hash column.

We add another table, node_revision__body__interned, with columns body_hash, body_value, body_summary, and body_format.

At insert/update time we just calculate the hash, create the interned storage row if it doesn't exist already, and store the hash in the revision table.

Garbage collection can be still be done by deleting rows from the interned storage table where the hash does not appear in the revision table; this can be done out-of-band (on cron) because, apart from disk space, it doesn't matter if we have orphaned hashes.

I had a thought that this could even be more generalised by having a single interned storage table per field type, which might provide slightly better effective compression ratios in some cases where the same data may appear in multiple fields, but the tradeoff there is that garbage collection is much more difficult (we would have to consider hashes from multiple tables), and I don't think the savings will be worth it.

Are we also assuming that this will only be used for revision storage? ie. node__body will not use the interned strings?

This technique could also be useful in a related but slightly different way for the message column in the watchdog table, which can also grow quite large on some sites.
Merge request !9417Draft: Resolve #3469082 "Interned prototype" → (Open) created by longwave
Comment about 1 year ago →
🇺🇸United States effulgentsia
The MR is looking great! I'm thrilled to be seeing this starting to come to life.

I think these are the key differences between #3 (what the MR is implementing) and what I suggested in #2:

Name the table node_revision__body__interned instead of node_revision_interned_item__body.
+1 to this. Much better name :)

Don't include an entity_id column in node_revision__body__interned.
The hypothesis to this being that we can do efficient enough garbage collection without it. +1 to this; that seems like a reasonable hypothesis.

Intern only at the FieldItem level and not add a 2nd level of interning for the FieldItemList level.
Perhaps this is an okay place to start. My thinking in #2 for wanting to intern at both levels is for something like 📌 [PP-1] Consider not storing the ComponentTreeStructure data type as a JSON blob Postponed . If we store that as a multi-valued field, each item wouldn't be very large, but there could be a lot of items. And you might have many revision edits where all you're doing is changing prop values and not changing the layout at all. So in that case having each revision only store a single hash instead of a hash per component seems beneficial. But, this is less critical than the FieldItem level for prop values (and non-XB use cases), so I'd be fine with punting the FieldItemList level to a followup.

Are we also assuming that this will only be used for revision storage? ie. node__body will not use the interned strings?

Yeah, perhaps I'm wrong but I'm not foreseeing it having much benefit for the non-revision tables. Where we know duplication happens a lot is across revisions. I don't think duplication across nodes is a significant contributor to database size. And by not doing it for the non-revision tables, it means less disruption to existing Views, EntityQuery, etc. code., except where those things query the revision tables.

This technique could also be useful in a related but slightly different way for the message column in the watchdog table, which can also grow quite large on some sites.

Oh that's a cool idea! Separate issue please, but that would be neat if it would help there. Do you think it would make much impact if it's only that column, but not also the variables column?
Comment about 1 year ago →
🇺🇸United States nicxvan
I think there may be a mistake in your math, or I'm missing something.

5KB * 20 * 1000 = 100MB
You already multiplied by the 20 languages to get 1000. This means your body size should be 5KB * 1000 which is 5MB. and 1000 articles is 5 GB.

Still massive, for no particular benefit, but I think the scale is off by an order of magnitude.
Comment about 1 year ago →
🇺🇸United States effulgentsia
Thanks for checking the math! I appreciate having more eyes on this.

You already multiplied by the 20 languages by 50 revisions to get 1000. Why are you multiplying by 20 again?

Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision. For example, say you create a brand new node and translate it into 20 languages. Conceptually you've only made one revision per language, but Drupal stores it in the database as 20 revisions, and the 20th revision has records for all 20 languages. If you now edit the node to make a small change (say add another sentence to $node->body), and then edit every translation to make that same change for each language, Drupal will have stored an additional 400 records, because each translation's change gets stored as a new revision, and each of those new revisions copies over the records for every language.

So if you then make 49 more changes, and translate each of those changes into each language, you'll end up with 1000 revisions and 20,000 records.
Comment about 1 year ago →
🇬🇧United Kingdom longwave UK
Some more thoughts on this while I slept on it:

Wondering now if this should be attached to a field type instead of a field storage setting - this really only makes sense for blob or longtext type fields; the hash is 64 chars so the data probably needs to be an order of magnitude larger for this technique to make sense.

If this is to be a field storage setting, is it somehow worth abstracting out the underlying field storage for "base fields", "configurable fields", "interned fields" into plugins and then allowing a storage plugin selection per field? This will need some rules - base fields can only have cardinality 1, interned fields should only be available for fields that use blob/longtext columns, etc, but it would provide more choice to the site builder. Implementing this is probably out of scope here but it is still worth discussing as it may still shape the solution here.

Unsure which hash method to choose, or if it matters that much. I chose SHA256 because we have precedent elsewhere in core and I remember from somewhere that, while shorter, MD5 or SHA1 are discouraged in some cases even if not used as a cryptographic hash.
Comment about 1 year ago →
🇺🇸United States nicxvan
Ah, I was missing that detail. This will be a great change.

Will this have any impact on entity reference revisions? Paragraphs already get weird with translations and revisions.
Comment about 1 year ago →
🇬🇧United Kingdom longwave UK
@nicxvan this should be suitable for any large field in any entity type that stores duplicated data so it will help reduce database size with Paragraphs as well as Experience Builder.
Comment about 1 year ago →
🇬🇧United Kingdom joachim
> Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision

What's the reason for the translation/revision system doing this? is it documented anywhere?
Comment about 1 year ago →
🇺🇸United States effulgentsia
Not sure if/where it's documented, but I think the reason is so that every revision is complete. For example, to be able to visit en/node/1/revisions/1/view for any language and any revision ID and be able to display that by only loading the data for that revision, regardless of if the language you're viewing it in matches the language that was edited by that revision.
Comment about 1 year ago →
🇺🇸United States nicxvan
Great to hear this will help there too!
Comment about 1 year ago →
🇬🇧United Kingdom joachim
> Not sure if/where it's documented, but I think the reason is so that every revision is complete

Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language < N.

Though I don't actually think it's very useful for editors to have the system the way it is. If you're the content editor working in German for instance, you view of your translation is littered with pointless identical revisions. Why should you care if your colleague in the French office made 6 edits to the French translation?
Comment about 1 year ago →
🇬🇧United Kingdom catch
Hashing: xxHash is probably what we should use these days - non cryptographic but collision resistant.

On the overall reason why the storage is like this, like most things it evolved over time and happened b somewhat by default. We started off with translations in different entities altogether (D6) and the tnid column. Drupal 7 added language to field tables but without an entity translation UI or support for base fields in core. Drupal 8 added the rest.

When we load and save an entity, we load and save the entire entity, not a slice of it. We would have had to have implemented per-language revision storage on top of all of the other things to make entity translation work and I don't think it really came up.

Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language

This would need to work for entity revision queries and views too, including revision queries across different languages. It's likely to result in more complex and hard to index queries than what's being discussed here. e.g. here we're looking at an extra join on fields that are rarely included in an entity query. I think selective revision saving/loading would require MAX(vid) for every language or something. Imagine a condition on a field value in any language, or querying with both a translatable field value and a non-translatable field value as conditions. Maybe it could be simplified if the latest revision holds all languages, but then there is workspaces too.

An advantage of the approach here is it's isolated to field types that are rarely queried, and it would work for monolingual sites where the body is large and the only change is publishing/unpublishing or adding a term reference or similar.
Comment about 1 year ago →
🇬🇧United Kingdom longwave UK
Back to my thoughts in #8:

While it makes sense to only do this for blob/text fields I don't think it makes sense to defer storage layout to the field type. I started prototyping this route but it turns out to be even more hacky. Discussed with @effulgentsia and while this probably only applies to core text fields, JSON fields in contrib and perhaps a small number of others, there is no benefit to it being a separate field type.

@effulgentsia also reminded me that base fields currently apply to all bundles in an entity. While we could allow fields on the base table to be NULL for bundles that don't use the field, and I've never really understood why e.g a boolean field of cardinality 1 requires an entirely separate table for its storage, I think this requires more discussion and is out of scope here - if I add test coverage then at least we can refactor at will later if we decide to make field storage pluggable in some way.

xxHash looks good, Symfony uses the 128 bit version of xxHash in places, so let's follow suit here.
Status changed to Needs review about 1 year ago10:15pm 10 September 2024
Comment about 1 year ago →
🇬🇧United Kingdom longwave UK
Some test fails around installed config, but while I sort those out and add new test coverage marking as NR to see if anyone has any early comments on the approach so far.

Will also need an upgrade path for existing config if we go this route.
Comment about 1 year ago →
🇺🇸United States nicxvan
Is this trading runtime efficiency for storage efficiency?

I'm concerned this makes saving and loading fields slower to hash revisions for the majority of sites. It seems it will make queryng revisions more complex.

I have a client with 10 languages and many nodes. When I shared this with them they raised concerns about the performance trade off since cpu is more expensive than storage.
Comment about 1 year ago →
🇺🇸United States effulgentsia
It's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.

For fields that opt into it, it adds an indexed JOIN to loading and querying revisions.

For fields that opt into it, it adds having to compute a hash for every field item for every translation when writing a revision (saving an entity). It's possible that this could be expensive (and we might want to consider future optimizations to track which items/translations are dirty and need re-hashing), but xxHash is very fast (so dirty tracking might not be needed).
Comment about 1 year ago →
🇺🇸United States effulgentsia
it will help reduce database size with Paragraphs as well as Experience Builder

Just to set expectations, this might help some with Paragraphs but probably not as much as it will help with Experience Builder. For Paragraphs, it will help with individual fields that have long content (if those fields opt into it, per #19), such as text_long fields that content editors populate with a lot of text. However, for some sites, Paragraphs don't necessarily have a ton of content in any one field, they just have a lot of fields, and in those cases, the work in this issue won't help much with that.

Because Experience Builder will combine values for all fields within a component into a single JSON field or Field Union field, this issue will help with the "lots of fields with small content" use case as well as the "single fields with large content" use case.
Comment about 1 year ago →
🇺🇸United States effulgentsia
xxHash looks good, Symfony uses the 128 bit version of xxHash in places, so let's follow suit here

I agree with 128 bit rather than 64 bit. https://en.wikipedia.org/wiki/Birthday_attack has a table that shows that with 64 bits, you'd have more than a 1 in a million chance of collision once your table grows above 6 million records.

However, instead of a 32 character hex string, we could reduce it to a 24 character base64 string.
Comment about 1 year ago →
🇺🇸United States nicxvan
It's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.

That resolves our concerns!

I missed: https://git.drupalcode.org/project/drupal/-/merge_requests/9417/diffs#83...
Comment about 1 year ago →
🇺🇸United States smustgrave
Seems the MR needs a manual rebase.
Status changed to Needs work 5 months ago8:50am 16 May 2025
Comment 5 months ago →
🇬🇧United Kingdom catch
I think we should consider whether there might be better savings from using native MySQL compression, e.g. from ✨ Make ROW_FORMAT configurable Needs work - we could set that by default for new sites potentially. Or whether both MySQL compression and this issue would combine to mean even greater space savings.
Comment 5 months ago →
pere orga Catalonia
✨ Make ROW_FORMAT configurable Needs work is nice because it doesn't increase complexity in Drupal. But because MySQL compression is applied per page (typically 16KB, it could be smaller), space would be saved only when the repeated data is located within the same page.

Also, it wouldn't reduce the size of database backups.

(And of course, it's MySQL-specific)
First commit to issue fork.
Pipeline finished with Success
4 months ago
Total: 523s
#524593
Comment 4 months ago →
🇬🇧United Kingdom AndyF
Thanks all! Rebased against 211722d90c2fdedb17308f4e7999275c2b3df495.

I tested it out locally on a site with some large revisions (max 5MB) but is otherwise simple: one language, only the (single-value) body field gets large, < 20k nodes. I saw the dump size reduce to ~50% of its original size (uncompressed), and the MySQL (5.7) DB data directory fall to about 68% of the size, measured immediately after doing a fresh DB import.
Comment 2 months ago →
🇺🇸United States smustgrave
Believe still needs an upgrade path. Also tagging for a change record to explain what the new key is doing.

Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languages

Problem/Motivation

Proposed resolution

Data model changes

Remaining tasks

Merge Requests

!9417Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languages
Open

Comments & Activities

Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languages

Problem/Motivation

Proposed resolution

Data model changes

Remaining tasks

Merge Requests

!9417Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languagesOpen

Comments & Activities

!9417Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languages
Open