Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languages

Created on 20 August 2024, 4 months ago
Updated 11 September 2024, 3 months ago

Problem/Motivation

This idea was initially proposed as part of ✨ JSON-based data storage proposal for component-based page building Active but is split out into this separate issue per discussion in that issue.

Suppose you have a site with long-form articles: where $node->body->value averages 5KB (~800 words with no formatting, or fewer words if there's also bytes taken up by HTML tags and attributes). If a given node has:

  • 20 language translations
  • 50 revisions per translation = 1000 total revisions

This now adds up to 5KB * 20 * 1000 = 100MB of just body field data just for that one node. If you have 1000 such articles, you now have 100GB of just body field data.

Note that this scales with the square of the number of translations, because generally if you make a revision in one language, you also tend to revise the other translations to match, while at the same time each revision stores its own copy of the data for every language.

Meanwhile, a lot of this data is duplicated. At a minimum, each revision is storing a copy of the data for every language, including the 19 languages not changed by that revision.

But it's not only the duplication of the non-active translations, consider also the case of a multivalued field. Suppose instead of a 5kb single-item body field, you have a multi-valued field with 10 items each containing 500 bytes, and for the majority of content edits (new revisions), you only change the value of 1 of those items. Now, in addition to each revision duplicating the field values of non-active translations, it's also duplicating the values of the 9 unedited items.

In other words, for a node with X field items and Y translations, a new revision where only 1 field item in 1 language is modified ends up storing X*Y field item values, where all but one are duplicates from other revisions.

Proposed resolution

Use an approach similar to string interning or hash consing, where instead of storing the large field value in the field item record, you store a hash of it there, and have a separate table that maps the hash value to the large value.

The question then is how to decide when to apply this interning logic and when not to.

  • Option 1: Make it the responsibility of the field type. For example, in addition to our current text_long field type, we could add a text_long_interned field type. And similarly for other field types that can hold large values, like field_union fields and json fields.
  • Option 2: Make it similar to a field storage setting, but one that's managed by the field system rather than by each field type.

Data model changes

Whether done as new field types or as a new field storage setting, it would be opt-in, so no changes to existing sites. However, it's not yet clear whether we can integrate this into Views and EntityQuery in a fully encapsulated way, or whether the decision of whether to intern or not would leak into various other code.

Remaining tasks

Discuss and/or prototype.

✨ Feature request
Status

Needs review

Version

11.0 πŸ”₯

Component
FieldΒ  β†’

Last updated 35 minutes ago

Created by

πŸ‡ΊπŸ‡ΈUnited States effulgentsia

Live updates comments and jobs are added and updated live.
  • Needs tests

    The change is currently missing an automated test that fails when run with the original code, and succeeds when the bug has been fixed.

Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @effulgentsia
  • πŸ‡ΊπŸ‡ΈUnited States effulgentsia

    When I first posted this issue, I didn't know how feasible option 2 (implementing it at a central but opt-in level) would be, but having thought a bit more on it, I think it's quite feasible.

    The key is deciding at what granularity to intern:

    • At the FieldItemList level (meaning store a single hash of the full list)
    • At the FieldItem level (a hash per field item)
    • At the property level (e.g., for a text_long field, intern the value property but not the format property)

    Of these, I think that:

    • At the FieldItem level is the most valuable.
    • At the FieldItemList level would provide significant added value so long as it's in addition to, not instead of, the FieldItem level.
    • I don't think there's much value to intern at the property level, and also this would be the most difficult to implement centrally.

    Given this, here's a rough outline of a possible implementation:

    Let's assume that $node->body is opted into interning via some configuration that's similar to a field storage setting. Then we want to modify the entity storage/schema handler(s) in the following way:

    • Instead of adding the body_value, body_summary, and body_format columns to the node_revision__body table, we only add a body_hash column.
    • We add two other tables: node_revision_interned_item__body and node_revision_interned_item_list__body.
    • The columns for node_revision_interned_item__body would be: entity_id, hash, body_value, body_summary, and body_format.
    • The columns for node_revision_interned_item_list__body would be: entity_id, list_hash, delta, item_hash.

    Now let's say $node->body has two items in it and it's time to save that. What we would write to the database is:

    • For each item, a record in node_revision_interned_item__body mapping the hash of [value, summary, format] to those values.
    • Let's say the hash of item 0 is xxxxxx and the hash of item 1 is yyyyyy, then we calculate a single hash for ['xxxxxx', 'yyyyyy']. Let's say that's zzzzzz. Assuming this is for node 100, what we'd store in node_revision_interned_item_list__body is:
      100, 'zzzzzz', 0, 'xxxxxx'
      100, 'zzzzzz', 1, 'yyyyyy'
      
    • Finally, in the node_revision__body table, we only store a single delta with 'zzzzzz' for the body_hash.

    The entity_id column in the two interned tables isn't strictly necessary, since the hash is enough to uniquely map to the underlying values, but I'm suggesting it to make garbage collection easier. For example, it means when an entity is deleted, we can safely delete all of its interned records. Or, if we want to do more fine-grained garbage collection, we can do bulk/batch operations on a subset of entities at a time.

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    I think this could be simpler if we do this at FieldItem level?

    • Instead of adding the body_value, body_summary, and body_format columns to the node_revision__body table, we only add a body_hash column.
    • We add another table, node_revision__body__interned, with columns body_hash, body_value, body_summary, and body_format.

    At insert/update time we just calculate the hash, create the interned storage row if it doesn't exist already, and store the hash in the revision table.

    Garbage collection can be still be done by deleting rows from the interned storage table where the hash does not appear in the revision table; this can be done out-of-band (on cron) because, apart from disk space, it doesn't matter if we have orphaned hashes.

    I had a thought that this could even be more generalised by having a single interned storage table per field type, which might provide slightly better effective compression ratios in some cases where the same data may appear in multiple fields, but the tradeoff there is that garbage collection is much more difficult (we would have to consider hashes from multiple tables), and I don't think the savings will be worth it.

    Are we also assuming that this will only be used for revision storage? ie. node__body will not use the interned strings?

    This technique could also be useful in a related but slightly different way for the message column in the watchdog table, which can also grow quite large on some sites.

  • πŸ‡ΊπŸ‡ΈUnited States effulgentsia

    The MR is looking great! I'm thrilled to be seeing this starting to come to life.

    I think these are the key differences between #3 (what the MR is implementing) and what I suggested in #2:

    • Name the table node_revision__body__interned instead of node_revision_interned_item__body.
      +1 to this. Much better name :)
    • Don't include an entity_id column in node_revision__body__interned.
      The hypothesis to this being that we can do efficient enough garbage collection without it. +1 to this; that seems like a reasonable hypothesis.
    • Intern only at the FieldItem level and not add a 2nd level of interning for the FieldItemList level.
      Perhaps this is an okay place to start. My thinking in #2 for wanting to intern at both levels is for something like πŸ“Œ [PP-1] Consider not storing the ComponentTreeStructure data type as a JSON blob Active . If we store that as a multi-valued field, each item wouldn't be very large, but there could be a lot of items. And you might have many revision edits where all you're doing is changing prop values and not changing the layout at all. So in that case having each revision only store a single hash instead of a hash per component seems beneficial. But, this is less critical than the FieldItem level for prop values (and non-XB use cases), so I'd be fine with punting the FieldItemList level to a followup.

    Are we also assuming that this will only be used for revision storage? ie. node__body will not use the interned strings?

    Yeah, perhaps I'm wrong but I'm not foreseeing it having much benefit for the non-revision tables. Where we know duplication happens a lot is across revisions. I don't think duplication across nodes is a significant contributor to database size. And by not doing it for the non-revision tables, it means less disruption to existing Views, EntityQuery, etc. code., except where those things query the revision tables.

    This technique could also be useful in a related but slightly different way for the message column in the watchdog table, which can also grow quite large on some sites.

    Oh that's a cool idea! Separate issue please, but that would be neat if it would help there. Do you think it would make much impact if it's only that column, but not also the variables column?

  • πŸ‡ΊπŸ‡ΈUnited States nicxvan

    I think there may be a mistake in your math, or I'm missing something.

    5KB * 20 * 1000 = 100MB
    You already multiplied by the 20 languages to get 1000. This means your body size should be 5KB * 1000 which is 5MB. and 1000 articles is 5 GB.

    Still massive, for no particular benefit, but I think the scale is off by an order of magnitude.

  • πŸ‡ΊπŸ‡ΈUnited States effulgentsia

    Thanks for checking the math! I appreciate having more eyes on this.

    You already multiplied by the 20 languages by 50 revisions to get 1000. Why are you multiplying by 20 again?

    Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision. For example, say you create a brand new node and translate it into 20 languages. Conceptually you've only made one revision per language, but Drupal stores it in the database as 20 revisions, and the 20th revision has records for all 20 languages. If you now edit the node to make a small change (say add another sentence to $node->body), and then edit every translation to make that same change for each language, Drupal will have stored an additional 400 records, because each translation's change gets stored as a new revision, and each of those new revisions copies over the records for every language.

    So if you then make 49 more changes, and translate each of those changes into each language, you'll end up with 1000 revisions and 20,000 records.

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Some more thoughts on this while I slept on it:

    1. Wondering now if this should be attached to a field type instead of a field storage setting - this really only makes sense for blob or longtext type fields; the hash is 64 chars so the data probably needs to be an order of magnitude larger for this technique to make sense.
    2. If this is to be a field storage setting, is it somehow worth abstracting out the underlying field storage for "base fields", "configurable fields", "interned fields" into plugins and then allowing a storage plugin selection per field? This will need some rules - base fields can only have cardinality 1, interned fields should only be available for fields that use blob/longtext columns, etc, but it would provide more choice to the site builder. Implementing this is probably out of scope here but it is still worth discussing as it may still shape the solution here.
    3. Unsure which hash method to choose, or if it matters that much. I chose SHA256 because we have precedent elsewhere in core and I remember from somewhere that, while shorter, MD5 or SHA1 are discouraged in some cases even if not used as a cryptographic hash.
  • πŸ‡ΊπŸ‡ΈUnited States nicxvan

    Ah, I was missing that detail. This will be a great change.

    Will this have any impact on entity reference revisions? Paragraphs already get weird with translations and revisions.

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    @nicxvan this should be suitable for any large field in any entity type that stores duplicated data so it will help reduce database size with Paragraphs as well as Experience Builder.

  • πŸ‡¬πŸ‡§United Kingdom joachim

    > Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision

    What's the reason for the translation/revision system doing this? is it documented anywhere?

  • πŸ‡ΊπŸ‡ΈUnited States effulgentsia

    Not sure if/where it's documented, but I think the reason is so that every revision is complete. For example, to be able to visit en/node/1/revisions/1/view for any language and any revision ID and be able to display that by only loading the data for that revision, regardless of if the language you're viewing it in matches the language that was edited by that revision.

  • πŸ‡ΊπŸ‡ΈUnited States nicxvan

    Great to hear this will help there too!

  • πŸ‡¬πŸ‡§United Kingdom joachim

    > Not sure if/where it's documented, but I think the reason is so that every revision is complete

    Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language < N.

    Though I don't actually think it's very useful for editors to have the system the way it is. If you're the content editor working in German for instance, you view of your translation is littered with pointless identical revisions. Why should you care if your colleague in the French office made 6 edits to the French translation?

  • πŸ‡¬πŸ‡§United Kingdom catch

    Hashing: xxHash is probably what we should use these days - non cryptographic but collision resistant.

    On the overall reason why the storage is like this, like most things it evolved over time and happened b somewhat by default. We started off with translations in different entities altogether (D6) and the tnid column. Drupal 7 added language to field tables but without an entity translation UI or support for base fields in core. Drupal 8 added the rest.

    When we load and save an entity, we load and save the entire entity, not a slice of it. We would have had to have implemented per-language revision storage on top of all of the other things to make entity translation work and I don't think it really came up.

    Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language

    This would need to work for entity revision queries and views too, including revision queries across different languages. It's likely to result in more complex and hard to index queries than what's being discussed here. e.g. here we're looking at an extra join on fields that are rarely included in an entity query. I think selective revision saving/loading would require MAX(vid) for every language or something. Imagine a condition on a field value in any language, or querying with both a translatable field value and a non-translatable field value as conditions. Maybe it could be simplified if the latest revision holds all languages, but then there is workspaces too.

    An advantage of the approach here is it's isolated to field types that are rarely queried, and it would work for monolingual sites where the body is large and the only change is publishing/unpublishing or adding a term reference or similar.

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Back to my thoughts in #8:

    1. While it makes sense to only do this for blob/text fields I don't think it makes sense to defer storage layout to the field type. I started prototyping this route but it turns out to be even more hacky. Discussed with @effulgentsia and while this probably only applies to core text fields, JSON fields in contrib and perhaps a small number of others, there is no benefit to it being a separate field type.
    2. @effulgentsia also reminded me that base fields currently apply to all bundles in an entity. While we could allow fields on the base table to be NULL for bundles that don't use the field, and I've never really understood why e.g a boolean field of cardinality 1 requires an entirely separate table for its storage, I think this requires more discussion and is out of scope here - if I add test coverage then at least we can refactor at will later if we decide to make field storage pluggable in some way.
    3. xxHash looks good, Symfony uses the 128 bit version of xxHash in places, so let's follow suit here.
  • Status changed to Needs review 3 months ago
  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Some test fails around installed config, but while I sort those out and add new test coverage marking as NR to see if anyone has any early comments on the approach so far.

    Will also need an upgrade path for existing config if we go this route.

  • πŸ‡ΊπŸ‡ΈUnited States nicxvan

    Is this trading runtime efficiency for storage efficiency?

    I'm concerned this makes saving and loading fields slower to hash revisions for the majority of sites. It seems it will make queryng revisions more complex.

    I have a client with 10 languages and many nodes. When I shared this with them they raised concerns about the performance trade off since cpu is more expensive than storage.

  • πŸ‡ΊπŸ‡ΈUnited States effulgentsia

    It's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.

    For fields that opt into it, it adds an indexed JOIN to loading and querying revisions.

    For fields that opt into it, it adds having to compute a hash for every field item for every translation when writing a revision (saving an entity). It's possible that this could be expensive (and we might want to consider future optimizations to track which items/translations are dirty and need re-hashing), but xxHash is very fast (so dirty tracking might not be needed).

  • πŸ‡ΊπŸ‡ΈUnited States effulgentsia

    it will help reduce database size with Paragraphs as well as Experience Builder

    Just to set expectations, this might help some with Paragraphs but probably not as much as it will help with Experience Builder. For Paragraphs, it will help with individual fields that have long content (if those fields opt into it, per #19), such as text_long fields that content editors populate with a lot of text. However, for some sites, Paragraphs don't necessarily have a ton of content in any one field, they just have a lot of fields, and in those cases, the work in this issue won't help much with that.

    Because Experience Builder will combine values for all fields within a component into a single JSON field or Field Union field, this issue will help with the "lots of fields with small content" use case as well as the "single fields with large content" use case.

  • πŸ‡ΊπŸ‡ΈUnited States effulgentsia

    xxHash looks good, Symfony uses the 128 bit version of xxHash in places, so let's follow suit here

    I agree with 128 bit rather than 64 bit. https://en.wikipedia.org/wiki/Birthday_attack has a table that shows that with 64 bits, you'd have more than a 1 in a million chance of collision once your table grows above 6 million records.

    However, instead of a 32 character hex string, we could reduce it to a 24 character base64 string.

  • πŸ‡ΊπŸ‡ΈUnited States nicxvan

    It's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.

    That resolves our concerns!

    I missed: https://git.drupalcode.org/project/drupal/-/merge_requests/9417/diffs#83...

  • πŸ‡ΊπŸ‡ΈUnited States smustgrave

    Seems the MR needs a manual rebase.

Production build 0.71.5 2024