- Issue created by @effulgentsia
- πΊπΈUnited States effulgentsia
When I first posted this issue, I didn't know how feasible option 2 (implementing it at a central but opt-in level) would be, but having thought a bit more on it, I think it's quite feasible.
The key is deciding at what granularity to intern:
- At the FieldItemList level (meaning store a single hash of the full list)
- At the FieldItem level (a hash per field item)
- At the property level (e.g., for a
text_long
field, intern thevalue
property but not theformat
property)
Of these, I think that:
- At the FieldItem level is the most valuable.
- At the FieldItemList level would provide significant added value so long as it's in addition to, not instead of, the FieldItem level.
- I don't think there's much value to intern at the property level, and also this would be the most difficult to implement centrally.
Given this, here's a rough outline of a possible implementation:
Let's assume that
$node->body
is opted into interning via some configuration that's similar to a field storage setting. Then we want to modify the entity storage/schema handler(s) in the following way:- Instead of adding the
body_value
,body_summary
, andbody_format
columns to thenode_revision__body
table, we only add abody_hash
column. - We add two other tables:
node_revision_interned_item__body
andnode_revision_interned_item_list__body
. - The columns for
node_revision_interned_item__body
would be:entity_id
,hash
,body_value
,body_summary
, andbody_format
. - The columns for
node_revision_interned_item_list__body
would be:entity_id
,list_hash
,delta
,item_hash
.
Now let's say
$node->body
has two items in it and it's time to save that. What we would write to the database is:- For each item, a record in
node_revision_interned_item__body
mapping the hash of[value, summary, format]
to those values. - Let's say the hash of item 0 is
xxxxxx
and the hash of item 1 isyyyyyy
, then we calculate a single hash for['xxxxxx', 'yyyyyy']
. Let's say that'szzzzzz
. Assuming this is for node 100, what we'd store innode_revision_interned_item_list__body
is:
100, 'zzzzzz', 0, 'xxxxxx' 100, 'zzzzzz', 1, 'yyyyyy'
- Finally, in the
node_revision__body
table, we only store a single delta with'zzzzzz'
for thebody_hash
.
The
entity_id
column in the two interned tables isn't strictly necessary, since the hash is enough to uniquely map to the underlying values, but I'm suggesting it to make garbage collection easier. For example, it means when an entity is deleted, we can safely delete all of its interned records. Or, if we want to do more fine-grained garbage collection, we can do bulk/batch operations on a subset of entities at a time. - π¬π§United Kingdom longwave UK
I think this could be simpler if we do this at FieldItem level?
- Instead of adding the
body_value
,body_summary
, andbody_format
columns to thenode_revision__body
table, we only add abody_hash
column. - We add another table,
node_revision__body__interned
, with columnsbody_hash
,body_value
,body_summary
, andbody_format
.
At insert/update time we just calculate the hash, create the interned storage row if it doesn't exist already, and store the hash in the revision table.
Garbage collection can be still be done by deleting rows from the interned storage table where the hash does not appear in the revision table; this can be done out-of-band (on cron) because, apart from disk space, it doesn't matter if we have orphaned hashes.
I had a thought that this could even be more generalised by having a single interned storage table per field type, which might provide slightly better effective compression ratios in some cases where the same data may appear in multiple fields, but the tradeoff there is that garbage collection is much more difficult (we would have to consider hashes from multiple tables), and I don't think the savings will be worth it.
Are we also assuming that this will only be used for revision storage? ie.
node__body
will not use the interned strings?This technique could also be useful in a related but slightly different way for the
message
column in thewatchdog
table, which can also grow quite large on some sites. - Instead of adding the
- πΊπΈUnited States effulgentsia
The MR is looking great! I'm thrilled to be seeing this starting to come to life.
I think these are the key differences between #3 (what the MR is implementing) and what I suggested in #2:
- Name the table
node_revision__body__interned
instead ofnode_revision_interned_item__body
.
+1 to this. Much better name :) - Don't include an
entity_id
column innode_revision__body__interned
.
The hypothesis to this being that we can do efficient enough garbage collection without it. +1 to this; that seems like a reasonable hypothesis. - Intern only at the
FieldItem
level and not add a 2nd level of interning for theFieldItemList
level.
Perhaps this is an okay place to start. My thinking in #2 for wanting to intern at both levels is for something like π [PP-1] Consider not storing the ComponentTreeStructure data type as a JSON blob Active . If we store that as a multi-valued field, each item wouldn't be very large, but there could be a lot of items. And you might have many revision edits where all you're doing is changing prop values and not changing the layout at all. So in that case having each revision only store a single hash instead of a hash per component seems beneficial. But, this is less critical than theFieldItem
level for prop values (and non-XB use cases), so I'd be fine with punting theFieldItemList
level to a followup.
Are we also assuming that this will only be used for revision storage? ie. node__body will not use the interned strings?
Yeah, perhaps I'm wrong but I'm not foreseeing it having much benefit for the non-revision tables. Where we know duplication happens a lot is across revisions. I don't think duplication across nodes is a significant contributor to database size. And by not doing it for the non-revision tables, it means less disruption to existing Views, EntityQuery, etc. code., except where those things query the revision tables.
This technique could also be useful in a related but slightly different way for the message column in the watchdog table, which can also grow quite large on some sites.
Oh that's a cool idea! Separate issue please, but that would be neat if it would help there. Do you think it would make much impact if it's only that column, but not also the
variables
column? - Name the table
- πΊπΈUnited States nicxvan
I think there may be a mistake in your math, or I'm missing something.
5KB * 20 * 1000 = 100MB
You already multiplied by the 20 languages to get 1000. This means your body size should be 5KB * 1000 which is 5MB. and 1000 articles is 5 GB.Still massive, for no particular benefit, but I think the scale is off by an order of magnitude.
- πΊπΈUnited States effulgentsia
Thanks for checking the math! I appreciate having more eyes on this.
You already multiplied by the 20 languages by 50 revisions to get 1000. Why are you multiplying by 20 again?
Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision. For example, say you create a brand new node and translate it into 20 languages. Conceptually you've only made one revision per language, but Drupal stores it in the database as 20 revisions, and the 20th revision has records for all 20 languages. If you now edit the node to make a small change (say add another sentence to
$node->body
), and then edit every translation to make that same change for each language, Drupal will have stored an additional 400 records, because each translation's change gets stored as a new revision, and each of those new revisions copies over the records for every language.So if you then make 49 more changes, and translate each of those changes into each language, you'll end up with 1000 revisions and 20,000 records.
- π¬π§United Kingdom longwave UK
Some more thoughts on this while I slept on it:
- Wondering now if this should be attached to a field type instead of a field storage setting - this really only makes sense for blob or longtext type fields; the hash is 64 chars so the data probably needs to be an order of magnitude larger for this technique to make sense.
- If this is to be a field storage setting, is it somehow worth abstracting out the underlying field storage for "base fields", "configurable fields", "interned fields" into plugins and then allowing a storage plugin selection per field? This will need some rules - base fields can only have cardinality 1, interned fields should only be available for fields that use blob/longtext columns, etc, but it would provide more choice to the site builder. Implementing this is probably out of scope here but it is still worth discussing as it may still shape the solution here.
- Unsure which hash method to choose, or if it matters that much. I chose SHA256 because we have precedent elsewhere in core and I remember from somewhere that, while shorter, MD5 or SHA1 are discouraged in some cases even if not used as a cryptographic hash.
- πΊπΈUnited States nicxvan
Ah, I was missing that detail. This will be a great change.
Will this have any impact on entity reference revisions? Paragraphs already get weird with translations and revisions.
- π¬π§United Kingdom longwave UK
@nicxvan this should be suitable for any large field in any entity type that stores duplicated data so it will help reduce database size with Paragraphs as well as Experience Builder.
- π¬π§United Kingdom joachim
> Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision
What's the reason for the translation/revision system doing this? is it documented anywhere?
- πΊπΈUnited States effulgentsia
Not sure if/where it's documented, but I think the reason is so that every revision is complete. For example, to be able to visit
en/node/1/revisions/1/view
for any language and any revision ID and be able to display that by only loading the data for that revision, regardless of if the language you're viewing it in matches the language that was edited by that revision. - π¬π§United Kingdom joachim
> Not sure if/where it's documented, but I think the reason is so that every revision is complete
Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language < N.
Though I don't actually think it's very useful for editors to have the system the way it is. If you're the content editor working in German for instance, you view of your translation is littered with pointless identical revisions. Why should you care if your colleague in the French office made 6 edits to the French translation?
- π¬π§United Kingdom catch
Hashing: xxHash is probably what we should use these days - non cryptographic but collision resistant.
On the overall reason why the storage is like this, like most things it evolved over time and happened b somewhat by default. We started off with translations in different entities altogether (D6) and the tnid column. Drupal 7 added language to field tables but without an entity translation UI or support for base fields in core. Drupal 8 added the rest.
When we load and save an entity, we load and save the entire entity, not a slice of it. We would have had to have implemented per-language revision storage on top of all of the other things to make entity translation work and I don't think it really came up.
Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language
This would need to work for entity revision queries and views too, including revision queries across different languages. It's likely to result in more complex and hard to index queries than what's being discussed here. e.g. here we're looking at an extra join on fields that are rarely included in an entity query. I think selective revision saving/loading would require MAX(vid) for every language or something. Imagine a condition on a field value in any language, or querying with both a translatable field value and a non-translatable field value as conditions. Maybe it could be simplified if the latest revision holds all languages, but then there is workspaces too.
An advantage of the approach here is it's isolated to field types that are rarely queried, and it would work for monolingual sites where the body is large and the only change is publishing/unpublishing or adding a term reference or similar.
- π¬π§United Kingdom longwave UK
Back to my thoughts in #8:
- While it makes sense to only do this for blob/text fields I don't think it makes sense to defer storage layout to the field type. I started prototyping this route but it turns out to be even more hacky. Discussed with @effulgentsia and while this probably only applies to core text fields, JSON fields in contrib and perhaps a small number of others, there is no benefit to it being a separate field type.
- @effulgentsia also reminded me that base fields currently apply to all bundles in an entity. While we could allow fields on the base table to be NULL for bundles that don't use the field, and I've never really understood why e.g a boolean field of cardinality 1 requires an entirely separate table for its storage, I think this requires more discussion and is out of scope here - if I add test coverage then at least we can refactor at will later if we decide to make field storage pluggable in some way.
- xxHash looks good, Symfony uses the 128 bit version of xxHash in places, so let's follow suit here.
- Status changed to Needs review
4 months ago 10:15pm 10 September 2024 - π¬π§United Kingdom longwave UK
Some test fails around installed config, but while I sort those out and add new test coverage marking as NR to see if anyone has any early comments on the approach so far.
Will also need an upgrade path for existing config if we go this route.
- πΊπΈUnited States nicxvan
Is this trading runtime efficiency for storage efficiency?
I'm concerned this makes saving and loading fields slower to hash revisions for the majority of sites. It seems it will make queryng revisions more complex.
I have a client with 10 languages and many nodes. When I shared this with them they raised concerns about the performance trade off since cpu is more expensive than storage.
- πΊπΈUnited States effulgentsia
It's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.
For fields that opt into it, it adds an indexed JOIN to loading and querying revisions.
For fields that opt into it, it adds having to compute a hash for every field item for every translation when writing a revision (saving an entity). It's possible that this could be expensive (and we might want to consider future optimizations to track which items/translations are dirty and need re-hashing), but xxHash is very fast (so dirty tracking might not be needed).
- πΊπΈUnited States effulgentsia
it will help reduce database size with Paragraphs as well as Experience Builder
Just to set expectations, this might help some with Paragraphs but probably not as much as it will help with Experience Builder. For Paragraphs, it will help with individual fields that have long content (if those fields opt into it, per #19), such as
text_long
fields that content editors populate with a lot of text. However, for some sites, Paragraphs don't necessarily have a ton of content in any one field, they just have a lot of fields, and in those cases, the work in this issue won't help much with that.Because Experience Builder will combine values for all fields within a component into a single JSON field or Field Union field, this issue will help with the "lots of fields with small content" use case as well as the "single fields with large content" use case.
- πΊπΈUnited States effulgentsia
xxHash looks good, Symfony uses the 128 bit version of xxHash in places, so let's follow suit here
I agree with 128 bit rather than 64 bit. https://en.wikipedia.org/wiki/Birthday_attack has a table that shows that with 64 bits, you'd have more than a 1 in a million chance of collision once your table grows above 6 million records.
However, instead of a 32 character hex string, we could reduce it to a 24 character base64 string.
- πΊπΈUnited States nicxvan
It's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.
That resolves our concerns!
I missed: https://git.drupalcode.org/project/drupal/-/merge_requests/9417/diffs#83...