- 🇫🇷France andypost
The issue is duplicated by 🐛 ChainedFastBackend invalidates all items when cache tags are invalidated Fixed
- 🇬🇧United Kingdom catch
📌 Optimize last_write_timestamp writes in ChainedFastBackend Active landed recently which means less writes to the consistent backend when multiple items are written. Reading back over this issue, it's similar-ish to what I suggested in #37 back in 2016, although at that time I hadn't appreciated how many invalidations we do within a single request on cold caches.
📌 Add a grace period to ChainedFastBackend when last_write_timestamp is ahead Active will reduce that even further.
Neither of these address that frequent writes to the invalidate the cache items themselves, but they should massively reduce lock contention (e.g. thousands of writes reduced to dozens immediately after a cache clear).
Related to the cache collector discussion, 🐛 Race conditions/bad lock logic in CacheCollector/State Active is open to address race conditions there, that might make some things here simpler if/when that lands (that issue is not simple though).
I think this should still stay open for exploring more approaches.
- 🇬🇧United Kingdom catch
Thinking about this more while working on 📌 Add a grace period to ChainedFastBackend when last_write_timestamp is ahead Active .
For chained fast, we use the $last_write_timestamp to invalidate everything before the item was written. This is so that the cache remains consistent between multiple web heads, and between cli and web heads too - let's call those 'nodes'. Unfortunate that it's the same word as Drupal nodes but my other idea was instances which sounds like class instance.
If we can identify which node a particular consistent cache item was written on, then we can check if it was on the current node or not. If it was written on the current node, then as long as we always write-through or refresh the local cache item, we can ignore $last_write_timestamp only on our server.
Something like:
- in the local cache - APCu - keep a special key which is a random UUID for the 'node'.
- store last_write_timestamp per-node, so an array like [$uuid1 => $timestamp, $uuid2 => $timestamp2].
- whenever we write to the consistent cache, we add the node uuid to the cache item
['chained_fast_uiid' => $uuid, 'data' => $data]
- we have to add this in on set and remove it on get for the caller, same structure for the fast cache. At the same time, we either have to write the item to the fast backend too, or at minimum delete it. This would undo one of the changes in 📌 Add a grace period to ChainedFastBackend when last_write_timestamp is ahead Active so a fine balance.- when we retrieve an item from the fast cache, we check if the node uuid matches the node uuid we're on, if it does, then we ignore $last_write_timestamp only for the current node.
- we also have to compare against the max($timestamps) of all the other nodes though, just in case they've written more recently than the item was written.
On a single web-head set-up, with infrequent cache writes from the command line, this would allow $last_write_timestamp to be almost entirely ignored for web requests, which would mean setting an item to chained fast would no longer invalidate the fast backend.
With two web heads, if they both write to the persistent cache in the same second, then it wouldn't help much, but if one web head happens to write to state a minute later, then it would avoid invalidating all the other fast cache items written in the last minute.
With multiple web heads this would likely be no more effective than the current situation, but it also probably wouldn't make it any worse.
I think the big question is whether the better hit rate here, would make it worse diluting the grace period approach in 📌 Add a grace period to ChainedFastBackend when last_write_timestamp is ahead Active .
We would need to bring back the write-through on save and delete, or at least delete the APCu item if that's better for locking and space usage, we couldn't just ignore it as is done in that issue.
- 🇧🇪Belgium kristiaanvandeneynde Antwerp, Belgium
- when we retrieve an item from the fast cache, we check if the node uuid matches the node uuid we're on, if it does, then we ignore $last_write_timestamp only for the current node.
Imagine this set-up (with write-through on set):
- Node A writes item X to the persistent cache, we flag item X with the UUID for node A
- Node A immediately writes item X to the fast cache, also flagged with the UUID for node A
- We keep a last write timestamp per node
Now, item X should remain valid indefinitely unless its tags got invalidated, it's got a max age that has expired or it got manually flushed. The first two scenarios are no troublemakers as both the fast and persistent cache can deal with tags and max ages just fine. What has me worried is the manual clear.
If I trigger a manual clear on node B, that will clear the persistent cache used by all nodes, but only the fast cache of node B (unless otherwise set up). Now node B has to recalculate item X and stores it in the persistent cache as originating from node B's UUID. Cool.
But node A's fast cache hasn't been cleared and it still contains item X as originating from node A. In the meantime, node A's last write timestamp has not been updated. So now node A will happily keep serving a stale cache item because it has no clue that the underlying persistent item now originates from node B. With the current system, this cannot happen as we have one timestamp to rule them all.
This can be cured by making sure markAsOutdated() called from invalidate calls also updates the last write timestamp of all nodes. That would, however, beat the purpose of having a timestamp per node unless we differentiate between writes and invalidations when calling markAsOutdated().
- we also have to compare against the max($timestamps) of all the other nodes though, just in case they've written more recently than the item was written.
Maybe that was your way around the above, but then I wonder if we still gain anything. Checking if any other node recently wrote something seems to undo what we're trying to achieve here. It sounds a lot like one centralized timestamp, with extra steps.
I think the big question is whether the better hit rate here, would make it worse diluting the grace period approach in #3526080: Reduce write contention to the fast backend in ChainedFastBackend.
We would need to bring back the write-through on save and delete, or at least delete the APCu item if that's better for locking and space usage, we couldn't just ignore it as is done in that issue.
The more I think about this, the more I'm wondering whether we should just pick one road and go down that one. Trying to combine the grace period from the other issue, dropping write-throughs in the process, seems rather incompatible with this one.
And given my thoughts above I'm not entirely sure how much benefit we could gain here. But that might very well be because I'm not fully picturing what you had in mind yet with the timestamps per node.
- 🇬🇧United Kingdom catch
Yeah the gain would mostly be in a single web head + cli situation. So mostly local environments and smaller hosting setups. As soon as you get to multiple web heads it's going to be marginal.
- First commit to issue fork.