- Issue created by @catch
- πΊπΈUnited States luke.leber Pennsylvania
Whenever I see a call to wait on a lock I get a chill up my spine.
From a cursory glance, this change could result on a ton of processes spinning/waiting to acquire a lock under a heavy load, couldn't it?
I think this one should be subject to some load testing on a larger site.
- π¬π§United Kingdom catch
Whenever I see a call to wait on a lock I get a chill up my spine.
I agree with this but I think we might not need the lock wait here, just to more aggressively clear/invalidate/set the cache even when we can't acquire the lock. Pushed a commit that tries this.
- π¬π§United Kingdom catch
The lock name was a red herring but I discovered some real race conditions.
https://git.drupalcode.org/project/drupal/-/jobs/3827879 runs a frequently failing tests that depends on writing to state 2500 times and is green with the MR.
https://git.drupalcode.org/project/drupal/-/pipelines/380036 is a full test run with only nightwatch failures, which this won't help with.
- πΊπΈUnited States dww
Thanks for this! Great work tracking this down.
I added some suggestions to the MR, mostly for trivial nits and a typo. Hate to slow down progress on a critical, but I really think we need comments about why we're not releasing the locks we're acquiring. I don't think this is RTBC if all the MR threads aren't resolved.
- π¬π§United Kingdom catch
I'm going to spend a bit more time trying to figure out if we can do away with holding the lock, if we can't, it should result in a better comment, possibly add a @todo and issue to have another try in a follow-up, if we can release the lock, then great we don't need to explain why we can't.
- π¬π§United Kingdom catch
Alright it looks like that theory was correct.
The remaining race condition was something like as follows:
We lock cache writes, but not cache reads. CacheCollector uses the cache created timestamp (which is stored with millisecond precision) to detect if the cache item it read at the beginning of the request is the same as the one it reads again at the end of the request.
Because we only lock writes, it's possible for two processes to lock acquire, cache write, lock release within the same millisecond. For the usual CacheCollector case this is a non-issue, because we are generally caching things which are expensive to build and don't get written to as such (like the runtime theme registry), so all these things happening in a single millisecond does not really come up. It's a problem here because state is cheap to build and also gets individual keys written to, especially with the database on a ramdisk on the same server etc. like gitlab pipelines.
If State::set() writes a cache item in the same millisecond as another cache item is set, this breaks the race condition detection in ::updateCache(). Another request can read the newly outdated cache item, get to the end of the request, compare with the State::set() one, see them as 'the same' and then write back stale data.
Holding the lock longer was preventing the situation where another process writes back to the cache item immediately after State::set() has written to it, having read a different cache item created in the same millisecond as the call to State::set().
Instead of that, we can usleep() for 10 milliseconds before and after writing, and try a bit harder to acquire the lock, this still results in 2500 passes for the test, but it doesn't involve either the full lock->wait() call or holding the lock that were necessary in earlier iterations.
There are still other theoretical race conditions here - e.g. if two servers have a time offset, and write cache items in the 'same' millisecond even though they are actually different due to the offset. We could potentially protect against this in CacheCollector by hashing the cache data every time it's read, and then comparing the hash when we write back to the cache - but this would introduce hashing overhead to every cache collector read which is not great. Or maybe we could store the hash with the cache item and compare that, which would mean changing the cache data structure a bit, but that might work and allow us to remove the usleep() calls then.
Going to try that approach in the sandbox MR and see if it's straightforward enough.
- π³π±Netherlands spokje
*cough*
Testbot really doesn't like it...
*cough* - π¬π§United Kingdom catch
OK trying to use a hash of the data doesn't work, or at least not yet, it also requires changing the structure of the cache item which means that even if did work it would probably be minor only.
I'll open a follow-up for that.
https://git.drupalcode.org/project/drupal/-/merge_requests/10704 is the MR and approach to review.
- π¬π§United Kingdom catch
Here's the follow-up: π Try to use a non-timestamp approach for CacheCollector race condition detection Active .
- π¬π§United Kingdom catch
Just kicked off a third run of https://git.drupalcode.org/project/drupal/-/merge_requests/10705/pipelines which will give us 7500 passing tests in a row assuming it also comes back green.
- π³π±Netherlands spokje
Hating to be that guy, but hey, I _am_ that guy...
I'm very doubtful if 7500 runs will fit into the current cutoff time of 1 hour.
- π¬π§United Kingdom catch
Well that's interesting, 2 runs out of 7500 failed https://git.drupalcode.org/project/drupal/-/jobs/3831692/viewer
Looks like this happened when the overall time for the test went up to 30-50+ seconds from 6-10s
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 9s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 9s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 9s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 9s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 9s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 18s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 23s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 24s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 26s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 26s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 30s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 36s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 35s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 35s \Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksT 0 passes 37s 1 fails Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 44s \Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksT 0 passes 48s 1 fails Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 62s Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe 1 passes 60s
Might be a sign of the lock failing to be acquired or similar.
Don't think this necessarily invalidates the fix here since it's still a significant improvement, but might show we need to lock wait for longer and/or explore the hash idea further in the other issue.
- π¬π§United Kingdom catch
Another 1/2500 fail on https://git.drupalcode.org/project/drupal/-/jobs/3832573/viewer so at 1497/15000 passes now.
- π¬π§United Kingdom catch
Hmm four fails on https://git.drupalcode.org/project/drupal/-/jobs/3849058 out of 2500
- π¬π§United Kingdom catch
Famous last words because we're down to 4/15000 odds so false negatives are increasingly likely, but might have figured out the last race condition.
In earlier iterations of the MR, I was holding the lock acquired in State:set() and not releasing it, this was out of paranoia so changed it to release after it was noted in review. I now think there was a good reason for it after all.
In CacheCollector::updateCache(), we use a new cache item as a tombstone record for the following case:
1. Process A - cold cache
2. Process B - invalidates the cache (e.g. a state write).
3. Process A writes the cache back at end of request, with stale data from before process B wrote its data.
To avoid this, Process B writes a new cache item, and then process A sees that a brand new cache item has been written when none existed before, and throws its data away instead.
The remaining race condition looked like this:
1. Process A - cold cache
2. Process B - state write and cache write.
3. Process B - end of request - deletes cache item (fails to acquire a lock or any other reason
4. Process A - because the cache item is deleted, doesn't see it, so writes the stale data to the cache.Holding the lock does the following:
1. Process A - cold cache
2. Process B - state write and cache write
3. Process B - end of request - still has the lock, so writes to cache again instead of deleting.
4. Process A - finds the new cache item so throws its data away - πͺπΈSpain penyaskito Seville π, Spain πͺπΈ, UTC+2 πͺπΊ
If this helps, as someone completely new to this issue, the code comments look as clear as I think they can be to understand the complex issue and the fix.
- π¬π§United Kingdom catch
10,000 passes so far with holding the lock, but probably need 20,000 passes or more to prove the negative given it failed 3/15000 without it.
https://git.drupalcode.org/project/drupal/-/jobs/3850861
https://git.drupalcode.org/project/drupal/-/jobs/3850954
https://git.drupalcode.org/project/drupal/-/jobs/3851133
https://git.drupalcode.org/project/drupal/-/jobs/3851201 - πΊπΈUnited States nicxvan
I think this is getting pretty close do we want a branch now with just the cache changes?
I'm trying to figure out how to check the probabilities, it's been a while since I took statistics.
- πΊπΈUnited States nicxvan
Ok using Poisson's distribution, then the odds that something that happens 3/15000, happens 0 times in 22500 is 1.1%
https://www.wolframalpha.com/input?i2d=true&i=Divide%5B%5C%2840%29Power%...Now this isn't perfect because it assumes that the 3/15000 rate is constant and obviously we're affecting the thing driving it, but we can't really do better that I can see.
Just for others following along the formula is (e^(-Ξ») (Ξ»^x))/x!
e is euler's number
Ξ» is the rate 3 divided by 15000 multiplied by 22500 = 4.5
x is the number expected which in this case is 0Plug those numbers in and you get 1.1%
I interpret that to mean if the rate was 3/15000 and we'd have to run a set of 22500 tests 100 times to get a run that had 0 failures.That feels definitive to me.
- π³π±Netherlands spokje
Just my EUR0.02: I can see an, almost always considerable, rise in the failure rate when it's more busy on Drupal CI.
All runs here have taken place in the Christmas/New Years period, in which it's very quiet around here.If possible, I would like to see a few runs from monday January 6th onward, when normal business is restored and see if that affects the numbers drastically.
If that would be too long of a wait, Im OK with this as-is, if stuff starts breaking down (more) on Monday we can act then as needed.
- π¬π§United Kingdom catch
Couple of general points, mostly answering questions I asked myself while working on this:
This problem is fairly unique to state's use of cache collector. Most cache collector implementations just build a 'runtime' cache from a discovery cache (smaller array from a massive array), so they are not changing values in the underlying data at all.
The most likely problem this change could cause we would be finding badly behaving contrib or custom code that is writing to state every request or something, but that would already cause problems with state caching, the lock wait would make those problems a bit worse, but writing to state every request would be pretty extreme.
With a normal/occasional state write, while the lock is being held, other processes will need to do some extra key/value queries, and won't be able to write back to the cache, but this is relatively cheap (about the same as before we had state caching in the first place). So that bit seems fine.
While adding all the logic, I wondered if state should instead re-implement the parts of cache collector it needs instead of making it more complicated, but:
- overall we'd probably end up with the same amount of complexity, but in two places instead of one
- the changes to cache collector make it more robust for contrib/custom use-cases doing similar things
- even if state moved to 'immediate write' with no end of request lazy-writing at all, it would still have to lock both state writes and cache writes.There is one possible optimization we can do here which I realised writing the above:
When we write to the cache item in State::set(), if we've acquired a lock, include the other state keys that are already loaded too, that will make it more likely that subsequent requests get a cache hit (including while the lock is being held). Not going to have time to work on that for at least a few hours though.
@nicxvan https://git.drupalcode.org/project/drupal/-/merge_requests/10704 is the committable/reviewable branch, it should be up-to-date (except for if that extra optimisation works).
- π¬π§United Kingdom catch
@nicxvan thanks for the stats research, that's pretty encouraging.
I've pushed a revert of π [random test failure] LayoutBuilderBlocksTest::testBlockPlaceholder failing Active to the 2500 test runs branch (https://git.drupalcode.org/project/drupal/-/merge_requests/10749) so that it's possible to keep hitting the 'run pipelines' button on there a bit more next week if this isn't committed before then. That will also give us another 2500 runs in the meantime.
- π¬π§United Kingdom catch
Little trip down memory lane:
#20114: race condition in variable_set in bootstrap.inc β and #973436: Overzealous locking in variable_initialize() β amongst others. - π§πͺBelgium wim leers Ghent π§πͺπͺπΊ
Arrived here via π [random test failure] EditorSecurityTest::testEditorXssFilterOverride Active .
@catch's #13 + #30 + #36 + #38 are π€―
@nicxvan: it's been >10 years since I saw somebody in a software engineering context mention Poisson's distribution and it reminds me of one of the very few books I still have lying around from studying Computer Science: https://www.taylorfrancis.com/books/mono/10.1201/9781420011425/probabili... π€π
@spokje in #35: +1 for increased CI failure rates when CI infra is under load.
- π¬π§United Kingdom catch
Since it's Monday, started kicking off pipelines again. Will edit this comment to add them when I remember to kick another one off (insert xkcd compiling image).
https://git.drupalcode.org/project/drupal/-/pipelines/387144
- π¬π§United Kingdom catch
I think we can skip holding the lock after all.
Short explanation of why:
Performance tests were failing in the latest MR, this is because the cli was holding the lock and preventing the site-under-test from acquiring it (annoying).
Trying to workaround this made me realise we're mainly holding the lock to avoid ::updateCache() throwing away the tombstone record when it can't acquire a lock.
Instead of that, if we've acquired a lock, we can completely disable the end of request logic in state, and release the lock. That means as soon as the state item is written other requests can start to build the cache again.
Obviously, now need to check that this still fixes the race condition so another 10k+ test runs to kick off on the test-only MR - will start that once the main fix MR is green again.
- π¬π§United Kingdom catch
Implemented that. One functional js test failure which I think is unrelated. Updated the test only MR, first 2500 runs came back green, will kick off more over the next 24 hours.
https://git.drupalcode.org/project/drupal/-/pipelines/387799
https://git.drupalcode.org/project/drupal/-/pipelines/387820 - π¬π§United Kingdom catch
12.5k runs without a failure so far, which is pretty encouraging, so I think this genuinely is ready for review again now. I'll kick off some more pipelines tomorrow.
- π³π±Netherlands spokje
Great work, I suppose this also takes care of random failures in tests that actually need to use State for testing like
\Drupal\Tests\system\Functional\System\CronRunTest::testCronUI
?This is an example where we can't just switch to KeyValue, since State is an integral part of running cron.
- π¬π§United Kingdom catch
@Spokje yes it should fix cases like that as well.
- π¬π§United Kingdom catch
Several fails on: https://git.drupalcode.org/project/drupal/-/jobs/3923493
- π¬π§United Kingdom catch
Theory on the above - the runtime doubling could be not acquiring a lock, and having to wait for a second - the runtime doubled when the test itself was holding the lock for the duration of the test (although that prevented any new caching at all so fixed the race condition).
So... trying to be more aggressive when we fail to acquire the lock.
- π¬π§United Kingdom catch
Unrelated fail on https://git.drupalcode.org/project/drupal/-/jobs/3925298, otherwise green.
- π¬π§United Kingdom catch
The simplification led to a behaviour change for the path alias cache, but it's very small and still works.
I think it is probably possible to make updateCache() work for what State:::set() wants to do, but I've already spent a lot of time on this issue and need a break, so I think we should defer that to a follow-up - it might reduce some code and make things hopefully even more robust, but it will be as tricky as this issue to ensure we've caught all the cases. Would rather have things correct which hopefully the current state of the MR gets us to.
Because a few things have changed again, going to run more pipelines..
https://git.drupalcode.org/project/drupal/-/pipelines/390206
- π¬π§United Kingdom catch
- π¬π§United Kingdom alexpott πͺπΊπ
dww β credited alexpott β .
- πΊπΈUnited States dww
(Sorry for the assigned/unassigned noise -- accidentally clicked something in the MR review UI).
Resolved the opened threads, since @catch has already addressed all the feedback.
Opened a few trivial suggestions for some nits, and to rename a test method that no longer does what the old name says.
I would RTBC, but I think the suggestions are worth applying, first. π
Initial pass at saving credit:
@catch (duh)
@alexpott for MR reviews + suggestions
@Spokje, @nicxvan, @godotislate, @luke.leber, and myself for MR and issue reviewsAlmost there!
Thanks,
-Derekp.s. restoring the Bug Smash Initiative tag, which I think @catch accidentally clobbered in #13 with a stale form submit.
- π¬π§United Kingdom catch
Tracked down the very specific race condition with the test (well I think I did) which was still resulting in some failures occasionally - about 1/2500.
I think it is:
1.test requests a page
2. page writes to the state cache collector
3. test sets state
4. test requests a page.Because the test doesn't use WaitTerminateTestTrait, if #2 is holding the lock while #3 happens, we go into the !$lock_acquired branch in state.
But the implementation so far was concentrating on the request that hits the $lock_acquired branch eventually winning out when it reaches the end of the request (more or less eventual consistency). In the test, there is no 'end of the request', there is just the end of the test, and the page we hit immediately after trying to set state can still get the wrong value in the meantime. The answer is to even more aggressively write to the cache when we can't acquire the lock. It still doesn't work if the underlying request takes longer than a second to complete, because then both the pre- and post- lock acquire attempt cache writes can still be overwritten. We could wait longer than a second, but on a real request the end of request logic is more likely to kick in later and clean everything up.
Bit annoyed that I didn't look at the test in more detail earlier because that's obvious in retrospect, but was trying to focus on the stampede condition that real sites will hit vs functional test weirdness.
Fresh set of runs:
https://git.drupalcode.org/project/drupal/-/pipelines/392873
https://git.drupalcode.org/project/drupal/-/pipelines/393001 Is "Needs work" correct as current status, or is this ready to be reviewed again?
- π¬π§United Kingdom catch
@godotislate it's not actually fixing the race condition at the moment so very much needs work - although review of the approach would probably still be useful.