Race conditions/bad lock logic in CacheCollector/State

Issue created by @catch
Comment 7 months ago →
🇬🇧United Kingdom catch
Merge request !10704Resolve #3496257 "Race conditionsbad lock" → (Open) created by catch
Comment 7 months ago →
🇬🇧United Kingdom catch
Merge request !10705Draft: Resolve #3496257 "2500" → (Open) created by catch
Comment 7 months ago →
🇺🇸United States luke.leber Pennsylvania
Whenever I see a call to wait on a lock I get a chill up my spine.

From a cursory glance, this change could result on a ton of processes spinning/waiting to acquire a lock under a heavy load, couldn't it?

I think this one should be subject to some load testing on a larger site.
Comment 7 months ago →
🇬🇧United Kingdom catch
Whenever I see a call to wait on a lock I get a chill up my spine.

I agree with this but I think we might not need the lock wait here, just to more aggressively clear/invalidate/set the cache even when we can't acquire the lock. Pushed a commit that tries this.
Comment 7 months ago →
🇬🇧United Kingdom catch
The lock name was a red herring but I discovered some real race conditions.

https://git.drupalcode.org/project/drupal/-/jobs/3827879 runs a frequently failing tests that depends on writing to state 2500 times and is green with the MR.

https://git.drupalcode.org/project/drupal/-/pipelines/380036 is a full test run with only nightwatch failures, which this won't help with.
Comment 7 months ago →
🇬🇧United Kingdom catch
Comment 7 months ago →
godotislate
lgtm
Comment 7 months ago →
🇺🇸United States dww
Thanks for this! Great work tracking this down.

I added some suggestions to the MR, mostly for trivial nits and a typo. Hate to slow down progress on a critical, but I really think we need comments about why we're not releasing the locks we're acquiring. I don't think this is RTBC if all the MR threads aren't resolved.
Pipeline finished with Canceled
7 months ago
Total: 227s
#380290
Comment 7 months ago →
🇬🇧United Kingdom catch
I'm going to spend a bit more time trying to figure out if we can do away with holding the lock, if we can't, it should result in a better comment, possibly add a @todo and issue to have another try in a follow-up, if we can release the lock, then great we don't need to explain why we can't.
Pipeline finished with Canceled
7 months ago
Total: 278s
#380292
Pipeline finished with Canceled
7 months ago
Total: 806s
#380299
Pipeline finished with Success
7 months ago
Total: 871s
#380307
Pipeline finished with Failed
7 months ago
Total: 53s
#380310
Pipeline finished with Canceled
7 months ago
Total: 842s
#380315
Pipeline finished with Canceled
7 months ago
Total: 865s
#380322
Pipeline finished with Success
7 months ago
Total: 868s
#380329
Pipeline finished with Success
7 months ago
Total: 1504s
#380325
Comment 7 months ago →
🇬🇧United Kingdom catch
Alright it looks like that theory was correct.

The remaining race condition was something like as follows:

We lock cache writes, but not cache reads. CacheCollector uses the cache created timestamp (which is stored with millisecond precision) to detect if the cache item it read at the beginning of the request is the same as the one it reads again at the end of the request.

Because we only lock writes, it's possible for two processes to lock acquire, cache write, lock release within the same millisecond. For the usual CacheCollector case this is a non-issue, because we are generally caching things which are expensive to build and don't get written to as such (like the runtime theme registry), so all these things happening in a single millisecond does not really come up. It's a problem here because state is cheap to build and also gets individual keys written to, especially with the database on a ramdisk on the same server etc. like gitlab pipelines.

If State::set() writes a cache item in the same millisecond as another cache item is set, this breaks the race condition detection in ::updateCache(). Another request can read the newly outdated cache item, get to the end of the request, compare with the State::set() one, see them as 'the same' and then write back stale data.

Holding the lock longer was preventing the situation where another process writes back to the cache item immediately after State::set() has written to it, having read a different cache item created in the same millisecond as the call to State::set().

Instead of that, we can usleep() for 10 milliseconds before and after writing, and try a bit harder to acquire the lock, this still results in 2500 passes for the test, but it doesn't involve either the full lock->wait() call or holding the lock that were necessary in earlier iterations.

There are still other theoretical race conditions here - e.g. if two servers have a time offset, and write cache items in the 'same' millisecond even though they are actually different due to the offset. We could potentially protect against this in CacheCollector by hashing the cache data every time it's read, and then comparing the hash when we write back to the cache - but this would introduce hashing overhead to every cache collector read which is not great. Or maybe we could store the hash with the cache item and compare that, which would mean changing the cache data structure a bit, but that might work and allow us to remove the usleep() calls then.

Going to try that approach in the sandbox MR and see if it's straightforward enough.
Comment 7 months ago →
🇳🇱Netherlands spokje
*cough*
Testbot really doesn't like it...
*cough*
Comment 7 months ago →
🇬🇧United Kingdom catch
OK trying to use a hash of the data doesn't work, or at least not yet, it also requires changing the structure of the cache item which means that even if did work it would probably be minor only.

I'll open a follow-up for that.

https://git.drupalcode.org/project/drupal/-/merge_requests/10704 is the MR and approach to review.
Comment 7 months ago →
🇬🇧United Kingdom catch
catch → changed the visibility of the branch 3496257-2500 to hidden.
Comment 7 months ago →
🇬🇧United Kingdom catch
Here's the follow-up: 📌 Try to use a non-timestamp approach for CacheCollector race condition detection Active .
Comment 7 months ago →
🇬🇧United Kingdom catch
catch → changed the visibility of the branch 3496257-2500 to active.
Comment 7 months ago →
🇬🇧United Kingdom catch
Just kicked off a third run of https://git.drupalcode.org/project/drupal/-/merge_requests/10705/pipelines which will give us 7500 passing tests in a row assuming it also comes back green.
Comment 7 months ago →
🇳🇱Netherlands spokje
Hating to be that guy, but hey, I _am_ that guy...

I'm very doubtful if 7500 runs will fit into the current cutoff time of 1 hour.
Comment 7 months ago →
🇬🇧United Kingdom catch
@spokje it's 2500 runs * 3, should be OK.
Pipeline finished with Failed
7 months ago
Total: 1091s
#380418

🇬🇧United Kingdom catch

Well that's interesting, 2 runs out of 7500 failed https://git.drupalcode.org/project/drupal/-/jobs/3831692/viewer

Looks like this happened when the overall time for the test went up to 30-50+ seconds from 6-10s

Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   18s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   23s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   24s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   26s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   26s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   30s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   36s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   35s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   35s                                      
\Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksT   0 passes   37s   1 fails                            
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   44s                                      
\Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksT   0 passes   48s   1 fails                            
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   62s                                      
Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   60s

Might be a sign of the lock failing to be acquired or similar.

Don't think this necessarily invalidates the fix here since it's still a significant improvement, but might show we need to lock wait for longer and/or explore the hash idea further in the other issue.

Pipeline finished with Success
7 months ago
Total: 966s
#380433
Pipeline finished with Success
7 months ago
Total: 862s
#380454
Comment 7 months ago →
🇬🇧United Kingdom catch
Pass rate of 12248/12500 now.
Pipeline finished with Failed
7 months ago
Total: 1191s
#380500
Comment 7 months ago →
🇬🇧United Kingdom catch
Another 1/2500 fail on https://git.drupalcode.org/project/drupal/-/jobs/3832573/viewer so at 1497/15000 passes now.
Comment 7 months ago →
🇬🇧United Kingdom catch
catch → changed the visibility of the branch 3496257-2500 to hidden.
Merge request !10748Draft: run tests 1750 times → (Open) created by catch
Comment 7 months ago →
🇬🇧United Kingdom catch
catch → changed the visibility of the branch 3496257-1750-again to hidden.
Merge request !10749Draft: Run tests 2500 times → (Open) created by catch
Comment 7 months ago →
🇬🇧United Kingdom catch
Hmm four fails on https://git.drupalcode.org/project/drupal/-/jobs/3849058 out of 2500
Comment 7 months ago →
🇬🇧United Kingdom catch
Famous last words because we're down to 4/15000 odds so false negatives are increasingly likely, but might have figured out the last race condition.

In earlier iterations of the MR, I was holding the lock acquired in State:set() and not releasing it, this was out of paranoia so changed it to release after it was noted in review. I now think there was a good reason for it after all.

In CacheCollector::updateCache(), we use a new cache item as a tombstone record for the following case:

1. Process A - cold cache

2. Process B - invalidates the cache (e.g. a state write).

3. Process A writes the cache back at end of request, with stale data from before process B wrote its data.

To avoid this, Process B writes a new cache item, and then process A sees that a brand new cache item has been written when none existed before, and throws its data away instead.

The remaining race condition looked like this:

1. Process A - cold cache
2. Process B - state write and cache write.
3. Process B - end of request - deletes cache item (fails to acquire a lock or any other reason
4. Process A - because the cache item is deleted, doesn't see it, so writes the stale data to the cache.

Holding the lock does the following:

1. Process A - cold cache
2. Process B - state write and cache write
3. Process B - end of request - still has the lock, so writes to cache again instead of deleting.
4. Process A - finds the new cache item so throws its data away
Comment 7 months ago →
🇪🇸Spain penyaskito Seville 💃, Spain 🇪🇸, UTC+2 🇪🇺
If this helps, as someone completely new to this issue, the code comments look as clear as I think they can be to understand the complex issue and the fix.
Comment 7 months ago →
🇬🇧United Kingdom catch
10,000 passes so far with holding the lock, but probably need 20,000 passes or more to prove the negative given it failed 3/15000 without it.

https://git.drupalcode.org/project/drupal/-/jobs/3850861
https://git.drupalcode.org/project/drupal/-/jobs/3850954
https://git.drupalcode.org/project/drupal/-/jobs/3851133
https://git.drupalcode.org/project/drupal/-/jobs/3851201
Comment 7 months ago →
🇺🇸United States nicxvan
I think this is getting pretty close do we want a branch now with just the cache changes?

I'm trying to figure out how to check the probabilities, it's been a while since I took statistics.
Comment 7 months ago →
🇺🇸United States nicxvan
Ok using Poisson's distribution, then the odds that something that happens 3/15000, happens 0 times in 22500 is 1.1%
https://www.wolframalpha.com/input?i2d=true&i=Divide%5B%5C%2840%29Power%...

Now this isn't perfect because it assumes that the 3/15000 rate is constant and obviously we're affecting the thing driving it, but we can't really do better that I can see.

Just for others following along the formula is (e^(-λ) (λ^x))/x!
e is euler's number
λ is the rate 3 divided by 15000 multiplied by 22500 = 4.5
x is the number expected which in this case is 0

Plug those numbers in and you get 1.1%
I interpret that to mean if the rate was 3/15000 and we'd have to run a set of 22500 tests 100 times to get a run that had 0 failures.

That feels definitive to me.
Comment 7 months ago →
🇳🇱Netherlands spokje
Just my EUR0.02: I can see an, almost always considerable, rise in the failure rate when it's more busy on Drupal CI.
All runs here have taken place in the Christmas/New Years period, in which it's very quiet around here.

If possible, I would like to see a few runs from monday January 6th onward, when normal business is restored and see if that affects the numbers drastically.

If that would be too long of a wait, Im OK with this as-is, if stuff starts breaking down (more) on Monday we can act then as needed.
Comment 7 months ago →
🇬🇧United Kingdom catch
Couple of general points, mostly answering questions I asked myself while working on this:

This problem is fairly unique to state's use of cache collector. Most cache collector implementations just build a 'runtime' cache from a discovery cache (smaller array from a massive array), so they are not changing values in the underlying data at all.

The most likely problem this change could cause we would be finding badly behaving contrib or custom code that is writing to state every request or something, but that would already cause problems with state caching, the lock wait would make those problems a bit worse, but writing to state every request would be pretty extreme.

With a normal/occasional state write, while the lock is being held, other processes will need to do some extra key/value queries, and won't be able to write back to the cache, but this is relatively cheap (about the same as before we had state caching in the first place). So that bit seems fine.

While adding all the logic, I wondered if state should instead re-implement the parts of cache collector it needs instead of making it more complicated, but:

- overall we'd probably end up with the same amount of complexity, but in two places instead of one
- the changes to cache collector make it more robust for contrib/custom use-cases doing similar things
- even if state moved to 'immediate write' with no end of request lazy-writing at all, it would still have to lock both state writes and cache writes.

There is one possible optimization we can do here which I realised writing the above:

When we write to the cache item in State::set(), if we've acquired a lock, include the other state keys that are already loaded too, that will make it more likely that subsequent requests get a cache hit (including while the lock is being held). Not going to have time to work on that for at least a few hours though.

@nicxvan https://git.drupalcode.org/project/drupal/-/merge_requests/10704 is the committable/reviewable branch, it should be up-to-date (except for if that extra optimisation works).
Comment 7 months ago →
🇬🇧United Kingdom catch
@nicxvan thanks for the stats research, that's pretty encouraging.

I've pushed a revert of 🐛 [random test failure] LayoutBuilderBlocksTest::testBlockPlaceholder failing Active to the 2500 test runs branch (https://git.drupalcode.org/project/drupal/-/merge_requests/10749) so that it's possible to keep hitting the 'run pipelines' button on there a bit more next week if this isn't committed before then. That will also give us another 2500 runs in the meantime.
Comment 7 months ago →
🇬🇧United Kingdom catch
Little trip down memory lane:
#20114: race condition in variable_set in bootstrap.inc → and #973436: Overzealous locking in variable_initialize() → amongst others.
Comment 6 months ago →
🇧🇪Belgium wim leers Ghent 🇧🇪🇪🇺
Arrived here via 🐛 [random test failure] EditorSecurityTest::testEditorXssFilterOverride Active .

@catch's #13 + #30 + #36 + #38 are 🤯

@nicxvan: it's been >10 years since I saw somebody in a software engineering context mention Poisson's distribution and it reminds me of one of the very few books I still have lying around from studying Computer Science: https://www.taylorfrancis.com/books/mono/10.1201/9781420011425/probabili... 🤓😅

@spokje in #35: +1 for increased CI failure rates when CI infra is under load.
Comment 6 months ago →
🇬🇧United Kingdom catch
Since it's Monday, started kicking off pipelines again. Will edit this comment to add them when I remember to kick another one off (insert xkcd compiling image).

https://git.drupalcode.org/project/drupal/-/pipelines/387144
Comment 6 months ago →
🇬🇧United Kingdom catch
I think we can skip holding the lock after all.

Short explanation of why:

Performance tests were failing in the latest MR, this is because the cli was holding the lock and preventing the site-under-test from acquiring it (annoying).

Trying to workaround this made me realise we're mainly holding the lock to avoid ::updateCache() throwing away the tombstone record when it can't acquire a lock.

Instead of that, if we've acquired a lock, we can completely disable the end of request logic in state, and release the lock. That means as soon as the state item is written other requests can start to build the cache again.

Obviously, now need to check that this still fixes the race condition so another 10k+ test runs to kick off on the test-only MR - will start that once the main fix MR is green again.
Comment 6 months ago →
🇬🇧United Kingdom catch
Implemented that. One functional js test failure which I think is unrelated. Updated the test only MR, first 2500 runs came back green, will kick off more over the next 24 hours.

https://git.drupalcode.org/project/drupal/-/pipelines/387799
https://git.drupalcode.org/project/drupal/-/pipelines/387820
Comment 6 months ago →
🇬🇧United Kingdom catch
Too much status change.
Pipeline finished with Success
6 months ago
Total: 882s
#387846
Pipeline finished with Success
6 months ago
Total: 898s
#387861
Comment 6 months ago →
🇬🇧United Kingdom catch
12.5k runs without a failure so far, which is pretty encouraging, so I think this genuinely is ready for review again now. I'll kick off some more pipelines tomorrow.
Pipeline finished with Success
6 months ago
Total: 880s
#387885
Comment 6 months ago →
🇳🇱Netherlands spokje
Great work, I suppose this also takes care of random failures in tests that actually need to use State for testing like \Drupal\Tests\system\Functional\System\CronRunTest::testCronUI?

This is an example where we can't just switch to KeyValue, since State is an integral part of running cron.
Comment 6 months ago →
🇬🇧United Kingdom catch
@Spokje yes it should fix cases like that as well.
Pipeline finished with Failed
6 months ago
Total: 911s
#388250
Comment 6 months ago →
🇬🇧United Kingdom catch
Several fails on: https://git.drupalcode.org/project/drupal/-/jobs/3923493
Pipeline finished with Success
6 months ago
Total: 833s
#388309
Comment 6 months ago →
🇬🇧United Kingdom catch
Theory on the above - the runtime doubling could be not acquiring a lock, and having to wait for a second - the runtime doubled when the test itself was holding the lock for the duration of the test (although that prevented any new caching at all so fixed the race condition).

So... trying to be more aggressive when we fail to acquire the lock.
Pipeline finished with Failed
6 months ago
Total: 906s
#388406
Comment 6 months ago →
🇬🇧United Kingdom catch
Unrelated fail on https://git.drupalcode.org/project/drupal/-/jobs/3925298, otherwise green.
Pipeline finished with Success
6 months ago
Total: 1550s
#388387
Pipeline finished with Success
6 months ago
Total: 928s
#388431
Pipeline finished with Success
6 months ago
Total: 822s
#388648
Pipeline finished with Failed
6 months ago
Total: 885s
#388646
Pipeline finished with Success
6 months ago
Total: 882s
#388807
Comment 6 months ago →
🇬🇧United Kingdom catch
The simplification led to a behaviour change for the path alias cache, but it's very small and still works.

I think it is probably possible to make updateCache() work for what State:::set() wants to do, but I've already spent a lot of time on this issue and need a break, so I think we should defer that to a follow-up - it might reduce some code and make things hopefully even more robust, but it will be as tricky as this issue to ensure we've caught all the cases. Would rather have things correct which hopefully the current state of the MR gets us to.

Because a few things have changed again, going to run more pipelines..

https://git.drupalcode.org/project/drupal/-/pipelines/390206
Comment 6 months ago →
🇬🇧United Kingdom catch
Opened 📌 Try to make State::set() logic re-usable, directly within CacheCollector::updateCache() Active .
Pipeline finished with Failed
6 months ago
Total: 1064s
#391267
Pipeline finished with Failed
6 months ago
#391292
Comment 6 months ago →
🇬🇧United Kingdom alexpott 🇪🇺🌍
dww → credited alexpott → .
Comment 6 months ago →
🇺🇸United States dww
(Sorry for the assigned/unassigned noise -- accidentally clicked something in the MR review UI).

Resolved the opened threads, since @catch has already addressed all the feedback.

Opened a few trivial suggestions for some nits, and to rename a test method that no longer does what the old name says.

I would RTBC, but I think the suggestions are worth applying, first. 😅

Initial pass at saving credit:
@catch (duh)
@alexpott for MR reviews + suggestions
@Spokje, @nicxvan, @godotislate, @luke.leber, and myself for MR and issue reviews

Almost there!

Thanks,
-Derek

p.s. restoring the Bug Smash Initiative tag, which I think @catch accidentally clobbered in #13 with a stale form submit.
Comment 6 months ago →
🇬🇧United Kingdom catch
Tracked down the very specific race condition with the test (well I think I did) which was still resulting in some failures occasionally - about 1/2500.

I think it is:

1.test requests a page
2. page writes to the state cache collector
3. test sets state
4. test requests a page.

Because the test doesn't use WaitTerminateTestTrait, if #2 is holding the lock while #3 happens, we go into the !$lock_acquired branch in state.

But the implementation so far was concentrating on the request that hits the $lock_acquired branch eventually winning out when it reaches the end of the request (more or less eventual consistency). In the test, there is no 'end of the request', there is just the end of the test, and the page we hit immediately after trying to set state can still get the wrong value in the meantime. The answer is to even more aggressively write to the cache when we can't acquire the lock. It still doesn't work if the underlying request takes longer than a second to complete, because then both the pre- and post- lock acquire attempt cache writes can still be overwritten. We could wait longer than a second, but on a real request the end of request logic is more likely to kick in later and clean everything up.

Bit annoyed that I didn't look at the test in more detail earlier because that's obvious in retrospect, but was trying to focus on the stampede condition that real sites will hit vs functional test weirdness.

Fresh set of runs:

https://git.drupalcode.org/project/drupal/-/pipelines/392873
https://git.drupalcode.org/project/drupal/-/pipelines/393001
Pipeline finished with Failed
6 months ago
Total: 1080s
#393466
Comment 6 months ago →
godotislate
Is "Needs work" correct as current status, or is this ready to be reviewed again?
Comment 6 months ago →
🇬🇧United Kingdom catch
@godotislate it's not actually fixing the race condition at the moment so very much needs work - although review of the approach would probably still be useful.
Comment 5 months ago →
🇬🇧United Kingdom catch
Merged 11.x into the repeat test run MR. One fail in 5000 runs.

https://git.drupalcode.org/project/drupal/-/jobs/4331449

https://git.drupalcode.org/project/drupal/-/jobs/4331562 -> fail

I'm wondering if the remaining test failure is due to 🐛 Race conditions with lock/cache using non-database storage - add a non-transactional database connection Needs work (e.g. a deadlock on the cache/lock table), and if so, that will be unfixable with this MR. Given that, going to rebase the main MR here and move this back to needs review.

Race conditions/bad lock logic in CacheCollector/State

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

Merge Requests

!10704Race conditions/bad lock logic in CacheCollector/State
Open

!10749Race conditions/bad lock logic in CacheCollector/State
Open

!10748Race conditions/bad lock logic in CacheCollector/State
Open

!10705Race conditions/bad lock logic in CacheCollector/State
Open

Comments & Activities

Race conditions/bad lock logic in CacheCollector/State

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

Merge Requests

!10704Race conditions/bad lock logic in CacheCollector/StateOpen

!10749Race conditions/bad lock logic in CacheCollector/StateOpen

!10748Race conditions/bad lock logic in CacheCollector/StateOpen

!10705Race conditions/bad lock logic in CacheCollector/StateOpen

Comments & Activities

!10704Race conditions/bad lock logic in CacheCollector/State
Open

!10749Race conditions/bad lock logic in CacheCollector/State
Open

!10748Race conditions/bad lock logic in CacheCollector/State
Open

!10705Race conditions/bad lock logic in CacheCollector/State
Open