Race conditions/bad lock logic in CacheCollector/State

Created on 27 December 2024, 24 days ago

Problem/Motivation

__CLASS__ gives the name of the class it's used in, not the current object when a subclass, so we are creating a lock with a very non-specific ID in CacheCollector.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

πŸ› Bug report
Status

Active

Version

11.0 πŸ”₯

Component

base system

Created by

πŸ‡¬πŸ‡§United Kingdom catch

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @catch
  • πŸ‡¬πŸ‡§United Kingdom catch
  • Merge request !10704Resolve #3496257 "Race conditionsbad lock" β†’ (Open) created by catch
  • πŸ‡¬πŸ‡§United Kingdom catch
  • Merge request !10705Draft: Resolve #3496257 "2500" β†’ (Open) created by catch
  • πŸ‡ΊπŸ‡ΈUnited States luke.leber Pennsylvania

    Whenever I see a call to wait on a lock I get a chill up my spine.

    From a cursory glance, this change could result on a ton of processes spinning/waiting to acquire a lock under a heavy load, couldn't it?

    I think this one should be subject to some load testing on a larger site.

  • πŸ‡¬πŸ‡§United Kingdom catch

    Whenever I see a call to wait on a lock I get a chill up my spine.

    I agree with this but I think we might not need the lock wait here, just to more aggressively clear/invalidate/set the cache even when we can't acquire the lock. Pushed a commit that tries this.

  • πŸ‡¬πŸ‡§United Kingdom catch

    The lock name was a red herring but I discovered some real race conditions.

    https://git.drupalcode.org/project/drupal/-/jobs/3827879 runs a frequently failing tests that depends on writing to state 2500 times and is green with the MR.

    https://git.drupalcode.org/project/drupal/-/pipelines/380036 is a full test run with only nightwatch failures, which this won't help with.

  • πŸ‡¬πŸ‡§United Kingdom catch
  • πŸ‡ΊπŸ‡ΈUnited States dww

    Thanks for this! Great work tracking this down.

    I added some suggestions to the MR, mostly for trivial nits and a typo. Hate to slow down progress on a critical, but I really think we need comments about why we're not releasing the locks we're acquiring. I don't think this is RTBC if all the MR threads aren't resolved.

  • Pipeline finished with Canceled
    24 days ago
    Total: 227s
    #380290
  • πŸ‡¬πŸ‡§United Kingdom catch

    I'm going to spend a bit more time trying to figure out if we can do away with holding the lock, if we can't, it should result in a better comment, possibly add a @todo and issue to have another try in a follow-up, if we can release the lock, then great we don't need to explain why we can't.

  • Pipeline finished with Canceled
    24 days ago
    Total: 278s
    #380292
  • Pipeline finished with Canceled
    24 days ago
    Total: 806s
    #380299
  • Pipeline finished with Success
    24 days ago
    Total: 871s
    #380307
  • Pipeline finished with Failed
    24 days ago
    Total: 53s
    #380310
  • Pipeline finished with Canceled
    24 days ago
    Total: 842s
    #380315
  • Pipeline finished with Canceled
    24 days ago
    Total: 865s
    #380322
  • Pipeline finished with Success
    24 days ago
    Total: 868s
    #380329
  • Pipeline finished with Success
    24 days ago
    Total: 1504s
    #380325
  • πŸ‡¬πŸ‡§United Kingdom catch

    Alright it looks like that theory was correct.

    The remaining race condition was something like as follows:

    We lock cache writes, but not cache reads. CacheCollector uses the cache created timestamp (which is stored with millisecond precision) to detect if the cache item it read at the beginning of the request is the same as the one it reads again at the end of the request.

    Because we only lock writes, it's possible for two processes to lock acquire, cache write, lock release within the same millisecond. For the usual CacheCollector case this is a non-issue, because we are generally caching things which are expensive to build and don't get written to as such (like the runtime theme registry), so all these things happening in a single millisecond does not really come up. It's a problem here because state is cheap to build and also gets individual keys written to, especially with the database on a ramdisk on the same server etc. like gitlab pipelines.

    If State::set() writes a cache item in the same millisecond as another cache item is set, this breaks the race condition detection in ::updateCache(). Another request can read the newly outdated cache item, get to the end of the request, compare with the State::set() one, see them as 'the same' and then write back stale data.

    Holding the lock longer was preventing the situation where another process writes back to the cache item immediately after State::set() has written to it, having read a different cache item created in the same millisecond as the call to State::set().

    Instead of that, we can usleep() for 10 milliseconds before and after writing, and try a bit harder to acquire the lock, this still results in 2500 passes for the test, but it doesn't involve either the full lock->wait() call or holding the lock that were necessary in earlier iterations.

    There are still other theoretical race conditions here - e.g. if two servers have a time offset, and write cache items in the 'same' millisecond even though they are actually different due to the offset. We could potentially protect against this in CacheCollector by hashing the cache data every time it's read, and then comparing the hash when we write back to the cache - but this would introduce hashing overhead to every cache collector read which is not great. Or maybe we could store the hash with the cache item and compare that, which would mean changing the cache data structure a bit, but that might work and allow us to remove the usleep() calls then.

    Going to try that approach in the sandbox MR and see if it's straightforward enough.

  • πŸ‡³πŸ‡±Netherlands spokje

    *cough*
    Testbot really doesn't like it...
    *cough*

  • πŸ‡¬πŸ‡§United Kingdom catch

    OK trying to use a hash of the data doesn't work, or at least not yet, it also requires changing the structure of the cache item which means that even if did work it would probably be minor only.

    I'll open a follow-up for that.

    https://git.drupalcode.org/project/drupal/-/merge_requests/10704 is the MR and approach to review.

  • πŸ‡¬πŸ‡§United Kingdom catch

    catch β†’ changed the visibility of the branch 3496257-2500 to hidden.

  • πŸ‡¬πŸ‡§United Kingdom catch

    catch β†’ changed the visibility of the branch 3496257-2500 to active.

  • πŸ‡¬πŸ‡§United Kingdom catch

    Just kicked off a third run of https://git.drupalcode.org/project/drupal/-/merge_requests/10705/pipelines which will give us 7500 passing tests in a row assuming it also comes back green.

  • πŸ‡³πŸ‡±Netherlands spokje

    Hating to be that guy, but hey, I _am_ that guy...

    I'm very doubtful if 7500 runs will fit into the current cutoff time of 1 hour.

  • πŸ‡¬πŸ‡§United Kingdom catch

    @spokje it's 2500 runs * 3, should be OK.

  • Pipeline finished with Failed
    24 days ago
    Total: 1091s
    #380418
  • πŸ‡¬πŸ‡§United Kingdom catch

    Well that's interesting, 2 runs out of 7500 failed https://git.drupalcode.org/project/drupal/-/jobs/3831692/viewer

    Looks like this happened when the overall time for the test went up to 30-50+ seconds from 6-10s

    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes    9s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   18s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   23s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   24s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   26s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   26s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   30s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   36s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   35s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   35s                                      
    \Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksT   0 passes   37s   1 fails                            
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   44s                                      
    \Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksT   0 passes   48s   1 fails                            
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   62s                                      
    Drupal\Tests\layout_builder\Functional\LayoutBuilderBlocksTe   1 passes   60s     
    

    Might be a sign of the lock failing to be acquired or similar.

    Don't think this necessarily invalidates the fix here since it's still a significant improvement, but might show we need to lock wait for longer and/or explore the hash idea further in the other issue.

  • Pipeline finished with Success
    24 days ago
    Total: 966s
    #380433
  • Pipeline finished with Success
    24 days ago
    Total: 862s
    #380454
  • πŸ‡¬πŸ‡§United Kingdom catch

    Pass rate of 12248/12500 now.

  • Pipeline finished with Failed
    24 days ago
    Total: 1191s
    #380500
  • πŸ‡¬πŸ‡§United Kingdom catch

    Another 1/2500 fail on https://git.drupalcode.org/project/drupal/-/jobs/3832573/viewer so at 1497/15000 passes now.

  • πŸ‡¬πŸ‡§United Kingdom catch

    catch β†’ changed the visibility of the branch 3496257-2500 to hidden.

  • Merge request !10748Draft: run tests 1750 times β†’ (Open) created by catch
  • πŸ‡¬πŸ‡§United Kingdom catch

    catch β†’ changed the visibility of the branch 3496257-1750-again to hidden.

  • Merge request !10749Draft: Run tests 2500 times β†’ (Open) created by catch
  • πŸ‡¬πŸ‡§United Kingdom catch
  • πŸ‡¬πŸ‡§United Kingdom catch

    Famous last words because we're down to 4/15000 odds so false negatives are increasingly likely, but might have figured out the last race condition.

    In earlier iterations of the MR, I was holding the lock acquired in State:set() and not releasing it, this was out of paranoia so changed it to release after it was noted in review. I now think there was a good reason for it after all.

    In CacheCollector::updateCache(), we use a new cache item as a tombstone record for the following case:

    1. Process A - cold cache

    2. Process B - invalidates the cache (e.g. a state write).

    3. Process A writes the cache back at end of request, with stale data from before process B wrote its data.

    To avoid this, Process B writes a new cache item, and then process A sees that a brand new cache item has been written when none existed before, and throws its data away instead.

    The remaining race condition looked like this:

    1. Process A - cold cache
    2. Process B - state write and cache write.
    3. Process B - end of request - deletes cache item (fails to acquire a lock or any other reason
    4. Process A - because the cache item is deleted, doesn't see it, so writes the stale data to the cache.

    Holding the lock does the following:

    1. Process A - cold cache
    2. Process B - state write and cache write
    3. Process B - end of request - still has the lock, so writes to cache again instead of deleting.
    4. Process A - finds the new cache item so throws its data away

  • πŸ‡ͺπŸ‡ΈSpain penyaskito Seville πŸ’ƒ, Spain πŸ‡ͺπŸ‡Έ, UTC+2 πŸ‡ͺπŸ‡Ί

    If this helps, as someone completely new to this issue, the code comments look as clear as I think they can be to understand the complex issue and the fix.

  • πŸ‡¬πŸ‡§United Kingdom catch
  • πŸ‡ΊπŸ‡ΈUnited States nicxvan

    I think this is getting pretty close do we want a branch now with just the cache changes?

    I'm trying to figure out how to check the probabilities, it's been a while since I took statistics.

  • πŸ‡ΊπŸ‡ΈUnited States nicxvan

    Ok using Poisson's distribution, then the odds that something that happens 3/15000, happens 0 times in 22500 is 1.1%
    https://www.wolframalpha.com/input?i2d=true&i=Divide%5B%5C%2840%29Power%...

    Now this isn't perfect because it assumes that the 3/15000 rate is constant and obviously we're affecting the thing driving it, but we can't really do better that I can see.

    Just for others following along the formula is (e^(-Ξ») (Ξ»^x))/x!
    e is euler's number
    Ξ» is the rate 3 divided by 15000 multiplied by 22500 = 4.5
    x is the number expected which in this case is 0

    Plug those numbers in and you get 1.1%
    I interpret that to mean if the rate was 3/15000 and we'd have to run a set of 22500 tests 100 times to get a run that had 0 failures.

    That feels definitive to me.

  • πŸ‡³πŸ‡±Netherlands spokje

    Just my EUR0.02: I can see an, almost always considerable, rise in the failure rate when it's more busy on Drupal CI.
    All runs here have taken place in the Christmas/New Years period, in which it's very quiet around here.

    If possible, I would like to see a few runs from monday January 6th onward, when normal business is restored and see if that affects the numbers drastically.

    If that would be too long of a wait, Im OK with this as-is, if stuff starts breaking down (more) on Monday we can act then as needed.

  • πŸ‡¬πŸ‡§United Kingdom catch

    Couple of general points, mostly answering questions I asked myself while working on this:

    This problem is fairly unique to state's use of cache collector. Most cache collector implementations just build a 'runtime' cache from a discovery cache (smaller array from a massive array), so they are not changing values in the underlying data at all.

    The most likely problem this change could cause we would be finding badly behaving contrib or custom code that is writing to state every request or something, but that would already cause problems with state caching, the lock wait would make those problems a bit worse, but writing to state every request would be pretty extreme.

    With a normal/occasional state write, while the lock is being held, other processes will need to do some extra key/value queries, and won't be able to write back to the cache, but this is relatively cheap (about the same as before we had state caching in the first place). So that bit seems fine.

    While adding all the logic, I wondered if state should instead re-implement the parts of cache collector it needs instead of making it more complicated, but:

    - overall we'd probably end up with the same amount of complexity, but in two places instead of one
    - the changes to cache collector make it more robust for contrib/custom use-cases doing similar things
    - even if state moved to 'immediate write' with no end of request lazy-writing at all, it would still have to lock both state writes and cache writes.

    There is one possible optimization we can do here which I realised writing the above:

    When we write to the cache item in State::set(), if we've acquired a lock, include the other state keys that are already loaded too, that will make it more likely that subsequent requests get a cache hit (including while the lock is being held). Not going to have time to work on that for at least a few hours though.

    @nicxvan https://git.drupalcode.org/project/drupal/-/merge_requests/10704 is the committable/reviewable branch, it should be up-to-date (except for if that extra optimisation works).

  • πŸ‡¬πŸ‡§United Kingdom catch

    @nicxvan thanks for the stats research, that's pretty encouraging.

    I've pushed a revert of πŸ› [random test failure] LayoutBuilderBlocksTest::testBlockPlaceholder failing Active to the 2500 test runs branch (https://git.drupalcode.org/project/drupal/-/merge_requests/10749) so that it's possible to keep hitting the 'run pipelines' button on there a bit more next week if this isn't committed before then. That will also give us another 2500 runs in the meantime.

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Arrived here via πŸ› [random test failure] EditorSecurityTest::testEditorXssFilterOverride Active .

    @catch's #13 + #30 + #36 + #38 are 🀯

    @nicxvan: it's been >10 years since I saw somebody in a software engineering context mention Poisson's distribution and it reminds me of one of the very few books I still have lying around from studying Computer Science: https://www.taylorfrancis.com/books/mono/10.1201/9781420011425/probabili... πŸ€“πŸ˜…

    @spokje in #35: +1 for increased CI failure rates when CI infra is under load.

  • πŸ‡¬πŸ‡§United Kingdom catch

    Since it's Monday, started kicking off pipelines again. Will edit this comment to add them when I remember to kick another one off (insert xkcd compiling image).

    https://git.drupalcode.org/project/drupal/-/pipelines/387144

  • πŸ‡¬πŸ‡§United Kingdom catch

    I think we can skip holding the lock after all.

    Short explanation of why:

    Performance tests were failing in the latest MR, this is because the cli was holding the lock and preventing the site-under-test from acquiring it (annoying).

    Trying to workaround this made me realise we're mainly holding the lock to avoid ::updateCache() throwing away the tombstone record when it can't acquire a lock.

    Instead of that, if we've acquired a lock, we can completely disable the end of request logic in state, and release the lock. That means as soon as the state item is written other requests can start to build the cache again.

    Obviously, now need to check that this still fixes the race condition so another 10k+ test runs to kick off on the test-only MR - will start that once the main fix MR is green again.

  • πŸ‡¬πŸ‡§United Kingdom catch

    Implemented that. One functional js test failure which I think is unrelated. Updated the test only MR, first 2500 runs came back green, will kick off more over the next 24 hours.

    https://git.drupalcode.org/project/drupal/-/pipelines/387799
    https://git.drupalcode.org/project/drupal/-/pipelines/387820

  • πŸ‡¬πŸ‡§United Kingdom catch

    Too much status change.

  • Pipeline finished with Success
    14 days ago
    Total: 882s
    #387846
  • Pipeline finished with Success
    14 days ago
    Total: 898s
    #387861
  • πŸ‡¬πŸ‡§United Kingdom catch

    12.5k runs without a failure so far, which is pretty encouraging, so I think this genuinely is ready for review again now. I'll kick off some more pipelines tomorrow.

  • Pipeline finished with Success
    14 days ago
    Total: 880s
    #387885
  • πŸ‡³πŸ‡±Netherlands spokje

    Great work, I suppose this also takes care of random failures in tests that actually need to use State for testing like \Drupal\Tests\system\Functional\System\CronRunTest::testCronUI?

    This is an example where we can't just switch to KeyValue, since State is an integral part of running cron.

  • πŸ‡¬πŸ‡§United Kingdom catch

    @Spokje yes it should fix cases like that as well.

  • Pipeline finished with Failed
    14 days ago
    Total: 911s
    #388250
  • Pipeline finished with Success
    14 days ago
    Total: 833s
    #388309
  • πŸ‡¬πŸ‡§United Kingdom catch

    Theory on the above - the runtime doubling could be not acquiring a lock, and having to wait for a second - the runtime doubled when the test itself was holding the lock for the duration of the test (although that prevented any new caching at all so fixed the race condition).

    So... trying to be more aggressive when we fail to acquire the lock.

  • Pipeline finished with Failed
    14 days ago
    Total: 906s
    #388406
  • πŸ‡¬πŸ‡§United Kingdom catch
  • Pipeline finished with Success
    14 days ago
    Total: 1550s
    #388387
  • Pipeline finished with Success
    14 days ago
    Total: 928s
    #388431
  • Pipeline finished with Success
    14 days ago
    Total: 822s
    #388648
  • Pipeline finished with Failed
    14 days ago
    Total: 885s
    #388646
  • Pipeline finished with Success
    13 days ago
    Total: 882s
    #388807
  • πŸ‡¬πŸ‡§United Kingdom catch

    The simplification led to a behaviour change for the path alias cache, but it's very small and still works.

    I think it is probably possible to make updateCache() work for what State:::set() wants to do, but I've already spent a lot of time on this issue and need a break, so I think we should defer that to a follow-up - it might reduce some code and make things hopefully even more robust, but it will be as tricky as this issue to ensure we've caught all the cases. Would rather have things correct which hopefully the current state of the MR gets us to.

    Because a few things have changed again, going to run more pipelines..

    https://git.drupalcode.org/project/drupal/-/pipelines/390206

  • Pipeline finished with Failed
    11 days ago
    Total: 1064s
    #391267
  • Pipeline finished with Failed
    11 days ago
    #391292
  • πŸ‡¬πŸ‡§United Kingdom alexpott πŸ‡ͺπŸ‡ΊπŸŒ
  • πŸ‡ΊπŸ‡ΈUnited States dww

    (Sorry for the assigned/unassigned noise -- accidentally clicked something in the MR review UI).

    Resolved the opened threads, since @catch has already addressed all the feedback.

    Opened a few trivial suggestions for some nits, and to rename a test method that no longer does what the old name says.

    I would RTBC, but I think the suggestions are worth applying, first. πŸ˜…

    Initial pass at saving credit:
    @catch (duh)
    @alexpott for MR reviews + suggestions
    @Spokje, @nicxvan, @godotislate, @luke.leber, and myself for MR and issue reviews

    Almost there!

    Thanks,
    -Derek

    p.s. restoring the Bug Smash Initiative tag, which I think @catch accidentally clobbered in #13 with a stale form submit.

  • πŸ‡¬πŸ‡§United Kingdom catch

    Tracked down the very specific race condition with the test (well I think I did) which was still resulting in some failures occasionally - about 1/2500.

    I think it is:

    1.test requests a page
    2. page writes to the state cache collector
    3. test sets state
    4. test requests a page.

    Because the test doesn't use WaitTerminateTestTrait, if #2 is holding the lock while #3 happens, we go into the !$lock_acquired branch in state.

    But the implementation so far was concentrating on the request that hits the $lock_acquired branch eventually winning out when it reaches the end of the request (more or less eventual consistency). In the test, there is no 'end of the request', there is just the end of the test, and the page we hit immediately after trying to set state can still get the wrong value in the meantime. The answer is to even more aggressively write to the cache when we can't acquire the lock. It still doesn't work if the underlying request takes longer than a second to complete, because then both the pre- and post- lock acquire attempt cache writes can still be overwritten. We could wait longer than a second, but on a real request the end of request logic is more likely to kick in later and clean everything up.

    Bit annoyed that I didn't look at the test in more detail earlier because that's obvious in retrospect, but was trying to focus on the stampede condition that real sites will hit vs functional test weirdness.

    Fresh set of runs:

    https://git.drupalcode.org/project/drupal/-/pipelines/392873
    https://git.drupalcode.org/project/drupal/-/pipelines/393001

  • Pipeline finished with Failed
    9 days ago
    Total: 1080s
    #393466
  • Is "Needs work" correct as current status, or is this ready to be reviewed again?

  • πŸ‡¬πŸ‡§United Kingdom catch

    @godotislate it's not actually fixing the race condition at the moment so very much needs work - although review of the approach would probably still be useful.

Production build 0.71.5 2024