[random test failures] Race condition in state when individual keys are set with an empty cache

Created on 4 April 2024, about 2 months ago
Updated 23 April 2024, about 1 month ago

Problem/Motivation

Various test modules use the state API to track things between the tested site and the test, after 📌 Use cache collector for state Needs review this creates race conditions, in combination with 🐛 Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration Fixed - because the state cache item is written as a post response task.

The race condition is as follows:

1. drupalGet(), state cache miss.
2. Test class, calls State:set(), which calls invalidate, but there's nothing to invalidate yet
3. request from #1 is still running end of request tasks, we get to updateCache() in CacheCollector, there was no cache at the beginning of the request, and there's still no cache now, so it just writes to the cache.
4. Test class calls drupalGet(), and this now gets a cache without whatever happened in #2.

State has an explicit race condition prevention mechanism when a request begins with a cache item, writes to it at the end of the request, but it's been upated in-between. However, it doesn't handle the case when there is a cache miss at the beginning of the request and the cache is invalidated again before it tries to write.

This case would be extremely unlikely on a real site - you usually either have an empty cache after an invalidation, in which case it's safe to write, or you have a non-empty cache before an invalidation, in which case we already handle it, but the 'clean' environment of the test means we're reproducing it in a fraction of test runs.

We can prevent this by writing a cache item from State::set(), and in CacheCollector, not writing to the cache, if it was empty at the beginning of the request and has been written before the end of the request. Most CacheCollector implementations don't need this protection because they are caching discovery and similar - which can be invalidated, but not individually written to.

Steps to reproduce

Proposed resolution

For test randomness, either using TestWaitTerminate in these tests, or switching from state to key/value would prevent the random test failures.

However, it's also possible to add race condition protection to CacheCollector + state.MR 7336 attempts to do this and is the MR to review.

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

🐛 Bug report
Status

Fixed

Version

10.3

Component
PHPUnit 

Last updated about 8 hours ago

Created by

🇬🇧United Kingdom catch

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @catch
  • First commit to issue fork.
  • Merge request !7334Draft: test repeat → (Open) created by Berdir
  • Pipeline finished with Success
    about 2 months ago
    Total: 1133s
    #137839
  • Pipeline finished with Canceled
    about 2 months ago
    Total: 927s
    #137843
  • Pipeline finished with Canceled
    about 2 months ago
    Total: 1941s
    #137860
  • Pipeline finished with Canceled
    about 2 months ago
    Total: 859s
    #137907
  • Pipeline finished with Success
    about 2 months ago
    Total: 628s
    #137938
  • Merge request !7338Draft: usleep(300) → (Open) created by catch
  • Pipeline finished with Success
    about 2 months ago
    Total: 625s
    #137964
  • Pipeline finished with Failed
    about 2 months ago
    Total: 171s
    #137975
  • Pipeline finished with Success
    about 2 months ago
    #137977
  • Pipeline finished with Failed
    about 2 months ago
    Total: 632s
    #138008
  • Pipeline finished with Failed
    about 2 months ago
    Total: 580s
    #138466
  • Status changed to Needs review about 2 months ago
  • 🇬🇧United Kingdom catch

    I need to back out the test changes here, they're just for the repeated test run job so that we only run one test method 100 times instead of the entire test, but would be great to get reviews - and leaving it in for now in case we want to re-run that job even more times.

  • Status changed to Needs work about 2 months ago
  • 🇬🇧United Kingdom longwave UK

    Looks good to me - the fixes make sense and the comments definitely help explain what's going on. As was discussed in Slack this was originally a concern with the cache collector but never was an issue in practice until tests started hitting cold caches occasionally in this way. For me the evidence in #6 is enough to commit this, so NW to remove the test changes.

  • Status changed to Needs review about 2 months ago
  • 🇬🇧United Kingdom catch

    Yeah we covered the warm (but invalid) cache situation with the cache collector, which is the main case, but cold + invalid just never came up or seemed realistic. Additionally you would never hit this with a discovery cache (which is most cache collector use cases), because it's unlikely to get invalidated twice in the same request or anything, needs to be extremely high write. I'm not sure a site would even run into this with state because fully cold cache then immediate write situation seems unlikely there too, but we're clearly hitting it in tests.

    Rebased to drop the test modifications.

  • Status changed to RTBC about 2 months ago
  • 🇬🇧United Kingdom longwave UK

    MR!7336 looks great.

  • 🇬🇧United Kingdom catch

    Started off as a task but definitely a bugfix now.

  • 🇬🇧United Kingdom catch

    Had one more look over this and realised the comment in State::set() was slightly out of date and could be clearer. Updated that but leaving RTBC since it's comment-only.

  • 🇬🇧United Kingdom alexpott 🇪🇺🌍

    alexpott changed the visibility of the branch 3438424-random-test-failures to hidden.

  • 🇬🇧United Kingdom alexpott 🇪🇺🌍

    alexpott changed the visibility of the branch 3438424-usleep to hidden.

  • Status changed to Needs work about 2 months ago
  • 🇬🇧United Kingdom alexpott 🇪🇺🌍

    The latest changes have the MR failing tests... see https://git.drupalcode.org/project/drupal/-/merge_requests/7336/pipelines

  • Status changed to RTBC about 2 months ago
  • 🇬🇧United Kingdom catch

    Managed to commit to the wrong branch... dropped from the MR now.

  • Pipeline finished with Failed
    about 2 months ago
    Total: 638s
    #139601
  • Status changed to Needs work about 2 months ago
  • 🇬🇧United Kingdom alexpott 🇪🇺🌍

    Still failing tests...

    core/tests/Drupal/Tests/Core/Cache/CacheCollectorTest.php
    core/profiles/standard/tests/src/FunctionalJavascript/StandardPerformanceTest.php

    I think having the MR link in the issue messes up d.o so it puts the failing test link in the wrong place... I've removed that from the issue summary so hopefully this is easier to see.

  • Pipeline finished with Failed
    about 2 months ago
    Total: 637s
    #139966
  • Pipeline finished with Failed
    about 2 months ago
    Total: 640s
    #139991
  • Pipeline finished with Failed
    about 2 months ago
    Total: 575s
    #140018
  • Pipeline finished with Failed
    about 2 months ago
    Total: 606s
    #140025
  • 🇬🇧United Kingdom catch

    OK now I can see the test failures, that's helpful.

    The unit test failure is very handy, I wasn't sure how to test the race condition here, but turns out we already had a unit test for a cache item being set in the middle of the request. I've split that one method into two, so it now tests both the warm cache situation (already covered but the test wasn't explicitly testing this) and cold cache situation (what we're fixing here and what the test coverage was actually testing but with different expectations).

  • Pipeline finished with Canceled
    about 2 months ago
    #140032
  • Pipeline finished with Failed
    about 2 months ago
    Total: 573s
    #140039
  • Pipeline finished with Canceled
    about 2 months ago
    Total: 544s
    #140044
  • Pipeline finished with Failed
    about 2 months ago
    Total: 573s
    #140049
  • Pipeline finished with Running
    about 2 months ago
    #140054
  • Status changed to Needs review about 2 months ago
  • 🇬🇧United Kingdom catch

    Performance tests are green now - it's one extra state cache set + the related key/value queries collected during that request, not surprising since we're being more conservative about writing to the cache on completely cold starts now.

  • Status changed to RTBC about 2 months ago
  • 🇬🇧United Kingdom alexpott 🇪🇺🌍

    Given catch's changes are to a test to fix and improved comments going to rtbc and commit. Hopefully this will reduce some of the random fails due to the state cache collector from occurring.

  • Pipeline finished with Success
    about 2 months ago
    Total: 1031s
    #141810
  • 🇬🇧United Kingdom alexpott 🇪🇺🌍

    Committed 3a3e618 and pushed to 11.x. Thanks!
    Committed af46c48 and pushed to 10.3.x. Thanks!

    • alexpott committed af46c48e on 10.3.x
      Issue #3438424 by catch, Berdir, alexpott, longwave: [random test...
  • Status changed to Fixed about 2 months ago
    • alexpott committed 3a3e6186 on 11.x
      Issue #3438424 by catch, Berdir, alexpott, longwave: [random test...
  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.69.0 2024