[random test failure] Random failure in PathWorkspacesTest

Created on 19 July 2023, 11 months ago
Updated 21 July 2023, 11 months ago

Problem/Motivation

Random fail in PathWorkspacesTest:

PHPUnit 9.6.8 by Sebastian Bergmann and contributors.

Testing Drupal\Tests\workspaces\Functional\PathWorkspacesTest
F..                                                                 3 / 3 (100%)

Time: 00:42.107, Memory: 4.00 MB

There was 1 failure:

1) Drupal\Tests\workspaces\Functional\PathWorkspacesTest::testPathAliases
Failed asserting that a boolean is not empty.

/var/www/html/vendor/phpunit/phpunit/src/Framework/Constraint/Constraint.php:122
/var/www/html/vendor/phpunit/phpunit/src/Framework/Constraint/Constraint.php:55
/var/www/html/core/modules/workspaces/tests/src/Functional/PathWorkspacesTest.php:109
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:728

FAILURES!
Tests: 3, Assertions: 126, Failures: 1.

Only affects 11.x but across PHP versions and database drivers.

https://www.drupal.org/pift-ci-job/2719595
https://www.drupal.org/pift-ci-job/2718910
https://www.drupal.org/pift-ci-job/2719593
https://www.drupal.org/pift-ci-job/2719016
https://www.drupal.org/pift-ci-job/2719804

Discussed briefly in Slack, 🐛 Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration Fixed is possibly a suspect issue.

Steps to reproduce

?

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

🐛 Bug report
Status

Fixed

Version

11.0 🔥

Component
Workspaces 

Last updated 16 days ago

No maintainer
Created by

🇬🇧United Kingdom longwave UK

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @longwave
  • Status changed to Needs review 11 months ago
  • last update 11 months ago
    3 pass
  • 🇬🇧United Kingdom longwave UK

    Running the failing test 100x to see if this triggers it.

  • 🇬🇧United Kingdom longwave UK
  • last update 11 months ago
    3 pass
  • 🇬🇧United Kingdom longwave UK

    Trying again with 1500x.

  • 25:24
    59:06
    Running
  • 🇬🇧United Kingdom longwave UK

    1500 passes suggests it is somehow an interaction with another test. Trying all functional tests 50x at @Spokje's suggestion.

  • 🇳🇱Netherlands Spokje

    I was trying to run all tests from @group path, but core/scripts/run-tests.sh doesn't seem to be handle a "simple" thing like --group?

    I see support for module, class, directory but not the standard phpunit --group foo?

  • 🇬🇧United Kingdom longwave UK

    The default argument is a group or comma-separated list of groups, so you just do core/scripts/run-tests.sh path. That is why the DrupalCI argument is testgroups as well, it's a hack that you can use it to pass other switches.

  • Status changed to Active 11 months ago
  • 🇳🇱Netherlands Spokje

    Ah, crap.

    I knew that once, many moons ago.
    Thanks @longwave

    I think here's you're (rather big) fail-canary-in-a-coal-mine to start with.

  • last update 11 months ago
    29 pass, 1 fail
  • 🇨🇭Switzerland znerol

    Is there any chance that the test generates a node with id() != 1 under some circumstances?

    I do not understand any of the workspaces stuff. But from skimming the test it looks like the assertions aren't really consistent. In the first part there is a drupalGet which constructs the path using $node->id() while in the assertions following it looks like the path is hard-coded to preload-paths:/node/1.

      /**
       * Tests path aliases with workspaces.
       */
      public function testPathAliases() {
        // Create a published node in Live, without an alias.
        $node = $this->drupalCreateNode([
          'type' => 'article',
          'status' => TRUE,
        ]);
    
    [...]
    
        $edit = [
          'path[0][alias]' => '/' . $this->randomMachineName(),
        ];
        $this->drupalGet('node/' . $node->id() . '/edit');
        $this->submitForm($edit, 'Save');
    
    [...]
    
        // Publish the workspace and check that the alias can be accessed in Live.
        $stage->publish();
        $this->assertAccessiblePaths([$path]);
        $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:/node/1'));
    
    
  • 🇬🇧United Kingdom longwave UK

    It shouldn't be able to, as each test is supposed to be independent, and should run the same each time, unless we explicitly say that something is random.

  • 🇳🇱Netherlands Spokje

    n=50, but it seems 🐛 Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration Fixed is a very likely suspect.

    This patch was brought to you by the Coal Mine Canary Libreation Front

  • last update 11 months ago
    29 pass
  • 🇨🇭Switzerland znerol

    That hunk maybe? In that case we'd need a better way to wait for termination to have terminated in the child site.

        // Cacheable normalizations are written after the response is flushed to
        // the client; give the server a chance to complete this work.
        sleep(1);
    
  • 🇳🇱Netherlands Spokje

    Yes please!

    For me personally sleep(1); is a very undeterministic way of determining something is finished.

  • 🇬🇧United Kingdom longwave UK

    I think you might be right. In PathAliasTest for example we had to add:

    +    // The \Drupal\path_alias\AliasWhitelist service performs cache clears after
    +    // Drupal has flushed the response to the client; wait for this to finish.
    +    sleep(1);
         $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:' . $edit['path[0][value]']), 'Cache entry was created.');
    

    In PathWorkspacesTest the failure is also looking for this cache key:

        $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:/node/1'));
    

    So the easy fix is to add sleep(1); to PathWorkspacesTest, but is this the symptom of something worse? Has 🐛 Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration Fixed introduced the possibility of cache coherency bugs or race conditions, if we need to sleep in tests?

  • 🇨🇭Switzerland znerol

    Symfony docs about kernel termination state that

    you will trigger the kernel.terminate event where you can perform certain actions that you may have delayed in order to return the response as quickly as possible to the client (e.g. sending emails).

    and further (probably needs an update now):

    at the moment, only the PHP FPM server API is able to send a response to the client while the server's PHP process still performs some tasks. With all other server APIs, listeners to kernel.terminate are still executed, but the response is not sent to the client until they are all completed.

    Side note: I guess the symfony docs need an update after upstream upstream PR 46931.

    After reading the docs, I feel that the terminate event should be used for stuff which also could be performed in a background task (a queue or cron). Not sure whether the usage of the terminate event in Drupal core and contrib falls into this category.

    On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.

  • last update 11 months ago
    29,826 pass
  • @znerol opened merge request.
  • 🇨🇭Switzerland znerol

    Checking MR together with patch from #9

  • last update 11 months ago
    29 pass, 1 fail
  • last update 11 months ago
    29,827 pass
  • 🇨🇭Switzerland znerol

    Moving around sleeps.

  • last update 11 months ago
    29 pass
  • 🇫🇷France andypost

    The test fails mostly every time in 🐛 Fix deprecated overloaded function usage in PHP 8.3 Fixed

  • Open in Jenkins → Open on Drupal.org →
    Environment: PHP 8.1 & sqlite-3.27
    last update 11 months ago
    29 pass, 2 fail
  • Open in Jenkins → Open on Drupal.org →
    Environment: PHP 8.2 & pgsql-14.1
    last update 11 months ago
    29 pass
  • Open in Jenkins → Open on Drupal.org →
    Environment: PHP 8.1 & MySQL 5.7
    last update 11 months ago
    29 pass
  • Status changed to Needs review 11 months ago
  • 🇨🇭Switzerland znerol

    The sqlite test in #19 failed with General error: 5 database is locked. Is that a result of running tests concurrently?

  • Open in Jenkins → Open on Drupal.org →
    Environment: PHP 8.1 & sqlite-3.27
    last update 11 months ago
    29 pass
  • 🇬🇧United Kingdom longwave UK

    SQLite is prone to database locks, it is not designed for the sorts of heavy loads and high concurrency that we have in test runs. I think that is OK to ignore here, I triggered a retest to see if it goes away.

  • 🇬🇧United Kingdom catch

    On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.

    Yeah I think the way we use it is fine and the problems it's causing are test-specific. Add a way to delay executions in test runner until terminate event completed in the child site Fixed is a good idea.

  • 🇳🇱Netherlands Spokje

    Now I hate to be _that_ guy, but hey, I _am_ that guy:

    The normal routine to prove a random failure is fixed is to run the failing patch and the patch with the fix at the same time, whilst the latter has to have ~8000 - 10.000 failure free runs to prove it's credibility.

    Seeing that the current 50x run takes around 10 minutes and that concurrent tests are, at least in my experience, getting wonky/throwing unpredictable unrelated randoms around the 1 hour mark, I think we need a 300x run-patch, which we then trigger 8000/300 ~ 27 times.

    The best way for that is using the "Custom parameters" option when creating a test-run.
    After that first time, it's "just" a matter of browser-back for 27x and pressing the submit with the same chosen options (Those tend to stick after the 2nd time browser-back in my, far too big, experience).

    We could also try to bring the number of different testable classes in the "canary"/failing patch down, so we can have more runs of that.
    But that's very likely to take (much) more time then blindly monkey-bashing the browser-back button.

    Also: Big yay! for Add a way to delay executions in test runner until terminate event completed in the child site Fixed

    • longwave committed 2dad249a on 11.x
      Issue #3375584 by znerol, longwave, Spokje: [random test failure] Random...
  • Status changed to Fixed 11 months ago
  • 🇬🇧United Kingdom longwave UK

    Discussed this with @Spokje in Slack and @znerol in person at Drupal Dev Days. This is affecting contributions at DDD as otherwise green patches are hitting it, so I am making the pragmatic call to commit this from NR. The fix appears to be the correct one given that we needed similar sleeps elsewhere, and this should unblock developers here; it won't make anything worse if it is the wrong fix, and we have the followup being actively worked on to handle this all in a better way.

    Committed and pushed 2dad249ac0 to 11.x. Thanks!

  • 🇬🇧United Kingdom catch

    +1 on the commit from NR, I guess this can be a posthumous RTBC.

    @Spokje that has turned out to be very important on issues where we're unskipping skipped random failures, but I think for 'quick fix' issues like this where the issue is actively failing, it's a bit less of a concern since the commit is a lot less likely to re-introduce a random failure even if it doesn't 100% fix one.

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.69.0 2024