[random test failure] Random failure in PathWorkspacesTest

Issue created by @longwave
Status changed to Needs review about 2 years ago2:18pm 19 July 2023
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
3 pass
Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
Running the failing test 100x to see if this triggers it.
Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
Comment about 2 years ago →
🇨🇭Switzerland znerol
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
3 pass
Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
Trying again with 1500x.
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
38:49
12:31
Running
Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
1500 passes suggests it is somehow an interaction with another test. Trying all functional tests 50x at @Spokje's suggestion.
Comment about 2 years ago →
🇳🇱Netherlands spokje
I was trying to run all tests from @group path, but core/scripts/run-tests.sh doesn't seem to be handle a "simple" thing like --group?

I see support for module, class, directory but not the standard phpunit --group foo?
Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
The default argument is a group or comma-separated list of groups, so you just do core/scripts/run-tests.sh path. That is why the DrupalCI argument is testgroups as well, it's a hack that you can use it to pass other switches.
Status changed to Active about 2 years ago12:56pm 20 July 2023
Comment about 2 years ago →
🇳🇱Netherlands spokje
Ah, crap.

I knew that once, many moons ago.
Thanks @longwave

I think here's you're (rather big) fail-canary-in-a-coal-mine to start with.
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
29 pass, 1 fail

🇨🇭Switzerland znerol

Is there any chance that the test generates a node with id() != 1 under some circumstances?

I do not understand any of the workspaces stuff. But from skimming the test it looks like the assertions aren't really consistent. In the first part there is a drupalGet which constructs the path using $node->id() while in the assertions following it looks like the path is hard-coded to preload-paths:/node/1.

  /**
   * Tests path aliases with workspaces.
   */
  public function testPathAliases() {
    // Create a published node in Live, without an alias.
    $node = $this->drupalCreateNode([
      'type' => 'article',
      'status' => TRUE,
    ]);

[...]

    $edit = [
      'path[0][alias]' => '/' . $this->randomMachineName(),
    ];
    $this->drupalGet('node/' . $node->id() . '/edit');
    $this->submitForm($edit, 'Save');

[...]

    // Publish the workspace and check that the alias can be accessed in Live.
    $stage->publish();
    $this->assertAccessiblePaths([$path]);
    $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:/node/1'));

Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
It shouldn't be able to, as each test is supposed to be independent, and should run the same each time, unless we explicitly say that something is random.
Comment about 2 years ago →
🇳🇱Netherlands spokje
n=50, but it seems 🐛 Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration Fixed is a very likely suspect.

This patch was brought to you by the Coal Mine Canary Libreation Front
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
29 pass
Comment about 2 years ago →
🇨🇭Switzerland znerol
That hunk maybe? In that case we'd need a better way to wait for termination to have terminated in the child site.

// Cacheable normalizations are written after the response is flushed to // the client; give the server a chance to complete this work. sleep(1);
Comment about 2 years ago →
🇳🇱Netherlands spokje
Yes please!

For me personally sleep(1); is a very undeterministic way of determining something is finished.
Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
I think you might be right. In PathAliasTest for example we had to add:

+ // The \Drupal\path_alias\AliasWhitelist service performs cache clears after + // Drupal has flushed the response to the client; wait for this to finish. + sleep(1); $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:' . $edit['path[0][value]']), 'Cache entry was created.');
In PathWorkspacesTest the failure is also looking for this cache key:

$this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:/node/1'));
So the easy fix is to add sleep(1); to PathWorkspacesTest, but is this the symptom of something worse? Has 🐛 Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration Fixed introduced the possibility of cache coherency bugs or race conditions, if we need to sleep in tests?
Comment about 2 years ago →
🇨🇭Switzerland znerol
Symfony docs about kernel termination state that

you will trigger the kernel.terminate event where you can perform certain actions that you may have delayed in order to return the response as quickly as possible to the client (e.g. sending emails).

and further (probably needs an update now):

at the moment, only the PHP FPM server API is able to send a response to the client while the server's PHP process still performs some tasks. With all other server APIs, listeners to kernel.terminate are still executed, but the response is not sent to the client until they are all completed.

Side note: I guess the symfony docs need an update after upstream upstream PR 46931.

After reading the docs, I feel that the terminate event should be used for stuff which also could be performed in a background task (a queue or cron). Not sure whether the usage of the terminate event in Drupal core and contrib falls into this category.

On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
29,826 pass
@znerol opened merge request.
Comment about 2 years ago →
🇨🇭Switzerland znerol
Checking MR together with patch from #9
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
29 pass, 1 fail
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
29,827 pass
Comment about 2 years ago →
🇨🇭Switzerland znerol
Moving around sleeps.
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
29 pass
Comment about 2 years ago →
🇫🇷France andypost
The test fails mostly every time in 🐛 Fix deprecated overloaded function usage in PHP 8.3 Fixed
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.1 & sqlite-3.27
last update about 2 years ago
29 pass, 2 fail
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & pgsql-14.1
last update about 2 years ago
29 pass
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.1 & MySQL 5.7
last update about 2 years ago
29 pass
Comment about 2 years ago →
🇨🇭Switzerland znerol
Filed a follow-up ✨ Add a way to delay executions in test runner until terminate event completed in the child site Fixed
Status changed to Needs review about 2 years ago9:01am 21 July 2023
Comment about 2 years ago →
🇨🇭Switzerland znerol
The sqlite test in #19 failed with General error: 5 database is locked. Is that a result of running tests concurrently?
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.1 & sqlite-3.27
last update about 2 years ago
29 pass
Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
SQLite is prone to database locks, it is not designed for the sorts of heavy loads and high concurrency that we have in test runs. I think that is OK to ignore here, I triggered a retest to see if it goes away.
Comment about 2 years ago →
🇬🇧United Kingdom catch
On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.

Yeah I think the way we use it is fine and the problems it's causing are test-specific. ✨ Add a way to delay executions in test runner until terminate event completed in the child site Fixed is a good idea.
Comment about 2 years ago →
🇳🇱Netherlands spokje
Now I hate to be _that_ guy, but hey, I _am_ that guy:

The normal routine to prove a random failure is fixed is to run the failing patch and the patch with the fix at the same time, whilst the latter has to have ~8000 - 10.000 failure free runs to prove it's credibility.

Seeing that the current 50x run takes around 10 minutes and that concurrent tests are, at least in my experience, getting wonky/throwing unpredictable unrelated randoms around the 1 hour mark, I think we need a 300x run-patch, which we then trigger 8000/300 ~ 27 times.

The best way for that is using the "Custom parameters" option when creating a test-run.
After that first time, it's "just" a matter of browser-back for 27x and pressing the submit with the same chosen options (Those tend to stick after the 2nd time browser-back in my, far too big, experience).

We could also try to bring the number of different testable classes in the "canary"/failing patch down, so we can have more runs of that.
But that's very likely to take (much) more time then blindly monkey-bashing the browser-back button.

Also: Big yay! for ✨ Add a way to delay executions in test runner until terminate event completed in the child site Fixed

Comment about 2 years ago →

System Message

longwave → committed 2dad249a on 11.x

Issue #3375584 by znerol, longwave, Spokje: [random test failure] Random...

Status changed to Fixed about 2 years ago12:36pm 21 July 2023
Comment about 2 years ago →
🇬🇧United Kingdom longwave UK
Discussed this with @Spokje in Slack and @znerol in person at Drupal Dev Days. This is affecting contributions at DDD as otherwise green patches are hitting it, so I am making the pragmatic call to commit this from NR. The fix appears to be the correct one given that we needed similar sleeps elsewhere, and this should unblock developers here; it won't make anything worse if it is the wrong fix, and we have the followup being actively worked on to handle this all in a better way.

Committed and pushed 2dad249ac0 to 11.x. Thanks!
Comment about 2 years ago →
🇬🇧United Kingdom catch
+1 on the commit from NR, I guess this can be a posthumous RTBC.

@Spokje that has turned out to be very important on issues where we're unskipping skipped random failures, but I think for 'quick fix' issues like this where the issue is actively failing, it's a bit less of a concern since the commit is a lot less likely to re-introduce a random failure even if it doesn't 100% fix one.
Comment almost 2 years ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

[random test failure] Random failure in PathWorkspacesTest

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Comments & Activities