- Issue created by @longwave
- Status changed to Needs review
over 1 year ago 2:18pm 19 July 2023 - last update
over 1 year ago 3 pass - 🇬🇧United Kingdom longwave UK
Running the failing test 100x to see if this triggers it.
- last update
over 1 year ago 3 pass 45:18 19:00 Running- 🇬🇧United Kingdom longwave UK
1500 passes suggests it is somehow an interaction with another test. Trying all functional tests 50x at @Spokje's suggestion.
- 🇳🇱Netherlands spokje
I was trying to run all tests from
@group path
, butcore/scripts/run-tests.sh
doesn't seem to be handle a "simple" thing like--group
?I see support for module, class, directory but not the standard
phpunit --group foo
? - 🇬🇧United Kingdom longwave UK
The default argument is a group or comma-separated list of groups, so you just do
core/scripts/run-tests.sh path
. That is why the DrupalCI argument istestgroups
as well, it's a hack that you can use it to pass other switches. - Status changed to Active
over 1 year ago 12:56pm 20 July 2023 - 🇳🇱Netherlands spokje
Ah, crap.
I knew that once, many moons ago.
Thanks @longwaveI think here's you're (rather big) fail-canary-in-a-coal-mine to start with.
- last update
over 1 year ago 29 pass, 1 fail - 🇨🇭Switzerland znerol
Is there any chance that the test generates a node with
id() != 1
under some circumstances?I do not understand any of the workspaces stuff. But from skimming the test it looks like the assertions aren't really consistent. In the first part there is a
drupalGet
which constructs the path using$node->id()
while in the assertions following it looks like the path is hard-coded topreload-paths:/node/1
./** * Tests path aliases with workspaces. */ public function testPathAliases() { // Create a published node in Live, without an alias. $node = $this->drupalCreateNode([ 'type' => 'article', 'status' => TRUE, ]); [...] $edit = [ 'path[0][alias]' => '/' . $this->randomMachineName(), ]; $this->drupalGet('node/' . $node->id() . '/edit'); $this->submitForm($edit, 'Save'); [...] // Publish the workspace and check that the alias can be accessed in Live. $stage->publish(); $this->assertAccessiblePaths([$path]); $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:/node/1'));
- 🇬🇧United Kingdom longwave UK
It shouldn't be able to, as each test is supposed to be independent, and should run the same each time, unless we explicitly say that something is random.
- 🇳🇱Netherlands spokje
n=50, but it seems 🐛 Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration Fixed is a very likely suspect.
This patch was brought to you by the Coal Mine Canary Libreation Front
- last update
over 1 year ago 29 pass - 🇨🇭Switzerland znerol
That hunk maybe? In that case we'd need a better way to wait for termination to have terminated in the child site.
// Cacheable normalizations are written after the response is flushed to // the client; give the server a chance to complete this work. sleep(1);
- 🇳🇱Netherlands spokje
Yes please!
For me personally
sleep(1);
is a very undeterministic way of determining something is finished. - 🇬🇧United Kingdom longwave UK
I think you might be right. In PathAliasTest for example we had to add:
+ // The \Drupal\path_alias\AliasWhitelist service performs cache clears after + // Drupal has flushed the response to the client; wait for this to finish. + sleep(1); $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:' . $edit['path[0][value]']), 'Cache entry was created.');
In PathWorkspacesTest the failure is also looking for this cache key:
$this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:/node/1'));
So the easy fix is to add
sleep(1);
to PathWorkspacesTest, but is this the symptom of something worse? Has 🐛 Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration Fixed introduced the possibility of cache coherency bugs or race conditions, if we need to sleep in tests? - 🇨🇭Switzerland znerol
Symfony docs about kernel termination state that
you will trigger the kernel.terminate event where you can perform certain actions that you may have delayed in order to return the response as quickly as possible to the client (e.g. sending emails).
and further (probably needs an update now):
at the moment, only the PHP FPM server API is able to send a response to the client while the server's PHP process still performs some tasks. With all other server APIs, listeners to kernel.terminate are still executed, but the response is not sent to the client until they are all completed.
Side note: I guess the symfony docs need an update after upstream upstream PR 46931.
After reading the docs, I feel that the
terminate
event should be used for stuff which also could be performed in a background task (a queue or cron). Not sure whether the usage of the terminate event in Drupal core and contrib falls into this category.On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.
- last update
over 1 year ago 29,826 pass - @znerol opened merge request.
- last update
over 1 year ago 29 pass, 1 fail - last update
over 1 year ago 29,827 pass - last update
over 1 year ago 29 pass - 🇫🇷France andypost
The test fails mostly every time in 🐛 Fix deprecated overloaded function usage in PHP 8.3 Fixed
- last update
over 1 year ago 29 pass, 2 fail - last update
over 1 year ago 29 pass - last update
over 1 year ago 29 pass - 🇨🇭Switzerland znerol
Filed a follow-up ✨ Add a way to delay executions in test runner until terminate event completed in the child site Fixed
- Status changed to Needs review
over 1 year ago 9:01am 21 July 2023 - last update
over 1 year ago 29 pass - 🇬🇧United Kingdom longwave UK
SQLite is prone to database locks, it is not designed for the sorts of heavy loads and high concurrency that we have in test runs. I think that is OK to ignore here, I triggered a retest to see if it goes away.
- 🇬🇧United Kingdom catch
On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.
Yeah I think the way we use it is fine and the problems it's causing are test-specific. ✨ Add a way to delay executions in test runner until terminate event completed in the child site Fixed is a good idea.
- 🇳🇱Netherlands spokje
Now I hate to be _that_ guy, but hey, I _am_ that guy:
The normal routine to prove a random failure is fixed is to run the failing patch and the patch with the fix at the same time, whilst the latter has to have ~8000 - 10.000 failure free runs to prove it's credibility.
Seeing that the current 50x run takes around 10 minutes and that concurrent tests are, at least in my experience, getting wonky/throwing unpredictable unrelated randoms around the 1 hour mark, I think we need a 300x run-patch, which we then trigger 8000/300 ~ 27 times.
The best way for that is using the "Custom parameters" option when creating a test-run.
After that first time, it's "just" a matter of browser-back for 27x and pressing the submit with the same chosen options (Those tend to stick after the 2nd time browser-back in my, far too big, experience).We could also try to bring the number of different testable classes in the "canary"/failing patch down, so we can have more runs of that.
But that's very likely to take (much) more time then blindly monkey-bashing the browser-back button.Also: Big yay! for ✨ Add a way to delay executions in test runner until terminate event completed in the child site Fixed
-
longwave →
committed 2dad249a on 11.x
Issue #3375584 by znerol, longwave, Spokje: [random test failure] Random...
-
longwave →
committed 2dad249a on 11.x
- Status changed to Fixed
over 1 year ago 12:36pm 21 July 2023 - 🇬🇧United Kingdom longwave UK
Discussed this with @Spokje in Slack and @znerol in person at Drupal Dev Days. This is affecting contributions at DDD as otherwise green patches are hitting it, so I am making the pragmatic call to commit this from NR. The fix appears to be the correct one given that we needed similar sleeps elsewhere, and this should unblock developers here; it won't make anything worse if it is the wrong fix, and we have the followup being actively worked on to handle this all in a better way.
Committed and pushed 2dad249ac0 to 11.x. Thanks!
- 🇬🇧United Kingdom catch
+1 on the commit from NR, I guess this can be a posthumous RTBC.
@Spokje that has turned out to be very important on issues where we're unskipping skipped random failures, but I think for 'quick fix' issues like this where the issue is actively failing, it's a bit less of a concern since the commit is a lot less likely to re-introduce a random failure even if it doesn't 100% fix one.
Automatically closed - issue fixed for 2 weeks with no activity.