- Issue created by @catch
- Merge request !9054When tests are run in parallel, sort all the not 'slow' tests by the number of test methods. → (Open) created by catch
- Status changed to Needs work
6 months ago 11:13am 4 August 2024 - 🇬🇧United Kingdom catch
I think I've figured out why we needed to double the number of test jobs for functional javascript jobs after the selenium driver change.
We set the maximum number of browser sessions to 5 in pipeline.yml:
SE_NODE_MAX_SESSIONS: "5"
But concurrency is 15:
CONCURRENCY: 15
Let's say we have Test 1 with 15 methods.
The first five tests start running, they all install Drupal, then claim a browser session.
Then test six starts running, Drupal installs before a browser session is requested (and the nine tests after that), all well and good so far. Then it requests a browser session by making a curl request to selenium. Selenium holds this request and doesn't return until a session is available.
Test 1 method 1 finishes, and clears the browser session, freeing it up. It's then claimed by Test 6 method 1 which has been waiting for it.
Now Test 1 method 2 doesn't have a browser session to itself, so it gets stuck waiting for selenium.
This way, the first 15 tests are fighting each other to get hold of one of five browser sessions, and a reasonable amount of time is spent making API requests to selenium via curl which take several seconds to return - basically sleep().
This has two effects:
1. Threads are sitting there waiting for curl instead of either doing Drupal things or making browser requests.
2.Very slow tests which are already very slow (i.e. they envelope the entire test run), become even slower because there is essentially a sleep(rand(0,30)) between each test method.
With this MR, tests with lots of test methods are ordered first, and that exacerbates the problem because there's more starting and stopping and waiting for browser sessions within the concurrency limits, whereas when the tests are more mixed up it happens less often.
Setting the max sessions and max concurrency both to 5 nearly works, but it that means concurrency is very, very low
Setting max sessions and concurrency to 10 nearly works, except there are some outlying tests that can suddenly take 5-7 minutes seemingly at random.
I was able to reproduce this locally by setting max sessions to 2 then running views functional js tests via run-tests.sh with concurrency 10 + this logging:
public function start() { $time = microtime(TRUE); try { $this->wdSession = $this->webDriver->session($this->browserName, $this->desiredCapabilities); } catch (\Exception $e) { throw new DriverException('Could not open connection: '.$e->getMessage(), 0, $e); } file_put_contents('/tmp/time.txt', ceil(microtime(TRUE) - $time) . "\n", FILE_APPEND); $this->started = true; $this->applyTimeouts(); }
It starts off as '1' and then oscillates between 1-10.
The problem seems to be not the MAX_SESSIONS as such, but the fact that we fully close sessions in ::tearDown() instead of resetting them. This has to effect of closing the browser window entirely, releasing it, then opening a new one.
If instead we reset the session in ::tearDown() (and behat handles closing sessions when the PHP process ends), then one test class will 'hold' a browser for the full duration of the test class, so it's impossible for it to be released to a later-starting test.
Still need to work out what the optimum max sessions vs. concurrency is, but pretty confident this race condition/contention issue exists.
- 🇬🇧United Kingdom catch
Pretty certain there's a race condition/contention issue in HEAD functional js tests with selenium sessions now.
The default SE_SESSION_RETRY_INTERVAL is 5 seconds.
BrowserTestBase installs Drupal before asking for a session.
If there are 16 tests running concurrently, the first test takes 1 minute to install Drupal, and the next 15 tests take 10 seconds to install Drupal, then those 15 tests will request a browser session before the first one. Some of our functional js tests finish in about 7-10 seconds, so you can have a lot of tests starting and finishing within the 5 second poll delay. That means occasionally, a slow-install test starts, then spends a lot of time waiting for a session. Even if the first request for a session is correctly handled FIFO by the selenium session queue, each test method closes the session then requsts a new one, which means a test with several slow-ish methods can perpetually be put at the back of the session queue. And if there are say 15 test methods and each waits even 5 seconds for a session, that's more than a minute of wall time waiting for a session.
This explains the vastly different timings for individual tests between gitlab jobs when contention issues occur - an unlucky test class can spend a lot of time doing nothing.
The default selenium session request timeout is 300 seconds, so we hardly ever see an actual timeout, but I was able to reproduce one here: https://git.drupalcode.org/project/drupal/-/jobs/2334397
The current high parallel rate works around this - because not that many tests are executed per job, eventually selenium sessions get freed up for any test methods still waiting, but it'll still be happening with four jobs which can be seen for example here: https://git.drupalcode.org/project/drupal/-/jobs/2333840
I've tried to reduce the timeout to 10 seconds, but I'm not sure that's taking effect on the k8s stack yet - it might need the node to cycle out first to actually reset it?
- SE_NODE_MAX_SESSIONS: "5" + SE_NODE_MAX_SESSIONS: "16" + SE_SESSION_RETRY_INTERVAL: "1" + SE_SESSION_REQUEST_TIMEOUT: "10"
I started to look at requesting/starting a browser session before we install Drupal so that sessions are requested FIFO, but there's major problems with that. We start and stop the browser every test run, so there is still a window where another test can 'steal' our browser session. The only way to reserve a session for the duration of a test class would be to make the browser session static, which is major surgery to browser tests.
Also starting a browser session before installing Drupal is inefficient - we'd have a browser sitting there while we install and don't need it.
So instead I'm increasing the maximum sessions to approximately double the concurrency, this means there should always be a 'spare' browser session when we start each test method, although this doesn't seem to be 100% foolproof. Because concurrency is much lower than the maximum number of sessions, the actual number of sessions open at any one time shouldn't hit the maximum.
https://git.drupalcode.org/project/drupal/-/jobs/2337348 shows a single test method that's usually fast, taking about 6 minutes, which probably just slipped under the timeout.
And here's a successful job with parallel: 2 on functional js tests with the test run successfully finishing in under 3m30s each.
https://git.drupalcode.org/project/drupal/-/jobs/2337506
https://git.drupalcode.org/project/drupal/-/jobs/2337507 - 🇬🇧United Kingdom catch
The pipeline.yml + BrowserTestBase changes now in a standalone MR on 📌 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Needs review .
- Status changed to Postponed
6 months ago 9:38am 5 August 2024 - 🇬🇧United Kingdom catch
Postponed on 📌 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Needs review now.
https://git.drupalcode.org/project/drupal/-/merge_requests/9054 is the change to review. It adds the sorting logic, removes @group #slow from a lot of functional and functional js tests where it's no longer needed, adds it to several kernel tests which still need it.
https://git.drupalcode.org/project/drupal/-/merge_requests/9064#note_353021 includes this issue, 📌 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Needs review and 📌 Show test run time by class in run-tests.sh output Fixed so it's possible to see what the effect is a bit easier.
- 🇬🇧United Kingdom catch
Went on massive diversion here debugging 📌 Show test run time by class in run-tests.sh output Fixed but MR is ready to review now. Still needs the other issue to land to for it all to work though.
- Status changed to Needs work
6 months ago 4:14pm 6 August 2024 - Status changed to Needs review
6 months ago 11:22pm 6 August 2024 - 🇬🇧United Kingdom catch
Rebased the MR, this is now soft-blocked on 🐛 MigrateNoMigrateDrupalTest fails with missing classes in certain situations Fixed though, although incorporated the commit from that issue here to make sure we get a green run with it applied.
- 🇬🇧United Kingdom catch
Took the commit from 🐛 MigrateNoMigrateDrupalTest fails with missing classes in certain situations Fixed out of the MR and it passed, so while something seems to be going on with that test, it's not directly triggered by this MR. So not postponed after all and actually ready for review now.
What it does:
We already split tests into @group #slow plus everything else.For each of those arrays of tests, orders them by number of public methods descending. Number of methods corresponds to test time in 95% of cases.
Creates an array of 'bins' for each parallel test job.
Loops over the sorted tests, first the slow ones, then the rest, and adds them one by one to each bin.
This results in bins that have more or less the same number of slow tests (which we already do in HEAD), but now with the number of test methods evenly distributed between each bin (more or less) and tests order from slow to fast and more methods to less methods.
This gives us the same or better test performance as adding @group #slow to tests with lots of methods, but without all the manual management of that, which was getting impractical and has diminishing returns. Now we only need to add @group #slow to a test if it's disproportionately slow compared to tests with the same number of methods.
The MR removes @group #slow from dozens of tests because it's no longer needed, and it's necessary to do that here to show there's no regression.
Adds @group #slow to a handful of tests that are disproportionately slow but were hidden/less relevant with the previous approach.
Test timings show we still get < 6 minute test runs with this approach, which proves it works. it's not a dramatic improvement over the current situation, it's more about automating the logic without introducing a regression - but it will unblock future pipeline improvements too.
Occasionally seeing some outlier test runs that take longer, but I think that might be different runner specs or similar, the main thing is that the 'floor' of test timings stays the same.
- 🇺🇸United States smustgrave
Not sure I'm a good resoruce to review but which MR should be reviewed?
- 🇬🇧United Kingdom catch
@smustgrave the MR that's not a draft is the one to review. The draft combines this with some other changes (some of which have now been committed already) which made it easier to see the overall effect on test times.
- 🇫🇷France andypost
Is there a way to get metrics/stats from runner per class/method then it will be easier to sort tests and detect regressions
- 🇬🇧United Kingdom catch
@andypost there is this issue 📌 Add an automated way for slow tests to fail drupalci builds Needs work but because tests are split across multiple jobs, downloading the artifacts from each job, collating them, matching them to tests and allowing for new/removed tests etc. all seems pretty complicated. The approach here is very simple and ass can be seen from the MR, it works well.
- 🇬🇧United Kingdom catch
Hid the draft MR and rebased now that a couple of other pipeline improvements have landed.
- 🇬🇧United Kingdom catch
Bumping to major, this unblocks two things:
1. Sub-five minute test runs when combined with a handful of other issues.
2. Significant resource request reductions for every gitlab pipeline 📌 [PP-2] Reduce CPU requirements for core gitlab pipelines Postponed .
- First commit to issue fork.
- 🇬🇧United Kingdom catch
Had another looks at kernel tests - was able to remove @group #slow from several, and also adjusted the pipeline to be 4 * 8 CPUS instead of 3 * 16 CPUs, so an overall reduction in CPU requests from 48 to 32 for kernel tests. In my sandbox branch all finish in less than a couple of minutes: https://git.drupalcode.org/project/drupal/-/jobs/2530170 / https://git.drupalcode.org/project/drupal/-/jobs/2530171 / https://git.drupalcode.org/project/drupal/-/jobs/2530172 / https://git.drupalcode.org/project/drupal/-/jobs/2530173
This is faster than the 3 * 16 CPU jobs in HEAD https://git.drupalcode.org/project/drupal/-/jobs/2529021 / https://git.drupalcode.org/project/drupal/-/jobs/2529022 / https://git.drupalcode.org/project/drupal/-/jobs/2529023 which are between 2-3.5 minutes on the last daily run.
I think we can make the CPU request changes here because kernel tests don't rely on the container rebuild/installer changes that 📌 [PP-2] Reduce CPU requirements for core gitlab pipelines Postponed is currently postponed on.
- 🇺🇸United States smustgrave
So if I'm reading https://git.drupalcode.org/project/drupal/-/merge_requests/9054/pipeline... correctly is it adding almost 4 minutes?
- 🇬🇧United Kingdom catch
@smustgrave no the most recent pipeline when you posted your comment had failed on a github error and had to be re-run, so the wall time is artificially long. There is a lot of time outside running individual tests when pipelines run, like 'waiting for pod' etc. as well as reruns, so the only way to get somewhat reliable numbers about what's happening in terms of test runs themselves is to look at each individual job and ctrl-f for 'test run duration'.
However even with test run duration, it very much varies by what hardware the job runs on, whether (and which) other jobs are running alongside it etc. so you get occasional spikes where a single job will take twice as long, but then the next time, it goes back down again. It would be nice to diagnose and get rid of those spikes too, but this issue is more of a prerequisite to that than fixing it.
So to see whether something's an improvement, neutral, or regression, when there's a fair number of pipelines to look at (as there is with this issue because it took a while to come together), the best way to figure out the impact is to look at the shortest time for a full pipeline (if it's comparable HEAD timings) and/or check the individual job 'test run duration' across 2-3 pipelines to try to figure out a floor/average.
Did a rebase here (which include some test performance improvements committed in the past 24 hours) and got a 5m15s pipeline https://git.drupalcode.org/project/drupal/-/pipelines/262725 with most jobs coming in under 3 minutes.
On a fast run, https://git.drupalcode.org/project/drupal/-/pipelines/262730/builds shows how long each job as a whole took at a glance.
- Status changed to RTBC
5 months ago 2:05pm 23 August 2024 - 🇺🇸United States smustgrave
So based on the info shared I believe this is good. May need another check but going to mark since @catch has gotten these tests to run in a few minutes.
- 🇬🇧United Kingdom catch
Created a follow-up here that would add more complexity to the ordering logic but automate more of this 📌 Include a check for data providers when ordering by method count Active , this would help with optimizing pipelines/reducing requirements for contrib #3469828: Lower default concurrency → .
- 🇫🇷France nod_ Lille
Review the code, looks reasonable as well as the explanations in this issue. Most of the optimization issues are in, as well as the method visibility fixes.
RTBC+1 - Status changed to Downport
5 months ago 10:12am 27 August 2024 - Status changed to Fixed
4 months ago 6:59am 21 September 2024 - 🇬🇧United Kingdom catch
Cherry-picked, minus the changes to one file, to 10.4.x, thanks!
Automatically closed - issue fixed for 2 weeks with no activity.