Order tests by number of public methods to optimize gitlab job times

Issue created by @catch
Merge request !9054When tests are run in parallel, sort all the not 'slow' tests by the number of test methods. → (Open) created by catch
Pipeline finished with Failed
about 1 year ago
Total: 4036s
#242421
Merge request !9064Draft: Resolve #3465602 "With times" → (Open) created by catch
Pipeline finished with Success
about 1 year ago
Total: 562s
#243411
Pipeline finished with Success
about 1 year ago
Total: 551s
#243419
Status changed to Needs work about 1 year ago11:13am 4 August 2024
Comment about 1 year ago →
🇬🇧United Kingdom catch
Comment about 1 year ago →
🇬🇧United Kingdom catch
I think I've figured out why we needed to double the number of test jobs for functional javascript jobs after the selenium driver change.

We set the maximum number of browser sessions to 5 in pipeline.yml:

SE_NODE_MAX_SESSIONS: "5"
But concurrency is 15:

CONCURRENCY: 15
Let's say we have Test 1 with 15 methods.

The first five tests start running, they all install Drupal, then claim a browser session.

Then test six starts running, Drupal installs before a browser session is requested (and the nine tests after that), all well and good so far. Then it requests a browser session by making a curl request to selenium. Selenium holds this request and doesn't return until a session is available.

Test 1 method 1 finishes, and clears the browser session, freeing it up. It's then claimed by Test 6 method 1 which has been waiting for it.

Now Test 1 method 2 doesn't have a browser session to itself, so it gets stuck waiting for selenium.

This way, the first 15 tests are fighting each other to get hold of one of five browser sessions, and a reasonable amount of time is spent making API requests to selenium via curl which take several seconds to return - basically sleep().

This has two effects:

1. Threads are sitting there waiting for curl instead of either doing Drupal things or making browser requests.

2.Very slow tests which are already very slow (i.e. they envelope the entire test run), become even slower because there is essentially a sleep(rand(0,30)) between each test method.

With this MR, tests with lots of test methods are ordered first, and that exacerbates the problem because there's more starting and stopping and waiting for browser sessions within the concurrency limits, whereas when the tests are more mixed up it happens less often.

Setting the max sessions and max concurrency both to 5 nearly works, but it that means concurrency is very, very low

Setting max sessions and concurrency to 10 nearly works, except there are some outlying tests that can suddenly take 5-7 minutes seemingly at random.

I was able to reproduce this locally by setting max sessions to 2 then running views functional js tests via run-tests.sh with concurrency 10 + this logging:

public function start() { $time = microtime(TRUE); try { $this->wdSession = $this->webDriver->session($this->browserName, $this->desiredCapabilities); } catch (\Exception $e) { throw new DriverException('Could not open connection: '.$e->getMessage(), 0, $e); } file_put_contents('/tmp/time.txt', ceil(microtime(TRUE) - $time) . "\n", FILE_APPEND); $this->started = true; $this->applyTimeouts(); }
It starts off as '1' and then oscillates between 1-10.

The problem seems to be not the MAX_SESSIONS as such, but the fact that we fully close sessions in ::tearDown() instead of resetting them. This has to effect of closing the browser window entirely, releasing it, then opening a new one.

If instead we reset the session in ::tearDown() (and behat handles closing sessions when the PHP process ends), then one test class will 'hold' a browser for the full duration of the test class, so it's impossible for it to be released to a later-starting test.

Still need to work out what the optimum max sessions vs. concurrency is, but pretty confident this race condition/contention issue exists.
Comment about 1 year ago →
🇬🇧United Kingdom catch
Pretty certain there's a race condition/contention issue in HEAD functional js tests with selenium sessions now.

The default SE_SESSION_RETRY_INTERVAL is 5 seconds.

BrowserTestBase installs Drupal before asking for a session.

If there are 16 tests running concurrently, the first test takes 1 minute to install Drupal, and the next 15 tests take 10 seconds to install Drupal, then those 15 tests will request a browser session before the first one. Some of our functional js tests finish in about 7-10 seconds, so you can have a lot of tests starting and finishing within the 5 second poll delay. That means occasionally, a slow-install test starts, then spends a lot of time waiting for a session. Even if the first request for a session is correctly handled FIFO by the selenium session queue, each test method closes the session then requsts a new one, which means a test with several slow-ish methods can perpetually be put at the back of the session queue. And if there are say 15 test methods and each waits even 5 seconds for a session, that's more than a minute of wall time waiting for a session.

This explains the vastly different timings for individual tests between gitlab jobs when contention issues occur - an unlucky test class can spend a lot of time doing nothing.

The default selenium session request timeout is 300 seconds, so we hardly ever see an actual timeout, but I was able to reproduce one here: https://git.drupalcode.org/project/drupal/-/jobs/2334397

The current high parallel rate works around this - because not that many tests are executed per job, eventually selenium sessions get freed up for any test methods still waiting, but it'll still be happening with four jobs which can be seen for example here: https://git.drupalcode.org/project/drupal/-/jobs/2333840

I've tried to reduce the timeout to 10 seconds, but I'm not sure that's taking effect on the k8s stack yet - it might need the node to cycle out first to actually reset it?

- SE_NODE_MAX_SESSIONS: "5" + SE_NODE_MAX_SESSIONS: "16" + SE_SESSION_RETRY_INTERVAL: "1" + SE_SESSION_REQUEST_TIMEOUT: "10"
I started to look at requesting/starting a browser session before we install Drupal so that sessions are requested FIFO, but there's major problems with that. We start and stop the browser every test run, so there is still a window where another test can 'steal' our browser session. The only way to reserve a session for the duration of a test class would be to make the browser session static, which is major surgery to browser tests.

Also starting a browser session before installing Drupal is inefficient - we'd have a browser sitting there while we install and don't need it.

So instead I'm increasing the maximum sessions to approximately double the concurrency, this means there should always be a 'spare' browser session when we start each test method, although this doesn't seem to be 100% foolproof. Because concurrency is much lower than the maximum number of sessions, the actual number of sessions open at any one time shouldn't hit the maximum.

https://git.drupalcode.org/project/drupal/-/jobs/2337348 shows a single test method that's usually fast, taking about 6 minutes, which probably just slipped under the timeout.

And here's a successful job with parallel: 2 on functional js tests with the test run successfully finishing in under 3m30s each.

https://git.drupalcode.org/project/drupal/-/jobs/2337506
https://git.drupalcode.org/project/drupal/-/jobs/2337507
Comment about 1 year ago →
🇬🇧United Kingdom catch
The pipeline.yml + BrowserTestBase changes now in a standalone MR on 🐛 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Fixed .
Status changed to Postponed about 1 year ago9:38am 5 August 2024
Comment about 1 year ago →
🇬🇧United Kingdom catch
Postponed on 🐛 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Fixed now.

https://git.drupalcode.org/project/drupal/-/merge_requests/9054 is the change to review. It adds the sorting logic, removes @group #slow from a lot of functional and functional js tests where it's no longer needed, adds it to several kernel tests which still need it.

https://git.drupalcode.org/project/drupal/-/merge_requests/9064#note_353021 includes this issue, 🐛 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Fixed and 📌 Show test run time by class in run-tests.sh output Fixed so it's possible to see what the effect is a bit easier.
Pipeline finished with Success
about 1 year ago
Total: 554s
#244268
Pipeline finished with Success
about 1 year ago
Total: 455s
#244398
Pipeline finished with Success
about 1 year ago
Total: 578s
#244407
Pipeline finished with Success
about 1 year ago
Total: 602s
#245110
Pipeline finished with Failed
about 1 year ago
Total: 361s
#245288
Pipeline finished with Failed
about 1 year ago
Total: 482s
#245299
Pipeline finished with Failed
about 1 year ago
Total: 347s
#245307
Pipeline finished with Failed
about 1 year ago
#245318
Pipeline finished with Success
about 1 year ago
Total: 466s
#245337
Pipeline finished with Success
about 1 year ago
#245357
Comment about 1 year ago →
🇬🇧United Kingdom catch
Went on massive diversion here debugging 📌 Show test run time by class in run-tests.sh output Fixed but MR is ready to review now. Still needs the other issue to land to for it all to work though.
Pipeline finished with Success
about 1 year ago
Total: 1001s
#245739
Status changed to Needs work about 1 year ago4:14pm 6 August 2024
Comment about 1 year ago →
🇫🇷France andypost
blocker commited
Status changed to Needs review about 1 year ago11:22pm 6 August 2024
Comment about 1 year ago →
🇬🇧United Kingdom catch
Rebased the MR, this is now soft-blocked on 🐛 MigrateNoMigrateDrupalTest fails with missing classes in certain situations Fixed though, although incorporated the commit from that issue here to make sure we get a green run with it applied.
Pipeline finished with Success
about 1 year ago
Total: 388s
#246231
Pipeline finished with Success
about 1 year ago
#246249
Pipeline finished with Failed
about 1 year ago
Total: 462s
#246272
Pipeline finished with Success
about 1 year ago
Total: 447s
#246289
Comment about 1 year ago →
🇬🇧United Kingdom catch
Took the commit from 🐛 MigrateNoMigrateDrupalTest fails with missing classes in certain situations Fixed out of the MR and it passed, so while something seems to be going on with that test, it's not directly triggered by this MR. So not postponed after all and actually ready for review now.

What it does:
We already split tests into @group #slow plus everything else.

For each of those arrays of tests, orders them by number of public methods descending. Number of methods corresponds to test time in 95% of cases.

Creates an array of 'bins' for each parallel test job.

Loops over the sorted tests, first the slow ones, then the rest, and adds them one by one to each bin.

This results in bins that have more or less the same number of slow tests (which we already do in HEAD), but now with the number of test methods evenly distributed between each bin (more or less) and tests order from slow to fast and more methods to less methods.

This gives us the same or better test performance as adding @group #slow to tests with lots of methods, but without all the manual management of that, which was getting impractical and has diminishing returns. Now we only need to add @group #slow to a test if it's disproportionately slow compared to tests with the same number of methods.

The MR removes @group #slow from dozens of tests because it's no longer needed, and it's necessary to do that here to show there's no regression.

Adds @group #slow to a handful of tests that are disproportionately slow but were hidden/less relevant with the previous approach.

Test timings show we still get < 6 minute test runs with this approach, which proves it works. it's not a dramatic improvement over the current situation, it's more about automating the logic without introducing a regression - but it will unblock future pipeline improvements too.

Occasionally seeing some outlier test runs that take longer, but I think that might be different runner specs or similar, the main thing is that the 'floor' of test timings stays the same.
Pipeline finished with Failed
about 1 year ago
#247292
Pipeline finished with Success
about 1 year ago
Total: 502s
#247296
Pipeline finished with Failed
about 1 year ago
Total: 113187s
#247355
Comment about 1 year ago →
🇬🇧United Kingdom catch
Pipeline finished with Failed
about 1 year ago
Total: 221s
#251240
Pipeline finished with Failed
about 1 year ago
Total: 488s
#251245
Pipeline finished with Failed
about 1 year ago
#252604
Comment about 1 year ago →
🇺🇸United States smustgrave
Not sure I'm a good resoruce to review but which MR should be reviewed?
Comment about 1 year ago →
🇬🇧United Kingdom catch
@smustgrave the MR that's not a draft is the one to review. The draft combines this with some other changes (some of which have now been committed already) which made it easier to see the overall effect on test times.
Comment about 1 year ago →
🇫🇷France andypost
Is there a way to get metrics/stats from runner per class/method then it will be easier to sort tests and detect regressions
Comment about 1 year ago →
🇬🇧United Kingdom catch
@andypost there is this issue 📌 Add an automated way for slow tests to fail drupalci builds Needs work but because tests are split across multiple jobs, downloading the artifacts from each job, collating them, matching them to tests and allowing for new/removed tests etc. all seems pretty complicated. The approach here is very simple and ass can be seen from the MR, it works well.
Comment about 1 year ago →
🇬🇧United Kingdom catch
catch → changed the visibility of the branch 3465602-with-times to hidden.
Comment about 1 year ago →
🇬🇧United Kingdom catch
Hid the draft MR and rebased now that a couple of other pipeline improvements have landed.
Pipeline finished with Success
about 1 year ago
Total: 391s
#255441
Pipeline finished with Success
about 1 year ago
Total: 976s
#255452
Pipeline finished with Success
about 1 year ago
Total: 476s
#255460
Pipeline finished with Canceled
about 1 year ago
Total: 79s
#258021
Pipeline finished with Success
about 1 year ago
Total: 280s
#258022
Pipeline finished with Canceled
about 1 year ago
Total: 367s
#258033
Pipeline finished with Success
about 1 year ago
Total: 406s
#258038
Pipeline finished with Failed
about 1 year ago
#258042
Pipeline finished with Success
about 1 year ago
Total: 339s
#261351
Comment about 1 year ago →
🇬🇧United Kingdom catch
Bumping to major, this unblocks two things:

1. Sub-five minute test runs when combined with a handful of other issues.

2. Significant resource request reductions for every gitlab pipeline 📌 [PP-2] Reduce CPU requirements for core gitlab pipelines Postponed .
First commit to issue fork.
Comment about 1 year ago →
🇬🇧United Kingdom catch
Had another looks at kernel tests - was able to remove @group #slow from several, and also adjusted the pipeline to be 4 * 8 CPUS instead of 3 * 16 CPUs, so an overall reduction in CPU requests from 48 to 32 for kernel tests. In my sandbox branch all finish in less than a couple of minutes: https://git.drupalcode.org/project/drupal/-/jobs/2530170 / https://git.drupalcode.org/project/drupal/-/jobs/2530171 / https://git.drupalcode.org/project/drupal/-/jobs/2530172 / https://git.drupalcode.org/project/drupal/-/jobs/2530173

This is faster than the 3 * 16 CPU jobs in HEAD https://git.drupalcode.org/project/drupal/-/jobs/2529021 / https://git.drupalcode.org/project/drupal/-/jobs/2529022 / https://git.drupalcode.org/project/drupal/-/jobs/2529023 which are between 2-3.5 minutes on the last daily run.

I think we can make the CPU request changes here because kernel tests don't rely on the container rebuild/installer changes that 📌 [PP-2] Reduce CPU requirements for core gitlab pipelines Postponed is currently postponed on.
Pipeline finished with Success
about 1 year ago
#262438
Comment about 1 year ago →
🇺🇸United States smustgrave
So if I'm reading https://git.drupalcode.org/project/drupal/-/merge_requests/9054/pipeline... correctly is it adding almost 4 minutes?
Pipeline finished with Failed
about 1 year ago
Total: 416s
#262716
Pipeline finished with Success
about 1 year ago
Total: 321s
#262725
Comment about 1 year ago →
🇬🇧United Kingdom catch
@smustgrave no the most recent pipeline when you posted your comment had failed on a github error and had to be re-run, so the wall time is artificially long. There is a lot of time outside running individual tests when pipelines run, like 'waiting for pod' etc. as well as reruns, so the only way to get somewhat reliable numbers about what's happening in terms of test runs themselves is to look at each individual job and ctrl-f for 'test run duration'.

However even with test run duration, it very much varies by what hardware the job runs on, whether (and which) other jobs are running alongside it etc. so you get occasional spikes where a single job will take twice as long, but then the next time, it goes back down again. It would be nice to diagnose and get rid of those spikes too, but this issue is more of a prerequisite to that than fixing it.

So to see whether something's an improvement, neutral, or regression, when there's a fair number of pipelines to look at (as there is with this issue because it took a while to come together), the best way to figure out the impact is to look at the shortest time for a full pipeline (if it's comparable HEAD timings) and/or check the individual job 'test run duration' across 2-3 pipelines to try to figure out a floor/average.

Did a rebase here (which include some test performance improvements committed in the past 24 hours) and got a 5m15s pipeline https://git.drupalcode.org/project/drupal/-/pipelines/262725 with most jobs coming in under 3 minutes.

On a fast run, https://git.drupalcode.org/project/drupal/-/pipelines/262730/builds shows how long each job as a whole took at a glance.
Status changed to RTBC about 1 year ago2:05pm 23 August 2024
Comment about 1 year ago →
🇺🇸United States smustgrave
So based on the info shared I believe this is good. May need another check but going to mark since @catch has gotten these tests to run in a few minutes.
Comment about 1 year ago →
🇬🇧United Kingdom catch
Created a follow-up here that would add more complexity to the ordering logic but automate more of this 📌 Include a check for data providers when ordering by method count Active , this would help with optimizing pipelines/reducing requirements for contrib 📌 Lower default concurrency Active .
Comment about 1 year ago →
🇫🇷France nod_ Lille
Review the code, looks reasonable as well as the explanations in this issue. Most of the optimization issues are in, as well as the method visibility fixes.
RTBC+1
Comment about 1 year ago →
🇫🇷France nod_ Lille
let's try it on 11.x

Committed 276df7e and pushed to 11.x. Thanks!
I'll let you see how far you'll want to backport. It can be cherry picked to 11.0.x. But 10.4.x (and below I assume) needs a dedicated MR

nod_ → committed 276df7ec on 11.x

Issue #3465602 by catch, quietone, smustgrave: Order tests by number of...

Status changed to Downport about 1 year ago10:12am 27 August 2024
Comment about 1 year ago →
🇫🇷France nod_ Lille
Status changed to Fixed about 1 year ago6:59am 21 September 2024
Comment about 1 year ago →
🇬🇧United Kingdom catch
Cherry-picked, minus the changes to one file, to 10.4.x, thanks!
Comment about 1 year ago →
System Message

catch → committed 34b2b4ed on 10.4.x
Issue #3465602 by catch, quietone, smustgrave, nod_: Order tests by...
Comment about 1 year ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

Order tests by number of public methods to optimize gitlab job times

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

Merge Requests

!9054Order tests by number of public methods to optimize gitlab job times
Open

!9064Order tests by number of public methods to optimize gitlab job times
Open

Comments & Activities

Order tests by number of public methods to optimize gitlab job times

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

Merge Requests

!9054Order tests by number of public methods to optimize gitlab job timesOpen

!9064Order tests by number of public methods to optimize gitlab job timesOpen

Comments & Activities

!9054Order tests by number of public methods to optimize gitlab job times
Open

!9064Order tests by number of public methods to optimize gitlab job times
Open