Order tests by number of public methods to optimize gitlab job times

Created on 3 August 2024, 6 months ago
Updated 21 September 2024, 4 months ago

Problem/Motivation

run-tests.sh supports @group #slow which ensures that tests in that group start at the beginning of a run.

This works extremely well for very long running tests - i.e. if a test takes 3 minutes, and the total time to run all tests with concurrency is 3 minutes, it means the test can start at the very beginning, and finish alongside the rest. If it started after two and a half minutes, the run would take five and a half minutes instead.

However, it struggles a different situation - where you have a test that takes say 1 minute, and start 10s what would otherwise be the end of the test run. This can still extend the total time of a test run by 50 seconds (which can be 20% of a job), but if we add @group #slow to all the tests over say 30s, it undermines using it for the really long tests, because concurrency won't handle running all of the slow tests at the beginning of the job any more - some would have to start after other slow tests have finished.

There is an issue to use phpunit logs to order all of the tests by slowest to fastest, then distribute them between the test runs appropriately - this would theoretically lead to perfectly balanced job loads each time. However having worked on the phpstan/cspell artifact caching issue, collecting, parsing and collating the artifacts for parallel jobs would be potentially very complex to implement.

A simpler way to approximate the test run time is to use the number of public methods on a test - in general, a test with 20 public methods will take much longer than a test with 1 method. If a test with 1 method takes longer (because it installs Umami, or uses a data provider with 150 test cases), we can still use @group #slow.

There are two ways we can do this:

1. Between test jobs, sort all the tests by number of methods descending, then distribute them one by one between the jobs, this way the jobs get approximately the same number of test methods in total, which then further approximates to total test run time - meaning that jobs of the same type should finish close to the same time.

2. Within a test job, first run the slow tests (ordered by number of methods descending, since that will generally result in slower tests starting first), then run the rest of the tests, also ordered by methods descending, so that the shortest, fastest tests start last.

By doing both of these, we're able to remove @group #slow from dozens and dozens of tests, and the test jobs in general finish around the same time, with the slower runs finishing earlier than they do in HEAD.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

📌 Task
Status

Fixed

Version

10.4

Component
PHPUnit 

Last updated about 9 hours ago

Created by

🇬🇧United Kingdom catch

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @catch
  • Pipeline finished with Failed
    6 months ago
    Total: 4036s
    #242421
  • Merge request !9064Draft: Resolve #3465602 "With times" → (Open) created by catch
  • Pipeline finished with Success
    6 months ago
    Total: 562s
    #243411
  • Pipeline finished with Success
    6 months ago
    Total: 551s
    #243419
  • Status changed to Needs work 6 months ago
  • 🇬🇧United Kingdom catch
  • 🇬🇧United Kingdom catch

    I think I've figured out why we needed to double the number of test jobs for functional javascript jobs after the selenium driver change.

    We set the maximum number of browser sessions to 5 in pipeline.yml:

    SE_NODE_MAX_SESSIONS: "5"
    

    But concurrency is 15:

    CONCURRENCY: 15
    

    Let's say we have Test 1 with 15 methods.

    The first five tests start running, they all install Drupal, then claim a browser session.

    Then test six starts running, Drupal installs before a browser session is requested (and the nine tests after that), all well and good so far. Then it requests a browser session by making a curl request to selenium. Selenium holds this request and doesn't return until a session is available.

    Test 1 method 1 finishes, and clears the browser session, freeing it up. It's then claimed by Test 6 method 1 which has been waiting for it.

    Now Test 1 method 2 doesn't have a browser session to itself, so it gets stuck waiting for selenium.

    This way, the first 15 tests are fighting each other to get hold of one of five browser sessions, and a reasonable amount of time is spent making API requests to selenium via curl which take several seconds to return - basically sleep().

    This has two effects:

    1. Threads are sitting there waiting for curl instead of either doing Drupal things or making browser requests.

    2.Very slow tests which are already very slow (i.e. they envelope the entire test run), become even slower because there is essentially a sleep(rand(0,30)) between each test method.

    With this MR, tests with lots of test methods are ordered first, and that exacerbates the problem because there's more starting and stopping and waiting for browser sessions within the concurrency limits, whereas when the tests are more mixed up it happens less often.

    Setting the max sessions and max concurrency both to 5 nearly works, but it that means concurrency is very, very low

    Setting max sessions and concurrency to 10 nearly works, except there are some outlying tests that can suddenly take 5-7 minutes seemingly at random.

    I was able to reproduce this locally by setting max sessions to 2 then running views functional js tests via run-tests.sh with concurrency 10 + this logging:

        public function start()
        {
            $time = microtime(TRUE);
            try {
                $this->wdSession = $this->webDriver->session($this->browserName, $this->desiredCapabilities);
            } catch (\Exception $e) {
                throw new DriverException('Could not open connection: '.$e->getMessage(), 0, $e);
            }
            file_put_contents('/tmp/time.txt', ceil(microtime(TRUE) - $time) . "\n", FILE_APPEND);
    
            $this->started = true;
    
            $this->applyTimeouts();
        }
    
    

    It starts off as '1' and then oscillates between 1-10.

    The problem seems to be not the MAX_SESSIONS as such, but the fact that we fully close sessions in ::tearDown() instead of resetting them. This has to effect of closing the browser window entirely, releasing it, then opening a new one.

    If instead we reset the session in ::tearDown() (and behat handles closing sessions when the PHP process ends), then one test class will 'hold' a browser for the full duration of the test class, so it's impossible for it to be released to a later-starting test.

    Still need to work out what the optimum max sessions vs. concurrency is, but pretty confident this race condition/contention issue exists.

  • 🇬🇧United Kingdom catch

    Pretty certain there's a race condition/contention issue in HEAD functional js tests with selenium sessions now.

    The default SE_SESSION_RETRY_INTERVAL is 5 seconds.

    BrowserTestBase installs Drupal before asking for a session.

    If there are 16 tests running concurrently, the first test takes 1 minute to install Drupal, and the next 15 tests take 10 seconds to install Drupal, then those 15 tests will request a browser session before the first one. Some of our functional js tests finish in about 7-10 seconds, so you can have a lot of tests starting and finishing within the 5 second poll delay. That means occasionally, a slow-install test starts, then spends a lot of time waiting for a session. Even if the first request for a session is correctly handled FIFO by the selenium session queue, each test method closes the session then requsts a new one, which means a test with several slow-ish methods can perpetually be put at the back of the session queue. And if there are say 15 test methods and each waits even 5 seconds for a session, that's more than a minute of wall time waiting for a session.

    This explains the vastly different timings for individual tests between gitlab jobs when contention issues occur - an unlucky test class can spend a lot of time doing nothing.

    The default selenium session request timeout is 300 seconds, so we hardly ever see an actual timeout, but I was able to reproduce one here: https://git.drupalcode.org/project/drupal/-/jobs/2334397

    The current high parallel rate works around this - because not that many tests are executed per job, eventually selenium sessions get freed up for any test methods still waiting, but it'll still be happening with four jobs which can be seen for example here: https://git.drupalcode.org/project/drupal/-/jobs/2333840

    I've tried to reduce the timeout to 10 seconds, but I'm not sure that's taking effect on the k8s stack yet - it might need the node to cycle out first to actually reset it?

    -    SE_NODE_MAX_SESSIONS: "5"
    +    SE_NODE_MAX_SESSIONS: "16"
    +    SE_SESSION_RETRY_INTERVAL: "1"
    +    SE_SESSION_REQUEST_TIMEOUT: "10"
    

    I started to look at requesting/starting a browser session before we install Drupal so that sessions are requested FIFO, but there's major problems with that. We start and stop the browser every test run, so there is still a window where another test can 'steal' our browser session. The only way to reserve a session for the duration of a test class would be to make the browser session static, which is major surgery to browser tests.

    Also starting a browser session before installing Drupal is inefficient - we'd have a browser sitting there while we install and don't need it.

    So instead I'm increasing the maximum sessions to approximately double the concurrency, this means there should always be a 'spare' browser session when we start each test method, although this doesn't seem to be 100% foolproof. Because concurrency is much lower than the maximum number of sessions, the actual number of sessions open at any one time shouldn't hit the maximum.

    https://git.drupalcode.org/project/drupal/-/jobs/2337348 shows a single test method that's usually fast, taking about 6 minutes, which probably just slipped under the timeout.

    And here's a successful job with parallel: 2 on functional js tests with the test run successfully finishing in under 3m30s each.

    https://git.drupalcode.org/project/drupal/-/jobs/2337506
    https://git.drupalcode.org/project/drupal/-/jobs/2337507

  • 🇬🇧United Kingdom catch

    The pipeline.yml + BrowserTestBase changes now in a standalone MR on 📌 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Needs review .

  • Status changed to Postponed 6 months ago
  • 🇬🇧United Kingdom catch

    Postponed on 📌 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Needs review now.

    https://git.drupalcode.org/project/drupal/-/merge_requests/9054 is the change to review. It adds the sorting logic, removes @group #slow from a lot of functional and functional js tests where it's no longer needed, adds it to several kernel tests which still need it.

    https://git.drupalcode.org/project/drupal/-/merge_requests/9064#note_353021 includes this issue, 📌 Fix selenium performance/stampede issues in gitlab config and BrowserTestBase Needs review and 📌 Show test run time by class in run-tests.sh output Fixed so it's possible to see what the effect is a bit easier.

  • Pipeline finished with Success
    6 months ago
    Total: 554s
    #244268
  • Pipeline finished with Success
    6 months ago
    Total: 455s
    #244398
  • Pipeline finished with Success
    6 months ago
    Total: 578s
    #244407
  • Pipeline finished with Success
    6 months ago
    Total: 602s
    #245110
  • Pipeline finished with Failed
    6 months ago
    Total: 361s
    #245288
  • Pipeline finished with Failed
    6 months ago
    Total: 482s
    #245299
  • Pipeline finished with Failed
    6 months ago
    Total: 347s
    #245307
  • Pipeline finished with Failed
    6 months ago
    #245318
  • Pipeline finished with Success
    6 months ago
    Total: 466s
    #245337
  • Pipeline finished with Success
    6 months ago
    #245357
  • 🇬🇧United Kingdom catch

    Went on massive diversion here debugging 📌 Show test run time by class in run-tests.sh output Fixed but MR is ready to review now. Still needs the other issue to land to for it all to work though.

  • Pipeline finished with Success
    6 months ago
    Total: 1001s
    #245739
  • Status changed to Needs work 6 months ago
  • 🇫🇷France andypost

    blocker commited

  • Status changed to Needs review 6 months ago
  • 🇬🇧United Kingdom catch

    Rebased the MR, this is now soft-blocked on 🐛 MigrateNoMigrateDrupalTest fails with missing classes in certain situations Fixed though, although incorporated the commit from that issue here to make sure we get a green run with it applied.

  • Pipeline finished with Success
    6 months ago
    Total: 388s
    #246231
  • Pipeline finished with Success
    6 months ago
    #246249
  • Pipeline finished with Failed
    6 months ago
    Total: 462s
    #246272
  • Pipeline finished with Success
    6 months ago
    Total: 447s
    #246289
  • 🇬🇧United Kingdom catch

    Took the commit from 🐛 MigrateNoMigrateDrupalTest fails with missing classes in certain situations Fixed out of the MR and it passed, so while something seems to be going on with that test, it's not directly triggered by this MR. So not postponed after all and actually ready for review now.

    What it does:
    We already split tests into @group #slow plus everything else.

    For each of those arrays of tests, orders them by number of public methods descending. Number of methods corresponds to test time in 95% of cases.

    Creates an array of 'bins' for each parallel test job.

    Loops over the sorted tests, first the slow ones, then the rest, and adds them one by one to each bin.

    This results in bins that have more or less the same number of slow tests (which we already do in HEAD), but now with the number of test methods evenly distributed between each bin (more or less) and tests order from slow to fast and more methods to less methods.

    This gives us the same or better test performance as adding @group #slow to tests with lots of methods, but without all the manual management of that, which was getting impractical and has diminishing returns. Now we only need to add @group #slow to a test if it's disproportionately slow compared to tests with the same number of methods.

    The MR removes @group #slow from dozens of tests because it's no longer needed, and it's necessary to do that here to show there's no regression.

    Adds @group #slow to a handful of tests that are disproportionately slow but were hidden/less relevant with the previous approach.

    Test timings show we still get < 6 minute test runs with this approach, which proves it works. it's not a dramatic improvement over the current situation, it's more about automating the logic without introducing a regression - but it will unblock future pipeline improvements too.

    Occasionally seeing some outlier test runs that take longer, but I think that might be different runner specs or similar, the main thing is that the 'floor' of test timings stays the same.

  • Pipeline finished with Failed
    6 months ago
    #247292
  • Pipeline finished with Success
    6 months ago
    Total: 502s
    #247296
  • Pipeline finished with Failed
    6 months ago
    Total: 113187s
    #247355
  • 🇬🇧United Kingdom catch
  • Pipeline finished with Failed
    5 months ago
    Total: 221s
    #251240
  • Pipeline finished with Failed
    5 months ago
    Total: 488s
    #251245
  • Pipeline finished with Failed
    5 months ago
    #252604
  • 🇺🇸United States smustgrave

    Not sure I'm a good resoruce to review but which MR should be reviewed?

  • 🇬🇧United Kingdom catch

    @smustgrave the MR that's not a draft is the one to review. The draft combines this with some other changes (some of which have now been committed already) which made it easier to see the overall effect on test times.

  • 🇫🇷France andypost

    Is there a way to get metrics/stats from runner per class/method then it will be easier to sort tests and detect regressions

  • 🇬🇧United Kingdom catch

    @andypost there is this issue 📌 Add an automated way for slow tests to fail drupalci builds Needs work but because tests are split across multiple jobs, downloading the artifacts from each job, collating them, matching them to tests and allowing for new/removed tests etc. all seems pretty complicated. The approach here is very simple and ass can be seen from the MR, it works well.

  • 🇬🇧United Kingdom catch

    catch changed the visibility of the branch 3465602-with-times to hidden.

  • 🇬🇧United Kingdom catch

    Hid the draft MR and rebased now that a couple of other pipeline improvements have landed.

  • Pipeline finished with Success
    5 months ago
    Total: 391s
    #255441
  • Pipeline finished with Success
    5 months ago
    Total: 976s
    #255452
  • Pipeline finished with Success
    5 months ago
    Total: 476s
    #255460
  • Pipeline finished with Canceled
    5 months ago
    Total: 79s
    #258021
  • Pipeline finished with Success
    5 months ago
    Total: 280s
    #258022
  • Pipeline finished with Canceled
    5 months ago
    Total: 367s
    #258033
  • Pipeline finished with Success
    5 months ago
    Total: 406s
    #258038
  • Pipeline finished with Failed
    5 months ago
    #258042
  • Pipeline finished with Success
    5 months ago
    Total: 339s
    #261351
  • 🇬🇧United Kingdom catch

    Bumping to major, this unblocks two things:

    1. Sub-five minute test runs when combined with a handful of other issues.

    2. Significant resource request reductions for every gitlab pipeline 📌 [PP-2] Reduce CPU requirements for core gitlab pipelines Postponed .

  • First commit to issue fork.
  • 🇬🇧United Kingdom catch

    Had another looks at kernel tests - was able to remove @group #slow from several, and also adjusted the pipeline to be 4 * 8 CPUS instead of 3 * 16 CPUs, so an overall reduction in CPU requests from 48 to 32 for kernel tests. In my sandbox branch all finish in less than a couple of minutes: https://git.drupalcode.org/project/drupal/-/jobs/2530170 / https://git.drupalcode.org/project/drupal/-/jobs/2530171 / https://git.drupalcode.org/project/drupal/-/jobs/2530172 / https://git.drupalcode.org/project/drupal/-/jobs/2530173

    This is faster than the 3 * 16 CPU jobs in HEAD https://git.drupalcode.org/project/drupal/-/jobs/2529021 / https://git.drupalcode.org/project/drupal/-/jobs/2529022 / https://git.drupalcode.org/project/drupal/-/jobs/2529023 which are between 2-3.5 minutes on the last daily run.

    I think we can make the CPU request changes here because kernel tests don't rely on the container rebuild/installer changes that 📌 [PP-2] Reduce CPU requirements for core gitlab pipelines Postponed is currently postponed on.

  • Pipeline finished with Success
    5 months ago
    #262438
  • 🇺🇸United States smustgrave

    So if I'm reading https://git.drupalcode.org/project/drupal/-/merge_requests/9054/pipeline... correctly is it adding almost 4 minutes?

  • Pipeline finished with Failed
    5 months ago
    Total: 416s
    #262716
  • Pipeline finished with Success
    5 months ago
    Total: 321s
    #262725
  • 🇬🇧United Kingdom catch

    @smustgrave no the most recent pipeline when you posted your comment had failed on a github error and had to be re-run, so the wall time is artificially long. There is a lot of time outside running individual tests when pipelines run, like 'waiting for pod' etc. as well as reruns, so the only way to get somewhat reliable numbers about what's happening in terms of test runs themselves is to look at each individual job and ctrl-f for 'test run duration'.

    However even with test run duration, it very much varies by what hardware the job runs on, whether (and which) other jobs are running alongside it etc. so you get occasional spikes where a single job will take twice as long, but then the next time, it goes back down again. It would be nice to diagnose and get rid of those spikes too, but this issue is more of a prerequisite to that than fixing it.

    So to see whether something's an improvement, neutral, or regression, when there's a fair number of pipelines to look at (as there is with this issue because it took a while to come together), the best way to figure out the impact is to look at the shortest time for a full pipeline (if it's comparable HEAD timings) and/or check the individual job 'test run duration' across 2-3 pipelines to try to figure out a floor/average.

    Did a rebase here (which include some test performance improvements committed in the past 24 hours) and got a 5m15s pipeline https://git.drupalcode.org/project/drupal/-/pipelines/262725 with most jobs coming in under 3 minutes.

    On a fast run, https://git.drupalcode.org/project/drupal/-/pipelines/262730/builds shows how long each job as a whole took at a glance.

  • Status changed to RTBC 5 months ago
  • 🇺🇸United States smustgrave

    So based on the info shared I believe this is good. May need another check but going to mark since @catch has gotten these tests to run in a few minutes.

  • 🇬🇧United Kingdom catch

    Created a follow-up here that would add more complexity to the ordering logic but automate more of this 📌 Include a check for data providers when ordering by method count Active , this would help with optimizing pipelines/reducing requirements for contrib #3469828: Lower default concurrency .

  • 🇫🇷France nod_ Lille

    Review the code, looks reasonable as well as the explanations in this issue. Most of the optimization issues are in, as well as the method visibility fixes.
    RTBC+1

  • 🇫🇷France nod_ Lille

    let's try it on 11.x

    Committed 276df7e and pushed to 11.x. Thanks!
    I'll let you see how far you'll want to backport. It can be cherry picked to 11.0.x. But 10.4.x (and below I assume) needs a dedicated MR

    • nod_ committed 276df7ec on 11.x
      Issue #3465602 by catch, quietone, smustgrave: Order tests by number of...
  • Status changed to Downport 5 months ago
  • 🇫🇷France nod_ Lille
  • Status changed to Fixed 4 months ago
  • 🇬🇧United Kingdom catch

    Cherry-picked, minus the changes to one file, to 10.4.x, thanks!

    • catch committed 34b2b4ed on 10.4.x
      Issue #3465602 by catch, quietone, smustgrave, nod_: Order tests by...
  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024