Reduce CPU requirements for core gitlab pipelines

Created on 22 August 2024, 4 months ago
Updated 21 September 2024, 3 months ago

Problem/Motivation

Once 📌 Order tests by number of public methods to optimize gitlab job times Fixed lands we have tests distributed fairly evenly between test jobs on the gitlab pipeline.

I've also been experimenting in a sandbox branch with 📌 Add the ability to install multiple modules and only do a single container rebuild to ModuleInstaller Active and ✨ Use one-time login link instead of user login form in BrowserTestBase tests Fixed + various individual test performance issues combined to see what the absolute potential floor of test run time is.

With all of those applied, the best run I've managed is 4m55s which is down from the current floor of about 5m30s.

https://git.drupalcode.org/project/drupal/-/pipelines/261246

However, that time was achieved with much lower overall CPU requirements than the current pipelines - this issue is to extra those changes from the sandbox MR, however it will depend on some of the other issues landing in order not to be a regression against the current state (at least in terms of wall time).

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

📌 Task
Status

Fixed

Version

10.4 ✨

Component
PHPUnit  →

Last updated about 17 hours ago

Created by

🇬🇧United Kingdom catch

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @catch
  • 🇬🇧United Kingdom catch
  • Merge request !9302Lower CPU requests for pipeline jobs. → (Open) created by catch
  • Status changed to Postponed 4 months ago
  • 🇬🇧United Kingdom catch

    What this does:

    1. Lowers the CPU request from 24 to 16 for most jobs. The theory behind this is that the total-CPUs-per-machine is more likely to be a multiple of 16 than 24 so theoretically we can fit more jobs on a lower number of machines (or on 16 CPU machines if such a machine exists). I don't fully (or even much) understand the relationship between CPU requests, kubernetes and AWS instances, so this might be flawed, but also in general lower and simpler numbers seems better.

    2. Lowers the concurrency of a couple of jobs quite a lot, especially functional tests where I am pretty sure the concurrency in HEAD is leading to CPU contention and hence slower rather than faster test runs. This is made possible by 📌 Order tests by number of public methods to optimize gitlab job times Fixed which removes @group #slow from the vast majority of tests, relying on a better ordering algorithm instead.

    3. Increases the parallelism for functional js and functional tests by 1 each. This is because in theory most test runs (in the sandbox branch with various changes applied) can finish within about 2m30s, but we still have a lot of individual tests over 2 minutes each. With lower concurrency, those long running jobs are spread out enough we don't run two slow tests end to end. I'm pretty sure there is potential to bring this lower by continuing to optimise some of these slower tests, but it also gives us a bit of headroom when we add new coverage.

    If we look at the jobs, we can see that the overall CPU requirement is reduced dramatically:

    Functional JS:
    Before: 2 * 24 = 48
    After: 3 * 16 = 48

    Functional:
    Before: 7 *24 = 168

    After: 8*16 = 128

    W3 legacy:
    Before: 1 * 24 = 24

    After: 1 * 16 = 16

    So an overall reduction of 48 CPUs, with potential scope to reduce further.

  • 🇬🇧United Kingdom catch

    Found an extra 16 CPU requests to drop on 📌 Order tests by number of public methods to optimize gitlab job times Fixed which brings the total to 64 here.

  • Status changed to Needs review 3 months ago
  • 🇬🇧United Kingdom catch

    Just rebased and the full pipeline took six minutes: https://git.drupalcode.org/project/drupal/-/pipelines/287574

    Our current best runtime is currently around 5m30s, given the amount of variation between runs, that seems in range given the overall CPU saving here. Moving to needs review.

  • Status changed to RTBC 3 months ago
  • 🇪🇸Spain fjgarlin

    The changes look good and so do your maths in #4. Pipelines are also happy. RTBC.

  • Status changed to Downport 3 months ago
  • 🇬🇧United Kingdom longwave UK

    Committed 8877452 and pushed to 11.x. Thanks!

    Patch doesn't apply to 11.0.x and below, we don't run as many tests there, but this feels like a good candidate for backport if it reduces costs for the DA?

  • 🇬🇧United Kingdom catch

    I'm not sure it's worth backporting to 11.0.x but it probably is worth backporting to 10.4.x since that will then carry forward to the next 10.x branches which will have daily test runs for another couple of years.

    10.4.x does not have all of the test performance improvements in 11.x, but I'm sure that tests would still finish in 7-8 minutes or less with these changes, and we don't run that many MR pipelines against 10.4.x (compared to on-commit/scheduled runs).

    Also if my uninformed kubernetes theories are correct, it might help recycling/re-use of test runners since they'll be more consistent sizes between the branches?

    So moving back there. If there's a problem with 10.4.x and the changes here, we'll find out from the backport pipeline hopefully.

  • 🇬🇧United Kingdom catch

    catch → changed the visibility of the branch 3469687-pp-2-reduce-cpu to hidden.

  • Merge request !956110.4.x: reduce CPU requirements for gitlab jobs → (Closed) created by catch
  • Pipeline finished with Success
    3 months ago
    Total: 378s
    #288832
  • Status changed to Fixed 3 months ago
  • 🇬🇧United Kingdom catch

    Backport pipeline finished in 6 minutes and 12 seconds. https://git.drupalcode.org/project/drupal/-/pipelines/288832

    Since the backport itself was trivial, going to go ahead and commit here.

    • catch → committed 6dc60ef8 on 10.4.x
      Issue #3469687 by catch, fjgarlin, longwave: Reduce CPU requirements for...
  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024