- Issue created by @catch
- Status changed to Postponed
4 months ago 12:24pm 22 August 2024 - 🇬🇧United Kingdom catch
What this does:
1. Lowers the CPU request from 24 to 16 for most jobs. The theory behind this is that the total-CPUs-per-machine is more likely to be a multiple of 16 than 24 so theoretically we can fit more jobs on a lower number of machines (or on 16 CPU machines if such a machine exists). I don't fully (or even much) understand the relationship between CPU requests, kubernetes and AWS instances, so this might be flawed, but also in general lower and simpler numbers seems better.
2. Lowers the concurrency of a couple of jobs quite a lot, especially functional tests where I am pretty sure the concurrency in HEAD is leading to CPU contention and hence slower rather than faster test runs. This is made possible by 📌 Order tests by number of public methods to optimize gitlab job times Fixed which removes @group #slow from the vast majority of tests, relying on a better ordering algorithm instead.
3. Increases the parallelism for functional js and functional tests by 1 each. This is because in theory most test runs (in the sandbox branch with various changes applied) can finish within about 2m30s, but we still have a lot of individual tests over 2 minutes each. With lower concurrency, those long running jobs are spread out enough we don't run two slow tests end to end. I'm pretty sure there is potential to bring this lower by continuing to optimise some of these slower tests, but it also gives us a bit of headroom when we add new coverage.
If we look at the jobs, we can see that the overall CPU requirement is reduced dramatically:
Functional JS:
Before: 2 * 24 = 48
After: 3 * 16 = 48Functional:
Before: 7 *24 = 168After: 8*16 = 128
W3 legacy:
Before: 1 * 24 = 24After: 1 * 16 = 16
So an overall reduction of 48 CPUs, with potential scope to reduce further.
- 🇬🇧United Kingdom catch
Found an extra 16 CPU requests to drop on 📌 Order tests by number of public methods to optimize gitlab job times Fixed which brings the total to 64 here.
- Status changed to Needs review
3 months ago 7:35pm 19 September 2024 - 🇬🇧United Kingdom catch
Just rebased and the full pipeline took six minutes: https://git.drupalcode.org/project/drupal/-/pipelines/287574
Our current best runtime is currently around 5m30s, given the amount of variation between runs, that seems in range given the overall CPU saving here. Moving to needs review.
- Status changed to RTBC
3 months ago 8:31am 20 September 2024 - 🇪🇸Spain fjgarlin
The changes look good and so do your maths in #4. Pipelines are also happy. RTBC.
-
longwave →
committed 88774524 on 11.x
Issue #3469687 by catch, fjgarlin: Reduce CPU requirements for core...
-
longwave →
committed 88774524 on 11.x
- Status changed to Downport
3 months ago 10:35am 20 September 2024 - 🇬🇧United Kingdom catch
I'm not sure it's worth backporting to 11.0.x but it probably is worth backporting to 10.4.x since that will then carry forward to the next 10.x branches which will have daily test runs for another couple of years.
10.4.x does not have all of the test performance improvements in 11.x, but I'm sure that tests would still finish in 7-8 minutes or less with these changes, and we don't run that many MR pipelines against 10.4.x (compared to on-commit/scheduled runs).
Also if my uninformed kubernetes theories are correct, it might help recycling/re-use of test runners since they'll be more consistent sizes between the branches?
So moving back there. If there's a problem with 10.4.x and the changes here, we'll find out from the backport pipeline hopefully.
- Status changed to Fixed
3 months ago 7:27am 21 September 2024 - 🇬🇧United Kingdom catch
Backport pipeline finished in 6 minutes and 12 seconds. https://git.drupalcode.org/project/drupal/-/pipelines/288832
Since the backport itself was trivial, going to go ahead and commit here.
Automatically closed - issue fixed for 2 weeks with no activity.