Lower default concurrency

Created on 23 August 2024, 9 months ago

Problem/Motivation

Follow-up from #3450701: Add a variable for run-tests.sh concurrency → and 📌 Optimize test order when --directory is used Needs review . And based on findings from 📌 [PP-2] Reduce CPU requirements for core gitlab pipelines Postponed .

When contrib tests are run with concurrency, they're run in 'test type alphabetical order', which means that functional and functional js tests run first, then kernel tests, then unit tests.

This means that if a module has one functional test that takes 30 seconds, and 3 kernel tests that take 10 seconds each, the kernel tests can all be run one after the other while the functional test is going.

Currently, the concurrency variable is set quite high (32), and the CPU request is set quite low (2).

If a module has exactly 32 tests, with a mixture of functional, kernel, and unit tests, this means that all of those tests would start at the same time, maximising the CPU usage at the beginning of the job, and then the kernel and unit tests will finish within 1-5 seconds each, and then the functional tests will keep running for another 20-30 seconds afterwards at effectively 1-2 concurrency.

If a module has a larger number of functional js and functional tests, then running them all together at the same time could result in them all collectively running slower and resulting in the job taking overall longer to run.

This is going to be very hard to test and verify, but if a module with a moderately large test base finishes in the same or less time after the change, then it's not doing any harm.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

📌 Task
Status

Active

Component

gitlab-ci

Created by

🇬🇧United Kingdom catch

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @catch
  • Merge request !250Resolve #3469828 "Concurrency" → (Open) created by catch
  • Pipeline finished with Failed
    9 months ago
    Total: 51s
    #262194
  • Status changed to Needs review 9 months ago
  • 🇬🇧United Kingdom catch

    The pipeline fails due to expected versions which I think might be HEAD and I'm not able to trigger a child pipeline (probably due to that, or maybe permissions?).

  • Pipeline finished with Failed
    9 months ago
    Total: 142s
    #262195
  • 🇬🇧United Kingdom catch

    First phpunit run with no real changes (just a comment) was 5 min 26 sec. https://git.drupalcode.org/issue/paragraphs-3469830/-/jobs/2527933

    Realised 10.3/10.4 doesn't have test timings in run-tests.sh, so adding opt in next major to see the longest running tests.

  • Status changed to RTBC 9 months ago
  • 🇪🇸Spain fjgarlin

    Triggered downstream pipelines https://git.drupalcode.org/issue/gitlab_templates-3469828/-/pipelines/26...

    Based on the mentioned tickets where there is plenty of investigation I think this change makes sense.

  • 🇬🇧United Kingdom jonathan1055

    I don't know if this is useful, but here is a test on Scheduler using MR250. The PHPunit job is using _concurrent=1 and there are five branches, run in a parallel matrix. The js and kernel tests are separated out already so the benefit might not be seen.
    https://git.drupalcode.org/project/scheduler/-/pipelines/264154

    Here's the same branch, but using ref 'main'
    https://git.drupalcode.org/project/scheduler/-/pipelines/264160

  • Status changed to Needs work 9 months ago
  • 🇬🇧United Kingdom catch

    hmm that is useful, it actually makes it slower.

    e.g .https://git.drupalcode.org/project/scheduler/-/jobs/2546041 vs https://git.drupalcode.org/project/scheduler/-/jobs/2546071

    There is every possibility that the target -> main job was able to use spare CPUs rather than just 2, which means the 32 concurrency might be fine in those scenarios, but ideally we want the results of this to be neutral when runs aren't CPU constrained and faster when they aren't. If it fails to take advantage of extra CPUs when they're available, it's not really an improvement.

    Currently when --directory is passed we order by test suite which results in

    functional js -> functional -> kernel -> unit. Because scheduler already divides by test suite as you point out, there is no 'running quick tests at the end' effect.

    Something like the most optimal ordering would need to expand on the logic in 📌 Order tests by number of public methods to optimize gitlab job times Fixed , so that we first order by test suite, then order by number of methods.

    However for example https://git.drupalcode.org/project/scheduler/-/blob/2.x/tests/src/Functi... uses a dataProvider which isn't taken into account by the logic there.

    We can't expect all contrib authors to add @group #slow manually to their tests, and don't want to significantly regress existing performance, so should try to automate this a bit more I think.

    I just opened 📌 Include a check for data providers when ordering by method count Active against core which would try to account for data providers and run them earlier.

    The next step after that would be implementing the test ordering by type and method logic when --directory is used.

  • Pipeline finished with Success
    21 days ago
    Total: 50s
    #487443
  • 🇬🇧United Kingdom jonathan1055

    Rebased and fixed conflicts. I presume we do still want to work on this?
    📌 Switch to using run-tests.sh by default Active is also being looked at again.

  • 🇪🇸Spain fjgarlin

    Thanks for rebasing.

    Yeah we can investigate, but we will merge only if we find it makes a real difference, which based on #6 / #7, we couldn't really prove.

  • 🇬🇧United Kingdom catch

    I think we might want to postpone this on 📌 Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active which orders tests slowest-first more effectively than the current logic in run-tests.sh does, but I think it's worth trying to get done to see if we can get the best combination of pipeline wall time vs. CPU time for contrib tests.

  • 🇪🇸Spain fjgarlin

    Postponing based on the previous comment.

  • 🇬🇧United Kingdom jonathan1055

    I had already done some testing before you postponed this issue, so I might as well post the results here. I had made a change at some time since the tests in #6 and #7 above, so that running _PHPUNIT_TESTGROUPS: --all only runs one run job not the matrix, so we get a proper comparision now.

    A baseline using '--all' to run eveything in one job, without this MR
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5133960
    phpunit test run duration: 11 min 25 sec
    Overall job took 12 mins 16 sec

    Then with MR250
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5135182
    phpunit 22 min 32 sec. Job took 24 mins 32 secs
    re-run phpunit 19 mins 38 sec. Job 20 mins 32

    The outcome is similar to before, with this particular project changing concurrency from 32 to 4 significantly increases the run times.

  • 🇬🇧United Kingdom jonathan1055

    Revert unintended status change

  • 🇪🇸Spain fjgarlin

    Thanks for putting the results of the testing. They are really useful.

  • 🇬🇧United Kingdom catch

    With https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5135182 specifically there are several tests that take more than 300s each and one that is over 700s. To improve run times below 11 minutes, those tests would have to be split up or otherwise refactored to take less time per test class.

    However, it also looks like if these four tests were explicitly marked with @group #slow

    SchedulerFieldsDisplayTest
    SchedulerHooksTest
    SchedulerPermissionsTest 
    SchedulerRulesEventsTest
    

    then it might get the runtimes with lower concurrency down closer to 11 minutes. Or we might need to move the concurrency a bit higher to something like 8 too.

  • 🇬🇧United Kingdom jonathan1055

    Thanks for the info. I marked those four tests with @group #slow and it was quicker. Initial run 17 min 16 sec, and re-run 20 min 36 sec, so it does appear faster with this.
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5196407#L482

    Then I changed concurency from 4 (as in the MR) to 8 and got runtimes of 7 min 54 sec, 19 min 38 sec and 14 min 50 sec (I guess plenty of other factors are affecting the runtimes, hence why I did three)
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5196975#L482

  • 🇬🇧United Kingdom catch

    Thanks for testing, this is encouraging.

    I think the next tweak would be to mark this one with @group #slow

    SchedulerNonEnabledTypeTest
    

    Because that one finished last in the run. With eight concurrency, that still leaves three spare processes to run the other tests from the start of the job. Theoretically might get under 7m58 then, but also at 7m58 this looks like it's up to 4 minutes faster than HEAD?

    Once 📌 Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active lands in 11.2, it would be great to see what a result looks like with @grouip #slow against that issue.

    And then once we've got a baseline against that issue, it would be interesting to see if reverting the @group #slow from scheduler still results in faster test runs (because the default ordering should in some cases lead to the same results).

  • 🇬🇧United Kingdom jonathan1055

    With SchedulerNonEnabledTypeTest also marked #slow, the times are 13 mins 50, 11 mins 48 and 13 mins 45
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5198370#L482

  • 🇬🇧United Kingdom catch

    Looks like SchedulerRequiredTest could also use @group #slow.

    (sorry for the back and forth, this is what it was like trying to get core test runs down too - constant whack-a-mole).

  • 🇬🇧United Kingdom jonathan1055

    sorry for the back and forth, this is what it was like trying to get core ...

    No problem at all, I'm pleased you are giving good feedback and ideas.

    With #slow added in SchedulerRequiredTest - 9 min 53 sec, 7 min 2 sec and 6 min 18 sec. Each re-run was faster, but I'm sure that is not salient.
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5198987

  • 🇬🇧United Kingdom catch

    In https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5198987 it looks like SchedulerRequiredTest might not be marked as #slow yet (it starts in the middle of the job still), is that definitely the right link?

  • 🇬🇧United Kingdom jonathan1055

    Sorry, yes you are right. Not sure what happened there, the tests were run with the correct change, so the results were right, I just had the wrong link. Here is the first of that run, and the two re-runs next door. 9 min 53 sec, 7 min 2 sec and 6 min 18
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5206786

  • 🇬🇧United Kingdom catch

    Thank you!

    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5206786 looks pretty good.

    SchedulerTestLegacyHooks still overhangs, and SchedulerMultilingualTest nearly overhangs but does not (e.g. SchedulerLightweightCronTest finishes in-between).

    Since SchedulerLightWeightCronTest starts sixth from last and takes 30s, I think it's likely that the last few finishing tests all finish within a few seconds of each other.

    As soon as we use @group #slow for more than 8 tests, one of them won't start until another has, and then you're back to square one, we even considered @group #really_slow in core for that but didn't go that far yet. So if we tagged SchedulerTestLegacyHooks as #slow it might save 10-30s still, but it starts to get towards diminishing returns at this point.

    So for me this is showing that we can get good pipeline runtimes with 8 concurrency, but it will probably require manual tweaking for projects with several long-running tests and more than 8 tests in total.

    Next thing is to see if 📌 Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active gets us the same or better results with less manual tweaking.

    Explicitly marking this with PP-1 on that issue, I think it could make things worse for some projects if we make the switch before that lands. But once it lands, hopefully we can make everything a bit more efficient.

  • 🇬🇧United Kingdom jonathan1055

    Six tests have @group #slow and we have _CONCURRENCY_THREADS: 8. So shall I mark those two you identified (LegacyHooks and Multilingual) with #slow, to make sure that they get run by the last two threads, and don't get picked together? Or is the tactic to always have more concurrency than #slow tests? I will try it anyway, to see what that shows us.

    By the way, is there any way to widen the summatry output? The test classes are truncated at 60 chars in the sumary and three of them show as identical Drupal\Tests\scheduler_rules_integration\Functional\Schedule, and two as Drupal\Tests\scheduler\FunctionalJavascript\SchedulerJavascr
    I was just doing a little bit of data collection and analysis, fiding average runtimes per test, but cannot programatically get one-to-one numbers when the test classes are not shown as distinct.

  • 🇬🇧United Kingdom catch

    Or is the tactic to always have more concurrency than #slow tests? I will try it anyway, to see what that shows us.

    Exactly the same number is fine, so if the 8 slowest tests are all tagged that's great - as soon as one finishes, a process if freed up for a faster test to start.

    Once you get to 9 @group #slow tests with 8 concurrency, the risk is that the 9th slow test is the slowest one - say a test that takes 5 minutes, and it's been displaced by a test that takes 45s that runs first, now the test run is going to be at least 5m 45s. If the 9th test takes 90 seconds, then it'd be 90s + 45s which is usually fine, but this is where it gets into diminishing returns very quickly. Unfortunately speaking from experience here...

    So it's best to tag the minimum number of tests as @group #slow to avoid overhangs, and let ordering (improved by the discovery MR) take care of the rest. It may be the case with scheduler that the ideal minimum number of tests to tag is exactly 8 :) at least until the discovery MR lands and changes things again.

  • 🇬🇧United Kingdom catch

    By the way, is there any way to widen the summatry output? T

    I think something like this just landed in 11.2, iirc it truncates from the right instead of the left, or something like that now,so next minor might pick that up already.

  • 🇬🇧United Kingdom jonathan1055

    Here are the visual results from the sets of tests when we started with 4 #slow, then added the 5th and the 6th. "Total minutes" is the combined time of all tests, which is obviously much larger than the elapsed time, but I just added that for interest.

    I have not done any more test runs with the two (7th and 8th) you mentioned in comment 23. There is some variation in the positions. The javascript and rules integration tests seem to have the most variability, looking the colours.

  • 🇬🇧United Kingdom catch

    📌 Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active is in 11.2 now, so while this can't be committed yet, I think it would be possible to test scheduler without @group #slow and see how it gets on with next minor, ideally both with and without the change here. if it's neutral or better, then this might be ready to go (with concurrency of 6 or 8, probably not 4) when 11.2.0 is tagged.

Production build 0.71.5 2024