- Issue created by @catch
- Status changed to Needs review
9 months ago 12:43am 23 August 2024 - 🇬🇧United Kingdom catch
The pipeline fails due to expected versions which I think might be HEAD and I'm not able to trigger a child pipeline (probably due to that, or maybe permissions?).
- 🇬🇧United Kingdom catch
First phpunit run with no real changes (just a comment) was 5 min 26 sec. https://git.drupalcode.org/issue/paragraphs-3469830/-/jobs/2527933
Realised 10.3/10.4 doesn't have test timings in run-tests.sh, so adding opt in next major to see the longest running tests.
- Status changed to RTBC
9 months ago 8:43am 23 August 2024 - 🇪🇸Spain fjgarlin
Triggered downstream pipelines https://git.drupalcode.org/issue/gitlab_templates-3469828/-/pipelines/26...
Based on the mentioned tickets where there is plenty of investigation I think this change makes sense.
- 🇬🇧United Kingdom jonathan1055
I don't know if this is useful, but here is a test on Scheduler using MR250. The PHPunit job is using _concurrent=1 and there are five branches, run in a parallel matrix. The js and kernel tests are separated out already so the benefit might not be seen.
https://git.drupalcode.org/project/scheduler/-/pipelines/264154Here's the same branch, but using ref 'main'
https://git.drupalcode.org/project/scheduler/-/pipelines/264160 - Status changed to Needs work
9 months ago 12:00am 26 August 2024 - 🇬🇧United Kingdom catch
hmm that is useful, it actually makes it slower.
e.g .https://git.drupalcode.org/project/scheduler/-/jobs/2546041 vs https://git.drupalcode.org/project/scheduler/-/jobs/2546071
There is every possibility that the target -> main job was able to use spare CPUs rather than just 2, which means the 32 concurrency might be fine in those scenarios, but ideally we want the results of this to be neutral when runs aren't CPU constrained and faster when they aren't. If it fails to take advantage of extra CPUs when they're available, it's not really an improvement.
Currently when --directory is passed we order by test suite which results in
functional js -> functional -> kernel -> unit. Because scheduler already divides by test suite as you point out, there is no 'running quick tests at the end' effect.
Something like the most optimal ordering would need to expand on the logic in 📌 Order tests by number of public methods to optimize gitlab job times Fixed , so that we first order by test suite, then order by number of methods.
However for example https://git.drupalcode.org/project/scheduler/-/blob/2.x/tests/src/Functi... uses a dataProvider which isn't taken into account by the logic there.
We can't expect all contrib authors to add @group #slow manually to their tests, and don't want to significantly regress existing performance, so should try to automate this a bit more I think.
I just opened 📌 Include a check for data providers when ordering by method count Active against core which would try to account for data providers and run them earlier.
The next step after that would be implementing the test ordering by type and method logic when --directory is used.
- 🇬🇧United Kingdom jonathan1055
Rebased and fixed conflicts. I presume we do still want to work on this?
📌 Switch to using run-tests.sh by default Active is also being looked at again. - 🇪🇸Spain fjgarlin
Thanks for rebasing.
Yeah we can investigate, but we will merge only if we find it makes a real difference, which based on #6 / #7, we couldn't really prove.
- 🇬🇧United Kingdom catch
I think we might want to postpone this on 📌 Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active which orders tests slowest-first more effectively than the current logic in run-tests.sh does, but I think it's worth trying to get done to see if we can get the best combination of pipeline wall time vs. CPU time for contrib tests.
- 🇬🇧United Kingdom jonathan1055
I had already done some testing before you postponed this issue, so I might as well post the results here. I had made a change at some time since the tests in #6 and #7 above, so that running
_PHPUNIT_TESTGROUPS: --all
only runs one run job not the matrix, so we get a proper comparision now.A baseline using '--all' to run eveything in one job, without this MR
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5133960
phpunit test run duration: 11 min 25 sec
Overall job took 12 mins 16 secThen with MR250
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5135182
phpunit 22 min 32 sec. Job took 24 mins 32 secs
re-run phpunit 19 mins 38 sec. Job 20 mins 32The outcome is similar to before, with this particular project changing concurrency from 32 to 4 significantly increases the run times.
- 🇪🇸Spain fjgarlin
Thanks for putting the results of the testing. They are really useful.
- 🇬🇧United Kingdom catch
With https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5135182 specifically there are several tests that take more than 300s each and one that is over 700s. To improve run times below 11 minutes, those tests would have to be split up or otherwise refactored to take less time per test class.
However, it also looks like if these four tests were explicitly marked with @group #slow
SchedulerFieldsDisplayTest SchedulerHooksTest SchedulerPermissionsTest SchedulerRulesEventsTest
then it might get the runtimes with lower concurrency down closer to 11 minutes. Or we might need to move the concurrency a bit higher to something like 8 too.
- 🇬🇧United Kingdom jonathan1055
Thanks for the info. I marked those four tests with
@group #slow
and it was quicker. Initial run 17 min 16 sec, and re-run 20 min 36 sec, so it does appear faster with this.
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5196407#L482Then I changed concurency from 4 (as in the MR) to 8 and got runtimes of 7 min 54 sec, 19 min 38 sec and 14 min 50 sec (I guess plenty of other factors are affecting the runtimes, hence why I did three)
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5196975#L482 - 🇬🇧United Kingdom catch
Thanks for testing, this is encouraging.
I think the next tweak would be to mark this one with @group #slow
SchedulerNonEnabledTypeTest
Because that one finished last in the run. With eight concurrency, that still leaves three spare processes to run the other tests from the start of the job. Theoretically might get under 7m58 then, but also at 7m58 this looks like it's up to 4 minutes faster than HEAD?
Once 📌 Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active lands in 11.2, it would be great to see what a result looks like with @grouip #slow against that issue.
And then once we've got a baseline against that issue, it would be interesting to see if reverting the @group #slow from scheduler still results in faster test runs (because the default ordering should in some cases lead to the same results).
- 🇬🇧United Kingdom jonathan1055
With
SchedulerNonEnabledTypeTest
also marked #slow, the times are 13 mins 50, 11 mins 48 and 13 mins 45
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5198370#L482 - 🇬🇧United Kingdom catch
Looks like
SchedulerRequiredTest
could also use @group #slow.(sorry for the back and forth, this is what it was like trying to get core test runs down too - constant whack-a-mole).
- 🇬🇧United Kingdom jonathan1055
sorry for the back and forth, this is what it was like trying to get core ...
No problem at all, I'm pleased you are giving good feedback and ideas.
With #slow added in
SchedulerRequiredTest
- 9 min 53 sec, 7 min 2 sec and 6 min 18 sec. Each re-run was faster, but I'm sure that is not salient.
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5198987 - 🇬🇧United Kingdom catch
In https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5198987 it looks like SchedulerRequiredTest might not be marked as #slow yet (it starts in the middle of the job still), is that definitely the right link?
- 🇬🇧United Kingdom jonathan1055
Sorry, yes you are right. Not sure what happened there, the tests were run with the correct change, so the results were right, I just had the wrong link. Here is the first of that run, and the two re-runs next door. 9 min 53 sec, 7 min 2 sec and 6 min 18
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5206786 - 🇬🇧United Kingdom catch
Thank you!
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/5206786 looks pretty good.
SchedulerTestLegacyHooks still overhangs, and SchedulerMultilingualTest nearly overhangs but does not (e.g. SchedulerLightweightCronTest finishes in-between).
Since SchedulerLightWeightCronTest starts sixth from last and takes 30s, I think it's likely that the last few finishing tests all finish within a few seconds of each other.
As soon as we use @group #slow for more than 8 tests, one of them won't start until another has, and then you're back to square one, we even considered @group #really_slow in core for that but didn't go that far yet. So if we tagged SchedulerTestLegacyHooks as #slow it might save 10-30s still, but it starts to get towards diminishing returns at this point.
So for me this is showing that we can get good pipeline runtimes with 8 concurrency, but it will probably require manual tweaking for projects with several long-running tests and more than 8 tests in total.
Next thing is to see if 📌 Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active gets us the same or better results with less manual tweaking.
Explicitly marking this with PP-1 on that issue, I think it could make things worse for some projects if we make the switch before that lands. But once it lands, hopefully we can make everything a bit more efficient.
- 🇬🇧United Kingdom jonathan1055
Six tests have
@group #slow
and we have_CONCURRENCY_THREADS: 8
. So shall I mark those two you identified (LegacyHooks and Multilingual) with #slow, to make sure that they get run by the last two threads, and don't get picked together? Or is the tactic to always have more concurrency than #slow tests? I will try it anyway, to see what that shows us.By the way, is there any way to widen the summatry output? The test classes are truncated at 60 chars in the sumary and three of them show as identical
Drupal\Tests\scheduler_rules_integration\Functional\Schedule
, and two asDrupal\Tests\scheduler\FunctionalJavascript\SchedulerJavascr
I was just doing a little bit of data collection and analysis, fiding average runtimes per test, but cannot programatically get one-to-one numbers when the test classes are not shown as distinct. - 🇬🇧United Kingdom catch
Or is the tactic to always have more concurrency than #slow tests? I will try it anyway, to see what that shows us.
Exactly the same number is fine, so if the 8 slowest tests are all tagged that's great - as soon as one finishes, a process if freed up for a faster test to start.
Once you get to 9 @group #slow tests with 8 concurrency, the risk is that the 9th slow test is the slowest one - say a test that takes 5 minutes, and it's been displaced by a test that takes 45s that runs first, now the test run is going to be at least 5m 45s. If the 9th test takes 90 seconds, then it'd be 90s + 45s which is usually fine, but this is where it gets into diminishing returns very quickly. Unfortunately speaking from experience here...
So it's best to tag the minimum number of tests as @group #slow to avoid overhangs, and let ordering (improved by the discovery MR) take care of the rest. It may be the case with scheduler that the ideal minimum number of tests to tag is exactly 8 :) at least until the discovery MR lands and changes things again.
- 🇬🇧United Kingdom catch
By the way, is there any way to widen the summatry output? T
I think something like this just landed in 11.2, iirc it truncates from the right instead of the left, or something like that now,so next minor might pick that up already.
- 🇬🇧United Kingdom jonathan1055
Here are the visual results from the sets of tests when we started with 4 #slow, then added the 5th and the 6th. "Total minutes" is the combined time of all tests, which is obviously much larger than the elapsed time, but I just added that for interest.
I have not done any more test runs with the two (7th and 8th) you mentioned in comment 23. There is some variation in the positions. The javascript and rules integration tests seem to have the most variability, looking the colours.
- 🇬🇧United Kingdom catch
📌 Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active is in 11.2 now, so while this can't be committed yet, I think it would be possible to test scheduler without @group #slow and see how it gets on with next minor, ideally both with and without the change here. if it's neutral or better, then this might be ready to go (with concurrency of 6 or 8, probably not 4) when 11.2.0 is tagged.