Improve CI pipeline runtime

Created on 28 February 2025, 9 months ago

Overview

Our full CI pipeline currently takes about 26-30 minutes for a normal run, and times of up to 38 minutes have been seen.

Drupal core has a much larger number of tests and jobs, and completes a full run in about 5-6 minutes.

Proposed resolution

Use some of the techniques learned in Drupal core to improve the pipeline time here.

User interface changes

None

πŸ“Œ Task
Status

Active

Version

0.0

Component

Miscellaneous

Created by

πŸ‡¬πŸ‡§United Kingdom longwave UK

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @longwave
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Yes, please! πŸ™

    When we got started, it was 5–10 minutes! We wrote a lot of code and tests… πŸ˜…

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Related issue saves 3 mins on PHPUnit

  • Pipeline finished with Failed
    7 months ago
    Total: 802s
    #480852
  • Pipeline finished with Canceled
    7 months ago
    Total: 435s
    #480868
  • Pipeline finished with Failed
    7 months ago
    Total: 788s
    #480877
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    MR switches us to use run-tests.sh and makes PHPUnit tests finish in <6 mins instead of 24.
    makes cypress e2e run in parallel (4 containers) meaning it finishes in 10 mins instead of 20
    Total pipeline time goes from 30+mins to 13

    Next step would be to build a Cypress container and avoid paying 2+ mins of setup on every job where we install the dependencies each time.

    Doesn't look like we have per repo container repos so that'd be an MR to our general CI containers repo

  • Pipeline finished with Success
    7 months ago
    Total: 5122s
    #480860
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    πŸ₯³πŸ₯³πŸ₯³πŸ₯³

    Less than 10 LoC changes to the CI setup! The GitLab template must have improved quite a bit since I last looked at it then πŸš€ This was definitely not possible when I set the CI up originally. Wonderful 😊

    I don’t think we need to add @group to test base classes though? NW for just that. Trivial to fix!

  • πŸ‡¬πŸ‡§United Kingdom catch

    Could probably squeeze a few more seconds off by marking the following with @group #slow, they're the three slowest tests but they don't start first.

    ApiLayoutControllerPatchTest
    ApiLayoutControllerPostTest
    ComponentTreeItemTest

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    @catch: ❀️

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    I don’t think we need to add @group to test base classes though? NW for just that. Trivial to fix!

    Apparently core does this too, didn't realize! All good then, just trying @catch's #10 now :)

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    The GitLab template must have improved quite a bit since I last looked at it

    We've been busy πŸ™‚

    Another possible improvement could be to only run one of the DBs in the matrix during MRs and make the other jobs manually triggered. I guess most aren't dealing with different database nuances. Then we can leave the main project branch and/or scheduled runs to check all databases.

    There is an error in the pipeline, but I don't think it's related to the changes.

  • πŸ‡¬πŸ‡§United Kingdom catch

    For core other database types are manual on MRs, we run a selection on commit, and a bigger selection in a scheduled daily job on the development branches, weekly for the release branches (as of this morning due to the same screenshot that revived this issue). Very occasionally someone does something database related, everyone forgets to run the manual jobs, we commit a regression and have to revert, but the on-commit job catches those immediately unless it's also ignored for days, which I can only remember happening once in the past few years.

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    OMG!

    Drupal\Tests\experience_builder\Kernel\EcosystemSupport\Fiel   1 passes    4s                                      
    Drupal\Tests\experience_builder\Kernel\Plugin\ExperienceBuil  21 passes   94s                                      
    Drupal\Tests\experience_builder\Kernel\Config\ContentTemplat   3 passes   13s                                      
    

    β€” https://git.drupalcode.org/project/experience_builder/-/jobs/5053302

    Apparently run-tests.sh now lists how long the test took!!! πŸ₯³

    Took @catch's advice in #10. The 3 he highlighted take ~200 seconds. But tests are running with --concurrency 32. Played 2 minutes in TextMate and I now have a sorted list. I'll mark the slowest 21 as "slow": those are the ones that take >=100 seconds. That leaves room to spot new slow ones later and mark them slow too. Imagine the entire PHPUnit test suite running in ~100 seconds!

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    #13 + #14: I've been wanting to do something like that for months!

    We introduced the multi-DB testing in πŸ“Œ PHPUnit SQLite CI job Needs review , for πŸ“Œ Prevent modules from being uninstalled if they provide field types used in an Experience Builder field Fixed , which uses the DB-specific JSON_EXTRACT stuff. But if everything works out as we hope in πŸ“Œ Calculate field and component dependencies on save and store them in an easy to retrieve format Active , we could just drop that altogether. 🀞

    I didn't expect this issue to appear overnight though, so will spend a few minutes trying to get that working πŸ€“

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    @larowlan in #8:

    MR switches us to use run-tests.sh and makes PHPUnit tests finish in <6 mins instead of 24.

    to be precise:

    Test run duration: 3 min 45 sec
    

    (total CI job duration: 5 minutes 16 seconds)
    β€” https://git.drupalcode.org/project/experience_builder/-/jobs/5053302

    @catch's #10 + my #15
    πŸ‘‡

    Test run duration: 2 min 48 sec
    

    (total CI job duration: 5 minutes 13 seconds)
    β€” https://git.drupalcode.org/project/experience_builder/-/jobs/5055493 (at a time where there was still not full load on CI infra)

    Shaved off another minute, the rest is just CI infra overhead AFAIK.

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Turns out we were using _PHPUNIT_CONCURRENT: 1 until πŸ“Œ Unit tests for PropExpressions to go to/from string representation + single StructuredDataPropExpression::from(…) method Fixed !

    Specifically, I removed it because it was a work-around for πŸ“Œ Prepare for PHPUnit 10 Fixed .

    Back then (10 months ago!) it didn't make a material difference: the test suite was much smaller.

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Interestingly, the MySQL-powered PHPUnit CI job was notably faster:

    Not sure why that is. Rerunning both to see if it's reproducible. I'll make the default whichever of the 2 is fastest for @fjgarlin & @catch's proposal in #13 + #14.

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Yeah, on a bigger set of tests it makes a big difference to run tests with run-tests.sh. Amazing progress so far!!

  • Pipeline finished with Failed
    7 months ago
    Total: 1169s
    #481051
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    makes cypress e2e run in parallel (4 containers) meaning it finishes in 10 mins instead of 20

    This is great, but I just saw global-region.cy.js crash hard:

    We detected that the Electron Renderer process just crashed.
    We have failed the current spec but will continue running the next spec.
    This can happen for a number of different reasons.
    …
    

    β€” https://git.drupalcode.org/project/experience_builder/-/jobs/5055503

    I've not seen that happen on that particular test … ever? πŸ€”

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Yay, seems like the AI assistance to achieve #13 + #14 worked β€” I bet there's a nicer way to achieve this in GitLab CI, but this will do for here.

    https://www.drupal.org/project/gitlab_templates β†’ will hopefully some day lead the way and make this simple for every contrib module :)

    Also, #21 didn't occur again, so I see only reasons to land this right now, to make XB development a bit smoother! πŸš€

  • Pipeline finished with Skipped
    7 months ago
    #481114
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί
  • Pipeline finished with Failed
    7 months ago
    Total: 637s
    #481108
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Next up: ✨ Would you accept a Cypress container Active . @fjgarlin, any thoughts on that? πŸ˜‡

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    I made a comment on that other issue. That'll defo shave more minutes.

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Re your last commit on this issue's MR, you could have maybe played with the variable CI_PIPELINE_SOURCE, so it does not trigger on MR but it does on push to the main branch. You could have also set the "when: manual" so the jobs can be triggered manually.

  • πŸ‡¬πŸ‡§United Kingdom catch

    EDIT: The difference in #19 is not consistent. The second runs both took ~3 minutes. Presumably varies with CI infra load.

    IME the gitlab CI runners alternate between at least two instance types, and one is about 1/3rd faster than the other, guessing CPU speed. This is mostly conjecture. It can also happen that sometimes a job gets a machine to itself, and sometimes it's sharing it with other heavy tests - e.g. 32 CPU is more than 32 CPU if there are more CPUs available.

    I'm slightly surprised that this needed so many tests to be marked with @group #slow, but possibly πŸ“Œ Order tests by number of public methods to optimize gitlab job times Fixed and πŸ“Œ Include a check for data providers when ordering by method count Active only works with parallel > 1, would need to look again. Ideally that would 'just work' for contrib, πŸ“Œ Deprecate TestDiscovery test file scanning, use PHPUnit API instead Active will clean up some of the hacks but we might want to check what's going on when there's no CI parallelism to make sure the logic still runs.

  • πŸ‡¬πŸ‡§United Kingdom catch

    Might have figured out the answer to #28, opened πŸ“Œ Optimize test ordering when --group argument is passed to phpunit Active .

  • Status changed to Fixed 6 months ago
  • Automatically closed - issue fixed for 2 weeks with no activity.

  • Pipeline finished with Canceling
    about 1 month ago
    #621483
  • Pipeline finished with Success
    about 1 month ago
    Total: 615s
    #621626
Production build 0.71.5 2024