Random HTTP timeouts for GitLab CI jobs

Created on 6 May 2024, 9 months ago
Updated 20 September 2024, 4 months ago

Problem/Motivation

I recently have had 3 consecutive weeks of failure to download Composer dependencies in my weekly jobs.

https://git.drupalcode.org/project/tfa/-/jobs/1522954
https://git.drupalcode.org/project/tfa/-/jobs/1460015
https://git.drupalcode.org/project/tfa/-/jobs/1391716

    Failed to download composer/installers from dist: curl error 28 while downloading https://api.github.com/repos/composer/installers/zipball/c29dc4b93137acb82734f672c37e029dfbd95b35: Failed to connect to api.github.com port 443 after 10001 ms: Timeout was reached
    Now trying to download from source

It also appears these failures may be (one of) the cause(s) of random 413 Request Entity Too Large errors in G.D.O. artifact archiving.:
https://git.drupalcode.org/project/smart_date/-/jobs/1639062#L670
https://git.drupalcode.org/issue/s3fs-3447227/-/jobs/1695614#L384

A caching proxy would reduce the risks of “random” download failures and make D.O. A more respectful user of 3rd party services.

Alternative could be to deploy #3387117: Enable distributed caching in GitLab Runner though that may require larger storage costs the benefit is that the cache is per project and can be more precisely controlled by project owners. Issue forks would have limited benefit as each fork would have its own cache.

Steps to reproduce

Proposed resolution

  • Provide a caching proxy server for http/https requests
  • Use Satis to mirror the projects that Drupal Core requires into our own package server
  • Use https://github.com/gmta/velocita-proxy as a composer plugin (I'm not sure what advantages this has over a standard caching proxy server)
  • (security concerns)
  • (unlikely the same module will run on the same runner in the future)
  • (will not fix merge request pipelines)

Remaining tasks

User interface changes

API changes

Data model changes

🐛 Bug report
Status

Active

Component

GitLab

Created by

🇺🇸United States cmlara

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @cmlara
  • 🇫🇷France andypost

    Caching proxy in Gitlab is still beta and slowly moving, but it has caching for proxied https://docs.gitlab.com/ee/user/packages/composer_repository/index.html#...

    Personally I'd better added local runner's cache at least - distributed has own downsides of download/unpack issues

    OTOH runners are from kubernetes, so could use its approach https://docs.gitlab.com/runner/executors/kubernetes/#using-the-cache-wit...

  • 🇺🇸United States cmlara

    Personally I'd better added local runner's cache at least - distributed has own downsides of download/unpack issues

    I'm not sure local cache would be sufficient. The odds of obtaining the same runner where the local cache is present are at most ((1/current_fleet_size) * probability_last_runner_is_still_in_fleet).

    We would need INFRA to give us those statistical points for us to determine viability.

    OTOH runners are from kubernetes, so could use its approach https://docs.gitlab.com/runner/executors/kubernetes/#using-the-cache-wit...

    If I'm reading correctly that is part of GitLabs cache system correct? If so indeed another option for configuring how the cache data is shared internally and a solution for #3387117: Enable distributed caching in GitLab Runner

    Caching proxy in Gitlab is still beta and slowly moving, but it has caching for proxied

    Ah I had missed GitLab was looking into adding composer into their package registry. Maybe that is a good long term goal, however a generic caching proxy (squid or any other alternative) could bridge us so that we are not waiting on a solution that does not yet exist.

  • 🇫🇷France andypost

    I mean to share cache between projects you can mount https://docs.gitlab.com/runner/executors/kubernetes/#hostpath-volume or even cloud volume and define extra environment variables for composer and yarn

  • 🇺🇸United States cmlara

    Re-titling to allow this to be additional options based on suggestions. I'm increasing the priority due to the DA (through gitlab_templates) rollout of additional testing for all modules we are more likely to encounter this. Additionally we are still encouraging modules to adopt GitLab CI making the problem likely to get worse in the future.

    Recent testing leads to the opinion that we are likely triggering silent Github rate limits. This should not be surprising to us as we are aggressively and abusively connecting remote providers.

    A recent test with one job succeeded while another failed in the same pipeline (rollout by DA to enable Next Major testing for all contrib modules) allowed us to dig into runners, determine the job occured on diffrent runners. Max Whitehead pulled pod scheduling logs.

    We determine at the time of my failure:
    30 seconds prior 3 composer jobs had executed, 2 succeeded and one failed early in its process
    My job failed midway through.(At the same tim another job succeeded on a different node implying it was not a GitHub issue nor a connectivity issue)
    2 Seconds after my job started another composer job started and succeeded (this is within region of time errors that we may have both been downloading at the same time)
    Approximately 800 seconds later another composer job succeeded.

    Regarding the solution in #4 of using a shared volume mount: Initial testing shows that in a default composer configuration an attacker could inject a malicious package (such as PHPUnit) into the local cache exploiting build jobs.

    Other suggestions to date:

    • Use Satis to mirror the projects that Drupal Core requires into our own package server
    • Use https://github.com/gmta/velocita-proxy as a composer plugin (I'm not sure what advantages this has over a standard caching proxy server)
    • (Security concerns)
    • (Unlikely the same module will run on the same runner in the future)
    • Provide a caching proxy server
    • Require every maintainer to obtain an OAUTH login for Github (will not fix merge request pipelines)

    Whatever we do we also should deploy a solution for NPM as well.

    Other technically valid but not recommended temporary work around:

    • Increase composer stage to use the majority of CPU (more than half but less than the smallest runner) to reduce the risk of a node triggering blocking from multiple connections.
    • Require every maintainer to provide their own runners
  • 🇺🇸United States cmlara

    This time lets change the title and the priority....

    Also updated IS to show additional proposals.

  • 🇬🇧United Kingdom catch

    A possible way to mitigate this would to add a core pipeline schedule that runs composer install against all core branches with a custom configured composer cache directory, stores the contents of the cache directory as an artefact. Then in both core and contrib pipelines, before running composer install, retrieve the artefact via the gitlab API to pre-populate the directory. This would mean that most packages are pulled from the cache instead of from github but it doesn't help with npm unless we're able to figure out a similar trick for that.

    Because the API requests would only pull the artefact from actual core branches, there should be minimum risk of cache pollution by bad jobs then (i.e. we'd have to commit something bad to core). However it would be massive hack, relatively tricky to implement, and some kind of mirror seems like a better option if that's doable.

  • 🇬🇧United Kingdom catch

    Updating the issue title to make it clearer this is about github requests and not distributed caching in general. Also since this regularly results in pipeline failures rather than just being a performance issue, changing to a bug report.

  • 🇺🇸United States cmlara

    GitHub is our major offender currently. Any solution we implement ideally works for everything.

    We don’t want to abuse the packagist metadata sever or the npm/yarn metadata severs either as they too may start blocking us.

  • 🇫🇷France andypost

    So COMPOSER_CACHE_DIR can be set in job definition and used as artifacts

  • 🇳🇱Netherlands bbrala Netherlands

    Well, project_analysis has been failing every week the last few weeks with composer failures.

    https://git.drupalcode.org/project/project_analysis/-/jobs/2679682

    Is runs 10 concurrent jobs which all do a setup of core at their start, so it's relatively heavy in this regard.

  • 🇳🇱Netherlands bbrala Netherlands

    @andypost COMPOSER_CACHE_DIR might not work inless you use a custom directory. Artifacts needs to be in the project dir to be included.

  • 🇬🇧United Kingdom catch

    Yes I had the same problem with cspell caching etc. it would definitely need to be a custom directory.

  • 🇦🇺Australia elc

    A caching squid (or other) proxy using the standard composer/linux env variables would provide an immediate and huge difference to the number of requests going direct to Github. Proxy would be a stand-alone server operating between the runner's composer and Github and any other http/https downloaded resources. Offloading to a web proxy means avoiding the poising attacks from injecting something named the same into a shared directory - caching is done by URL and not project name.
    https://getcomposer.org/doc/faqs/how-to-use-composer-behind-a-proxy.md

    It could also be a good long term solution even if Gitlab does add proxy support as it would still operate as a separate cached store of downloads which appear to be retrieved thousands of times a day. Proxy would need to be tuned to cache even the largest blobs involved in the runner setup.

    Flush the proxy in off hours, as there is certainly a consistent time of day when these failures are happening, which happens when you northerners lot are running your jobs.

    Wordpress Playground had to do something similar with https://github-proxy.com/

  • 🇺🇸United States drumm NY, US

    Re-titling to describe the problem instead of the solution.

    Adding the actual error message to the issue summary. The error message that was mentioned, 413 Request Entity Too Large is mostly unrelated. The retry with git would leave .git directories, which could certainly lead to excessively-large artifacts.

    Being slow to respond to requests, causing us to reach a timeout, is not GitHub’s usual method of rate limiting, as far as I know. This could be something more on our end. Regardless, making fewer requests is a good idea.

    Since this is a timeout - has anyone tried increasing the curl timeout for a specific project’s jobs?

    Use Satis to mirror the projects that Drupal Core requires into our own package server

    For automatic updates, we have a Satis mirror of the drupal/* namespace from Packagist.org. It is relatively new, and needs testing. You can add to composer.json:

        "repositories": [
            {
                "type": "composer",
                "url": "https://packagist-signed.drupalcode.org"
            }

    Since this is new, it should be tested on individual projects first.

  • 🇺🇸United States cmlara

    Since this is a timeout - has anyone tried increasing the curl timeout for a specific project’s jobs?

    Considering on quick glance I saw a timeout at over 4 minutes I have not. 4 minutes seems more than long enough to obtain small files.

    For automatic updates, we have a Satis mirror of the drupal/* namespace from Packagist.org.

    Our main issue is not the druapl/* packages it is everything else that is needed as a dependency (Symfony, composer, behat, phpstan, etc). Yes the Drupal core packages add a few more requests however they are a small factor when compared to all the extra packages.

    Quick glancing it does not look like that repository has those files. Would it be easy to add all the dependent packages so we can test?

  • 🇪🇸Spain fjgarlin

    What's weird is that is timing out after 10001 ms, but the templates set it to a different value: https://git.drupalcode.org/project/gitlab_templates/-/blob/main/scripts/...
    'process-timeout' => 36000,

  • 🇪🇸Spain fjgarlin

    Not a fix to the problem, just a related issue as this affects CI tests. 📌 Ignore all git files in artifacts Active

  • 🇺🇸United States cmlara

    What's weird is that is timing out after 10001 ms, but the templates set it to a different value

    Process timeout controls how long composer allow a process to run, the errors we see are from curl/git and will not use the process timeout parameter, especially if they hit hard stops early.

    The fact we see such random times implies to me that maybe some data makes it through in some cases slowly (in fairness this leans a bit towards drumm statement it might be k8 internal network issues ) before failing and it is possibly a retry attempt (either is for the downloaded system or at the networking packet layer) that eventually reaches the limits.

Production build 0.71.5 2024