Random HTTP timeouts for GitLab CI jobs

Issue created by @cmlara
Comment over 1 year ago →
🇫🇷France andypost
Caching proxy in Gitlab is still beta and slowly moving, but it has caching for proxied https://docs.gitlab.com/ee/user/packages/composer_repository/index.html#...

Personally I'd better added local runner's cache at least - distributed has own downsides of download/unpack issues

OTOH runners are from kubernetes, so could use its approach https://docs.gitlab.com/runner/executors/kubernetes/#using-the-cache-wit...
Comment over 1 year ago →
🇺🇸United States cmlara
Personally I'd better added local runner's cache at least - distributed has own downsides of download/unpack issues

I'm not sure local cache would be sufficient. The odds of obtaining the same runner where the local cache is present are at most ((1/current_fleet_size) * probability_last_runner_is_still_in_fleet).

We would need INFRA to give us those statistical points for us to determine viability.

OTOH runners are from kubernetes, so could use its approach https://docs.gitlab.com/runner/executors/kubernetes/#using-the-cache-wit...

If I'm reading correctly that is part of GitLabs cache system correct? If so indeed another option for configuring how the cache data is shared internally and a solution for #3387117: Enable distributed caching in GitLab Runner →

Caching proxy in Gitlab is still beta and slowly moving, but it has caching for proxied

Ah I had missed GitLab was looking into adding composer into their package registry. Maybe that is a good long term goal, however a generic caching proxy (squid or any other alternative) could bridge us so that we are not waiting on a solution that does not yet exist.
Comment about 1 year ago →
🇫🇷France andypost
I mean to share cache between projects you can mount https://docs.gitlab.com/runner/executors/kubernetes/#hostpath-volume or even cloud volume and define extra environment variables for composer and yarn
Comment about 1 year ago →
🇺🇸United States cmlara
Re-titling to allow this to be additional options based on suggestions. I'm increasing the priority due to the DA (through gitlab_templates) rollout of additional testing for all modules we are more likely to encounter this. Additionally we are still encouraging modules to adopt GitLab CI making the problem likely to get worse in the future.

Recent testing leads to the opinion that we are likely triggering silent Github rate limits. This should not be surprising to us as we are aggressively and abusively connecting remote providers.

A recent test with one job succeeded while another failed in the same pipeline (rollout by DA to enable Next Major testing for all contrib modules) allowed us to dig into runners, determine the job occured on diffrent runners. Max Whitehead pulled pod scheduling logs.

We determine at the time of my failure:
30 seconds prior 3 composer jobs had executed, 2 succeeded and one failed early in its process
My job failed midway through.(At the same tim another job succeeded on a different node implying it was not a GitHub issue nor a connectivity issue)
2 Seconds after my job started another composer job started and succeeded (this is within region of time errors that we may have both been downloading at the same time)
Approximately 800 seconds later another composer job succeeded.

Regarding the solution in #4 of using a shared volume mount: Initial testing shows that in a default composer configuration an attacker could inject a malicious package (such as PHPUnit) into the local cache exploiting build jobs.

Other suggestions to date:

Use Satis to mirror the projects that Drupal Core requires into our own package server

Use https://github.com/gmta/velocita-proxy as a composer plugin (I'm not sure what advantages this has over a standard caching proxy server)

(Security concerns)

(Unlikely the same module will run on the same runner in the future)

Provide a caching proxy server

Require every maintainer to obtain an OAUTH login for Github (will not fix merge request pipelines)

Whatever we do we also should deploy a solution for NPM as well.

Other technically valid but not recommended temporary work around:

Increase composer stage to use the majority of CPU (more than half but less than the smallest runner) to reduce the risk of a node triggering blocking from multiple connections.

Require every maintainer to provide their own runners
Comment about 1 year ago →
🇺🇸United States cmlara
This time lets change the title and the priority....

Also updated IS to show additional proposals.
Comment about 1 year ago →
🇬🇧United Kingdom catch
A possible way to mitigate this would to add a core pipeline schedule that runs composer install against all core branches with a custom configured composer cache directory, stores the contents of the cache directory as an artefact. Then in both core and contrib pipelines, before running composer install, retrieve the artefact via the gitlab API to pre-populate the directory. This would mean that most packages are pulled from the cache instead of from github but it doesn't help with npm unless we're able to figure out a similar trick for that.

Because the API requests would only pull the artefact from actual core branches, there should be minimum risk of cache pollution by bad jobs then (i.e. we'd have to commit something bad to core). However it would be massive hack, relatively tricky to implement, and some kind of mirror seems like a better option if that's doable.
Comment about 1 year ago →
🇬🇧United Kingdom catch
Updating the issue title to make it clearer this is about github requests and not distributed caching in general. Also since this regularly results in pipeline failures rather than just being a performance issue, changing to a bug report.
Comment about 1 year ago →
🇺🇸United States cmlara
GitHub is our major offender currently. Any solution we implement ideally works for everything.

We don’t want to abuse the packagist metadata sever or the npm/yarn metadata severs either as they too may start blocking us.
Comment about 1 year ago →
🇫🇷France andypost
So COMPOSER_CACHE_DIR can be set in job definition and used as artifacts
Comment about 1 year ago →
🇳🇱Netherlands bbrala Netherlands
Well, project_analysis has been failing every week the last few weeks with composer failures.

https://git.drupalcode.org/project/project_analysis/-/jobs/2679682

Is runs 10 concurrent jobs which all do a setup of core at their start, so it's relatively heavy in this regard.
Comment about 1 year ago →
🇳🇱Netherlands bbrala Netherlands
@andypost COMPOSER_CACHE_DIR might not work inless you use a custom directory. Artifacts needs to be in the project dir to be included.
Comment about 1 year ago →
🇬🇧United Kingdom catch
Yes I had the same problem with cspell caching etc. it would definitely need to be a custom directory.
Comment about 1 year ago →
🇦🇺Australia elc
A caching squid (or other) proxy using the standard composer/linux env variables would provide an immediate and huge difference to the number of requests going direct to Github. Proxy would be a stand-alone server operating between the runner's composer and Github and any other http/https downloaded resources. Offloading to a web proxy means avoiding the poising attacks from injecting something named the same into a shared directory - caching is done by URL and not project name.
https://getcomposer.org/doc/faqs/how-to-use-composer-behind-a-proxy.md

It could also be a good long term solution even if Gitlab does add proxy support as it would still operate as a separate cached store of downloads which appear to be retrieved thousands of times a day. Proxy would need to be tuned to cache even the largest blobs involved in the runner setup.

Flush the proxy in off hours, as there is certainly a consistent time of day when these failures are happening, which happens when you northerners lot are running your jobs.

Wordpress Playground had to do something similar with https://github-proxy.com/
Comment about 1 year ago →
🇺🇸United States drumm NY, US
Re-titling to describe the problem instead of the solution.

Adding the actual error message to the issue summary. The error message that was mentioned, 413 Request Entity Too Large is mostly unrelated. The retry with git would leave .git directories, which could certainly lead to excessively-large artifacts.

Being slow to respond to requests, causing us to reach a timeout, is not GitHub’s usual method of rate limiting, as far as I know. This could be something more on our end. Regardless, making fewer requests is a good idea.

Since this is a timeout - has anyone tried increasing the curl timeout for a specific project’s jobs?

Use Satis to mirror the projects that Drupal Core requires into our own package server

For automatic updates, we have a Satis mirror of the drupal/* namespace from Packagist.org. It is relatively new, and needs testing. You can add to composer.json:

"repositories": [ { "type": "composer", "url": "https://packagist-signed.drupalcode.org" }
Since this is new, it should be tested on individual projects first.
Comment about 1 year ago →
🇺🇸United States cmlara
Since this is a timeout - has anyone tried increasing the curl timeout for a specific project’s jobs?

Considering on quick glance I saw a timeout at over 4 minutes I have not. 4 minutes seems more than long enough to obtain small files.

For automatic updates, we have a Satis mirror of the drupal/* namespace from Packagist.org.

Our main issue is not the druapl/* packages it is everything else that is needed as a dependency (Symfony, composer, behat, phpstan, etc). Yes the Drupal core packages add a few more requests however they are a small factor when compared to all the extra packages.

Quick glancing it does not look like that repository has those files. Would it be easy to add all the dependent packages so we can test?
Comment about 1 year ago →
🇪🇸Spain fjgarlin
What's weird is that is timing out after 10001 ms, but the templates set it to a different value: https://git.drupalcode.org/project/gitlab_templates/-/blob/main/scripts/...
'process-timeout' => 36000,
Comment about 1 year ago →
🇪🇸Spain fjgarlin
Not a fix to the problem, just a related issue as this affects CI tests. 📌 Ignore all git files in artifacts Active
Comment about 1 year ago →
🇺🇸United States cmlara
What's weird is that is timing out after 10001 ms, but the templates set it to a different value

Process timeout controls how long composer allow a process to run, the errors we see are from curl/git and will not use the process timeout parameter, especially if they hit hard stops early.

The fact we see such random times implies to me that maybe some data makes it through in some cases slowly (in fairness this leans a bit towards drumm statement it might be k8 internal network issues ) before failing and it is possibly a retry attempt (either is for the downloaded system or at the networking packet layer) that eventually reaches the limits.

Random HTTP timeouts for GitLab CI jobs

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments & Activities