Problem/Motivation
Stability of GitLab is sometimes a little questionable. There are multiple resulting errors that could just be retried by Gitlab itself if they occur so there is less manual labor if a job fails for some reason.
Steps to reproduce
Some examples of reported failures: https://drupal.slack.com/archives/CGKLP028K/p1695811936776989
Proposed resolution
We should set retry on jobs based on specific reasons:
https://docs.gitlab.com/ee/ci/yaml/#retrywhen
retry:
max: 2
when:
- reason
- reason
Gitlab support a lot of reasons i think we SHOULD implement the following:
unknown_failure
: Retry when the failure reason is unknown.
api_failure
: Retry on API failure.
stuck_or_timeout_failure
: Retry when the job got stuck or timed out.
runner_system_failure
: Retry if there is a runner system failure (for example, job setup failed).
scheduler_failure
: Retry if the scheduler failed to assign the job to a runner.
The following we SHOULD NOT implement:
always
: Retry on any failure (default).
script_failure
: Retry when: The script failed. The runner failed to pull the Docker image. For docker, docker+machine, kubernetes executors.
runner_unsupported
: Retry if the runner is unsupported
stale_schedule
: Retry if a delayed job could not be executed
job_execution_timeout: Retry if the script exceeded the maximum execution time set for the job
archived_failure
: Retry if the job is archived and canβt be run
unmet_prerequisites
: Retry if the job failed to complete prerequisite tasks
data_integrity_failure
: Retry if there is a structural integrity problem detected.
So the code should be:
retry:
max: 2
when:
- unknown_failure
- api_failure
- stuck_or_timeout_failure
- runner_system_failure
- scheduler_failure
This part should be added to all jobs.
Remaining tasks
- Decide if this is correct list
- Create follow up for contrib templates
- Implement
User interface changes
N.a.
API changes
N.a.
Data model changes
N.a.
Release notes snippet
N.a.