GitLab should retry jobs that fail outside test failures.

Created on 29 September 2023, 9 months ago
Updated 4 October 2023, 9 months ago

Problem/Motivation

Stability of GitLab is sometimes a little questionable. There are multiple resulting errors that could just be retried by Gitlab itself if they occur so there is less manual labor if a job fails for some reason.

Steps to reproduce

Some examples of reported failures: https://drupal.slack.com/archives/CGKLP028K/p1695811936776989

Proposed resolution

We should set retry on jobs based on specific reasons:

https://docs.gitlab.com/ee/ci/yaml/#retrywhen

retry:
  max: 2
  when:
    - reason
    - reason

Gitlab support a lot of reasons i think we SHOULD implement the following:

unknown_failure: Retry when the failure reason is unknown.
api_failure: Retry on API failure.
stuck_or_timeout_failure: Retry when the job got stuck or timed out.
runner_system_failure: Retry if there is a runner system failure (for example, job setup failed).
scheduler_failure: Retry if the scheduler failed to assign the job to a runner.

The following we SHOULD NOT implement:
always: Retry on any failure (default).
script_failure: Retry when: The script failed. The runner failed to pull the Docker image. For docker, docker+machine, kubernetes executors.
runner_unsupported: Retry if the runner is unsupported
stale_schedule: Retry if a delayed job could not be executed
job_execution_timeout: Retry if the script exceeded the maximum execution time set for the job
archived_failure: Retry if the job is archived and can’t be run
unmet_prerequisites: Retry if the job failed to complete prerequisite tasks
data_integrity_failure: Retry if there is a structural integrity problem detected.

So the code should be:

retry:
  max: 2
  when:
    - unknown_failure
    - api_failure
    - stuck_or_timeout_failure
    - runner_system_failure
    - scheduler_failure

This part should be added to all jobs.

Remaining tasks

  1. Decide if this is correct list
  2. Create follow up for contrib templates
  3. Implement

User interface changes

N.a.

API changes

N.a.

Data model changes

N.a.

Release notes snippet

N.a.

πŸ“Œ Task
Status

Fixed

Version

10.1 ✨

Component
OtherΒ  β†’

Last updated about 6 hours ago

Created by

πŸ‡³πŸ‡±Netherlands bbrala Netherlands

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Production build 0.69.0 2024