GitLab should retry jobs that fail outside test failures.

Open on Drupal.org →

Created on 29 September 2023, over 1 year ago

Updated 4 October 2023, over 1 year ago

Problem/Motivation

Stability of GitLab is sometimes a little questionable. There are multiple resulting errors that could just be retried by Gitlab itself if they occur so there is less manual labor if a job fails for some reason.

Steps to reproduce

Some examples of reported failures: https://drupal.slack.com/archives/CGKLP028K/p1695811936776989

Proposed resolution

We should set retry on jobs based on specific reasons:

https://docs.gitlab.com/ee/ci/yaml/#retrywhen

retry:
  max: 2
  when:
    - reason
    - reason

Gitlab support a lot of reasons i think we SHOULD implement the following:

unknown_failure: Retry when the failure reason is unknown.
api_failure: Retry on API failure.
stuck_or_timeout_failure: Retry when the job got stuck or timed out.
runner_system_failure: Retry if there is a runner system failure (for example, job setup failed).
scheduler_failure: Retry if the scheduler failed to assign the job to a runner.

The following we SHOULD NOT implement:
always: Retry on any failure (default).
script_failure: Retry when: The script failed. The runner failed to pull the Docker image. For docker, docker+machine, kubernetes executors.
runner_unsupported: Retry if the runner is unsupported
stale_schedule: Retry if a delayed job could not be executed
job_execution_timeout: Retry if the script exceeded the maximum execution time set for the job
archived_failure: Retry if the job is archived and can’t be run
unmet_prerequisites: Retry if the job failed to complete prerequisite tasks
data_integrity_failure: Retry if there is a structural integrity problem detected.

So the code should be:

retry:
  max: 2
  when:
    - unknown_failure
    - api_failure
    - stuck_or_timeout_failure
    - runner_system_failure
    - scheduler_failure

This part should be added to all jobs.

Remaining tasks

Decide if this is correct list
Create follow up for contrib templates
Implement

User interface changes

N.a.

API changes

N.a.

Data model changes

N.a.

Release notes snippet

N.a.

📌 Task

Status

Fixed

Version

10.1 ✨

Component

Last updated 7 days ago

Created by

🇳🇱Netherlands bbrala Netherlands

Live updates comments and jobs are added and updated live.

gitlabci-core

Sign in to follow issues

Comments & Activities

Issue created by @bbrala
Comment over 1 year ago →
🇳🇱Netherlands bbrala Netherlands
First commit to issue fork.
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update over 1 year ago
30,379 pass
@longwave opened merge request.
Status changed to Needs review over 1 year ago10:15am 4 October 2023
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
Just ran into some system failures at the end of two test jobs, let's see how this gets on.
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
We should also consider using the default keyword in place of the default-job-settings alias, but that's probably for another issue.
Status changed to RTBC over 1 year ago1:57pm 4 October 2023
Comment over 1 year ago →
🇳🇱Netherlands bbrala Netherlands
Reviewed if all jobs are affected by the changes supplied, and they are. Follow up could be good to use 'defaults', i agree, but that has no place in this issue.

RTBC
Status changed to Fixed over 1 year ago3:01pm 4 October 2023
Comment over 1 year ago →
🇬🇧United Kingdom catch
Committed/pushed to 11.x and cherry-picked to 10.1.x, thanks!

Comment over 1 year ago →

catch → committed 79c39b81 on 10.1.x

Issue #3390658 by longwave, bbrala: GitLab should retry jobs that fail...

Comment over 1 year ago →

catch → committed bbb91304 on 11.x

Issue #3390658 by longwave, bbrala: GitLab should retry jobs that fail...

Comment over 1 year ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024