Database service not starting in some CI runs.

Created on 7 February 2024, 5 months ago
Updated 23 May 2024, about 1 month ago

Problem/Motivation

We've been investigating for quite a while situations when the DB service is not available in core and contrib. It's hard to reproduce because re-running the jobs usually fixes the issue, but adding CI_DEBUG_SERVICES gives us extra debug information that can be useful

That was the case, and from @dww in https://www.drupal.org/project/gitlab_templates/issues/3414252#comment-1... β†’

Okay, here's a real failure from a job with CI_DEBUG_SERVICES enabled πŸŽ‰

https://git.drupalcode.org/project/address/-/jobs/752891

Logs are full of this:

[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:53.851198940Z netcat: connect to localhost (127.0.0.1) port 3306 (tcp) failed: Connection refused
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:53.851225416Z netcat: connect to localhost (::1) port 3306 (tcp) failed: Cannot assign requested address

Here's the real culprit:

[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451301665Z 2024-02-05T21:59:23.451125Z 0 [ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts.
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451322612Z 2024-02-05T21:59:23.451154Z 0 [Note] InnoDB: You can disable Linux Native AIO by setting innodb_use_native_aio = 0 in my.cnf
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451325887Z 2024-02-05T21:59:23.451266Z 0 [ERROR] InnoDB: Cannot initialize AIO sub-system
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451328497Z 2024-02-05T21:59:23.451274Z 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451331115Z 2024-02-05T21:59:23.451281Z 0 [ERROR] Plugin 'InnoDB' init function returned error.
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451333559Z 2024-02-05T21:59:23.451285Z 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451338953Z 2024-02-05T21:59:23.451289Z 0 [ERROR] Failed to initialize builtin plugins.
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451341452Z 2024-02-05T21:59:23.451292Z 0 [ERROR] Aborting
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451344050Z 
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451346775Z 2024-02-05T21:59:23.451304Z 0 [Note] Binlog end
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451465428Z 2024-02-05T21:59:23.451358Z 0 [Note] Shutting down plugin 'CSV'
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.453778406Z 2024-02-05T21:59:23.453665Z 0 [Note] /usr/sbin/mysqld: Shutdown complete

However, it's not clear why that is happening, just from these logs. Wonder if there's other output being saved somewhere that might be useful. Hopefully @fjgarlin has a chance to review this and knows where to look for the underlying problem.

I followed up on slack here: https://drupal.slack.com/archives/CGKLP028K/p1707211277736559?thread_ts=...

It seems that the quickest workaround is to add some configuration to the my.cnf file. We might need a bigger and more robust fix somewhere else, but it's not clear where or what yet, so we should address the issue here if possible.

It seems to be happening on mysql-5.7 but I'd probably do it for the other mysql versions too.

Steps to reproduce

Can be duplicated on a local system.

for i in $(seq 1 15);
do
    docker run --rm --name resource_exhaust_$i drupalci/mysql-5.7:production  > /dev/null 2> /dev/null &
    sleep 30
done

Proposed resolution

Change my.cnf files for the images with a fix for that situation.

Remaining tasks

MR

User interface changes

API changes

Data model changes

πŸ“Œ Task
Status

Fixed

Component

PHP Containers

Created by

πŸ‡ͺπŸ‡ΈSpain fjgarlin

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @fjgarlin
  • πŸ‡ΊπŸ‡ΈUnited States cmlara

    Added bash script that allows duplicating locally outside of GitLab. Nothing is special about GItLab as it relates to this issue, with the only variable being the k8 node fs.aio-max-nr value. Just periodically follow the logs (or remove the null redirect and have the logs output to the console) of each container to see if the error occurs.

    On a non-tuned laptop with several other containers running (including 3 MariaDB containers) I managed to reach the 7th container ( 10 database containers total) before I saw the error.

    One should be able to simulate higher concurrency by reducing sysctl fs.aio-max-nr (each container will take up a higher percentage of the max limit)

    Using the first results I could pull up on Slack (from March 2023) the largest runner in the fleet at the time was m4.10xlarge, which is 40vcpu.

    Assuming I understand the reservation system correct, and assuming the standard (gitlab_template) configuration of 2CPU reservations, and assuming an absolute worst case scenario that all pods are PHPUnit stage (where we have the build, helper, php, and mysql) containers along with any other containers a contrib project may have added) 20 SQL container instances may be running on a single physical K8 node(host/worker) at a time along with the other ancillary containers that consume AIO slots.

    That gives a (rough) minimal target of how many instances needs to be able to launch for this to be considered 'resolved'.

  • First commit to issue fork.
  • Merge request !41Fix IO problems in mysql-like databases β†’ (Merged) created by dimitriskr
  • Status changed to Needs review 4 months ago
  • πŸ‡ΊπŸ‡ΈUnited States cmlara
    Installing MariaDB/MySQL system tables in '/var/lib/mysql/' ...
    io_setup(8192) returned -11
    2024-02-13 21:02:08 0 [Warning] InnoDB: Linux Native AIO disabled.
    

    Config change was loaded into the mysql config for mariadb

    Was able to run the script above (when set to the mariadb-10.6:dev test image) with all 15 containers starting.

  • πŸ‡ΊπŸ‡ΈUnited States nnewton

    We are starting to hit this on core gitlabci as we are trying to consolidate runs on nodes. I would suggest we globally disable AIO for these containers (mysql/mariadb). There are solutions on the node side, but they are ugly and won't be portable between testing environments.

    On our larger nodes we can reproduce this fairly consistently while watching aio-nr.

    1 Job - 1 Node

    root@runner-s4yvuuu9g-project-78834-concurrent-0-hpbibdd6:/var/www/html# sysctl -a 2> /dev/null | grep fs.aio
    fs.aio-max-nr = 65536
    fs.aio-nr = 8805
    

    4 Jobs - 1 Node

    root@runner-s4yvuuu9g-project-78834-concurrent-3-qg6kwxrj:/var/www/html# sysctl -a 2> /dev/null  | grep aio
    fs.aio-max-nr = 65536
    fs.aio-nr = 35220
    

    And if we push 8 jobs to double that, the 8th will fail with:

    [ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts.
    [service:drupalci/mysql-5.7-database] 2024-04-17T21:58:10.746455040Z 2024-04-17T21:58:10.746096Z 0 [Note] InnoDB: You can disable Linux Native AIO by setting innodb_use_native_aio = 0 in my.cnf
    [service:drupalci/mysql-5.7-database] 2024-04-17T21:58:10.746456647Z 2024-04-17T21:58:10.746156Z 0 [ERROR] InnoDB: Cannot initialize AIO sub-system
    
  • πŸ‡«πŸ‡·France andypost

    maybe it just need to increase this value fs.aio-max-nr=200000 as most of distros doing?

  • πŸ‡ΊπŸ‡ΈUnited States nnewton

    Which distros have this set to not 65536? Debian/RHEL/AL2 all seem to have this set to the default of 65536. Either way, our (and everyone elses) EKS/AL2 based clusters will have this set to 65536. Modifying this would require a custom launch template or marking this sysctl as unsafe but allowed at the kubelet level. I would advise this be changed at the container level as that is a far cleaner solution and would resolve this portably between clusters.

  • Merge request !43Disable aio in mariadb and mysql images. β†’ (Merged) created by fjgarlin
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Based on #9 I added the same setting to all other mariadb and MySQL images: https://git.drupalcode.org/project/drupalci_environments/-/merge_request...

  • πŸ‡ΊπŸ‡ΈUnited States cmlara

    Either way, our (and everyone elses) EKS/AL2 based clusters will have this set to 65536.

    Key note is that is a default value, not necessarily what everyone runs with.

    Changing these to match the purpose of the environment is to be expected. Defaults are just that, defaults, a cluster manager is expected to manager their cluster to meet the needs of the design.

    Changing the Drupal images is a start, however that does nothing for projects that don't run the DrupalCI images (few if any at the moment) and not all containers allow an easy environment variable to disable this feature (for example I couldn't find it in the wodby or dockhub mariadb images). Few may use these right now, however its not impossible that the gitlab_templates project could move away from drupalci images if justification is provided.

    It makes a lot of sense in my opinion for D.O. infra to tweak the environment to perform to expected use by the community

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    I agree that it might make sense to change at infra level, but I also think that if we have a quick win available within the images that we are using right now (ie: this MR), we should go ahead and do it.

  • πŸ‡ΊπŸ‡ΈUnited States nnewton

    The defaults discussion was due to someone suggesting that distros were changing the default, which they are not.

    Obviously we change numerous default settings in drupal-infra. As I mentioned in my previous comment, this setting is very difficult to change in a manageable/secure way on an EKS cluster in our config management and we won't be doing so currently. We are working desperately to reduce maintenance overhead and this would increase it for no clear advantage (if people start using external images in mass enough that 8 would be co-scheduled on a node, we can address that then).

    If this change is not merged what we will do at the moment is limit per-node concurrency, not change the setting. This is why I suggested the change, because it would stabilize the runs and not require per-node concurrency limits. Changing this setting is not currently an option. We maybe able to re-address it in the future.

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    fjgarlin β†’ changed the visibility of the branch 3419805-fix-io-problems to hidden.

  • πŸ‡«πŸ‡·France andypost

    I did merge to dev, let's see if all images are build https://git.drupalcode.org/project/drupalci_environments/-/jobs/1408250

  • πŸ‡«πŸ‡·France andypost

    Current build system require changes in Dockerfile to automatically rebuild, so I updated last commit to dev with https://git.drupalcode.org/project/drupalci_environments/-/commit/cad6b4...

  • πŸ‡«πŸ‡·France andypost

    Tuned outdated repos and now all images are pushed

    ref https://git.drupalcode.org/project/drupalci_environments/-/jobs/1418214

    images just need to install netcat-traditional psmisc so I disabled all other repos via sed

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Tested on core D11 πŸ“Œ [ignore] Test dev images Closed: works as designed and core D7 πŸ“Œ [ignore] Test dev images Closed: works as designed with the :dev images and everything seems correct.

  • Status changed to Fixed 2 months ago
  • πŸ‡«πŸ‡·France andypost

    Thanks everyone involved, production images are published

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.69.0 2024