- Issue created by @fjgarlin
- πΊπΈUnited States cmlara
Added bash script that allows duplicating locally outside of GitLab. Nothing is special about GItLab as it relates to this issue, with the only variable being the k8 node fs.aio-max-nr value. Just periodically follow the logs (or remove the null redirect and have the logs output to the console) of each container to see if the error occurs.
On a non-tuned laptop with several other containers running (including 3 MariaDB containers) I managed to reach the 7th container ( 10 database containers total) before I saw the error.
One should be able to simulate higher concurrency by reducing sysctl fs.aio-max-nr (each container will take up a higher percentage of the max limit)
Using the first results I could pull up on Slack (from March 2023) the largest runner in the fleet at the time was m4.10xlarge, which is 40vcpu.
Assuming I understand the reservation system correct, and assuming the standard (gitlab_template) configuration of 2CPU reservations, and assuming an absolute worst case scenario that all pods are PHPUnit stage (where we have the build, helper, php, and mysql) containers along with any other containers a contrib project may have added) 20 SQL container instances may be running on a single physical K8 node(host/worker) at a time along with the other ancillary containers that consume AIO slots.
That gives a (rough) minimal target of how many instances needs to be able to launch for this to be considered 'resolved'.
- First commit to issue fork.
- π«π·France andypost
Merged as opened core's MR to test it https://git.drupalcode.org/project/drupal/-/merge_requests/6586
- Status changed to Needs review
10 months ago 8:45pm 13 February 2024 - πΊπΈUnited States cmlara
Installing MariaDB/MySQL system tables in '/var/lib/mysql/' ... io_setup(8192) returned -11 2024-02-13 21:02:08 0 [Warning] InnoDB: Linux Native AIO disabled.
Config change was loaded into the mysql config for mariadb
Was able to run the script above (when set to the mariadb-10.6:dev test image) with all 15 containers starting.
- πΊπΈUnited States nnewton
We are starting to hit this on core gitlabci as we are trying to consolidate runs on nodes. I would suggest we globally disable AIO for these containers (mysql/mariadb). There are solutions on the node side, but they are ugly and won't be portable between testing environments.
On our larger nodes we can reproduce this fairly consistently while watching aio-nr.
1 Job - 1 Node
root@runner-s4yvuuu9g-project-78834-concurrent-0-hpbibdd6:/var/www/html# sysctl -a 2> /dev/null | grep fs.aio fs.aio-max-nr = 65536 fs.aio-nr = 8805
4 Jobs - 1 Node
root@runner-s4yvuuu9g-project-78834-concurrent-3-qg6kwxrj:/var/www/html# sysctl -a 2> /dev/null | grep aio fs.aio-max-nr = 65536 fs.aio-nr = 35220
And if we push 8 jobs to double that, the 8th will fail with:
[ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts. [service:drupalci/mysql-5.7-database] 2024-04-17T21:58:10.746455040Z 2024-04-17T21:58:10.746096Z 0 [Note] InnoDB: You can disable Linux Native AIO by setting innodb_use_native_aio = 0 in my.cnf [service:drupalci/mysql-5.7-database] 2024-04-17T21:58:10.746456647Z 2024-04-17T21:58:10.746156Z 0 [ERROR] InnoDB: Cannot initialize AIO sub-system
- π«π·France andypost
maybe it just need to increase this value
fs.aio-max-nr=200000
as most of distros doing? - πΊπΈUnited States nnewton
Which distros have this set to not 65536? Debian/RHEL/AL2 all seem to have this set to the default of 65536. Either way, our (and everyone elses) EKS/AL2 based clusters will have this set to 65536. Modifying this would require a custom launch template or marking this sysctl as unsafe but allowed at the kubelet level. I would advise this be changed at the container level as that is a far cleaner solution and would resolve this portably between clusters.
- πͺπΈSpain fjgarlin
Based on #9 I added the same setting to all other mariadb and MySQL images: https://git.drupalcode.org/project/drupalci_environments/-/merge_request...
- πΊπΈUnited States cmlara
Either way, our (and everyone elses) EKS/AL2 based clusters will have this set to 65536.
Key note is that is a default value, not necessarily what everyone runs with.
Changing these to match the purpose of the environment is to be expected. Defaults are just that, defaults, a cluster manager is expected to manager their cluster to meet the needs of the design.
Changing the Drupal images is a start, however that does nothing for projects that don't run the DrupalCI images (few if any at the moment) and not all containers allow an easy environment variable to disable this feature (for example I couldn't find it in the wodby or dockhub mariadb images). Few may use these right now, however its not impossible that the gitlab_templates project could move away from drupalci images if justification is provided.
It makes a lot of sense in my opinion for D.O. infra to tweak the environment to perform to expected use by the community
- πͺπΈSpain fjgarlin
I agree that it might make sense to change at infra level, but I also think that if we have a quick win available within the images that we are using right now (ie: this MR), we should go ahead and do it.
- πΊπΈUnited States nnewton
The defaults discussion was due to someone suggesting that distros were changing the default, which they are not.
Obviously we change numerous default settings in drupal-infra. As I mentioned in my previous comment, this setting is very difficult to change in a manageable/secure way on an EKS cluster in our config management and we won't be doing so currently. We are working desperately to reduce maintenance overhead and this would increase it for no clear advantage (if people start using external images in mass enough that 8 would be co-scheduled on a node, we can address that then).
If this change is not merged what we will do at the moment is limit per-node concurrency, not change the setting. This is why I suggested the change, because it would stabilize the runs and not require per-node concurrency limits. Changing this setting is not currently an option. We maybe able to re-address it in the future.
- πͺπΈSpain fjgarlin
fjgarlin β changed the visibility of the branch 3419805-fix-io-problems to hidden.
-
andypost β
committed 12da3346 on dev authored by
fjgarlin β
Issue #3419805 by fjgarlin, cmlara, nnewton: Disable aio in mariadb and...
-
andypost β
committed 12da3346 on dev authored by
fjgarlin β
- π«π·France andypost
I did merge to dev, let's see if all images are build https://git.drupalcode.org/project/drupalci_environments/-/jobs/1408250
-
andypost β
committed cad6b497 on dev authored by
fjgarlin β
Issue #3419805 by fjgarlin, andypost, cmlara, nnewton: Disable aio in...
-
andypost β
committed cad6b497 on dev authored by
fjgarlin β
- π«π·France andypost
Current build system require changes in
Dockerfile
to automatically rebuild, so I updated last commit to dev with https://git.drupalcode.org/project/drupalci_environments/-/commit/cad6b4... -
andypost β
committed 1f233b8a on dev authored by
fjgarlin β
Issue #3419805 by fjgarlin, andypost, cmlara, nnewton: Disable aio in...
-
andypost β
committed 1f233b8a on dev authored by
fjgarlin β
-
andypost β
committed 993ad0a1 on dev authored by
fjgarlin β
Issue #3419805 by fjgarlin, andypost, cmlara, nnewton: Disable aio in...
-
andypost β
committed 993ad0a1 on dev authored by
fjgarlin β
- π«π·France andypost
Tuned outdated repos and now all images are pushed
ref https://git.drupalcode.org/project/drupalci_environments/-/jobs/1418214
images just need to install
netcat-traditional psmisc
so I disabled all other repos viased
- πͺπΈSpain fjgarlin
Tested on core D11 π [ignore] Test dev images Closed: works as designed and core D7 π [ignore] Test dev images Closed: works as designed with the
:dev
images and everything seems correct. -
andypost β
committed 88466756 on production authored by
fjgarlin β
Issue #3419805 by fjgarlin, andypost, cmlara, nnewton: Disable aio in...
-
andypost β
committed 88466756 on production authored by
fjgarlin β
- Status changed to Fixed
8 months ago 4:43pm 25 April 2024 - π«π·France andypost
Thanks everyone involved, production images are published
- π«π·France andypost
btw just got for D7 https://git.drupalcode.org/issue/drupal-3443234/-/jobs/1432748
Automatically closed - issue fixed for 2 weeks with no activity.
- π«π·France andypost
Testing new approach https://git.drupalcode.org/project/drupalci_environments/-/commit/de898f...