Add CI_DEBUG_SERVICES to investigate DB failures

Created on 12 January 2024, over 1 year ago

Clone from πŸ› Investigate random failures in nightwatch and phpunit Needs work for contrib.

Problem/Motivation

It's been reported both in core and contrib that some test jobs fail due to DB not being available (?)

Core
https://git.drupalcode.org/issue/drupal-1959304/-/jobs/573143

Contrib
https://git.drupalcode.org/project/cdn/-/pipelines/75330

Steps to reproduce

Proposed resolution

Add CI_DEBUG_SERVICES=true to .gitlab-ci.yml so we can get detailed information on what's happening in the database from a test that will fail.
Revert the change as soon as we find one job that will fail due to this

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

✨ Feature request
Status

Active

Component

gitlab-ci

Created by

πŸ‡ͺπŸ‡ΈSpain fjgarlin

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @fjgarlin
  • Merge request !107Issue #3414252 Add debug. β†’ (Closed) created by fjgarlin
  • Status changed to Needs review over 1 year ago
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    I think that's all that's needed. Please review.

  • πŸ‡ΊπŸ‡ΈUnited States cmlara

    Enabling CI_DEBUG_SERVICES may result in masked variables being revealed. When CI_DEBUG_SERVICES is enabled, service container logs and the CI job’s logs are streamed to the job’s trace log concurrently, which makes it possible for a service container log to be inserted inside a job’s masked log. This would thwart the variable masking mechanism and result in the masked variable being revealed.

    https://docs.gitlab.com/ee/ci/services/

    From a security operations standpoint this probably shouldn’t be merged into the mainline and only used as an optional opt in.

    We would IMHO be better off finding a project with a high number of fails (I suggest we ask acquia_migrate if they would let us use an issue fork for testing) and have that project test with the variables rather than making this globally enabled.

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    The above suggestion would be my preferred choice too.

    If we can find a project that runs into the issue we could just do an MR against the project instead of rolling this to all contrib.

  • πŸ‡ΊπŸ‡ΈUnited States dww

    I was seeing it with address (which I maintain) and inline_entity_form (which I might soon be another maintainer of). I’m happy to have an MR for address (at least) to try testing for this. I can warn bojanz if we think it’s gonna be a ton of issue noise…

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    That would be great. It's just adding the following to the ".gitlab-ci.yml" variables section:

    CI_DEBUG_SERVICES: "true"
    

    Then it'd be a matter of running pipelines there multiple times to see if we can find a failed run, and that one should give us more information about what's going on.

  • πŸ‡ΊπŸ‡ΈUnited States dww

    Just got hit by this again:

    https://git.drupalcode.org/project/address/-/pipelines/88194
    https://git.drupalcode.org/project/address/-/pipelines/88195
    https://git.drupalcode.org/project/address/-/pipelines/88196

    But then I remembered I needed to turn on this extra variable, pushed that to our 2.x branch, and now the pipeline is failing in a totally different, unrelated way:

    https://git.drupalcode.org/project/address/-/pipelines/88205

    Instead of failing to connect to the DB, it's now complaining about 5 cases of this:

    1) Drupal\Tests\address\FunctionalJavascript\AddressDefaultWidgetTest::testCountries
    TypeError: Behat\Mink\Element\TraversableElement::findButton(): Argument #1 ($locator) must be of type string, Drupal\Core\StringTranslation\TranslatableMarkup given, called in /builds/project/address/web/core/tests/Drupal/Tests/WebAssert.php on line 143
    /builds/project/address/vendor/behat/mink/src/Element/TraversableElement.php:97
    /builds/project/address/web/core/tests/Drupal/Tests/WebAssert.php:143
    /builds/project/address/web/core/tests/Drupal/Tests/UiHelperTrait.php:78
    /builds/project/address/tests/src/FunctionalJavascript/AddressDefaultWidgetTest.php:193
    /builds/project/address/vendor/phpunit/phpunit/src/Framework/TestResult.php:728
    

    None of that fails locally, and the previous commit I pushed had nothing to do with the tests. I have no idea why these are failing like this now. πŸ˜… But we don't want t() in those tests, anyway, so I opened πŸ“Œ Remove t() from tests that are not testing translation Fixed to clean that out. Hopefully the pipelines on that MR are now green. TBD.

  • πŸ‡ΊπŸ‡ΈUnited States dww

    Okay, here's a real failure from a job with CI_DEBUG_SERVICES enabled πŸŽ‰

    https://git.drupalcode.org/project/address/-/jobs/752891

    Logs are full of this:

    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:53.851198940Z netcat: connect to localhost (127.0.0.1) port 3306 (tcp) failed: Connection refused
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:53.851225416Z netcat: connect to localhost (::1) port 3306 (tcp) failed: Cannot assign requested address
    

    Here's the real culprit:

    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451301665Z 2024-02-05T21:59:23.451125Z 0 [ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts.
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451322612Z 2024-02-05T21:59:23.451154Z 0 [Note] InnoDB: You can disable Linux Native AIO by setting innodb_use_native_aio = 0 in my.cnf
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451325887Z 2024-02-05T21:59:23.451266Z 0 [ERROR] InnoDB: Cannot initialize AIO sub-system
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451328497Z 2024-02-05T21:59:23.451274Z 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451331115Z 2024-02-05T21:59:23.451281Z 0 [ERROR] Plugin 'InnoDB' init function returned error.
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451333559Z 2024-02-05T21:59:23.451285Z 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451338953Z 2024-02-05T21:59:23.451289Z 0 [ERROR] Failed to initialize builtin plugins.
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451341452Z 2024-02-05T21:59:23.451292Z 0 [ERROR] Aborting
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451344050Z 
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451346775Z 2024-02-05T21:59:23.451304Z 0 [Note] Binlog end
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451465428Z 2024-02-05T21:59:23.451358Z 0 [Note] Shutting down plugin 'CSV'
    [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.453778406Z 2024-02-05T21:59:23.453665Z 0 [Note] /usr/sbin/mysqld: Shutdown complete
    

    However, it's not clear why that is happening, just from these logs. Wonder if there's other output being saved somewhere that might be useful. Hopefully @fjgarlin has a chance to review this and knows where to look for the underlying problem.

  • πŸ‡ΊπŸ‡ΈUnited States cmlara

    https://stackoverflow.com/questions/45467988/docker-container-wont-start...

    Sounds like hitting sysctl limits on the docker HOST.

    I haven't confirmed the value however I've seen one comment that says the LInux default of 65536 is only acceptable for a few MYSQLD's at a time on a single machine.

    That would explain why we saw it less occurance when acquia_migrate increased their CPU reservation count, it would be less likely all the instances would be on the same runner.

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Thanks for the logs @dww!

    I've pinged @dimkritsotakis in slack (https://drupal.slack.com/archives/CGKLP028K/p1707211277736559?thread_ts=...) as he's more on top of these issues than I am.

  • Status changed to Closed: won't fix about 1 year ago
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Given that we can add the variables to any MR or project and that we seem to have identified the cause of the issues (which can be fixed in "drupalci_environments"), I am going to close this issue as I don't think we can do anything from this project.

    If I am wrong thinking that, please reopen.

  • πŸ‡¬πŸ‡·Greece dimitriskr

    https://hub.docker.com/r/drupalci/mariadb-10.6 includes a fix for the IO problem.

    Whoever has this problem in contrib, can you change to that database and see if it still occurs?

  • Status changed to Active about 1 year ago
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    I think we should manually test #14.

    @fjgarlin & @dimitriskr, can either of you provide guidance on how to do that? πŸ™

    (I'm getting hit by this pretty much daily in https://www.drupal.org/project/acquia_migrate β†’ .)

  • πŸ‡¬πŸ‡·Greece dimitriskr

    You can override the target_db_type and version variables

    automatic_updates did this here and haven't heard any complaints for almost 2 weeks. That will use MariaDB 10.6 in your tests

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    @dimitriskr Thanks! Tried that, but it caused:

    InvalidArgumentException: There is no database connection so no tests can be run. You must provide a SIMPLETEST_DB environment variable to run PHPUnit based functional tests outside of run-tests.sh.
    

    πŸ˜…

  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    Core and contrib are slightly different. What's your MR @wim-leers?

  • Status changed to Closed: won't fix about 1 year ago
  • πŸ‡ͺπŸ‡ΈSpain fjgarlin

    I think I'm closing this issue as this is not a feature that we will want to push to all contrib.

    People arriving here will find out that they can add CI_DEBUG_SERVICES: "true" to their own templates, inside the variables section, to get additional debugging, but this should be done on a need basis only.

  • πŸ‡ΊπŸ‡¦Ukraine Taran2L Lviv

    btw, I think when CI_DEBUG_SERVICES: "true", access to the job logs is protected, for example I cannot access logs of this pipeline https://git.drupalcode.org/issue/range-3437346/-/pipelines/154088

    Whenever I want to see any details about any job I'm getting

    The current user is not authorized to access the job log.

  • πŸ‡¬πŸ‡§United Kingdom jonathan1055

    @taran2l Did you have any more to add regarding #20 and the not having access to the logs when CI_DEBUG_SERVICES is on? Is it only maintainers of the project that can read the log? We could add a line about this in our documentation page about this.

    Also MR107 can be closed.

Production build 0.71.5 2024