Faulty permanent config cache has been set to the cache backend on failed sql server connection

Created on 6 November 2020, about 4 years ago
Updated 27 November 2023, about 1 year ago

Problem/Motivation

in core/lib/Drupal/Core/Config/DatabaseStorage.php

public function readMultiple(array $names) {
    $list = [];
    try {
      $list = $this->connection->query('SELECT name, data FROM {' . $this->connection->escapeTable($this->table) . '} WHERE collection = :collection AND name IN ( :names[] )', [':collection' => $this->collection, ':names[]' => $names], $this->options)->fetchAllKeyed();
      foreach ($list as &$data) {
        $data = $this->decode($data);
      }
    }
    catch (\Exception $e) {
      // If we attempt a read without actually having the database or the table
      // available, just return an empty array so the caller can handle it.
    }
    return $list;
  }

if there is any connection failure, exception will be catched, and an empty value will be returned.
According to the comment, the caller should handle it

while in
core/lib/Drupal/Core/Config/CachedStorage.php -> readMultiple

      $list = $this->storage->readMultiple($names_to_get);
      // Cache configuration objects that were loaded from the storage, cache
      // missing configuration objects as an explicit FALSE.
      $items = [];
      foreach ($names_to_get as $name) {
        $data = isset($list[$name]) ? $list[$name] : FALSE;
        $data_to_return[$name] = $data;
        $items[$cache_keys_map[$name]] = ['data' => $data];
      }
      $this->cache->setMultiple($items);

looks like it just set the FALSE to the cache backend

This is here because config could be fetched before the database is setup and it should succeed - and tables might only be created for some config upon first write - so lack of a table is not a failure.

However the catch is not just catching table not exists failures but also connection failures. So it can happen that the connection is lost and instead of an exception the code assumes no table and therefore gives a successful empty response even if there is configuration data in the database. This then gets injected into cache so even once the connection to the database is restored, the configuration is always returned empty

Steps to reproduce

Can be reproduced in a test by creating an invalid table with missing columns that will force an exception when calling the method. (Note: due to SQLite quoting this won’t work with a SQLite database)

Proposed resolution

Do not set empty config to the cache backend when there is an exception on getting the data, unless that exception is because the table does not exist. Instead, throw the exception to end the request.

We can detect table not exists failures by subsequently checking for the table in the exception. If it is not there we have confirmed it is safe to return empty. If it is indeed there then we can assume the query failed in some other way that needs rethrowing. In a connection loss instance this check for table will itself detect the loss of connection and throw (e.g. MySQL has gone away)

MR used

MR 2839 = 11.x

Remaining tasks

Review

User interface changes

NA

API changes

NA

Data model changes

NA

Release notes snippet

NA

🐛 Bug report
Status

Fixed

Version

10.1

Component
Configuration 

Last updated 3 days ago

Created by

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇺🇸United States smustgrave

    This issue is being reviewed by the kind folks in Slack, #needs-review-queue-initiative. We are working to keep the size of Needs Review queue [2700+ issues] to around 400 (1 month or less), following Review a patch or merge request as a guide.

    Reviewing MR #2839. Hiding files to avoid confusion.

    Code wise everything looks fine

    But when I ran the tests locally without the fix to DataStorage.php they still passed, when they should fail to show the bug.

    Thanks!

  • 🇬🇧United Kingdom Driskell

    @smustgrave

    I just ran the tests without the fix and received a failure. I then applied the fix and it worked again.
    However, from a hunch, I hooked up SQLite as the test database, and yes it did not fail!

    Bit of debugging later...

    https://www.sqlite.org/lang_keywords.html
    > If a keyword in double quotes (ex: "key" or "glob") is used in a context where it cannot be resolved to an identifier but where a string literal is allowed, then the token is understood to be a string literal instead of an identifier.

    The query that is meant to fail because of a missing column actually now resolves to a "does this string equal this string" and is accepted as a string expression!
    > SELECT 1 FROM "test57957719"."config" WHERE "collection" = :collection AND "name" = :name LIMIT 0, 1

    Will have a think about a way to make the test fail properly on SQLite as well.

  • 🇬🇧United Kingdom Driskell

    OK, tracked the quoting to double quotes back to https://www.drupal.org/project/drupal/issues/3126658
    It seems not to have taken into account SQLite.
    I checked and the ANSI quoting in Pgsql and MySQL should be fine as that forces double quote as identifier and it will never be string literal.
    In SQLLite it actually is string literal if the column does not exist.

    I don't know but... I am wondering if the above needs to be a separate issue as that's in a different module. So I'll get a ticket raised.
    I've noticed there's some duplication of tests in the MR anyway so I'll push a cleaned up version - and I'll set the other ticket for SQLLite as blocking this one.

  • Status changed to Postponed almost 2 years ago
  • Status changed to Needs review almost 2 years ago
  • 🇬🇧United Kingdom Driskell

    @smustgrave

    I'm guessing the impact to all SQLite queries of a change to that would prevent this from being a bug fix in 9.5.

    However, perhaps there just needs to be an acceptance that this test won't function properly with SQLite until the other issue is fixed so maybe there's a decision to be made here - in that the test will properly work on MySQL so the automated tests will work fine. It only will cause a problem for running locally with SQLite. Just I feel this patch has saved me many issues and is hugely beneficial - our incidents relating to cache corruption in this area are eliminated so far.

    So perhaps the SQLite issue does not postpone this? So I'm U-turning as usual and returning to Needs review to get some thoughts.

  • 🇫🇷France andypost

    The workaround looks incomplete tableExists() making a query on exception, but when other process creating table it could be wrong assumption

  • 🇬🇧United Kingdom Driskell

    @andypost

    At the moment it fails silently and corrupts cache during normal day to day running. The scenario of a writer creating the table and causing a valid read to then throw exception is true, but I think that would only occur once, when the table doesn't yet exist, and only in the perfect race between read and the very first ever write to the table, and it would recover completely on the second request and never happen again. Unfortunately it's difficult to quantify the real impact of that scenario but it seems to be small, and much smaller than the current issue's impact.

    Technically speaking I think the correct fix is to detect the specific error of the table not existing. However, this might be complex across the different drivers and versions supported to detect the specific error message, and would require driver changes. If that was opted into it would be better I think as a post-task after this to get those driver implementation for throwing a specific exception for table missing, for use in this code, so we're never handling arbitrary exceptions and only handling the exception we intend to: Table does not exist. Arguably that would be easier to test too and wouldn't need the quoting change for SQLite.

    What are your thoughts?

  • 🇬🇧United Kingdom Driskell

    Not seen it before but found an issue that was adding the table exception checks that would also resolve this: https://www.drupal.org/project/drupal/issues/2371709 📌 Move the on-demand-table creation into the database API Needs work

  • 🇬🇧United Kingdom catch

    The workaround looks incomplete tableExists() making a query on exception, but when other process creating table it could be wrong assumption

    This is fine - the worst that happens is a cache miss, and it's how we handle it in other subsystems.

  • Status changed to Needs work almost 2 years ago
  • The Needs Review Queue Bot tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

    Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.

    Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

  • 🇬🇧United Kingdom longwave UK

    I have a busy site that is experiencing occasional strange bugs where fields suddenly no longer exist (causing subsequent fatal errors in different places each time), but this is fixed on cache rebuild, and I am wondering if this is the culprit.

    I think the fix here looks good for the problem at hand but I wonder if there is a way of refactoring this so we explicitly create the table separately and then the service itself doesn't have to be concerned with catching exceptions at all; it can just pass them up to the caller. This should be a followup, though.]

  • Merge request !532211.x 3181013 faulty permanent config → (Closed) created by ericgsmith
  • Status changed to Needs review about 1 year ago
  • 🇳🇿New Zealand ericgsmith

    I suspect we also hit this issue recently. The symptoms were very similar to #37 - config status showed a list of config as in sync only - clearing the cache went back to showing no differences.

    Fixed CS issue reported in #36 and opened new MR to 11.x - setting back to needs review.

  • Status changed to Needs work about 1 year ago
  • 🇳🇿New Zealand ericgsmith

    Ah - setting back to needs work. As was indicated in #26 the test is failing on SQLite.

    Have triggered the additional pipelines for PostgreSQL and SQLite and it indeed is still failing form SQLite https://git.drupalcode.org/issue/drupal-3181013/-/pipelines/4703

  • 🇬🇧United Kingdom Driskell

    I am putting this back into Needs Review because a failing test is impossible for SQLite due to a known issue (see #3349286). The test system online is MySQL so reproduces it correctly. You cannot test this locally with a SQLite database, it won't fail.

    As for the other issues, #2371709 should be a follow on, it's a bit more involved. And hopefully at some point #3349286 can be looked at to fix the SQLite issue. But I don't feel either of these block this fix. The first improved it. The second is purely relating to local tests.

  • Status changed to Needs review about 1 year ago
  • 🇳🇿New Zealand ericgsmith

    Having also been meaning to come back to say we can almost certainly confirm this issue is what is responsible for the config shown as overridden.

    Further investigation showed multiple errors in the logs that mysql had gone away over about 1 minute. We can also see higher than expected numbers of evictions in memcache during this period from a combination of high traffic and our memcache cache bin size being lower than what the site needed. Immediately after the mysql gone away logs we start seeing warning of missing fields that imply the field does not exist, and as mentioned in #40 this was visible via drush config status and resolved after a cache clear.

    We are now running this patch in production now along side increasing the cache bin size of memcache.

  • 🇬🇧United Kingdom Driskell

    I've updated the MR to include a skip for SQLLite and linked it to the issue about the double quoting that prevents the test from working.

  • 🇬🇧United Kingdom Driskell

    Sorry - that last pushed commit was to fix the URL to be correct issue

  • Status changed to Needs work about 1 year ago
  • The Needs Review Queue Bot tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

    This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

    Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

  • Status changed to Needs review about 1 year ago
  • 🇬🇧United Kingdom Driskell

    Resolve review queue bot issue

  • Status changed to Needs work about 1 year ago
  • 🇺🇸United States smustgrave

    Can the 9.5.x MR be closed since D9 is EOL please

    The other MR seems to have some test failures potentially.

  • 🇬🇧United Kingdom Driskell

    @smustgrave I rebased it to 11.x and updated the MR. Not sure how to update the other one so I kept my existing.
    It skips the test for SQLite since it will fail due to the linked issue. But it successfully fails without the fix on MariaDB / MySQL and passes after the patch.

  • Status changed to Needs review about 1 year ago
  • 🇬🇧United Kingdom Driskell

    Not sure the test failures are related.

  • Status changed to Needs work about 1 year ago
  • 🇺🇸United States smustgrave

    Glad switching targets didn't cause issues, usually changing that causes issues.

    Checking the issue summary

    I am not sure the exact fix yet, but in my opinion, we should not set a permanent faulty cache to the cache backend when there is an exception on getting the data

    Isn't a full solution, can the proposed solution be documented helps reviewers/committers.

    Added missing sections too, left TBD for sections that may apply but not familiar with issue to know myself. If not applicable just change to NA.

  • Status changed to Needs review about 1 year ago
  • 🇬🇧United Kingdom Driskell

    @smustgrave I updated issue summary. It was proposed resolution that was taken so I just dropped the not sure bit and I’ve updated other parts to reflect final status. Thanks for checking

  • 🇬🇧United Kingdom Driskell

    Added more detailed description of problem and solution to IS

  • Status changed to RTBC about 1 year ago
  • 🇺🇸United States smustgrave

    Thanks issue summary reads much better now!

  • Open in Jenkins → Open on Drupal.org →
    Environment: PHP 8.1 & MySQL 5.7
    last update about 1 year ago
    30,675 pass, 2 fail
    • catch committed d8b92d57 on 10.1.x
      Issue #3181013 by Driskell, alexpott, ericgsmith, Pan Lee, smustgrave,...
    • catch committed 562a13e2 on 10.2.x
      Issue #3181013 by Driskell, alexpott, ericgsmith, Pan Lee, smustgrave,...
    • catch committed 201fd312 on 11.x
      Issue #3181013 by Driskell, alexpott, ericgsmith, Pan Lee, smustgrave,...
  • Status changed to Fixed about 1 year ago
  • 🇬🇧United Kingdom catch

    Committed/pushed to 11.x, cherry-picked to 10.2.x and 10.1.x, thanks!

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024