Queue confusion on replicated databases (auto_increment_offset)

Created on 14 November 2019, over 5 years ago
Updated 25 December 2021, over 3 years ago

We've been seeing an issue on our production environment where cache tags are not getting cleared from the Varnish server using the Late Runtime Processor.

For example, when the tags to be cleared are as follows:
entity_types node:21604 node_list content_moderation_state:2339 content_moderation_state_list

on our production environment, it would reliably clear these tags instead:
entity_types content_moderation_state:2339 node_list content_moderation_state:2339 content_moderation_state_list

So the node:21604 tag is missing, and replaced by a duplicated instance of the content_moderation_state tag.

This environment is running on a Galera database cluster. One of the tricks necessary for multi-master MySQL replication is to change the behaviour of AUTO_INCREMENT:

From https://mariadb.com/kb/en/library/mariadb-galera-cluster-known-limitations/

Do not rely on auto-increment values to be sequential. Galera uses a mechanism based on autoincrement increment to produce unique non-conflicting sequences, so on every single node the sequence will have gaps.

In the method DatabaseQueue::createItemMultiple, there is an assumption that the IDs generated by the database will be sequential:

      // A multiple row-insert doesn't give back all the individual IDs, so
      // calculate them back by applying subtraction.
      for ($i = 1; $i <= count($records); $i++) {
        $item_ids[] = $id;
        $id++;
      }

Which causes the databases IDs recorded in the TxBuffer to be incorrect, so the wrong object gets loaded from the buffer in QueueService::claim, leading to the tag confusion issue.

🐛 Bug report
Status

Needs work

Version

3.0

Component

Code

Created by

🇬🇧United Kingdom grahamC Oxford, UK

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 5.6
    last update over 1 year ago
    Patch Failed to Apply
  • 🇺🇸United States bpizzillo

    Re-rolling after PHPCS changes.

  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 5.6
    last update over 1 year ago
    124 pass, 89 fail
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 5.7
    last update over 1 year ago
    580 pass, 8 fail
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 7.4 & MySQL 5.7
    last update over 1 year ago
    580 pass, 8 fail
  • Status changed to Needs review over 1 year ago
  • Status changed to Needs work over 1 year ago
  • 🇺🇸United States bpizzillo

    Previous patch hardcoded the table name and used $this->connection->query. For some reason this broke a large number of tests. Using static::TABLE_NAME and $this->connection->insert does not seem to throw errors. Let's see if we can get a good build.

  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 5.7
    last update over 1 year ago
    610 pass
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 5.6
    last update over 1 year ago
    124 pass, 89 fail
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 7.4 & MySQL 5.7
    last update over 1 year ago
    610 pass
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 5.7
    last update over 1 year ago
    610 pass
  • 🇺🇸United States bpizzillo

    Accidentally left in old use statement from previous patch.

  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 7.4 & MySQL 5.7
    last update over 1 year ago
    610 pass
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 5.7
    last update over 1 year ago
    610 pass
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 5.7
    last update over 1 year ago
    610 pass
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 7.4 & MySQL 5.7
    last update over 1 year ago
    610 pass
  • Status changed to Needs review over 1 year ago
  • Status changed to RTBC over 1 year ago
  • 🇩🇪Germany marvin_B8

    Looks pretty good so far, I just noticed the following things:

    - unset the transaction should not be necessary (https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21Entity%21...)
    - The logic for the id could be reversed.

    So that it looks something like this:

      public function createItemMultiple(array $items) {
        // Wrap these in our own loop so we get the right ID values.
        $ids = [];
        $now = time();
        $transaction = $this->connection->startTransaction();
        try {
          foreach ($items as $data) {
            $id = $this->connection->insert(static::TABLE_NAME)
              ->fields([
                'data' => serialize($data),
                'created' => $now,
              ])->execute();
    
            if (empty($id)) {
              // This will trigger a rollback of the transaction.
              throw new \Exception('Inserted item did not return an ID');
            }
    
            $ids[] = (int) $id;
          }
        }
        catch (\Exception $e) {
          $transaction->rollBack();
          return FALSE;
        }
    
        return $ids;
      }
    

    and If it's a performance problem, we can also think about putting the whole thing in a different queue.

  • 🇺🇸United States davidwbarratt

    I accidentally created a duplicate of this issue at 🐛 Late Runtime Processor purges incorrect tags Active . This problem seems exceptionally bad with SQLite.

  • 🇺🇸United States davidwbarratt

    I tested the patch and it does not fix the issue I had in 🐛 Late Runtime Processor purges incorrect tags Active so I'm re-opening it.

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    @davidwbarratt If the patch does not fix the bug described by this issue, the status should be changed.

  • Status changed to Needs work 9 months ago
  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    Let's create a merge request, now that patches are no longer tested.

  • 🇺🇸United States davidwbarratt

    To clarify, the patch clearly fixes a problem, but I'm not sure if that's the problem I'm having 🐛 Late Runtime Processor purges incorrect tags Active or if that's the same problem described in this issue. From the comment thread, it looks like the patch does fix the issue for some folks so it's possible there is more than one problem and the patch isn't a comprehensive solution.

  • 🇺🇸United States bkline Virginia

    @davidwbarratt

    This problem seems exceptionally bad with SQLite.

    That would seem to imply that you are able to reproduce the unwanted behavior using this patch with other databases than SQLite (just not as frequently). Can you confirm that this is true?

  • 🇺🇸United States davidwbarratt

    I added repo steps to 🐛 Late Runtime Processor purges incorrect tags Active it happens every time I use Late Runtime Processor, if I use the drush or cron processor the issue goes away.

    I didn't test it on any other database, but I am using SQLite, I assumed since the problem happens consistantly that that could be a clue, but it might not be.

    I figured this out because I was writting a Cloudflare Worker that receives the purge requests and I noticed it was missing a tag. Upon further investigation, the tag isn't being sent.

  • 🇺🇸United States davidwbarratt

    I did try switching from mod_php to php-fpm. I thought it might since the former holds on to the response and perhaps I was hitting a timeout or something, but it didn't make a difference.

  • Merge request !47Add to return array → (Open) created by davidwbarratt
  • 🇺🇸United States davidwbarratt

    I have fixed it!

    [DEBUG] [purge] [2024-08-23T22:32:11] purger_cloudflare_worker_0b7da3a883: Received Purge Request for node_list, node_list:article, node:49 | uid: 1 | request-uri: http://localhost:8888/node/49/edit?destination=%2Fadmin%2Fcontent | refer: http://localhost:8888/node/49/edit?destination=/admin/content | ip:  127.0.0.1 | link:
    [DEBUG] [purge] [2024-08-23T22:32:11] purger_cloudflare_worker_0b7da3a883: Executing purge request for node_list, node_list:article, node:49 | uid: 1 | request-uri: http://localhost:8888/node/49/edit?destination=%2Fadmin%2Fcontent | refer: http://localhost:8888/node/49/edit?destination=/admin/content | ip:  127.0.0.1 | link:
    

    The code basically relies on the indexing of the array to always match. Instead of making that assumption, we'll just use whatever is passed in and ensure the indexing matches up.

  • Status changed to Needs review 9 months ago
  • 🇺🇸United States davidwbarratt

    davidwbarratt changed the visibility of the branch 3094343-queue-confusion-on-replicated-databases to hidden.

  • 🇺🇸United States davidwbarratt

    davidwbarratt changed the visibility of the branch 3094343-queue-confusion-on-replicated-databases to active.

  • Pipeline finished with Failed
    9 months ago
    Total: 2998s
    #263183
  • Pipeline finished with Success
    8 months ago
    Total: 1336s
    #263761
  • First commit to issue fork.
  • The code basically relies on the indexing of the array to always match. Instead of making that assumption, we'll just use whatever is passed in and ensure the indexing matches up.

    While it's certainly safer to ensure the keys align, looking at the code in QueueService::commitAdding(), I can't see how the keys would be anything other than sequential 0-indexed, since the array_chunk() call does not preserve keys. Nevertheless, if it indeed addresses an issue, then the keys need to be preserved in all implementations of QueueInterface::createItemMultiple(), so I've made the respective changes to MemoryQueue and QueueBase.

    I looked at how to write some tests for this, but I'm not sure it's possible to change the autoincrement setting of the database in tests, and otherwise mocking the database service is a huge lift. Leaving at Needs tests for now.

  • Pipeline finished with Success
    about 2 months ago
    Total: 1297s
    #445818
  • One note about this solution is that since one multi-value insert query is replaced by multiple insert queries, the performance gain from creating multiple at once is lost. I don't have an alternate solution of how to address this. I thought about doing a select query to get the last count(items) right after the insert query, but to do that, need to be sure that there were no other insertions. Which would mean using a lock and makes things much more complicated.

Production build 0.71.5 2024