Loading entity during save will cache old version

Created on 17 September 2024, 7 months ago

Problem/Motivation

When entity is being saved, a well-timed load for given entity will cause old version of it to be cached.

This can (and likely will) cause data loss.

There seems to be several tickets made that describes the behavior how I met this issue, and more real-life like scenario might contain something like messages with tokens that will eventually cause this bug to happen.

Steps to reproduce

Complete version using user entity as example

  1. Create new user, should be ID 2 with vanilla standard installation
  2. Have one terminal open running drush scr bugsaver.php
  3. Have few terminals open running drush scr bugloader.php - even one instance will eventually find the perfect timing but it's significantly faster to run 3 to 6 instances
  4. Once the saver script prints out the mismatch, feel free to close the loader scripts
  5. Open the edit form for the new user as admin - the user name is now old
  6. (Option A) To "fix" the entity, clear caches & refresh the pagedrush cr
  7. (Option B) Add a role to the user and save - the new user name is now fully gone

Sample snippet for saving

// bugsaver.php
$entityId = 2;
$entityType = 'user';
$field = 'name';

$entityTypeManager = \Drupal::entityTypeManager();
$uuidGenerator = \Drupal::service('uuid');

$entityStorage = $entityTypeManager->getStorage($entityType);
$entity = $entityStorage->load($entityId);
$current = $match = $uuidGenerator->generate();

$entity->set($field, $current)->save();
$entity->save();

while (TRUE) {
  $entityStorage = $entityTypeManager->getStorage($entityType);
  $entity = $entityStorage->load($entityId);

  $current = $entity->get($field)->value;
  if ($current != $match) {
    echo "uh oh. {$current} != {$match}" . PHP_EOL;
    exit;
  }

  $match = $uuidGenerator->generate();
  $entity->set($field, $match)->save();
}

Sample snippet for loading

$entityId = 2;
$entityType = 'user';

$entityTypeManager = \Drupal::entityTypeManager();
while (TRUE) {
  $entityStorage = $entityTypeManager->getStorage($entityType);
  $entity = $entityStorage->load($entityId);
}

Proposed resolution

TBD

  • Semaphore to prevent entity being loaded while it is being saved
  • State that prevents entity being cached while it is being saved
  • ????

Remaining tasks

TBD

User interface changes

TBD

Introduced terminology

TBD

API changes

TBD

Data model changes

TBD

Release notes snippet

TBD

🐛 Bug report
Status

Active

Version

10.3

Component
Entity 

Last updated 4 days ago

Created by

🇫🇮Finland dropa

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @dropa
  • 🇫🇮Finland dropa

    Seems to reproduce with latest vanilla as well

  • 🇺🇸United States luke.leber Pennsylvania

    This sounds a lot like a very rare and hard to reproduce issue we've seen on our production Acquia sites. In our case, this manifested as the incorrect revision being marked as the default revision, and showing stale information to end-users on the front-end.

    Following, and will try to reproduce in stage! Thanks for filing this issue! It could help to explain the "that can't happen" thing that certainly happens 🤣!

  • 🇫🇮Finland dropa

    In our case, this manifested as the incorrect revision being marked as the default revision

    This is exactly how it showed up in our case as well, but after digging (far) deeper into it turned out it doesn't matter whether entity supports revisions or not.

    showing stale information to end-users on the front-end.

    I'm sure majority of us has been there and just cleared cache without thinking twice. "Luckily" in our case the outcome was more site-breaking by affecting entities that are not manually editable. Looking back with what I know now, I'm sure I've met the same issue before.

  • 🇬🇧United Kingdom catch

    I think this is a 'known' issue in the following case:

    1. Empty entity cache for the entity.

    2. Process A loads the entity, get a cache miss.

    3. Process B saves the same entity and clears the cache.

    4. Process A writes to the cache.

    You would think that process A would be able to write to the cache before process B is simultaneously able to load and save the entity, but there are probably at least two scenarios that could make this more likely.

    Scenario 1:

    Taking the steps above, we can add a step zero.
    0. Process B, a long running drush process, loads the entity (cache hit or miss doesn't matter).

    This means that step 3 is only required to modify and save the entity, not load it first, because that already happened in step zero.

    --
    Scenario 2:

    Something happens during entity loading (loading other entities in hook_entity_load(), querying from custom tables etc., just having a lot of fields) that makes it take a long time. e.g. if there is 50ms or 100ms between getting a cache miss and writing back to the cache, that allows another process to save in the meantime.

    There's at least a couple of ways to mitigate this:

    1. Add some kind of lock/validation. For example when revisions are enabled we could check that the default revision ID in the database is the same as the one we're about to write a cache item for. But there are problems with this, extra work during an operation that needs to be fast, and still a potential race condition between running that query and writing the cache item, just a shorter one.

    2. Modify entity caching so that when we save an entity, instead of deleting the cache item, we overwrite it with one where $cache->data === FALSE - a tombstone record.

    Entity cache gets would need to be modified to treat $cache->data === FALSE as a miss.

    Entity cache sets would need to be modified to first get from cache, and if the timestamp is newer than REQUEST_TIME or a stored timestamp of when we attempted the cache get, discard the cache write (whether the content of the cache is FALSE or the entity).

    In terms of performance this means switching a cache delete for a cache set on save (should be neutral), and adding a cache get just before cache sets (extra work but fast). However in the stampede situation that we're talking about, being able to skip one or two cache writes probably isn't a bad thing either (especially with the database cache).

    3. Add a cache backend decorator that keeps track of cache IDs (in a class property) and then invalidates them end of request as well as always returning FALSE for them until then. This could either be in-addition to the invalidation during save or instead of it. We already did similar for cache tags in #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock . We'd need to move the delayed cache tag invalidation to end of request too probably to match.

    This would result in the following sequence:

    1. Empty entity cache for the entity.

    2. Process A loads the entity, get a cache miss.

    3. Process B saves the same entity and clears the cache.

    4. Process A writes to the cache.

    5. Process B invalidates the cache backend/tags at the end of the request.

    I think the tombstone record could be the best option - it could only go wrong if the 'empty' cache item gets invalidated or cycled out in the window between it being created and the cache set, but otherwise it should completely eliminate the race condition without adding a lot of overhead.

Production build 0.71.5 2024