New feature: option to use SCAN command instead of KEYS on cache wildcard deletions

Created on 10 February 2017, almost 8 years ago
Updated 28 February 2023, almost 2 years ago

Problem description

This module uses the Redis KEYS command to perform cache wildcard deletions. As the docs on KEYS command state, it should not be used in production environments. In a 1.5M Redis instance, we are experiencing website freezes on cache_menu bin flushes with key counts ranging from 500K to 800K. Happens also when flushing other cache bins, such as doing database updates. Redis freezes while doing KEYS to fetch matching keys for deletion and the entire site slows down to a crawl or even halts completely if the freeze lasts long enough (we've seen freezes lasting from 20 to 40sec).

It's also the first item under the @todo section in lib/Redis/Cache/Base.php :)

Solution

The alternative is to use the SCAN command which uses a cursor that you will keep calling repeatedly to iterate on the matching keys. Doing it this way, Redis can keep accepting commands and does not blocks. However, it has some differences:
Operation is not atomic, SCAN provides a cursor and the eval'ed script will loop until no more keys left. It can take slightly more time under heavy load, but no more freezing of Redis until it returns from enumerating keys.
To be replication-safe, it requires switching from script replication to script *effects* replication. This is available in Redis >=3.2. More info on Antirez's post.
Warning: This method might also cause trouble with script runs lasting more than lua-time-limit (which defaults to 5s), because of Redis answering 'BUSY' to clients past that limit. It's still being worked out.

Implementation

It has been implemented alongside current behaviour to preserve compatibility. Module will work by default as before unless you explicitly turn on this feature.
Also, if you accidentaly turn on the feature on a server not supporting effects replication (<3.2), it will simply ignore it, since the script always checks server version first and fallbacks to normal behaviour.
To add the feature, these are the main changes.

  • Add the feature toggle option in the configuration page.
  • Add a new parameter to the Lua script which uses the new SCAN code for deletion.
  • Add the scan on/off parameter on the eval() call in the cache backend implementation, according the configuration.

The Lua script has grown to accommodate the following new capabilities:

  • Checks second parameter (ARGV[1]) to enable SCAN.
  • Checks for server support for switching to effects replication with the redis.replicate_commands() call.
  • Loops performing SCAN/DEL until no more keys match.
  • Script supports specifying SCAN's COUNT parameter as ARGV[2], which defaults to 100. You can change this value from the configuration page.
  • It outputs information and events to Redis' logs. You can silence them raising server's minimum log level. Doing all logging with LOG_DEBUG level would render logging useless on high volume setups, since it will be lost among lots of items per sec. Information sent to log:
    • Script invocation started, with requested keys argument (notice).
    • If SCAN mode requested, say so, along server's returned version string and COUNT parameter (verbose).
    • If SCAN mode requested and not supported, emit a warning message (warning).
    • If using KEYS command, say so before and after calling KEYS to allow timing (notice).
    • Finished message with mode (SCAN/KEYS) and deleted keys count (notice).
  • Minor: Use the variable name _ convention instead of i for unused variables.
Feature request
Status

Needs review

Version

3.0

Component

Code

Created by

🇪🇸Spain amontero Barcelona

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • So I want to put out there that patch in #21 dramatically helped us for the most part.

    I've done review of how many keys we've had open and I've seen us in the 2.5 mil key range. Our system was always locking up with Redis (which as an FYI was an AWS ElastiCache managed service, not something running on localhost) warnings until it completed when we flushed Drupal cache. This solution helped us for the most part.

    We found that even by upping the number of deletions and scanned it was taking a long time and consuming a ton of memory -- more than our typical server needs. We had to add in some lines where deleteByPrefixUsingScan could extend the memory allocated to run, and to set drupal_set_time_limit to 0 so that it had unlimited time to delete everything.

    With this the site never became unusable even when we were flushing cache. Overall this was an improvement for us.

  • UPDATE: Even after applying this update, in Views updating something and saving the View seemed to cause a total Redis meltdown. I'm not sure if Views does not trigger this or what.

Production build 0.71.5 2024