Centralize/optimize stampede protection/locking (aka work while we sleep)

Created on 29 May 2015, over 10 years ago

Updated 17 August 2023, about 2 years ago

Problem/Motivation

Drupal's completely cold cache performance has several problems:

We have multiple registries/collections that have to be rebuilt from either YAML or annotation parsing + hooks/events before any HTML request and most REST requests can be served successfully. For example the container, router, theme registry, element info, plugins. These can take hundreds of milliseconds, or in some cases several seconds each.
On actual sites, there will be common page elements like menus, footers etc. which at least one page must build from scratch before any page can be served and the response sent. Also asset aggregates 🐛 Stampedes and cold cache performance issues with css/js aggregation Fixed

In Drupal 6/7 this has two manifestations:

in earlier releases of core, and in many contrib modules, we would get stampedes with multiple requests rebuilding exactly the same information.
due to this we've added the lock API, so that one process builds the expensive cache item, while the rest sleep() until it's there.

However, it is still very easy for sites to run into situations such as the following:
On low traffic sites, operations like enabling a module, changing the default theme, submitting a view, can take multiple seconds to complete.

On high traffic sites, in addition to the above, events such as code deployments can result in sites becoming unresponsive for up to a minute - as every incoming request is held waiting for 5-6 expensive cache items to be rebuilt sequentially, then may further have to build expensive page elements after that. While this is going on, apache clients build up, since none of them can send a response and close the connection. Drupal's high memory usage means the number of apache clients generally needs to be kept quite low per-server so reaching max-clients (or having to configure varnish to limit connections so it doesn't get reached) is very common.

Proposed resolution

Speaking to Fabianx yesterday he had an idea about parallel processing of blocks using PHP 5.5 generators and APC locks (get a list of blocks, if cached, serve it, if not cached, try to acquire a lock (in APCu) and rebuild it, if the lock can't be acquired, move on to the next block and come back for it later on the assumption another process might have built it by then). We didn't discuss this approach for other kinds of caches, but I think it's equally or more applicable here. Then later discussed cold cache performance with effulgentsia and whether there was a way to make things more robust while individual cache items remain expensive to build, then thought of this.

For the central cache items, we'd use the regular lock API (since we know these are global caches and the number of locks will be very small), and there's no need for generators just need a list of services and iterate over them.

While the router, theme registry, element info cache generally get accessed in a particular sequential order (at least on the same site), there are not really interdependencies between them - so the order they get built in shouldn't matter except for the ordering of our bootstrap, and particular interdependencies when the version of a cache depends on previous steps in the request.

The idea would be:

1. Any 'important and expensive' service such as routing, theme registry, element info cache implements an interface and tags itself (or we add adapters for this purpose).

Something like this:

<?php
Interface StampedeRebuildInterface {

public function isRebuildNeeded();
public function acquirelock();
public function doRebuild();
}

2. We add a stampede.protection.rebuild service (better names welcome), that can iterate over the tagged services, checks if they need rebuilding, tries to acquire a lock, rebuilds if it can, moves on to the next if it can't.

Examples of how this would change things.

Let's say we have three items (router, theme registry, element info) and each takes 3 seconds. In reality we have more things and they can take from 100ms to 10 seconds depending on server/site.

Before:


Process 1:
lock_acquire('router') -> router_rebuild() -> lock_acquire('theme_registry') -> theme_registry_rebuild() -> lock_acquire('element_info') -> element_info_rebuild()
3 + 3 + 3 =9 seconds
Process 2:
!lock_acquire('router') -> lock_wait('router') -> !lock_acquire('theme_registry') -> lock_wait('theme_registry') -> lock_acquire('element_info') -> lock_wait('element_info');

3 + 3 + 3 = 9 seconds

Process 3: lock_wait() lock_wait() lock_wait() blah blah blah
3 + 3+ 3 = 9 seconds

After:

Process 1:
lock_acquire('router') -> router_rebuild() -> HIT: theme registry -> HIT element_info()
3 + 0 + 0 = 3 seconds
Process 2 ->
!lock_acquire('router') -> lock_acquire('theme_registry') -> theme_registry_rebuild() -> HIT router -> HIT element_info()
0 + 3 + 0 = 3 seconds

Process 3 ->
!lock_acquire('router') -> !lock_acquire('theme_registry') -> lock_acquire('element_info') -> element_info_rebuild() -> HIT router -> HIT theme_registry()

0 + 0 + 3 = 3 seconds.

One limitation is that the services can't rely on the request. So for example the theme registry we can't know the active theme until we have a route. However we can build the theme registry for the default theme - and in the process of that a theme-independent cache item gets stored (theme_registry:build:modules) so it still lets us do the bulk of the work in a request-agnostic way. All we're doing is literally replacing time when the process would be sleeping with rebuilding caches that will be needed later in the same request, and in other requests that are coming in.

Remaining tasks

User interface changes

API changes

(It is possible without)

📌 Task

Status

Closed: duplicate

Version

9.5

Component

Base →

Last updated about 1 month ago

Maintained by
🇺🇸United States @effulgentsia
🇬🇧United Kingdom @catch
🇬🇧United Kingdom @alexpott
🇦🇺Australia @larowlan
🇺🇸United States @bnjmnm
🇫🇷France @nod_
🇪🇸Spain @ckrina
🇬🇧United Kingdom @justafish

Created by

🇬🇧United Kingdom catch

Live updates comments and jobs are added and updated live.

Performance
It affects performance. It is often combined with the Needs profiling tag.

scalability

Incomplete comments

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Comment about 2 years ago →
🇬🇧United Kingdom catch
Marking duplicate of 📌 Add a cache prewarm API and use it to distribute cache rebuids after cache clears / during stampedes Needs work , which implements almost exactly the same thing as the issue summary here eight years later.
Comment about 2 years ago →
🇬🇧United Kingdom catch

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024