Page cache creates vast amounts of unneeded cache data

Created on 5 November 2018, over 6 years ago
Updated 19 January 2024, over 1 year ago

We noticed on several of our sites that the page_cache table was steadily growing to hundreds of megabytes, containing thousands of rows. We didn't understand why as these sites were not very content heavy and should be very easily cacheable. The example site we used for debugging contains 589 pages and no authenticated users. So we expected to see 589 page_cache entries. Perhaps a few more caused by views pagers, exposed filters and so on. But certainly no more then 1.000.

Problem 1:
After some investigation we noticed a new page_cache entry was created for every unique URL, regardless wether or not that URL actually creates a different output... This is caused by adding query parameters to the URL.
As an example I created a simple bash script that requests url's with a variable id: http://example.tld/?id=x and loops over x from 1 to 10.000. Sure enough the page_cache generated 10.000 cache entries, resulting in 1.5GB of mysql data in the page_cache table.

This should not happen as the cache contexts for the response on this page does not contain the "url.query_args" context. So the cache system should know that the "id" query parameter does not result in any change and should not cause more then 1 entry in the cache_page.
Is this normal behaviour? I could find some related issues, but no clear description as to why this happens, nor a solution:
💬 Dynamic cache does not respect query parameters Closed: works as designed
#2062463: allow page cache cid to be alterable
#2662196: Cache route by Uri and not just Query+Path

I think this could potentially be exploited to overload websites or even crash mysql database or other caching software?

Problem 2:
If we would fix problem 1, a second problem still pops up. Because in our example site we have some modules active, some configuration, some blocks, some real content, ... Mainly a search block is placed in the header of our site on all pages. That in turn creates a form on our website and somehow that seems to add "url.query_args" to the cache context for all pages. So even if the cache id for the page_cache would filter on only query args passed in the cache context, this behaviour would still add all query parameters causing every query parameter to generate a new page_cache entry.
So we did some searching and found that Drupal core adds "url.query_args" in a few places where it should not be needed at all... See the listing below of what I could find.

Why is this implemented like this?
Probably lots of contrib modules already "rely" on this "bazooka"-behaviour...?

core/lib/Drupal/Core/Form/FormBuilder.php:

  /**
   * #lazy_builder callback; renders a form action URL.
   *
   * @return array
   *   A renderable array representing the form action.
   */
  public function renderPlaceholderFormAction() {
    return [
      '#type' => 'markup',
      '#markup' => $this->buildFormAction(),
      '#cache' => ['contexts' => ['url.path', 'url.query_args']],
    ];
  }

modules/user/src/Form/UserPasswordForm.php:

  public function buildForm(array $form, FormStateInterface $form_state) {
    ...
  
    $form['#cache']['contexts'][] = 'url.query_args';

    return $form;
  }

modules/views/views.theme.inc:

function template_preprocess_views_mini_pager(&$variables) {
  ...

  // This is based on the entire current query string. We need to ensure
  // cacheability is affected accordingly.
  $variables['#cache']['contexts'][] = 'url.query_args';

modules/views/src/Plugin/views/style/Table.php:

  public function getCacheContexts() {
    $contexts = [];

    foreach ($this->options['info'] as $field_id => $info) {
      if (!empty($info['sortable'])) {
        // The rendered link needs to play well with any other query parameter
        // used on the page, like pager and exposed filter.
        $contexts[] = 'url.query_args';
        break;
      }
    }

    return $contexts;
  }
🐛 Bug report
Status

Active

Version

11.0 🔥

Component
Cache 

Last updated 3 days ago

Created by

🇧🇪Belgium weseze

Live updates comments and jobs are added and updated live.
  • Security

    It is used for security vulnerabilities which do not need a security advisory. For example, security issues in projects which do not have security advisory coverage, or forward-porting a change already disclosed in a security advisory. See Drupal’s security advisory policy for details. Be careful publicly disclosing security vulnerabilities! Use the “Report a security vulnerability” link in the project page’s sidebar. See how to report a security issue for details.

Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇩🇰Denmark ressa Copenhagen

    A workaround that "fixed" the issue for me, is disabling the cache for views with large amounts of data.

    I tried this (I am using Facets module, which requires this) but the View is still getting cached, and the cache_page table is getting views entries, which it should not.

    The only method to stop the view from getting cached is to disable cache for everything under Performance (/admin/config/development/performance), setting "Caching | Browser and proxy cache maximum age" to <no caching> which is less than ideal ...

    I tried searching for drupal views "Caching:None" ignored but found no hints.

  • 🇫🇷France prudloff Lille

    We also got bitten by this on a website where bot requests expanded massively the size of the SQL database.
    The database_cache_max_rows setting helps with this but it is not enforced in real time.

    Our current solution to this is :

    • switch to a Redis cache and set a memory limit
    • add rate limit on the server
  • 🇺🇸United States gpotter

    Like @prudloff, we ran into the same issue.

    The basic example is any controller that returns a render array. If a bot hits that route with unique query strings, it will create a new cache_page record, if that bot hammers on the page consistency, you end up with tens or even hundreds of gigabytes of data because the entire page is stored even with the same content. This is namely the Internal Page Cache module that is causing the issue.

    "database_cache_max_rows" doesnt work like prudloff mentioned because those max rows are only restricted on a cron run. So by the time the cron runs the site dies because there are too many records to restrict and delete old records.

    We had similar solution ideas with Redis cache, or rate limits. Rate limits is a bit concerning of a solution as it could potentially rate limit a legitimate web crawler.

    The primary issue on our server is the site would die on cron runs, so we clear cache before a cron run nightly. A cache clear runs at a good speed versus the db row constrain from cron. Probably because a cache clear is a simple quick truncate of the cache tables.

  • 🇨🇭Switzerland gagarine

    This issue has been publicly known for years and still poses a serious risk.

    In my opinion, the core problem is that adding irrelevant query parameters still returns a valid page. Instead, Drupal should return a 404 response and avoid caching the page when “fake” query parameters are injected.

    As it stands, adding arbitrary query strings (e.g., ?test, ?foo=bar) bypasses the page cache and triggers full page generation. An attacker can exploit this to cause a denial-of-service (DoS) by flooding the site with requests using unique query parameters, quickly overwhelming the backend.

    What’s worse is that Drupal treats these URLs as legitimate and includes them—with the fake query strings—in the HTML it returns. The content is therefor not identical of the page requested on the normal URL.

    This should absolutely be treated as a critical security issue.

  • If this is critical, are you saying that #2526150: Database cache bins allow unlimited growth: cache DB tables of gigabytes! does not work, or that it is ineffective?

    I completely understand that even with row-limiting, an attacker can essentially replace useful cached objects with useless ones, and that Drupal could do better here. But can those outcomes render a site unusable or cause database deadlocks ? Possibly.

  • 🇬🇧United Kingdom catch

    @cilefen

    If this is critical, are you saying that #2526150: Database cache bins allow unlimited growth: cache DB tables of gigabytes! does not work, or that it is ineffective?

    It's known to be ineffective because it relies on cron trimming the table size. This means that if the table is growing to ten times the row limit in-between cron runs, you not only have tables growing very large in-between cron runs, but also the cron logic has to delete a lot of rows. It doesn't mean the logic is 'bad' as such, it's just not a replacement for a 'real' cache backend like Redis or Memcache which can prune items more efficiently, both the pruning and the deciding which ones to prune.

    @gagarine

    In my opinion, the core problem is that adding irrelevant query parameters still returns a valid page. Instead, Drupal should return a 400 (Bad Request) response

    This would require an 'allowed query parameters' API. But that API would have to validate both the parameter names and the values. And it couldn't just do that, because ?pager is a valid value and would have an integer range of 0-PHP_MAX_INT (or some arbitrary configured value). And unless it could be reliably implemented to validate the query parameters on a route-specific basis (which may not be the controller, it could be any block, JavaScript, 'destination' handling for a form etc.), then that still allows more than enough combinations of query parameters to create a very, very large number of cache entries. It would require more knowledge but it wouldn't prevent anything. And it wouldn't be possible to issue a 400 until Drupal 12 because loads of contrib modules would break.

    One possibility here is to just completely prevent caching if there are any query parameters in the URL, those requests would have to rely on dynamic_page_cache instead.

  • 🇬🇧United Kingdom catch

    MR will fail at least one method in PageCacheTest if not elsewhere, because that test method relies on caching pages with query parameters. It would have to be switched to using key value or something.

  • 🇨🇭Switzerland gagarine

    This would require an 'allowed query parameters' API. But that API would have to validate both the parameter names and the values. And it couldn't just do that, because ?pager is a valid value and would have an integer range of 0-PHP_MAX_INT (or some arbitrary configured value).

    Could the value be validated by the module that say they need this parameters? The validation may require a couple of DB queries, but it could be much lighter than building a full page.

    Where are the query strings included in the HTML returned by the page?

    The full URL is added in form action and in the pages's JS.
    If this was not the case, we could use a checksum to avoid duplication in cache. But adding a parameter will still allow to bypass the cache.

  • 🇬🇧United Kingdom catch

    Could the value be validated by the module that say they need this parameters? The validation may require a couple of DB queries, but it could be much lighter than building a full page.

    For views, which is the most likely way to end up with a pager query parameter, you would need to build and execute the entire view to validate the parameter to find out how many possible pages there might be - pretty much have built the page at that point.

    The full URL is added in form action and in the pages's JS.
    If this was not the case, we could use a checksum to avoid duplication in cache. But adding a parameter will still allow to bypass the cache.

    That's by design - query parameters can be used to pre-fill form values and other things which affect how the form is built, also for the ?destination parameter.

    📌 Allow both AJAX and non-AJAX forms to POST to dedicated URLs Postponed might allow the form action to be consistent each time, but that issue has had very little progress in the past 9 years because it will be incredibly hard to implement.

Production build 0.71.5 2024