Page cache creates vast amounts of unneeded cache data

Comment almost 2 years ago →
🇩🇰Denmark ressa Copenhagen
A workaround that "fixed" the issue for me, is disabling the cache for views with large amounts of data.

I tried this (I am using Facets module, which requires this) but the View is still getting cached, and the cache_page table is getting views entries, which it should not.

The only method to stop the view from getting cached is to disable cache for everything under Performance (/admin/config/development/performance), setting "Caching | Browser and proxy cache maximum age" to <no caching> which is less than ideal ...

I tried searching for drupal views "Caching:None" ignored but found no hints.
Comment 7 months ago →
🇫🇷France prudloff Lille
We also got bitten by this on a website where bot requests expanded massively the size of the SQL database.
The database_cache_max_rows setting helps with this but it is not enforced in real time.

Our current solution to this is :

switch to a Redis cache and set a memory limit

add rate limit on the server
Comment 7 months ago →
🇺🇸United States gpotter
Like @prudloff, we ran into the same issue.

The basic example is any controller that returns a render array. If a bot hits that route with unique query strings, it will create a new cache_page record, if that bot hammers on the page consistency, you end up with tens or even hundreds of gigabytes of data because the entire page is stored even with the same content. This is namely the Internal Page Cache module that is causing the issue.

"database_cache_max_rows" doesnt work like prudloff mentioned because those max rows are only restricted on a cron run. So by the time the cron runs the site dies because there are too many records to restrict and delete old records.

We had similar solution ideas with Redis cache, or rate limits. Rate limits is a bit concerning of a solution as it could potentially rate limit a legitimate web crawler.

The primary issue on our server is the site would die on cron runs, so we clear cache before a cron run nightly. A cache clear runs at a good speed versus the db row constrain from cron. Probably because a cache clear is a simple quick truncate of the cache tables.
Comment 7 months ago →
🇨🇭Switzerland gagarine
This issue has been publicly known for years and still poses a serious risk.

In my opinion, the core problem is that adding irrelevant query parameters still returns a valid page. Instead, Drupal should return a 404 response and avoid caching the page when “fake” query parameters are injected.

As it stands, adding arbitrary query strings (e.g., ?test, ?foo=bar) bypasses the page cache and triggers full page generation. An attacker can exploit this to cause a denial-of-service (DoS) by flooding the site with requests using unique query parameters, quickly overwhelming the backend.

What’s worse is that Drupal treats these URLs as legitimate and includes them—with the fake query strings—in the HTML it returns. The content is therefor not identical of the page requested on the normal URL.

This should absolutely be treated as a critical security issue.
Comment 7 months ago →
cilefen
If this is critical, are you saying that #2526150: Database cache bins allow unlimited growth: cache DB tables of gigabytes! → does not work, or that it is ineffective?

I completely understand that even with row-limiting, an attacker can essentially replace useful cached objects with useless ones, and that Drupal could do better here. But can those outcomes render a site unusable or cause database deadlocks → ? Possibly.
Comment 7 months ago →
🇬🇧United Kingdom catch
@cilefen

If this is critical, are you saying that #2526150: Database cache bins allow unlimited growth: cache DB tables of gigabytes! does not work, or that it is ineffective?

It's known to be ineffective because it relies on cron trimming the table size. This means that if the table is growing to ten times the row limit in-between cron runs, you not only have tables growing very large in-between cron runs, but also the cron logic has to delete a lot of rows. It doesn't mean the logic is 'bad' as such, it's just not a replacement for a 'real' cache backend like Redis or Memcache which can prune items more efficiently, both the pruning and the deciding which ones to prune.

@gagarine

In my opinion, the core problem is that adding irrelevant query parameters still returns a valid page. Instead, Drupal should return a 400 (Bad Request) response

This would require an 'allowed query parameters' API. But that API would have to validate both the parameter names and the values. And it couldn't just do that, because ?pager is a valid value and would have an integer range of 0-PHP_MAX_INT (or some arbitrary configured value). And unless it could be reliably implemented to validate the query parameters on a route-specific basis (which may not be the controller, it could be any block, JavaScript, 'destination' handling for a form etc.), then that still allows more than enough combinations of query parameters to create a very, very large number of cache entries. It would require more knowledge but it wouldn't prevent anything. And it wouldn't be possible to issue a 400 until Drupal 12 because loads of contrib modules would break.

One possibility here is to just completely prevent caching if there are any query parameters in the URL, those requests would have to rely on dynamic_page_cache instead.
Merge request !11634Try preventing page caching when there are query parameters. → (Open) created by catch
Comment 7 months ago →
🇬🇧United Kingdom catch
MR will fail at least one method in PageCacheTest if not elsewhere, because that test method relies on caching pages with query parameters. It would have to be switched to using key value or something.
Comment 7 months ago →
🇨🇭Switzerland gagarine
This would require an 'allowed query parameters' API. But that API would have to validate both the parameter names and the values. And it couldn't just do that, because ?pager is a valid value and would have an integer range of 0-PHP_MAX_INT (or some arbitrary configured value).

Could the value be validated by the module that say they need this parameters? The validation may require a couple of DB queries, but it could be much lighter than building a full page.

Where are the query strings included in the HTML returned by the page?

The full URL is added in form action and in the pages's JS.
If this was not the case, we could use a checksum to avoid duplication in cache. But adding a parameter will still allow to bypass the cache.
Comment 7 months ago →
🇬🇧United Kingdom catch
Could the value be validated by the module that say they need this parameters? The validation may require a couple of DB queries, but it could be much lighter than building a full page.

For views, which is the most likely way to end up with a pager query parameter, you would need to build and execute the entire view to validate the parameter to find out how many possible pages there might be - pretty much have built the page at that point.

The full URL is added in form action and in the pages's JS.
If this was not the case, we could use a checksum to avoid duplication in cache. But adding a parameter will still allow to bypass the cache.

That's by design - query parameters can be used to pre-fill form values and other things which affect how the form is built, also for the ?destination parameter.

📌 Allow both AJAX and non-AJAX forms to POST to dedicated URLs Postponed might allow the form action to be consistent each time, but that issue has had very little progress in the past 9 years because it will be incredibly hard to implement.

Page cache creates vast amounts of unneeded cache data

Merge Requests

!11634Page cache creates vast amounts of unneeded cache data
Open

Comments & Activities

Page cache creates vast amounts of unneeded cache data

Merge Requests

!11634Page cache creates vast amounts of unneeded cache dataOpen

Comments & Activities

!11634Page cache creates vast amounts of unneeded cache data
Open