Merging cache metadata is slow with thousands of tags added.

Issue created by @reinfate
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
Custom Commands Failed
Comment about 2 years ago →
🇮🇳India _utsavsharma
Fixed CCF.
Open in Jenkins → Open on Drupal.org →
Environment: PHP 8.2 & MySQL 8
last update about 2 years ago
Build Successful
Status changed to Needs review about 2 years ago7:33am 20 October 2023
Comment about 2 years ago →
🇮🇳India _utsavsharma
Status changed to Needs work about 2 years ago2:40pm 20 October 2023
Comment about 2 years ago →
🇺🇸United States smustgrave
Seems to have caused test failures.
Merge request !10225Issue #3395304: Merging cache metadata is slow with thousands of tags added → (Open) created by reinfate
Pipeline finished with Failed
12 months ago
Total: 138s
#342068
Comment 12 months ago →
reinfate
So another attempt for this.

Here is simple reproduction

$markups = array_fill(0, 20000, 'markups'); $build = []; foreach ($markups as $i => $markup) { $build[] = [ '#markup' => $markup, '#cache' => [ 'tags' => [(string) $i], ], ]; } $render = \Drupal::service('renderer')->render($build); return new Response($render);
Rendering this with #cache takes about 5-8 times longer on my machine than without them
Mainly due to array_unique being called 20_000 times on the array that is growing from 1 tag to 20_000 tags

Changing how public methods in Cache and Cacheable object work (to avoid calling array_unique on each cache tag merge) will inevitably fail tests and potentially break some user code.
But that said, I've looked that each CacheBackend from core does another array_unique call before writing cache, so I would guess that even if we remove array_unique call in \Drupal\Core\Cache\Cache merge methods, everything still should work.

But to be safe I've added additional mergeFast methods in the Cache-related classes and made use of them in the Renderer.
Attached screenshot is the profiler isolated to \Drupal::service('renderer')->render($build); call from the example

Lets see if this fail tests
Pipeline finished with Failed
12 months ago
Total: 660s
#342078
Pipeline finished with Failed
12 months ago
Total: 507s
#342119
Comment 12 months ago →
reinfate
So seems like only \Drupal\Tests\views\Kernel\Handler\FieldRenderedEntityTest::testRenderedEntityWithoutAndWithField failed.
The problem with this test is that it checks that #cache tags and contexts are exact arrays from \Drupal\Tests\views\Kernel\Handler\FieldRenderedEntityTest::assertCacheabilityMetadata. But these render arrays do not have the #cache['keys'] so updated renderer do not remove duplicates from the cache metadata (as it assumes element will not be cached anyway)
So either we can modify the test, or modify Renderer/RenderContext to apply cache to the element when RenderContext size is 1 (as it indicates that it is the top-most element for the current render() call)

Moving to needs review, since I would like to get feedback on this whole issue before proceeding further.
Comment 10 months ago →
🇨🇭Switzerland berdir Switzerland
First, I don't think the cache tags system was ever meant to handle thousands of them. Merging is one thing, but it also implies that checking them has to do queries with thousands of placeholders, which generally doesn't seem like a great idea.

At that scale, it might be better to just switch to list cache tags?

That's even more true with cache contexts, there really should only be a handful of those. I was just thinking about when I once accidentally implemented a cache context with high variation. The problem with contexts is that they are sticky, so if you for example add a cache context with the node id as parameter and add that to the breadcrumb block, over time you will have thousands of unique cache contexts there.

That said, I've definitely also seen considerable overhead of merging cacheability but it also tends to be blown up by xhprof/blackfire as that tends to add a lot of overhead to many but fast function calls. For example on the permissions page. So looking into improving that is definitely worth a try.
Comment 10 months ago →
🇬🇧United Kingdom catch
That's even more true with cache contexts, there really should only be a handful of those. I was just thinking about when I once accidentally implemented a cache context with high variation. The problem with contexts is that they are sticky, so if you for example add a cache context with the node id as parameter and add that to the breadcrumb block, over time you will have thousands of unique cache contexts there.

I think this is fixed in VariationCache now, although I can't remember exactly how it was done...
Comment 10 months ago →
🇧🇪Belgium kristiaanvandeneynde Antwerp, Belgium
Okay so let's start with the positives. I like how you:

Managed to profile this all the way down to the merging of cache tags

Considered that a render array without cache keys won't be cached to we can skip some processing there

The second point is something we should take into separate consideration to see if we can't optimize more for render arrays parts that won't be cached.

Having said that, there's some more light to be shed on the issue at hand. Let's get the obvious out of the way first: If you ever have a render array or CacheRedirect with thousands of cache contexts, you are doing something wrong as your whole page would not only become uncacheable, but actually slower.

Likewise, having thousands of cache tags is not going to be doing you any favors either and at that point you're probably better off thinking of other approaches. So it makes absolute sense for the current Cache class to call array_unique the way it does, because the expectation was never that it would be called thousands of times on large data sets during a single request.

Now, this leaves one use case we might want to address where you still run into the issue you have here: You are adding only a handful of cache tags, but you're adding them over and over and over again. To me, this is a red flag and something that probably doesn't happen all that often.

I'm inclined to say that we shouldn't fix this in core, but rather have GraphQL consider the fact that it's manipulating cacheable metadata thousands of times during a request and that it should perhaps optimize for that. This could be as simple as a custom CacheableMetadata object that incorporates the logic from the MR and at the very end of "adding fields", adds that objects metadata to the actual response once.

So the MR in its current shape doesn't quite cut it for me, personally. I'd rather see us revisit what we do with cacheable metadata during the render process, as demonstrated in the MR, but I'm not sure that would be such a performance gain. A great first step here would be to use the PerformanceTestTrait to gain some metrics on what is currently happening and then see how much it improves if we change the renderer.

Keep in mind that the renderer is a crucial part of Drupal, so we shouldn't be changing it lightly without enough evidence to back up the necessity of the change. We should also make sure we have ample tests covering us when we do make changes.

Finally, we could also consider changing the RefinableCacheableDependencyTrait and CacheableDependencyTrait in a way that it only calls array_unique on a get, but only until a new set has been detected. That would probably be a net gain across the board, but requires us to refactor how both traits interact with one another as they would need to share some information between them.

This avenue is worth exploring and I would definitely support that change, but not until we have above profiling in place first to see if it even makes sense to carry out.

The code we currently have in StandardPerformanceTest looks like this:

$recorded_queries = $performance_data->getQueries(); $this->assertSame($expected_queries, $recorded_queries); $this->assertSame(34, $performance_data->getQueryCount()); $this->assertSame(124, $performance_data->getCacheGetCount()); $this->assertSame(45, $performance_data->getCacheSetCount()); $this->assertSame(0, $performance_data->getCacheDeleteCount()); $this->assertSame(36, $performance_data->getCacheTagChecksumCount()); $this->assertSame(43, $performance_data->getCacheTagIsValidCount()); $this->assertSame(0, $performance_data->getCacheTagInvalidationCount());
We should probably add some tracking to the traits mentioned above and get some metrics on those too. It might not cover every cache tag manipulation as we're not guaranteed that everyone uses those traits, but most classes do.
Comment 10 months ago →
🇧🇪Belgium kristiaanvandeneynde Antwerp, Belgium
To clarify: I don't think we need permanent coverage of the adding and getting of cache tags, as that would be rather pointless and hard to pull off as we can't decorate a trait. But for the sake of this issue, we could hack into the traits, add some coverage and see how many times we call array_unique in there.

Then create a second MR with both the coverage and the fix to see the difference and finally create an MR with just the fix.

🇮🇩Indonesia el7cosmos 🇮🇩 GMT+7

I did some benchmarks with this

use Drupal\Core\Cache\Cache;

final class CacheMergeTagsBench {

  private const int RANGE = 100;

  public function benchCore(): void {
    $tags = [];
    foreach (range(1, self::RANGE) as $i) {
      $tags = Cache::mergeTags($tags, [(string) $i, (string) $i]);
    }
  }

  public function benchArrayPushUniqueInLoop(): void {
    $tags = [];
    foreach (range(1, self::RANGE) as $i) {
      foreach ([(string) $i, (string) $i] as $tag) {
        $tags[] = $tag;
      }
      $tags = array_values(array_unique($tags));
    }
  }

  public function benchArrayMergeUniqueOutsideLoop(): void {
    $tags = [];
    foreach (range(1, self::RANGE) as $i) {
      $tags = array_merge($tags, [(string) $i, (string) $i]);
    }
    $tags = array_values(array_unique($tags));
  }

  public function benchArrayPushUniqueOutsideLoop(): void {
    $tags = [];
    foreach (range(1, self::RANGE) as $i) {
      foreach ([(string) $i, (string) $i] as $tag) {
        $tags[] = $tag;
      }
    }
    $tags = array_values(array_unique($tags));
  }

}

Here are the results:

RANGE = 100

CacheMergeTagsBench
===================

Average iteration times by variant

232.2μs   │ █
203.1μs   │ █
174.1μs   │ █
145.1μs   │ █
116.1μs   │ █ █
87.1μs    │ █ █
58.0μs    │ █ █ █
29.0μs    │ █ █ █ ▇
          └─────────
            1 2 3 4

[█ <current>]

1: benchCore            2: benchArrayPushUniqu᠁ 3: benchArrayMergeUniq᠁ 4: benchArrayPushUniqu᠁


<current>
+-------------------------------------+---------+-----------+---------+-----------+
| subject                             | memory  | mode      | rstdev  | stdev     |
+-------------------------------------+---------+-----------+---------+-----------+
| benchCore ()                        | 8.648mb | 232.153μs | ±81.02% | 312.908μs |
| benchArrayPushUniqueInLoop ()       | 8.649mb | 115.196μs | ±3.33%  | 3.899μs   |
| benchArrayMergeUniqueOutsideLoop () | 8.649mb | 57.742μs  | ±6.06%  | 3.611μs   |
| benchArrayPushUniqueOutsideLoop ()  | 8.649mb | 24.448μs  | ±7.30%  | 1.855μs   |
+-------------------------------------+---------+-----------+---------+-----------+

RANGE = 1_000

CacheMergeTagsBench
===================

Average iteration times by variant

8.7ms     │ █
7.6ms     │ █ ▃
6.5ms     │ █ █
5.4ms     │ █ █
4.3ms     │ █ █
3.2ms     │ █ █ █
2.2ms     │ █ █ █
1.1ms     │ █ █ █ ▂
          └─────────
            1 2 3 4

[█ <current>]

1: benchCore            2: benchArrayPushUniqu᠁ 3: benchArrayMergeUniq᠁ 4: benchArrayPushUniqu᠁


<current>
+-------------------------------------+---------+-----------+--------+-----------+
| subject                             | memory  | mode      | rstdev | stdev     |
+-------------------------------------+---------+-----------+--------+-----------+
| benchCore ()                        | 8.648mb | 8.665ms   | ±2.87% | 251.757μs |
| benchArrayPushUniqueInLoop ()       | 8.649mb | 6.787ms   | ±0.78% | 52.716μs  |
| benchArrayMergeUniqueOutsideLoop () | 8.649mb | 3.171ms   | ±1.35% | 42.865μs  |
| benchArrayPushUniqueOutsideLoop ()  | 8.649mb | 176.902μs | ±2.94% | 5.238μs   |
+-------------------------------------+---------+-----------+--------+-----------+

RANGE = 10_000

CacheMergeTagsBench
===================

Average iteration times by variant

803.8ms   │ █
703.3ms   │ █ ▃
602.8ms   │ █ █
502.4ms   │ █ █
401.9ms   │ █ █ ▂
301.4ms   │ █ █ █
200.9ms   │ █ █ █
100.5ms   │ █ █ █ ▁
          └─────────
            1 2 3 4

[█ <current>]

1: benchCore            2: benchArrayPushUniqu᠁ 3: benchArrayMergeUniq᠁ 4: benchArrayPushUniqu᠁


<current>
+-------------------------------------+---------+-----------+---------+-----------+
| subject                             | memory  | mode      | rstdev  | stdev     |
+-------------------------------------+---------+-----------+---------+-----------+
| benchCore ()                        | 8.648mb | 803.765ms | ±19.56% | 174.104ms |
| benchArrayPushUniqueInLoop ()       | 8.649mb | 630.155ms | ±0.17%  | 1.100ms   |
| benchArrayMergeUniqueOutsideLoop () | 8.649mb | 322.530ms | ±20.93% | 75.267ms  |
| benchArrayPushUniqueOutsideLoop ()  | 8.649mb | 1.899ms   | ±5.04%  | 98.035μs  |
+-------------------------------------+---------+-----------+---------+-----------+

RANGE = 20_000

CacheMergeTagsBench
===================

Average iteration times by variant

3.3s      │ █
2.9s      │ █ ▃
2.5s      │ █ █
2.0s      │ █ █
1.6s      │ █ █ ▂
1.2s      │ █ █ █
817.6ms   │ █ █ █
408.8ms   │ █ █ █ ▁
          └─────────
            1 2 3 4

[█ <current>]

1: benchCore            2: benchArrayPushUniqu᠁ 3: benchArrayMergeUniq᠁ 4: benchArrayPushUniqu᠁


<current>
+-------------------------------------+---------+---------+---------+-----------+
| subject                             | memory  | mode    | rstdev  | stdev     |
+-------------------------------------+---------+---------+---------+-----------+
| benchCore ()                        | 8.648mb | 3.270s  | ±7.65%  | 264.635ms |
| benchArrayPushUniqueInLoop ()       | 8.649mb | 2.558s  | ±7.49%  | 198.958ms |
| benchArrayMergeUniqueOutsideLoop () | 8.649mb | 1.318s  | ±10.64% | 147.986ms |
| benchArrayPushUniqueOutsideLoop ()  | 8.649mb | 3.752ms | ±2.05%  | 76.243μs  |
+-------------------------------------+---------+---------+---------+-----------+

Calling both array_merge and array_unique in a loop is a significant performance hit. Array merge can be replaced with $tags[] = $tag without any side effects, as we don't care about the array keys.

Comment 5 months ago →
🇮🇩Indonesia el7cosmos 🇮🇩 GMT+7
I do more benchmarking with https://www.php.net/manual/en/class.ds-set.php extension with add() method, and it is the most performant with the same behaviour as array_unique(array_merge())

Merging cache metadata is slow with thousands of tags added.

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Merge Requests

!10225Merging cache metadata is slow with thousands of tags added.
Open

Comments & Activities

RANGE = 100

RANGE = 1_000

RANGE = 10_000

RANGE = 20_000

Merging cache metadata is slow with thousands of tags added.

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Merge Requests

!10225Merging cache metadata is slow with thousands of tags added.Open

Comments & Activities

RANGE = 100

RANGE = 1_000

RANGE = 10_000

RANGE = 20_000

!10225Merging cache metadata is slow with thousands of tags added.
Open