Separate cache operations from database queries in OpenTelemetry and assertions

Open on Drupal.org →

Created on 23 October 2023, almost 2 years ago

Updated 6 February 2024, over 1 year ago

Problem/Motivation

PerformanceTestBase currently relies on parsing database queries to get cache operations for assertions, and sends all database queries to OpenTelemetry. This was OK as an initial implementation but has a couple of drawbacks:

1. Because several core cache bins use chained fast, whether or not a cache get is recorded depends on apcu hits or misses. An apcu hit or miss depends on when the apcu entry was added vs the last write timestamp, and that is inherently variable because it depends on which unix second something falls into.

This turned out to be the reason for variable cache and database counts in performance tests, so this issue allows those to use assertSame() which will both prevent random test failures and also be a lot stricter when real changes occur.

2. dozens or hundreds of database queries in OpenTelemetry/Grafana Tempo is hard to scan, if we show cache operations distinctly, it's easier to look for database queries, cache gets, cache sets in the list and see what's going on.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

📌 Task

Status

Fixed

Version

11.0 🔥

Component

Last updated 12 days ago

Maintained by
🇬🇧United Kingdom @alexpott
🇨🇭Switzerland @berdir
🇩🇪Germany @dawehner
🇮🇹Italy @mondrake

Created by

🇬🇧United Kingdom catch

Live updates comments and jobs are added and updated live.

gander

Performance
It affects performance. It is often combined with the Needs profiling tag.

Sign in to follow issues

Merge Requests

!6204Separate cache operations from database queries in OpenTelemetry and assertions
Open
🇬🇧United Kingdom catch
updated over 1 year ago

Comments & Activities

Issue created by @catch
Comment over 1 year ago →
🇫🇷France andypost
Next kind of measure could be serialisation
Comment over 1 year ago →
🇬🇧United Kingdom catch
Assuming 📌 Allow assertions on the number of database queries run during tests RTBC and 📌 Add database query spans to otel traces Needs review land before this one, we should handle both assertions and open telemetry spans here.

Implementation something like:

1. Add a decorator for all cache backends (webprofiler module has one we can borrow from) which times cache operations to the performance_test module that's already in core.

2. Replace the database query handling for cache operations with the information we collect directly from the cache API - we can throw away the database queries since they're duplicates.

3. Add the cache operations to open telemetry (again in place of the database queries for cache operations) so they can be browsed there - cid, bin and maybe backtrace would all be useful here.
Comment over 1 year ago →
🇬🇧United Kingdom catch
Some of this happened in 📌 Allow assertions on the number of database queries run during tests RTBC but we should decorate the cache services (and cache tags, and key value) instead of looking at database query backtraces so it works generically and can also be separated in Open Telemetry traces.
Merge request !6204Initial implementation for recording cache operations via a decorator. → (Open) created by catch
Status changed to Needs work over 1 year ago5:18pm 16 January 2024
Comment over 1 year ago →
🇬🇧United Kingdom catch
Not there yet but made a start.
Comment over 1 year ago →
🇬🇧United Kingdom catch
Opened a follow-up for cache tags 📌 Collect cache tags checksum queries via a decorator Postponed . I think we should separate them out via the same mechanism that we've been doing for cache operations before that issue (i.e. checking the query caller), because a decorator can't decorate the code that actually runs the query since it's buried in the checksum implementation, we'd need to subclass rather than decorate, which is not necessarily cleaner.

MR is green now, but I had to adjust a lot of cache assertions, not sure if that's because we've fixed bugs in the tests or introduced them yet.
Comment over 1 year ago →
🇬🇧United Kingdom catch
but I had to adjust a lot of cache assertions, not sure if that's because we've fixed bugs in the tests or introduced them yet.

OK figured it out. Previously we were watching only database queries, now we're watching cache operations, this includes chained fast, which includes apcu hits from chained fast, which don't hit the database - so we're seeing things we couldn't previously. This also might explain some of variation in cache get recording we've seen, if it depends whether things are in apcu or not.

I think it would be a good idea to add the cache backend class to the spans, so it's more obvious which are from the database backend and which are from chained fast, but that could potentially be a follow-up. We could also detect cache hits and misses now (because we have access to the return values in the decorator), but not sure if that's useful when we already have gets vs. sets vs. deletes.
Status changed to Needs review over 1 year ago10:38pm 16 January 2024
Comment over 1 year ago →
🇬🇧United Kingdom catch
Might see if I can remove the ranges from some cache assertions, not sure if that's possible or not but it might be, otherwise could use review.
Comment over 1 year ago →
🇧🇪Belgium kristiaanvandeneynde Antwerp, Belgium
I love the idea of CacheBackendDecorator. In a similar fashion, we could even create a second decorator that tracks the cache hit ratio by comparing $cids_copy and $cids. Obviously not the goal of this decorator, but a nice thought nonetheless.

I'm a bit puzzled as to why some cache counts go down. I'd expect more logging to mean all of the counts go up? #8 seems to point to more hits across the board too because we now also find cache operations that don't hit the DB, yet in testAnonymous some cache hit assertions are adjusted downwards.

Also left 2 nitpicks, leaving at NR because they're so tiny
Comment over 1 year ago →
🇬🇧United Kingdom catch
I love the idea of CacheBackendDecorator

It is somewhat stolen from https://www.drupal.org/project/webprofiler → although the only bit that was really re-usable more or less as-is was the factory decorator.

we could even create a second decorator that tracks the cache hit ratio by comparing $cids_copy and $cids.

I think we could probably add that to this one? Not necessarily in this issue though.

I'm a bit puzzled as to why some cache counts go down.

Yeah me too - going to OpenTelemetry-ify the standard performance test and see if I can compare before/after. There is some inherent variability to these tests because they also request css and js aggregates and image derivatives which are Drupal bootstraps until the files exist, which makes them more realistic but less repeatable..
Comment over 1 year ago →
🇬🇧United Kingdom catch
I'm a bit puzzled as to why some cache counts go down.

I wondered if by recording chained fast gets that might get rid of the variability altogether, because it wouldn't matter whether things are in apcu or not, or what the last clear timestamp is compared to which millisecond they were written in... and that seems to be the case! So have switched all of standard performance test over to assertSame(). Will see if it's possible to tighten the query counts next.
Comment over 1 year ago →
🇬🇧United Kingdom catch
This has fixed 📌 Investigate variable database query counts in performance tests Active . Bumping to major because the shift from ranges to assertSame() makes everything much more robust both for false positives and false negatives.
Comment over 1 year ago →
🇬🇧United Kingdom catch
Comment over 1 year ago →
🇬🇧United Kingdom catch
Comment over 1 year ago →
🇬🇧United Kingdom catch
Comment over 1 year ago →
🇬🇧United Kingdom catch
Adding a screenshot of how the cache spans look in grafana tempo after this change.
Comment over 1 year ago →
🇸🇮Slovenia slashrsm
Comment over 1 year ago →
🇧🇪Belgium kristiaanvandeneynde Antwerp, Belgium
Going from random ranges to exact comparisons is huge so for that alone this looks great.

As far as I am concerned, this is RTBC, but I'll hold off in case @catch wants to add more sugar to this. So feel free to RTBC yourself if you feel this is complete. I am disregarding the Nightwatch test fail as I assume that one is random, but please do double-check by running said pipeline again.

I would like to see a follow-up where we mull whether we want to compare $cids_copy to $cids to determine the cache hit ratio, but compared to the benefit of the work being done here I would regard that feature as minor or normal at best.
Pipeline finished with Failed
over 1 year ago
#81257
Pipeline finished with Success
over 1 year ago
#81260
Comment over 1 year ago →
🇳🇱Netherlands kingdutch
catch → credited Kingdutch → .
Comment over 1 year ago →
🇬🇧United Kingdom catch
Opened 📌 Separate single and multiple cache operations in performance test assertions Active and 📌 Add cache hit and miss detection to performance tests Active following a slack discussion with @kristiaanvandeneynde and @KingDutch.

I adjusted the decorator so that every method calls the respective decorated method instead of re-using the *multiple methods, this should be more robust for different cache implementations and opens up counting single and multiple operations separately. Also opened 📌 Add minimal backtrace to cache spans in performance tests Active for adding backtrace info to spans, which should be simple but this MR is getting big enough and blocks several other issues now.
Status changed to RTBC over 1 year ago11:54am 23 January 2024
Comment over 1 year ago →
🇧🇪Belgium kristiaanvandeneynde Antwerp, Belgium
Pipelines go green, code looks good and I agree we can tackle the other remarks in the follow-ups. I would recommend to use strict type checking in the in_array() calls.

So RTBC, but please fix the three in_array() calls upon commit.
Pipeline finished with Failed
over 1 year ago
#81309
Pipeline finished with Success
over 1 year ago
#81373
Comment over 1 year ago →
🇬🇧United Kingdom alexpott 🇪🇺🌍
Committed 9c61822 and pushed to 11.x. Thanks!
Status changed to Fixed over 1 year ago4:30pm 23 January 2024

Comment over 1 year ago →

alexpott → committed 9c618223 on 11.x

Issue #3396196 by catch, kristiaanvandeneynde, Kingdutch: Separate cache...

Comment over 1 year ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024