ResourceObjectNormalizer::getNormalization can result in max-age drift when different sets of fields are requested

Created on 30 April 2024, 7 months ago

Problem/Motivation

We recently had a 3rd party team start using one of our APIs. Almost immediately we started seeing data issues on both the app and our website that we'd never seen before. These data issues seemed to stem from things being cached for too long. After a huge effort from multiple developers we tracked it down to max-age drift in several areas of the site. The most noteable being the jsonapi_normalizations cache bin.

We have an opening_hours computed field which caches the result "until midnight". This is represented by the number of seconds until midnight as a max age. This worked perfectly until this other team started using our API. We started seeing things being cached until random times just after midnight, such as 00:03, 1:23, etc.

I have tracked this issue down to ResourceObjectNormalizer::getNormalization and subsequently ResourceObjectNormalizationCacher::set

What happens is:
1. Application A requests jsonapi/node/store with fields title,opening_hours - this generates a max age of seconds "until midnight"
2. Drupal caches the normalized data with an expiry of midnight (converted to UTC of course) in the jsonapi_normalizations cache bin.
3. Application B requests jsonapi/node/store with fields title,path,opening_hours
4. ResourceObjectNormalizer::getNormalization retrieves the cache entry from step 2. That cache entry contains the normalized data, and also includes the max age until midnight that was generated at the Application A requested the data
5. ResourceObjectNormalizer::getNormalization sees a new field has been requested, serializes it, and enters this code block in order to recache the data
6. On terminate, we enter ResourceObjectNormalizationCacher::set here and bubble all the cacheable metadata from the normalized fields. Again, this includes the max age generated in step 1.
7. We cache the new normalized data with the same number of seconds from step 1, except we are now in the future and "now" + that max age is after midnight!

Steps to reproduce

See above.

Proposed resolution

Potentially - instead of bubbling max ages, translate the expire value of the cache entry itself back into a max age and set that?

Remaining tasks

Figure out the fix
Tests
Fix
Review
Commit

Release notes snippet

πŸ› Bug report
Status

Active

Version

11.0 πŸ”₯

Component
JSON APIΒ  β†’

Last updated about 16 hours ago

Created by

πŸ‡¦πŸ‡ΊAustralia acbramley

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @acbramley
  • πŸ‡¦πŸ‡ΊAustralia acbramley
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡¦πŸ‡ΊAustralia acbramley
  • πŸ‡¦πŸ‡ΊAustralia acbramley

    πŸ“Œ Add #cache['downstream-ttl'] to force expiration after a certain time and fix #cache['max-age'] logic by adding #cache['age'] Active reports this exact issue in #11

    I've confirmed that the drifted max age bubbles up to DPC resulting in the entire page being cached for too long as well.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Hmm, interesting bug. Was just talking about something similar with a colleague earlier this week.

    Not sure is this is truly jsonapi or more caching. I'll come back to this issue soon.

  • πŸ‡¦πŸ‡ΊAustralia acbramley

    Not sure is this is truly jsonapi or more caching

    True it could exist in other areas, I'm not sure if other parts of the caching system cache things like this though. The issue really stems from that fact that we get a cached item (the normalized fields) which already contain max ages, add stuff on to that same cache entry, and then re cache it with a mix of new and existing data.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Ok interesting, fair enough. I'll put in on my list and will probably start to see if i can create some failing tests.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    We have ran into this issue, and my colleague @casey has found this problem happening in other parts of the caching system also. We ran into this because we use max-age in combination with a time for anonimous visitors to have access to a private file and we noticed that cache was not invalidated a tthe right time. This was also a combination of different caches that stack their max-age and therefor ending up with caches that don't invalidate at the right time. This gets even worse, when certain caches invalidate or get purged because of cache size this can actually result in issues around an invalid state which results in WSOD's. We've seen this in cron and certain very big sites.

    As we talked about this this morning, we think we might need a META issue around this, where max-age should consider expires and be recalculated if a later cache uses earlier caches. This would probably fix a lot of issues.

    I hope i can make a good writeup soon so we can work on this. We can probably fix this in steps though and start calculating max-age based on bubbled max-age and expires.

    Caching is hard lol

    adding some related issues also

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Add parent issue, since me and casey think we can solve this globally.

  • Merge request !8154Set proper expire headers, based on #3447821 β†’ (Closed) created by bbrala
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Since solving the main issue is a web of tasks and unsolved questions (it seems), i've extracted the jsonapi part in a mr. Would love to get some feedback on this approach, so gonna post it here, might even be able to do a local fix here, without touching the others.

    The code should apply max-age properly this way based on the time of the request. We use this pattern from the parent issue, which we succesfully use in production (see this comment β†’ in #2352009: [pp-3] Bubbling of elements' max-age to the page's headers and the page cache β†’ ).

  • Status changed to Needs review 6 months ago
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands
  • Pipeline finished with Success
    6 months ago
    Total: 571s
    #179604
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Code looks good, is there a way we can add a test for this behaviour?

  • πŸ‡¦πŸ‡ΊAustralia acbramley

    I will test this with our previous use case

  • πŸ‡³πŸ‡±Netherlands casey
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    I'll think up a test Friday with casey. Didn't get there tonight.

    Adding casey since we worked on this together.

  • Pipeline finished with Success
    6 months ago
    Total: 511s
    #179779
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Note:

    Probably use ResourceObjectNormalizerCacherTest and use a dataprovider to create different set's of fields and check variationcache for the proper max age.

    We can use the request stack getCurrectRequest to set the time in the request and see if invalidation is ok. That test should then fail without the code.

    Think that should work.

  • Pipeline finished with Failed
    6 months ago
    Total: 196s
    #180965
  • Pipeline finished with Failed
    6 months ago
    Total: 215s
    #180971
  • Pipeline finished with Canceled
    6 months ago
    Total: 181s
    #180974
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Added a test that tests the max age result in a normalization. It tests default max-age, then little later since last request, and lastly after it expires. I also added the code suggestion from casey earlier.

    Test only run fails: https://git.drupalcode.org/issue/drupal-3444257/-/jobs/1675965 yay.

    Think this is covering it quite well. And only like 0,7 seconds of testing.

    Al ready for review again

  • Pipeline finished with Failed
    6 months ago
    Total: 766s
    #180979
  • Pipeline finished with Failed
    6 months ago
    Total: 509s
    #180993
  • Pipeline finished with Canceled
    6 months ago
    Total: 136s
    #181036
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Warning was missing 'account' from the context for field serialization. Fixed right now by passing 'account' => NULL, but also added a followup to fix the warning at its root: πŸ“Œ Calling normalize without account context generates a warning. Active .

    The created timestamp can be a float, so we need to floor the expires value so that we can calculate a proper max_age. We use floor because if it
    expires on timestamp 1720000.5 we need to make sure it expires on timestamp 1720001.

    Locally this fixed the tests, think all is ok now for real ;)

  • Pipeline finished with Success
    6 months ago
    Total: 1904s
    #181039
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    We talked this over some more during lunch lol ;p

    Expires is a bad idea. This is create time of the cache, this could mean the cache was requested, but created seconds later if the request is slow. Since everything everywhere uses request time for caches, this is wrong and will result in silly bugs if we are not carefull.

    When looking through core i did notice this code in Drupal\views\Plugin\views\cache\CachePluginBase

    
      public function cacheGet($type) {
        $cutoff = $this->cacheExpire($type);
        switch ($type) {
          case 'query':
            // Not supported currently, but this is certainly where we'd put it.
            return FALSE;
    
          case 'results':
            // Values to set: $view->result, $view->total_rows, $view->execute_time,
            // $view->current_page.
            if ($cache = \Drupal::cache($this->resultsBin)->get($this->generateResultsKey())) {
              if (!$cutoff || $cache->created > $cutoff) {
    

    That is kinda scary, unrelated, but why would that use created, and not when it expires...

    Anyways, unrelated ;)

  • Pipeline finished with Success
    6 months ago
    Total: 657s
    #181181
  • Pipeline finished with Canceled
    6 months ago
    Total: 102s
    #181881
  • Pipeline finished with Failed
    6 months ago
    Total: 189s
    #181885
  • Pipeline finished with Success
    6 months ago
    Total: 530s
    #181895
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Added test for specifically 0 seconds.
    Ended up being convinved to change the loop to caseys suggestion.

    Do wonder if the fact that REQUEST_TIME+1600 gived a max-age of 0 is correct. Shouldn't that be a fresh record with 800?

  • Pipeline finished with Success
    6 months ago
    Total: 657s
    #183044
  • First commit to issue fork.
  • Merge request !8431Test only fix β†’ (Open) created by mxr576
  • πŸ‡­πŸ‡ΊHungary mxr576 Hungary

    I did not have a project where I would suffer from the described problem, but based on the issue description and the attached test I think the described problem is addressed.
    I had some nitpicks and also spin up a test-only branch just to confirm that the test fails.

  • Pipeline finished with Failed
    5 months ago
    Total: 191s
    #200995
  • Status changed to Needs work 5 months ago
  • The Needs Review Queue Bot β†’ tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

    This does not mean that the patch necessarily needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

    Consult the Drupal Contributor Guide β†’ to find step-by-step guides for working with issues.

  • πŸ‡­πŸ‡ΊHungary mxr576 Hungary

    mxr576 β†’ changed the visibility of the branch 3444257-test.only to hidden.

  • Pipeline finished with Success
    5 months ago
    Total: 612s
    #201854
  • Pipeline finished with Canceled
    5 months ago
    Total: 75s
    #201873
  • Pipeline finished with Success
    5 months ago
    Total: 605s
    #201874
  • Status changed to RTBC 5 months ago
  • πŸ‡­πŸ‡ΊHungary mxr576 Hungary

    Fixed the things that I have spotted, RTBC from my part.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    <3

  • πŸ‡³πŸ‡ΏNew Zealand quietone

    I read the IS, comments and the MR (not a code review). All questions are answered. The only thing I spotted was that the proposed resolution is a question.

    Leaving at RTBC

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Updates a few small things in IS, thanks

  • πŸ‡¦πŸ‡ΊAustralia acbramley

    Great job guys, this is a good one to get in!

  • Status changed to Needs review 4 months ago
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Left a question on the MR

    Also keen to see this go in, but the test makes me think we're left with a performance regression in place of the bug.

  • Status changed to RTBC 4 months ago
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Hmm, digging into my memory i think the following reasoning was important here. I've gone through this with @casey to see if what you say is true. We've updated the comments in the tests and adjusted the test with 1600 to 801 to hopefully be more clear on the intention of the tests.

    To awnser your comment: Cache entries track expire time, the cache record is valid in the second it invalidates. See cache backends:

    $cache->valid = $cache->expire == Cache::PERMANENT || $cache->expire >= $this->time->getRequestTime();
    

    The cache should be renewed the next second after expiring. A max-age of 0 means the cache is still valid but will not be next second. Invaliding the cache is still just done in the cache backend through the expire data. We are not changing anything on that layer. The only thing the max-age does is tell the layers in front of drupal (nginx, a CDN or whatever) it is about to expire.

    Should we disable the cache on max-age 0 by returning FALSE in the ResourceObjectNormalizationCacher::get method we should pretty much get the same data because the cache has not expired yet.

    Whats important to note is that the test only tests the ResourceObjectNormalizationCacher and its logic around max-age. We are not testing the cache backend and handling of expire data. We assume that is correct. (although i didn't find any tests around the expire=0 and handling of that case).

  • Pipeline finished with Canceled
    4 months ago
    Total: 79s
    #231030
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Also rebased. :x

  • Pipeline finished with Failed
    4 months ago
    Total: 481s
    #231031
  • Status changed to Needs work 4 months ago
  • πŸ‡³πŸ‡ΏNew Zealand quietone

    Left some comments in the MR and their is a failing test :-(.

  • Pipeline finished with Canceled
    4 months ago
    Total: 156s
    #240593
  • Pipeline finished with Success
    4 months ago
    Total: 477s
    #240594
  • Status changed to Needs review 4 months ago
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Added comments to the two parts you gave feedback on quietone.

    Also fixed the test, i updated the the comment to the test based on a deepdive, and accidentally updated the tested value. Sorry about that.

    See https://git.drupalcode.org/project/drupal/-/merge_requests/8154/diffs?co...

  • Status changed to RTBC 3 months ago
  • πŸ‡ΊπŸ‡ΈUnited States smustgrave

    Believe feedback has been addressed on this one.

  • Pipeline finished with Success
    3 months ago
    Total: 530s
    #249135
  • πŸ‡³πŸ‡ΏNew Zealand quietone

    @bbrala, thanks for the comments.

    I rebased the MR and there were no conflicts. Then I removed an @see to the issue because we don't link to the core issues in the code base, And I also changed the format of the comments to follow the recommendations in the coding standards. See https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... β†’ .

    Since the changes did not change the code and only reformatted comments, I am leaving the RTBC.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Bit embarrassing I'm wrongly formatting a comment.

    Thanks :)

  • πŸ‡¬πŸ‡§United Kingdom catch

    I found the comment a bit hard to read and suggested on possible change, although not solid enough to use gitlab suggestions for it.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Tried to make the comment a bit more understandable. It is still hard imo, but more clean in many ways :)

    RTBC since its only a small comment change.

    • catch β†’ committed 5670bba3 on 10.3.x
      Issue #3444257 by bbrala, mxr576, quietone, acbramley, larowlan, casey,...
    • catch β†’ committed c2ebccf8 on 10.4.x
      Issue #3444257 by bbrala, mxr576, quietone, acbramley, larowlan, casey,...
    • catch β†’ committed c9b15c8c on 11.0.x
      Issue #3444257 by bbrala, mxr576, quietone, acbramley, larowlan, casey,...
    • catch β†’ committed 9bfcb369 on 11.x
      Issue #3444257 by bbrala, mxr576, quietone, acbramley, larowlan, casey,...
  • πŸ‡¬πŸ‡§United Kingdom catch

    That looks better, definitely can't improve on it at least!

    Committed/pushed to 11.x and cherry-picked back through to 10.3.x, thanks!

  • πŸ‡¬πŸ‡§United Kingdom catch
  • Pipeline finished with Failed
    about 1 month ago
    Total: 483s
    #303908
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Wow! 😳

    Seems like this was wrong ever since 2019, where I worked on this in #2819335: Resource (entity) normalization should use partial caching β†’ ? AFAICT the big refactor in πŸ“Œ Refactor ResourceObjectNormalizationCacher to use VariationCache Fixed didn't change this.

    Looks like the entire "D8 cacheability" effort back in 2013–2015 overlooked the general problem: πŸ› Stacked caches result in max-age drift in caches Active 😬

  • Status changed to Fixed 25 days ago
  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024