ResourceObjectNormalizer::getNormalization can result in max-age drift when different sets of fields are requested

Created on 30 April 2024, about 2 months ago
Updated 18 June 2024, 10 days ago

Problem/Motivation

We recently had a 3rd party team start using one of our APIs. Almost immediately we started seeing data issues on both the app and our website that we'd never seen before. These data issues seemed to stem from things being cached for too long. After a huge effort from multiple developers we tracked it down to max-age drift in several areas of the site. The most noteable being the jsonapi_normalizations cache bin.

We have an opening_hours computed field which caches the result "until midnight". This is represented by the number of seconds until midnight as a max age. This worked perfectly until this other team started using our API. We started seeing things being cached until random times just after midnight, such as 00:03, 1:23, etc.

I have tracked this issue down to ResourceObjectNormalizer::getNormalization and subsequently ResourceObjectNormalizationCacher::set

What happens is:
1. Application A requests jsonapi/node/store with fields title, opening_hours - this generates a max age of seconds "until midnight"
2. Drupal caches the normalized data with an expiry of midnight (converted to UTC of course) in the jsonapi_normalizations cache bin.
3. Application B requests jsonapi/node/store with fields title, path, opening_hours
4. ResourceObjectNormalizer::getNormalization retrieves the cache entry from step 2. That cache entry contains the normalized data, and also includes the max age until midnight that was generated at the Application A requested the data
5. ResourceObjectNormalizer::getNormalization sees a new field has been requested, serializes it, and enters this code block in order to recache the data
6. On terminate, we enter ResourceObjectNormalizationCacher::set here and bubble all the cacheable metadata from the normalized fields. Again, this includes the max age generated in step 1.
7. We cache the new normalized data with the same number of seconds from step 1, except we are now in the future and "now" + that max age is after midnight!

Steps to reproduce

See above.

Proposed resolution

Potentially - instead of bubbling max ages, translate the expire value of the cache entry itself back into a max age and set that?

Remaining tasks

Figure out the fix
Tests
Fix
Review
Commit

Release notes snippet

πŸ› Bug report
Status

RTBC

Version

11.0 πŸ”₯

Component
JSON APIΒ  β†’

Last updated about 7 hours ago

Created by

πŸ‡¦πŸ‡ΊAustralia acbramley

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @acbramley
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡¦πŸ‡ΊAustralia acbramley

    πŸ“Œ Add #cache['downstream-ttl'] to force expiration after a certain time and fix #cache['max-age'] logic by adding #cache['age'] Active reports this exact issue in #11

    I've confirmed that the drifted max age bubbles up to DPC resulting in the entire page being cached for too long as well.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Hmm, interesting bug. Was just talking about something similar with a colleague earlier this week.

    Not sure is this is truly jsonapi or more caching. I'll come back to this issue soon.

  • πŸ‡¦πŸ‡ΊAustralia acbramley

    Not sure is this is truly jsonapi or more caching

    True it could exist in other areas, I'm not sure if other parts of the caching system cache things like this though. The issue really stems from that fact that we get a cached item (the normalized fields) which already contain max ages, add stuff on to that same cache entry, and then re cache it with a mix of new and existing data.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Ok interesting, fair enough. I'll put in on my list and will probably start to see if i can create some failing tests.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    We have ran into this issue, and my colleague @casey has found this problem happening in other parts of the caching system also. We ran into this because we use max-age in combination with a time for anonimous visitors to have access to a private file and we noticed that cache was not invalidated a tthe right time. This was also a combination of different caches that stack their max-age and therefor ending up with caches that don't invalidate at the right time. This gets even worse, when certain caches invalidate or get purged because of cache size this can actually result in issues around an invalid state which results in WSOD's. We've seen this in cron and certain very big sites.

    As we talked about this this morning, we think we might need a META issue around this, where max-age should consider expires and be recalculated if a later cache uses earlier caches. This would probably fix a lot of issues.

    I hope i can make a good writeup soon so we can work on this. We can probably fix this in steps though and start calculating max-age based on bubbled max-age and expires.

    Caching is hard lol

    adding some related issues also

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Add parent issue, since me and casey think we can solve this globally.

  • Merge request !8154Set proper expire headers, based on #3447821 β†’ (Open) created by bbrala
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Since solving the main issue is a web of tasks and unsolved questions (it seems), i've extracted the jsonapi part in a mr. Would love to get some feedback on this approach, so gonna post it here, might even be able to do a local fix here, without touching the others.

    The code should apply max-age properly this way based on the time of the request. We use this pattern from the parent issue, which we succesfully use in production (see this comment β†’ in #2352009: [pp-3] Bubbling of elements' max-age to the page's headers and the page cache β†’ ).

  • Status changed to Needs review about 1 month ago
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands
  • Pipeline finished with Success
    about 1 month ago
    Total: 571s
    #179604
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Code looks good, is there a way we can add a test for this behaviour?

  • πŸ‡¦πŸ‡ΊAustralia acbramley

    I will test this with our previous use case

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    I'll think up a test Friday with casey. Didn't get there tonight.

    Adding casey since we worked on this together.

  • Pipeline finished with Success
    about 1 month ago
    Total: 511s
    #179779
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Note:

    Probably use ResourceObjectNormalizerCacherTest and use a dataprovider to create different set's of fields and check variationcache for the proper max age.

    We can use the request stack getCurrectRequest to set the time in the request and see if invalidation is ok. That test should then fail without the code.

    Think that should work.

  • Pipeline finished with Failed
    about 1 month ago
    Total: 196s
    #180965
  • Pipeline finished with Failed
    about 1 month ago
    Total: 215s
    #180971
  • Pipeline finished with Canceled
    about 1 month ago
    Total: 181s
    #180974
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Added a test that tests the max age result in a normalization. It tests default max-age, then little later since last request, and lastly after it expires. I also added the code suggestion from casey earlier.

    Test only run fails: https://git.drupalcode.org/issue/drupal-3444257/-/jobs/1675965 yay.

    Think this is covering it quite well. And only like 0,7 seconds of testing.

    Al ready for review again

  • Pipeline finished with Failed
    about 1 month ago
    Total: 766s
    #180979
  • Pipeline finished with Failed
    about 1 month ago
    Total: 509s
    #180993
  • Pipeline finished with Canceled
    about 1 month ago
    Total: 136s
    #181036
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Warning was missing 'account' from the context for field serialization. Fixed right now by passing 'account' => NULL, but also added a followup to fix the warning at its root: πŸ“Œ Calling normalize without account context generates a warning. Active .

    The created timestamp can be a float, so we need to floor the expires value so that we can calculate a proper max_age. We use floor because if it
    expires on timestamp 1720000.5 we need to make sure it expires on timestamp 1720001.

    Locally this fixed the tests, think all is ok now for real ;)

  • Pipeline finished with Success
    about 1 month ago
    Total: 1904s
    #181039
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    We talked this over some more during lunch lol ;p

    Expires is a bad idea. This is create time of the cache, this could mean the cache was requested, but created seconds later if the request is slow. Since everything everywhere uses request time for caches, this is wrong and will result in silly bugs if we are not carefull.

    When looking through core i did notice this code in Drupal\views\Plugin\views\cache\CachePluginBase

    
      public function cacheGet($type) {
        $cutoff = $this->cacheExpire($type);
        switch ($type) {
          case 'query':
            // Not supported currently, but this is certainly where we'd put it.
            return FALSE;
    
          case 'results':
            // Values to set: $view->result, $view->total_rows, $view->execute_time,
            // $view->current_page.
            if ($cache = \Drupal::cache($this->resultsBin)->get($this->generateResultsKey())) {
              if (!$cutoff || $cache->created > $cutoff) {
    

    That is kinda scary, unrelated, but why would that use created, and not when it expires...

    Anyways, unrelated ;)

  • Pipeline finished with Success
    about 1 month ago
    Total: 657s
    #181181
  • Pipeline finished with Canceled
    about 1 month ago
    Total: 102s
    #181881
  • Pipeline finished with Failed
    about 1 month ago
    Total: 189s
    #181885
  • Pipeline finished with Success
    about 1 month ago
    Total: 530s
    #181895
  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    Added test for specifically 0 seconds.
    Ended up being convinved to change the loop to caseys suggestion.

    Do wonder if the fact that REQUEST_TIME+1600 gived a max-age of 0 is correct. Shouldn't that be a fresh record with 800?

  • Pipeline finished with Success
    about 1 month ago
    Total: 657s
    #183044
  • First commit to issue fork.
  • Merge request !8431Test only fix β†’ (Open) created by mxr576
  • πŸ‡­πŸ‡ΊHungary mxr576 Hungary

    I did not have a project where I would suffer from the described problem, but based on the issue description and the attached test I think the described problem is addressed.
    I had some nitpicks and also spin up a test-only branch just to confirm that the test fails.

  • Pipeline finished with Failed
    10 days ago
    Total: 191s
    #200995
  • Status changed to Needs work 10 days ago
  • The Needs Review Queue Bot β†’ tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

    This does not mean that the patch necessarily needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

    Consult the Drupal Contributor Guide β†’ to find step-by-step guides for working with issues.

  • πŸ‡­πŸ‡ΊHungary mxr576 Hungary

    mxr576 β†’ changed the visibility of the branch 3444257-test.only to hidden.

  • Pipeline finished with Success
    10 days ago
    Total: 612s
    #201854
  • Pipeline finished with Canceled
    10 days ago
    Total: 75s
    #201873
  • Pipeline finished with Success
    10 days ago
    Total: 605s
    #201874
  • Status changed to RTBC 10 days ago
  • πŸ‡­πŸ‡ΊHungary mxr576 Hungary

    Fixed the things that I have spotted, RTBC from my part.

  • πŸ‡³πŸ‡±Netherlands bbrala Netherlands

    <3

Production build 0.69.0 2024