Implement path prefix based prefix list as a cache collector to reduce redirect lookups

Created on 2 January 2025, 20 days ago

Problem/Motivation

See 🌱 [meta] Try to reduce database queries from redirect lookups Active

Steps to reproduce

Proposed resolution

I just had an idea for a 4th option that I think should work quite well. Dynamic include/exclude detection based on the first part of the path with a cache collector. If you don't have redirects on system paths such as /node/* then it would quickly converge on option two, where it only runs on what would be 404s. Maybe the check on URL's that only consist of a single part. That cache entry could grow somewhat large, especially combined with weird crawlers/bots. Could try to process some access logs to see how many unique first parts there typically are.

Remaining tasks

User interface changes

API changes

Data model changes

📌 Task
Status

Active

Version

1.0

Component

Code

Created by

🇨🇭Switzerland berdir Switzerland

Live updates comments and jobs are added and updated live.
  • Performance

    It affects performance. It is often combined with the Needs profiling tag.

Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @berdir
  • Merge request !131Add RedirectPrefixList implementation → (Open) created by berdir
  • 🇨🇭Switzerland berdir Switzerland

    Implemented an initial version and seems to be working reasonably well.

    TODO:
    * Optimize cache invalidation. Right now I just used redirect_list, but on systems that add a lot of redirects, this would be frequently invalidated. I think it should be possible to check the source path against the prefix list on save and invalidate only if a relevant change is detected.
    * Still want to do some checks against access logs on real sites to get an idea on how many
    * Probably an opt-out settings flag for sites using the alias patch or where request patterns cause a huge amount of variations.

  • 🇨🇭Switzerland berdir Switzerland

    I analyzed the last 100k/400k entries in php.access.log in a single fairly large production project.

    For the last 100k, I found 116 unique prefixes after resolving aliases and ignoring those that are only one part. In our case, 80k of those are node and this project does have node/... prefixed redirects, so won't be a huge help. About 59 of those prefixes don't have prefixes, but because most of the frequent prefixes do, it would only save around 1800 out of those 100'000 queries.

    For the last 400k, the list grows to 159. still quite manageable, but it would be easy enough to add a hard cut off and not store more patterns, seems like a fair assumption that after 500 unique patterns or something, we're not going to find many frequently requested ones anymore.

    Also checked a few different projects, but most actually have at least one node/something redirect for various reasons. So this is for many projects actually not as beneficial as I thought.

    I found one projects where I got a success rate (no redirects on that prefix) of 83% out of 100k redirects, with a small tweak (checking for redirects that start with "$prefix/", because that actually has a redirect on /node). On that project I also had access to ~1M log entries, parsing that resulted in 400 prefixes (and 11'000 unique single-component urls). The requests per prefix quickly go below 10 (out of 1M), so I think it's a good idea to cut off this prefix list after 100 or so to save on cache size.

    So, quite interesting for *some* projects, but many likely won't benefit (much) from this.

    (I may have spent quite a bit more time on this analysis than I expected)

  • 🇬🇧United Kingdom catch

    Right now I just used redirect_list, but on systems that add a lot of redirects, this would be frequently invalidated. I think it should be possible to check the source path against the prefix list on save and invalidate only if a relevant change is detected.

    Could implement ::set() on the cache collector class itself, and merge the prefix from the new redirect into the existing cache item then save it (if it's not already in there). Probably easiest to ::get() first and only change anything if that returns FALSE.

  • 🇨🇭Switzerland berdir Switzerland

    Yes, set() is a good idea, much better than just invalidate.

    This also needs tests, our existing tests pass, but they aren't really good examples, mostly just use URL's that aren't really a path. An actual performance test that lets us inspect if there were redirect queries would be neat.

  • 🇬🇧United Kingdom catch

    Was going to suggest a performance test, to my knowledge it would be the first ever performance test for a contrib module (outside core or Drupal CMS) - could hit one page where the redirect query is expected, then another page where it's not?

  • 🇨🇭Switzerland berdir Switzerland

    Yes, that's exactly the idea. My plan is to filter and only assert on redirect queries. Anything else seems way too much pain to keep up with and support various different minor and major core versions.

    Also considering to add one to redis, but similar problems there. I could for example assert that there are no semaphore queries, but we're already testing that in the existing full web test that simply asserts that no semaphore, queue, cache, .. tables exist at the end.

  • 🇬🇧United Kingdom catch

    My plan is to filter and only assert on redirect queries.

    Yeah should be able to do array_diff() or array_intersect() on a specific list of redirect queries, so it only fails if they're there or missing.
Production build 0.71.5 2024