[meta] Try to reduce database queries from redirect lookups

Issue created by @catch
Comment 7 months ago →
🇨🇭Switzerland berdir Switzerland
As I mentioned on slack, yes, performance is a concern, especially combined with the runtime overhead as well, things like repeated inbound path processing, access checks and so on.

Lots of people also want additional features like alias redirects (which would conflict with option 3 but I pretty much veto'd that), wildcard redirects which would obviously make this much worse as well.

The problem is that you can have redirects on valid system paths, and there are use cases for that, like redirecting requests to valid pages to somewhere else (one-off rabbithole use cases, where using yet another module for checks like that seems like overkill).

I just had an idea for a 4th option that I think should work quite well. Dynamic include/exclude detection based on the first part of the path with a cache collector. If you don't have redirects on system paths such as /node/* then it would quickly converge on option two, where it only runs on what would be 404s. Maybe the check on URL's that only consist of a single part. That cache entry could grow somewhat large, especially combined with weird crawlers/bots. Could try to process some access logs to see how many unique first parts there typically are.

Or alternatively, do that based on the route name. Would be a longer list, and we'd have to mess with the event priority which could break things, but then we could exclude node edit routes for example even if you have a node/ID redirect. But I definitely have some concerns around caching for that because that would require to run after dynamic page cache, so sites that allow query string based redirects as well would need to always vary dynamic page cache by the full query string.
Comment 7 months ago →
🇬🇧United Kingdom catch
The problem is that you can have redirects on valid system paths, and there are use cases for that, like redirecting requests to valid pages to somewhere else (one-off rabbithole use cases, where using yet another module for checks like that seems like overkill).

This would still be compatible with option #3 - we allow aliasing over system paths (whether that's advisable is a very old core issue).

I just had an idea for a 4th option that I think should work quite well. Dynamic include/exclude detection based on the first part of the path with a cache collector. If you don't have redirects on system paths such as /node/* then it would quickly converge on option two, where it only runs on what would be 404s. Maybe the check on URL's that only consist of a single part. That cache entry could grow somewhat large, especially combined with weird crawlers/bots. Could try to process some access logs to see how many unique first parts there typically are.

Core has a dynamically built path alias allow list that works pretty much like this - see AliasPrefixList. Could it be built from the first-path-part of all of the stored redirects?
Comment 7 months ago →
🇨🇭Switzerland berdir Switzerland
> This would still be compatible with option #3 - we allow aliasing over system paths (whether that's advisable is a very old core issue).

Yes, but I have considerable concerns over maintaining that. There's constant interaction between aliases and redirects. New aliases delete redirects, redirects are created when aliases change. And redirect has a number of partially configurable cases where redirect lookups are skipped. such as it not being a GET request, not having access to the route, maintenance mode, admin route and so on. We couldn't limit the alias to not apply then but would need to fall back to the original path then in that case.

Lots of edge cases to figure out.

> Core has a dynamically built path alias allow list that works pretty much like this

Yes, that's what my idea is based on, I remember when we rewrote that.

> Could it be built from the first-path-part of all of the stored redirects?

Possibly, but sites can have vast amounts of redirects. I just checked two examples. One site has about 10k redirects with 357 unique first parts based on SELECT DISTINCT SUBSTRING_INDEX(redirect_source__path, '/', 1) FROM redirect. That's doable, although I suspect that we might need to do backend specific queries for this.

The other site is more fun, huge amount of content that went through URL alias pattern changes and migrations. 200k+ redirects, with 6360 unique prefixes, all kinds of nonsense and about 5000 of those are redirects that is only a single part, because older iterations of this site had aliases like that. Quite easy to imagine a site that has tens of thousands of such single-component redirects. The vast majority of those probably hasn't been used in years, if ever. (redirect doesn't have the auto-purge feature from D7 anymore).

That's why I'm concerned about aggregating them all, I'm sure there are plenty of edge cases out there that would cause such a lookup to explode.

So instead, similar to AliasPrefixList, but without the restriction on route path prefixes, we build up the list as we go, doing one query to check if this prefix has any redirects for each new one we encounter. could even add some kind of hard limit that we don't grow this list over 1k entries or whatever and/or restrict it to (inbound processed) paths with at least 2 components.
Comment 7 months ago →
🇨🇭Switzerland berdir Switzerland
One thing to keep in mind is that the AiasPrefixList thing is on path to alias lookups, so we expect many lookups for a single page and this allows us to optimize away a possibly large amount of alias lookups (e.g. all /admin/ menu links for logged in users).

This would only be for a single redirect query, and contrary to what I wrote above, I'm not 100% sure I want to run this on the inbound processed path or before. I'll want to do some profiling to see how expensive that is, but it would be a _lot_ more efficient when done on the system path. Most sites use mostly nodes, some a few other entity types and system paths, like products checkout.

If we get a request for /foo/bar, we resolve it through the alias to /node/12 and then check the node prefix against our prefix list. Missing, so we do a query for any redirect starting with "node", get none. At this point, we do not have to run a single redirect lookup query anymore on any alias for a node. That should be true for most sites. And even if not, it still allows to rule out all admin pages, asset aggregation, autocomplete and other frequently accessed pages.
Comment 7 months ago →
🇨🇭Switzerland berdir Switzerland
Did a bit of profiling on a dynamic page cache hit, so that RedirectRequestSubscriber makes up a considerable chunk of the total time.

It was about 8% in my case with blackfire. The query itself barely registered on my local setup with ddev, but that could be quite different on a large redirect table and with the database having network overhead. In an extreme case, possibly even whether it's about having to do any queries and database connection at all, but that would require an alternative session backend.

And those 8% are fairly misleading, because it's mostly stuff that would otherwise happen elsewhere later on. Such as a permission check and state (for the maintenance state check, I'm not 100% sure if that's really needed) as well as the alias lookup.

IMHO, the good thing about that is that it should be OK to run the prefix list on the system path.

Unrelated to this issue but interesting: I did profile on top of 🐛 Feature or Bug? - The path-alias preloading is not used if the "Enforce clean and canonical URLs" option is enabled Active , simply because I still was on that branch. And even on a dynamic page cache *hit*, I'm seeing the preload alias query. because of \Drupal\redirect\EventSubscriber\RouteNormalizerRequestSubscriber::onKernelRequestRedirect(), as that generates the URL for the current path. The new approach in ✨ Route normalizer: Global Redirect in core Needs work ] avoids this by bypassing the AliasManager and going directly against the storage. But this also bypasses the whitelist and static caching, I'm not sure that's really the correct thing to do?
Comment 6 months ago →
🇬🇧United Kingdom catch
If we get a request for /foo/bar, we resolve it through the alias to /node/12 and then check the node prefix against our prefix list. Missing, so we do a query for any redirect starting with "node", get none.

This sounds efficient both in terms of the prefix list storage and avoiding queries, but would it not make it impossible for someone to manually create an alias from foo/bar to /bar/baz then when there's already an alias? Not that I think people should be able to do that, I would rather have less database queries.
Comment 6 months ago →
🇨🇭Switzerland berdir Switzerland
That's currently not supported. This inbound processing already happens, the only question is if the prefix ist is checked before or after that. See 🐛 Redirects from aliased paths aren't triggered Needs review , it would make that issue definitely not possible to implement though.
Comment 6 months ago →
🇬🇧United Kingdom catch
it would make that issue definitely not possible to implement though.

That seems fine to me - people can just delete the aliases then create the redirects.
Comment 6 months ago →
🇨🇭Switzerland berdir Switzerland
> That seems fine to me - people can just delete the aliases then create the redirects.

Well, people get confused about this *a lot*. We see frequent support requests in our projects because people don't understand this. But I've also suggested an alternative approach for this a long time ago (which is to remove the alias automatically as the redirect is added).

I'm OK with that too. Just suspect that the 120 followers on that issue indicate that _many_ sites use patches like that and they will not be happy about this change. But they can always disable it in that patch.*

[meta] Try to reduce database queries from redirect lookups

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments & Activities