Reintroduce "Crawl-delay" in robots.txt

Created on 18 December 2016, over 8 years ago

Updated 1 May 2024, over 1 year ago

Problem/Motivation

The Crawl-delay setting was removed from Drupal 8's default robots.txt after having been in place for a long time in previous versions. This means the site doesn't place any limits on crawl rate and the crawler bots can decide it entirely by themselves. In practice there have been cases where poorly behaving bots continue to hammer a site with many requests per second from many remote addresses even long after it's clear that the site is experiencing issues and becomes unresponsive or otherwise consumed a significant amount of resources for very little reason.

We should revisit the removal of Crawl-delay from Drupal 8's default robots.txt and discuss reinstating it. The decision to remove it wasn't thought through and the arguments that were made to support it may not stand up to scrutiny. Experience tells that there are real cases where having a default delay is useful. Hass also raised this issue when the change was being thought of for Drupal 7 → . There are many bots around that can't be trusted to crawl at a sensible rate but that do respect Crawl-delay. Below are the arguments made when deciding to remove Crawl-delay and an examination of each of them.

These arguments were presented as reasons to remove

1. Google doesn't support Crawl-delay anymore thus making it obsolete
2. Google recommends not adjusting GoogleBot's crawl rate through Google Webmaster Tools
3. Lowering crawl rate for Google will make your search ranking drop
4. Server capacity has changed
5. Caching mechanisms have changed
6. Crawler "intelligence" has changed
7. Bing calls a 10 second Crawl-delay value "extremely slow"
8. robots.txt is supposed to be customised

Examination of the arguments

There has been no recent change in Google's behavior regarding Crawl-delay. It has completely ignored Crawl-delay for a long time: Here's Google's Matt Cutts in 2006 stating they ignore it. Here's a forum thread from 2004 mentioning GoogleBot doesn't support it. This would seem to predate the previous decision to include it for Drupal's default robots.txt. As far as I can tell GoogleBot may have never supported Crawl-delay so it's not all that relevant to the discussion.

The reason for Google's lack of support is also not because they consider it obsolete. Here's what Matt Cutts gave as the reason in an interview:

And, the reason that Google doesn’t support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that.

Google's recommendation to not change the crawl rate is not meant to be taken as a general rule applied to all bots, after all it's not even talking about Crawl-delay. Google may be smart enough to not flood your site with requests when it can't handle them but you can't just go ahead and assume that the same is true for every single crawler bot out there based on that. In a previous issue it was already estimated → that even the 10 second Crawl-delay setting that has been the Drupal default for years should not be a problem unless your site is fairly large. This is shown pretty well by the example with crawl-time of 5000 pages still being just under 14 hours with a 10 second delay in the previous issue. If you have a site that changes at a rate sufficient for that to become a problem it seems to me that you're only risking some of the less prominent pages being out of date due to delays in crawling them.

Again, this doesn't even concern Google but even if it did there was no evidence shown for this being the case. The mentioned case where the SERP ranking for Google had dropped actually sounded like a situation where Crawl-delay might have helped (had Google supported it) if the client site was being penalised because of poor response times.

There has been no change to the resources afforded by search engines and other crawlers. You can still be heavily outmatched by a crawler and it's typically much cheaper to fire off HTTP requests in rapid succession than it is to serve them. Having the capacity doesn't always mean that you want to spend in on serving a bunch of poorly behaving bots that provide little to no benefit to you either.

Drupal 8 caching mechanisms may have taken some important steps forward but I don't think it really changes the overall picture that much. Crawlers are likely to find the parts of your site that are not cached and cause more load than a typical visitor.

The claim that crawler "intelligence" has improved requires some evidence. There are definitely "dumb" bots still out there and this happening across the board doesn't seem obvious.

The important thing to realize here is that we are not talking only about reputable crawlers like Bing, we are talking about all of them. It seems unlikely that all bots out there would even want to behave well by default or being sophisticated enough to do it. Even GoogleBot has gotten confused by faceted search interfaces and crawl a basically infinite number of pages very fast on an otherwise small site and other bots (for example AHRefsBot and SemRushbot) have brought sites down repeatedly until they get blocked when there is no Crawl-delay in place. You can't just dismiss this with "get a better server" either because the crawler failing to do any rate limiting or doing it very poorly can result in a site that would otherwise see just moderate traffic suddenly getting a huge number of expensive requests from someone that has far more resources than you do. Even if it is manageable it may not be something you're particularly interested in using your resources for. Creating specific blocks for these kinds of bots doesn't seem like the best strategy either since the list is constantly changing and Crawl-delay takes care of all past, present and future ones with (in my opinion) very little downside. I'd much rather have exceptions for the few bots we really care about if there's enough concern to warrant that.

This doesn't seem like an argument for removal. It could be an argument for reducing the value (at least for Bing) and maybe that is something that should indeed be done. However, just going by a label they give with no reasoning whatsoever doesn't seem particularly convincing. It doesn't seem like Crawl-delay is considered to be a SERP ranking killer in SEO circles anywhere, also rschwab stated he hasn't found any claims that it would kill rankings when rejecting the change for Drupal 6 → .

Seems like customisation can be done to remove the Crawl-delay just as easily as adding one, it's Drupal's responsibility just to provide a safe and sensible defaults and these defaults probably should be geared towards smaller sites as those are much less likely to change these settings.

Proposed resolution

Add a Crawl-delay line back to Drupal's robots.txt.

Remaining tasks

Discuss.

📌 Task

Status

Active

Version

11.0 🔥

Component

Other →

Last updated 3 days ago

Created by

🇫🇮Finland Antti J. Salminen

Live updates comments and jobs are added and updated live.

robots.txt

Incomplete comments

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Comment over 1 year ago →
🇪🇨Ecuador jwilson3
Even GoogleBot has gotten confused by faceted search interfaces and crawl a basically infinite number of pages very fast on an otherwise small site and other bots (for example AHRefsBot and SemRushbot) have brought sites down repeatedly until they get blocked when there is no Crawl-delay in place.

We're seeing this exact case right now. Is there an issue for it somewhere on D.o to address this by other mechanisms inbuilt to facets/search_api?
Comment 7 months ago →
🇩🇰Denmark ressa Copenhagen
Adding a 10 second Crawl-delay seems sane to me. Huge sites, with a lot of server resources can override this, if they want to let the crawlers go crazy.

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024