Not always effective, unfortunately

Created on 13 February 2025, 4 months ago

This module seemed to work for me as I could improve the situation in the past, but it seems that crawlers get more difficult to target.

I see floods of requests from different IPs, with user agents having a random string in it, for very similar existing pages (facet search) - and this is bringing my server down.

It seems that this module does not limit these requests, even when I set regular_traffic and regular_traffic_asn to very low limits (interval: 600, requests: 2).

I know it works in principle, because I lock myself out with such setting. But obviously, the crawlers use different IP and different user agent for every request.

Is there a log which shows if any limits by this module have been triggered?
Is there an option to further limit requests, e.g. disregard the User-Agent and only block by IP?

πŸ’¬ Support request
Status

Active

Version

3.0

Component

Miscellaneous

Created by

πŸ‡¦πŸ‡ΉAustria alexh

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @alexh
  • πŸ‡ΊπŸ‡ΈUnited States bburg Washington D.C.

    I agree with OP, this module feels like "throwing spaghetti at the problem". I use it because it's there, but I have no metrics on how many requests it's blocked, and I'm not able to adjust the settings on the fly as everything is hard-coded in settings.php. It would be great to see these things.

    And yes, I'm seeing a trend in my own sites of bots getting caught up in endless combinations of facet links. The solution for that for me was to block requests that contained a certain number of facet query parameters. e.g. a faceted search URL might look like this:

    /search?f[0]=filter0&f[1]=filter1&f[2]=filter2&f[3]=filter3

    If you block the "f[3]" via your WAF rules like in Cloudflare, you can stop a large amount of this traffic, but allow normal human traffic to use some of the faceted search feature. I'm working on a list of mitigation approaches around this problem. But this was the one that addressed the last issue I was having around this.

    Other things I'm using as well are the Perimeter module to block probing for vulnerable URL paths. Not that I'm worried about these attempts finding a vulnerability, but I was seeing a lot of traffic like this, which also serves un-cached pages. Also, Fast 404 (to make rendering 404 pages less resource intensive), and Antibot, and Honeypot to block spam form submissions.

  • πŸ‡·πŸ‡ΈSerbia vaish

    @alexh, sorry for the late reply. I was away due to personal reasons. I would like to try and help you resolve your issue if you are still interested.

    I have personal experience with bots endlessly crawling faceted search result pages going through all possible combinations of facets. In my case, though, User Agents were commonly recognized as bots and handled by bot_traffic limits. However, only occasionally they would cross bot rate limits I set and only then they would be blocked.

    If I understand correctly, in your case, User Agent strings are spoofing browsers and, on top of that, adding random strings to make them appear unique. Unique UA strings combined with multiple IP addresses make it very difficult for regular_traffic limit to ever be reached. That's where rate limiting by ASN should be able help.

    I can think of several reasons for regular_traffic_asn limit not being reached:

    • There is an issue with module configuration
    • ASN database is outdated and it doesn't contain entries for IP addresses you are trying to rate limit.
    • IP addresses you would like to rate limit are belonging to multiple different ASNs.

    You also mentioned that these requests are bringing your server down. If that's the case, problem may be outside of Drupal and Crawler Rate Limit. When server receives more requests than it can handle, some requests will have to wait until they get processed. At some point server won't be able keep up and it will start returning 500 errors before Drupal and Crawler Rate Limit ever get a chance to handle the request.

    Is there a log which shows if any limits by this module have been triggered?

    Crawler Rate Limit does not perform any logging. That's by design. Goal of the module is to have minimal effect on performance. However, you can inspect your server logs and search for requests with response code 429.

    Note that you can review status of the module on Drupal's Status Report page (/admin/reports/status). There you should be able to see if module is configured correctly and enabled.

    I suggest you start by reviewing the Status Report page. Taking a screenshot of the Crawler Rate section and posting it here might also be a good idea.

    Is there an option to further limit requests, e.g. disregard the User-Agent and only block by IP?

    Rate limiting by only IP is not available.

    Note that, at the moment, I'm working on a new feature which will allow you to block (as opposed to rate-limiting) requests by ASN. You will, of course, need to analyze and understand your website traffic in order to make sure that blocking a whole ASN won't block genuine visitor traffic. Again, your web server logs should give you all the info you may need for that.

  • πŸ‡·πŸ‡ΈSerbia vaish

    I have no metrics on how many requests it's blocked, and I'm not able to adjust the settings on the fly as everything is hard-coded in settings.php. It would be great to see these things.

    You can analyze your server logs and look for requests with response code 429. Server logs, in general, are source of all the info you may need in order to configure the module optimally and monitor its effectiveness.

    Everything is hard-coded in settings.php by design. Goal of this module is to have minimal effect on performance.

Production build 0.71.5 2024