- Issue created by @alexh
- πΊπΈUnited States bburg Washington D.C.
I agree with OP, this module feels like "throwing spaghetti at the problem". I use it because it's there, but I have no metrics on how many requests it's blocked, and I'm not able to adjust the settings on the fly as everything is hard-coded in settings.php. It would be great to see these things.
And yes, I'm seeing a trend in my own sites of bots getting caught up in endless combinations of facet links. The solution for that for me was to block requests that contained a certain number of facet query parameters. e.g. a faceted search URL might look like this:
/search?f[0]=filter0&f[1]=filter1&f[2]=filter2&f[3]=filter3
If you block the "f[3]" via your WAF rules like in Cloudflare, you can stop a large amount of this traffic, but allow normal human traffic to use some of the faceted search feature. I'm working on a list of mitigation approaches around this problem. But this was the one that addressed the last issue I was having around this.
Other things I'm using as well are the Perimeter module to block probing for vulnerable URL paths. Not that I'm worried about these attempts finding a vulnerability, but I was seeing a lot of traffic like this, which also serves un-cached pages. Also, Fast 404 (to make rendering 404 pages less resource intensive), and Antibot, and Honeypot to block spam form submissions.
- π·πΈSerbia vaish
@alexh, sorry for the late reply. I was away due to personal reasons. I would like to try and help you resolve your issue if you are still interested.
I have personal experience with bots endlessly crawling faceted search result pages going through all possible combinations of facets. In my case, though, User Agents were commonly recognized as bots and handled by
bot_traffic
limits. However, only occasionally they would cross bot rate limits I set and only then they would be blocked.If I understand correctly, in your case, User Agent strings are spoofing browsers and, on top of that, adding random strings to make them appear unique. Unique UA strings combined with multiple IP addresses make it very difficult for
regular_traffic
limit to ever be reached. That's where rate limiting by ASN should be able help.I can think of several reasons for
regular_traffic_asn
limit not being reached:- There is an issue with module configuration
- ASN database is outdated and it doesn't contain entries for IP addresses you are trying to rate limit.
- IP addresses you would like to rate limit are belonging to multiple different ASNs.
You also mentioned that these requests are bringing your server down. If that's the case, problem may be outside of Drupal and Crawler Rate Limit. When server receives more requests than it can handle, some requests will have to wait until they get processed. At some point server won't be able keep up and it will start returning 500 errors before Drupal and Crawler Rate Limit ever get a chance to handle the request.
Is there a log which shows if any limits by this module have been triggered?
Crawler Rate Limit does not perform any logging. That's by design. Goal of the module is to have minimal effect on performance. However, you can inspect your server logs and search for requests with response code 429.
Note that you can review status of the module on Drupal's Status Report page (
/admin/reports/status
). There you should be able to see if module is configured correctly and enabled.I suggest you start by reviewing the Status Report page. Taking a screenshot of the Crawler Rate section and posting it here might also be a good idea.
Is there an option to further limit requests, e.g. disregard the User-Agent and only block by IP?
Rate limiting by only IP is not available.
Note that, at the moment, I'm working on a new feature which will allow you to block (as opposed to rate-limiting) requests by ASN. You will, of course, need to analyze and understand your website traffic in order to make sure that blocking a whole ASN won't block genuine visitor traffic. Again, your web server logs should give you all the info you may need for that.
- π·πΈSerbia vaish
I have no metrics on how many requests it's blocked, and I'm not able to adjust the settings on the fly as everything is hard-coded in settings.php. It would be great to see these things.
You can analyze your server logs and look for requests with response code 429. Server logs, in general, are source of all the info you may need in order to configure the module optimally and monitor its effectiveness.
Everything is hard-coded in settings.php by design. Goal of this module is to have minimal effect on performance.