- Issue created by @chrisolof
- Status changed to Needs review
25 days ago 10:37pm 5 June 2024 - First commit to issue fork.
We just encountered a crawler not identifying as a bot (pretending to be human-traffic) and rapidly crawling our site from hundreds of unique IPs and user agent strings. Because of the distributed nature of this crawl, this bot was able to bypass our bot and regular traffic request limits (no single "visitor" was crawling over our established limits, but combined the crawler was very, very much over the limit).
Further analysis revealed that all traffic from this crawler was coming in under a single autonomous system number (ASN), identifying the network of the cloud-computing platform the crawler was running from. To keep distributed crawlers like this from completely bypassing all limits (and slowing down our site), I would like the optional ability to rate-limit regular traffic at the ASN level. Clearly a limit at this level would need to anticipate the shared nature of any given ASN, but I feel a sane limit at this level could really help.
N/a.
Open up the ability to rate-limit regular traffic at the ASN level. Eg.
$settings['crawler_rate_limit.settings']['regular_traffic'] = [
'interval' => 600,
'requests' => 300,
'asn_interval' => 600,
'asn_requests' => 800,
];
When enabled, use a tool like GeoIP2-php to obtain ASN for requesting regular traffic IPs and enforce ASN-level rate limits. Cache IP->ASN info so the lookup is only necessary on the first request for a given IP.
Discuss, patch, review.
N/a.
N/a.
N/a.
Needs review
3.0
Code