Rate-limit by ASN

Created on 17 May 2024, about 1 month ago
Updated 14 June 2024, 17 days ago

Problem/Motivation

We just encountered a crawler not identifying as a bot (pretending to be human-traffic) and rapidly crawling our site from hundreds of unique IPs and user agent strings. Because of the distributed nature of this crawl, this bot was able to bypass our bot and regular traffic request limits (no single "visitor" was crawling over our established limits, but combined the crawler was very, very much over the limit).

Further analysis revealed that all traffic from this crawler was coming in under a single autonomous system number (ASN), identifying the network of the cloud-computing platform the crawler was running from. To keep distributed crawlers like this from completely bypassing all limits (and slowing down our site), I would like the optional ability to rate-limit regular traffic at the ASN level. Clearly a limit at this level would need to anticipate the shared nature of any given ASN, but I feel a sane limit at this level could really help.

Steps to reproduce

N/a.

Proposed resolution

Open up the ability to rate-limit regular traffic at the ASN level. Eg.

$settings['crawler_rate_limit.settings']['regular_traffic'] = [
  'interval' => 600,
  'requests' => 300,
  'asn_interval' => 600,
  'asn_requests' => 800,
];

When enabled, use a tool like GeoIP2-php to obtain ASN for requesting regular traffic IPs and enforce ASN-level rate limits. Cache IP->ASN info so the lookup is only necessary on the first request for a given IP.

Remaining tasks

Discuss, patch, review.

User interface changes

N/a.

API changes

N/a.

Data model changes

N/a.

✨ Feature request
Status

Needs review

Version

3.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States chrisolof

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.69.0 2024