Rate-limit by ASN

Created on 17 May 2024, 6 months ago
Updated 24 July 2024, 3 months ago

Problem/Motivation

We just encountered a crawler not identifying as a bot (pretending to be human-traffic) and rapidly crawling our site from hundreds of unique IPs and user agent strings. Because of the distributed nature of this crawl, this bot was able to bypass our bot and regular traffic request limits (no single "visitor" was crawling over our established limits, but combined the crawler was very, very much over the limit).

Further analysis revealed that all traffic from this crawler was coming in under a single autonomous system number (ASN), identifying the network of the cloud-computing platform the crawler was running from. To keep distributed crawlers like this from completely bypassing all limits (and slowing down our site), I would like the optional ability to rate-limit regular traffic at the ASN level. Clearly a limit at this level would need to anticipate the shared nature of any given ASN, but I feel a sane limit at this level could really help.

Steps to reproduce

N/a.

Proposed resolution

Open up the ability to rate-limit regular traffic at the ASN level. Eg.

$settings['crawler_rate_limit.settings']['regular_traffic'] = [
  'interval' => 600,
  'requests' => 300,
  'asn_interval' => 600,
  'asn_requests' => 800,
];

When enabled, use a tool like GeoIP2-php to obtain ASN for requesting regular traffic IPs and enforce ASN-level rate limits. Cache IP->ASN info so the lookup is only necessary on the first request for a given IP.

Remaining tasks

Discuss, patch, review.

User interface changes

N/a.

API changes

N/a.

Data model changes

N/a.

✨ Feature request
Status

Fixed

Version

3.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States chrisolof

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @chrisolof
  • Merge request !2Resolve #3447955 "Rate limit by asn" β†’ (Merged) created by chrisolof
  • Status changed to Needs review 5 months ago
  • πŸ‡ΊπŸ‡ΈUnited States chrisolof
  • First commit to issue fork.
  • πŸ‡ΊπŸ‡ΈUnited States chrisolof

    Notes from testing this MR against real traffic for about two weeks now:

    This is currently performing very well / as designed against real traffic. The addition of the ASN lookup on those requests where it is actually necessary (requester not openly identifying as bot and not already blocked at the visitor-level) is so fast, even without the optional C extension, that it is imperceptible to our end-users. On the other hand, rate-limiting all crawlers, including those that horizontally spread out across multiple IPs and/or user agent strings, is a perceptible performance boost for our end-users.

  • πŸ‡ΊπŸ‡ΈUnited States darrell_ulm

    This looks good, although I'm getting this message now:
    Missing dependencies: In order to rate-limit regular traffic at the ASN-level, you need to install the GeoIP2 PHP API.
    I did download the a test ASN database, GeoLite2-ASN-Test.mmdb,

    Full message is:

    CRAWLER RATE LIMIT Enabled

    • Configured to use memcached backend.
    • Rate limiting bot/crawler requests at the bot/crawler-level. 100 requests allowed per bot/crawler over a 600-second interval.
    • Rate limiting regular traffic requests at the visitor-level. 200 requests allowed per visitor over a 1400-second interval.
    • Rate limiting regular traffic requests at the ASN-level. 600 requests allowed per ASN over a 600-second interval.
    • Issue(s) detected that prevent rate limiter from functioning. In order to prevent fatal errors rate limiting has been disabled. You must fix all the errors or disable the Crawler Rate Limit.
    • Missing dependencies: In order to rate-limit regular traffic at the ASN-level, you need to install the GeoIP2 PHP API.
  • πŸ‡·πŸ‡ΈSerbia vaish

    Darell, there are two steps you need to complete before being able to use ASN-level rate limiting. You did download the ASN database already. What's left is to install PHP package geoip2/geoip2. That's what error message you got is about. You can install this package via composer, as usual.

    composer require geoip2/geoip2

    Please, let me know if you run into any other issues with this feature. I'm about to merge this MR but feel free to open a follow up issue if you find any bugs.

  • Pipeline finished with Skipped
    4 months ago
    #220908
  • Status changed to Fixed 4 months ago
  • πŸ‡·πŸ‡ΈSerbia vaish

    Thanks @chrisolof. Everything works great. I just made few minor tweaks.

  • πŸ‡ΊπŸ‡ΈUnited States darrell_ulm

    That makes sense, thank you. I'll give it another try from the dev branch.

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024