Block User Agents

Created on 2 June 2025, about 2 months ago

Problem/Motivation

Would be nice to have a list of user agent substrings to block. I just saw a lot off requests from one including "HTTrack", which seems to be a "website copier" tool. It's generating a lot of requests.

✨ Feature request
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States bburg Washington D.C.

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @bburg
  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    That would be an interesting feature, but since HTTrack is a scraper, if the feature was added, this project could almost consider expanding its scope and name to https://www.drupal.org/project/bot_blocker β†’ ? Scrapers can cause a lot of extra traffic, which might be a strain, even for web sites without facets.

  • πŸ‡ΊπŸ‡ΈUnited States bburg Washington D.C.

    I do like that idea of using a more general namespace for the module. I do think it's important to keep a separation of concerns. Will keep this issue active until I, or someone else creates "bot_blocker"

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Sounds great, and thanks for all your work with facets and agents already here.

    About blocking scrapers, one method could be a rule about number of hits over a certain period (maybe five minutes?) and being able to block an IP if a threshold of requested URL's is exceeded. The reason I thought about a more generalized "hits per time period"-rule is because I have a web site where five or six facets by human is to be expected. But an intense pounding by a bot is problematic mostly due to the rapid requests, not the number of facets.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    In the meantime, I created a doc page for the module, and added some general tips and trick: https://www.drupal.org/docs/extending-drupal/contributed-modules/contrib... β†’

    Perhaps a Documentation β†’ link can be added under "Resources" on the project page?

  • πŸ‡ΊπŸ‡ΈUnited States bburg Washington D.C.

    That seems like some good info, but not specific to this module per se, and presumes you have more control over your server than you might in a host like Pantheon or Acquia.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    You're right @bburg, my thinking was that eventually the module scope might be expanded from facets, to covering more broadly also regular bots and crawlers, and in the mean time, these tips could be relevant.

    I moved the page to Security in Drupal β†’ under https://www.drupal.org/docs/administering-a-drupal-site/security-in-drup... β†’ .

    About the audience, even the cheapest $5 a month shared hosting has command line access, and it should be doable. (/var/log/apache2/access.log access may require sudo permissions, though)

    Also, Drupal web sites hosted on Acquia or Pantheon are in the minority, I would think. Actually, I am surprised to hear that ... is there no command line access on Acquia or Pantheon hosting?

  • πŸ‡ΊπŸ‡ΈUnited States bburg Washington D.C.

    My agency uses Pantheon mostly, but we work with a variety of hosting options for sites. Pantheon does not allow ssh access, and logs are stored in separate app containers. Acquia allows ssh, but that doesn't allow much more than running drush commands, and logs are similarly stored somewhere else. It's a bit of a complicated situation, and I don't want to make assumptions in documentation about what people are running on.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Thanks for a fast response. I didn't realize their hosting environments were that limited still ... But, like I wrote, the minority of Drupal sites are hosted on these two, and there are plenty of cheap, managed hosting offering SSH, like the ones listed here:

    https://b2evolution.net/web-hosting/ssh-hosting-secure-shell-access.php
    https://whoishostingthis.com/best-web-hosting/ssh-access/

    After all, running a Drupal site in 2025 requires Composer, and is too hard without Drush, so for the majority of Drupal sites, SSH access is likely accessible, either on a VPS, or managed hosting with SSH. I haven't tried it, but for example cPanel offers CLI via the browser.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    In the end, a generic module solution in the shape of a Drupal "Bad Bot Blocker" module would of course be best, since everyone could use that -- like I suggested -- maybe based around a rule for maximum number of hits over a certain period, before getting blocked?

    I found an article and script, which does this via firewall, and I added it to the doc page under Automated bot blocking via firewall β†’ , which is of course a more advanced solution, limited by the capabilities of the server, and technical skills of the Drupal administrator.

  • πŸ‡ΊπŸ‡ΈUnited States bburg Washington D.C.

    Hello,

    I just wanted to let you know I've just created a new module at Bot Blocker β†’ , which handles blocking requests with specific substrings in user agents (i.e. crawling tools), and very old versions of browsers as identified by their user agent strings. I'm going to mark this as closed won't fix, and we can discuss any feature enhancements over in that queue.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Thanks @bburg, yes I just saw it, it's great news! Though you could argue that this issue got "Fixed", only it was by creating the other module :)

Production build 0.71.5 2024