Protection from Bots

Created on 14 November 2024, 5 months ago

Problem/Motivation

Is it possible to have configurable bot protection to prevent bots from sending requests directly to the search URL? I have HoneyPot and Antibot installed which provides protection to the form but doesn't prevent queries from going directly to the search URL with the key parameter. This is likely an issue with any Drupal search but since Vertex AI Searches come with a cost, bots spamming the search results URL can drive up the cost.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

✨ Feature request
Status

Active

Version

1.3

Component

Miscellaneous

Created by

πŸ‡ΊπŸ‡ΈUnited States Christian DeLoach

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @Christian DeLoach
  • πŸ‡ΊπŸ‡ΈUnited States SamLerner

    I like this idea. I'm actually looking into this problem for some sites I'm managing, as search bots are causing millions of additional hits each month on pages of search results.

    My first attempt to solve this was to add a <meta name="robots" content="noindex,nofollow"/> tag on the URL for the search page. This seems like something easy to add as an option for the Drupal search page, if in fact it's a working solution.

    @christian-deloach did you have any specific configuration options in mind?

  • πŸ‡ΊπŸ‡ΈUnited States Christian DeLoach

    Thank you @samlerner.

    By default, Drupal core's robots.txt already includes "Disallow: /search/" which requests search engines to not index the /search/ paths. Adding the robots meta tag may be redundant unless a search engine does not use the robots.txt or the site does not have Drupal's default robots.txt file.

    But neither the robots.txt file nor the robots meta tag will prevent malicious bots from sending queries directly to the search application.

    My thought is to add an option to check if the request to the search query came from the search form by passing a token. This would obviously "break" how the Drupal search currently works in that it does not require the request to come from the form so this request is an add-on that should be disabled by default, but I suspect most sites with the Vertex AI Search module would benefit, unless the site is already protected from bots.

    As a quick fix, I was considering setting up my server to redirect any request to the /search path without the "searchPage" argument to the default search page. The "searchPage" argument appears to be added by the Vertex AI Search module to the search redirect URL. However, the "searchPage" argument is not added when submitting the search form from the Search Form Block, it's only added when submitting the search form from the search page. Of course, it's not a robust way to block bots, but all of the malicious bots hitting my search form are not going through the form itself, but rather sending the queries directly to the search URL (e.g. /search?keys=foobar).

  • πŸ‡ΊπŸ‡ΈUnited States SamLerner

    I see what you're saying. Another idea would be to use flood control, similar to what Acquia Search is using:

    https://git.drupalcode.org/project/acquia_search/-/blob/3.1.x/src/EventS...

    That wouldn't block bots from using the search path, but it could keep things from getting out of hand.

  • Merge request !23Adds flood control to searches. β†’ (Merged) created by SamLerner
  • πŸ‡ΊπŸ‡ΈUnited States tzura

    timozura β†’ made their first commit to this issue’s fork.

  • πŸ‡ΊπŸ‡ΈUnited States tzura

    There is an MR that needs review. It adds some flood control to the vertex search service if the flood control is enabled on a search page's configuration page.

    To configure, edit the Vertex search page...check the box to enable flood control and set the threshold, window, and message values. Perform enough searches to hit your threshold and no more vertex searches will be performed until the window closes.

  • First commit to issue fork.
  • πŸ‡ΊπŸ‡ΈUnited States tzura
  • πŸ‡ΊπŸ‡ΈUnited States tzura

    @christian-deloach we added in some flood control functionality. I'll update the docs soon, but the 1.5.0-beta5 release adds some flood control options (threshold, time window, message) to the search page configuration. It's disabled by default, but if you get a chance to test it out, let me know how it goes. Closing this issue for now though - hoping it helps ward off the bots.

  • πŸ‡ΊπŸ‡ΈUnited States Christian DeLoach

    This is a great step towards adding bot protection! It may be very slightly redundant to Google Cloud's own Search requests per minute, but the new configuration provides the functionality for site owners to have control over setting their own timespan. Frankly, if my current threshold in Google Cloud is 10 requests per minute, which is potentially too low, in 60 minutes a bot may have hit the search form 600 times. But with the new configuration, I can set the threshold to something like 100 requests per hour. Of course, it all depends on traffic to the site and how often the search is used.

    Just a thought, should the response return a different HTTP header when the flood control has been reached? Something like 429 Too Many Requests or 403 Forbidden if we don't want to convey why the request was unsuccessful.

    Other enhancement considerations:

    • Disable flood control for Administrator users, Authorized users, or by roles.
    • Log flood control errors in the Drupal log.
  • πŸ‡ΊπŸ‡ΈUnited States SamLerner

    Thanks @christian-deloach , I like these suggestions. Thoughts on them:

    1. I like the idea of returning a 429 Too Many Requests response, that's what the Acquia Search module does. It's useful to be informative, and I don't think it poses much of a security risk, as the flood protection is already working. Maybe we make it an option, in addition to 403 Forbidden and 500 Server Error.
    2. Disabling the control by role sounds interesting, but what's the use case? Not sure why we'd want anybody able to flood the search.
    3. Logging the errors is intriguing, but I'm wondering if that creates a possible overload when using dblog. If we do this, we should make it optional, and warn about the potential database issue.

    These should all be new feature request issues, we can discuss further there.

  • πŸ‡ΊπŸ‡ΈUnited States Christian DeLoach

    Thank you @samlerner. The intent behind disabling flood control by role wasn't to allow humans to flood the search but to prevent flood control from blocking visitors we know are likely humans whose intent is unlikely malicious.

    It was just a thought. Considering most sites don't have authenticated users, the effort to implement may outweigh the benefit.

Production build 0.71.5 2024