Disallow AI bots by default in robots.txt

Created on 1 November 2023, 8 months ago
Updated 28 June 2024, about 21 hours ago

Problem/Motivation

Users should be protected from AI bot(s) scraping by default. If they want to allow it, they can choose to do so after the fact by editing robots.txt or using modules like RobotsTxt. This would protect users and teams either unaware of AI bot scraping, those who don't want it, and or those who are not aware they need to be taking action (in this manner).

Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to primarily aggregate personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.

I think this is a sensible change and also doesn't put full trust into ChatGPT or Google Bard not ingesting things or interpreting sensitive content as not-sensitive.

Proposed resolution

Add the following to the default robots.txt to block Google and OpenAI (ChatGPT):

User-agent: GPTBot
Disallow: /

User-agent: CCbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

It may also be relevant to add an ai.txt by default for a broader disallow (is this a real growing standard or proposed?).

See also: https://site.spawning.ai/spawning-ai-txt

✨ Feature request
Status

Active

Version

11.0 πŸ”₯

Component
BaseΒ  β†’

Last updated about 1 hour ago

Created by

πŸ‡ΊπŸ‡ΈUnited States kevinquillen

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Moving to core queue since RobotsTxt module uses the default robots.txt file to create initial configuration.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States cilefen

    This would be the first crawler blocked by default in robots.txt.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Here?

    https://git.drupalcode.org/project/drupal/-/blob/11.x/robots.txt?ref_typ...

    Because as I understand it you'd need to declare it again to prevent it from scraping content paths, where the default is currently for assets and admin paths:

    # This file is to prevent the crawling and indexing of certain parts
    # of your site by web crawlers and spiders run by sites like Yahoo!
    # and Google. By telling these "robots" where not to go on your site,
    # you save bandwidth and server resources.

  • πŸ‡ΊπŸ‡ΈUnited States cilefen

    I don't understand what #5 is communicating (to me?). If it is to me, then maybe my comment wasn't clear.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    I thought you were saying that GPTBot is blocked already from scraping a Drupal site by default, however I am not seeing that.

  • πŸ‡ΊπŸ‡ΈUnited States cilefen

    That explains the confusion. That's not what I was saying.

    This issue would introduce, for the first time, blocking a specific crawler by default. By that I mean, Drupal AFAIK has never done that and it comes with some downsides. Notably, there are many crawlers. Doing this could open the door to other requests to add hated crawlers.

    So all I'm saying is that there is another decision implicit in this issue: whether to block specific crawlers.

  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    @kevinquillen why? what does this achieve (long term)?
    Is there some article that goes in depth into this? (measuring both sides of the argument)
    This would be big decision and needs solid reasons.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    https://searchengineland.com/more-popular-websites-blocking-gptbot-432531

    https://www.kevin-indig.com/most-sites-will-block-chat-gpt/

    https://searchengineland.com/websites-blocking-gptbot-431183

    Its not a knock against AI, I am simply saying that as a CMS by default should likely have this to protect the user from their content entering model(s) either prematurely or at all. This all largely entered the public consciousness in the last 12 months, so its brand new to most people. But I can see down the road where some sites find their content parroted out in ChatGPT models without consent and IMO that would look bad on Drupal for assuming users would just know to add that post-install. This is different than search engine crawlers, they've been around a long time. I think it should be something users go "okay I am ready to allow GPTBot now, remove it" or not at all.

    I can't see far enough down the road yet, but it seems like once something is in an LLM, its permanent. Depending on the type of site, you may not want this at all, but at the same time it may not even occur to people until its too late to do anything about it.

    NPR, for example: https://www.npr.org/robots.txt

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Is there a wider list we should be considering, rather than just one AI based crawler?
    I agree with @cilefen we shouldn't just pick one.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    Reuters has even more.. is there a standard somewhere anyone knows of?

    User-agent: PiplBot
    Disallow: /
    
    User-agent: CCbot
    Disallow: /
    
    User-agent: anthropic-ai
    Disallow: /
    
    User-agent: Claude-Web
    Disallow: /
    
    User-agent: Google-Extended
    Disallow: /
    
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡±πŸ‡ΉLithuania mindaugasd

    I read the articles.

    Summary:

    • websites block AI for financial interest.
    • this is a massive trend of (Top aka. profitable) websites blocking AI.

    Conclusion:

    • this is slowing AI progress (maybe a good thing?)
    • this is slowing AI usefulness (that is probably very bad for users) aka. good for business, bad for users.

    I seem to agree it can be blocked by default, because:

    1. Its a massive trend anyway.
    2. As Kevin have said, its not possible to remove the data after its in the model. So data can end up in model without people having enough time to consent about sharing it (if bots are not blocked by default). People should be given time to decide.

    Counter-arguments?

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen
  • πŸ‡±πŸ‡ΉLithuania mindaugasd
  • πŸ‡¬πŸ‡§United Kingdom scott_euser

    At my agency (Soapbox) we work with Think Tanks and other organisations that generally work to positively influence policy and decision making. We have consulted with a number of them on their thoughts since the Robots.txt blocking was added as an option. The general feeling we gather is that they want their information used to better inform the end user, rather than blocking the information and having AI bots produce a less accurate result. There are concerns about citations and credits and prioritisation of research driven and informed answers of course, but the general idea is that they prefer to allow AI bots.

    There is a danger that this goes unnoticed if not front and centre in a site builder's installation steps resulting in an eventual large number of Drupal websites no longer contributing to the quality of the responses that AI bots give. Not sure if there are any statistics but I would expect eg a higher percentage of Drupal sites to come from organisations that influence policy and decision making compared to possibly lower quality content from WordPress sites that may be more prone to opinions of individuals, so a possible danger of reduction in AI bot response quality.

    Not against having to opt-in to tracking for our clients of course, but I suppose this needs a general Drupal policy consideration (if we do not already have one).

  • πŸ‡ΊπŸ‡ΈUnited States cilefen

    My organization is similarly-minded. All of its published work is for everyone. I imagine scholars here would consider bad or missing citations as an oversight on the user’s end rather than a reason to block anyone. But that is only what I think. I will ask around.

  • πŸ‡ΊπŸ‡ΈUnited States mindbet

    To answer @cilefen

    The proposal above seems analogous to Drupal's blocking of Google FLoC

    https://www.drupal.org/project/drupal/issues/3209628 β†’

    FLoC has now gone away and so have the blocking headers.

    https://www.drupal.org/project/drupal/issues/3260401 β†’

    That said, my preference is that blocking AI bots shouldn't be in core;
    it should be in contrib for those who want to opt out.

    Perhaps I am still on a sugar high from drinking the AI Kool-Aid but I think
    the benefits of LLMs will greatly outweigh the risks.

    AI bots reading content should be considered transformative use.

    Imagine multiple super intelligences reading the medical literature,
    coming up with new treatments and models -- but wait -- a content troll company
    like Elsevier prevents this with exorbitant fees or a complete blockade.

    For your consideration, an essay from Benedict Evans:

    Generative AI and intellectual property

    https://www.ben-evans.com/benedictevans/2023/8/27/generative-ai-ad-intel...

  • πŸ‡ΊπŸ‡ΈUnited States cilefen

    I've confirmed in a technical staff committee meeting that blocking this crawler from accessing public information would be perceived as an anti-pattern by my organization. Unattributed citations are the LLM user's responsibility.

  • πŸ‡ΉπŸ‡ΉTrinidad and Tobago frazras

    I see this as an opinionated and somewhat intrusive one-sided stance on artificial intelligence as a whole.

    The majority of public-facing websites exist to share content and not for profit. Most of the people with issues about their content ending up in a model are the minority (and powerful) profit-seeking content creators who will suffer with the progress of technology - as they have already had with the progress of the internet killing paper.
    Although not apples to apples, this is like blocking the internet archive bot from accessing information about your site because you might post content you want to be removed in the future.

    Besides, as weak barriers are setup like this, it will create a market for subversion, with people using lesser-known models to bypass the blacklist and with models like open-source LLAMA and the many others with close to equivalent power of GPT4, these non-commercial options would be just as powerful as the popular ones and hacks will be used to employ them. You will end up chasing a constantly growing list of bots.

    I say, still put the code but commented for those who care about their content being indexed by AI, they should have the resources to uncomment these if they really need it.

  • πŸ‡°πŸ‡¬Kyrgyzstan sahaj

    Unless a customer require not to, I'm totally fine feeding the AI.
    But what really got on my nerve is that, especially Antrhopic/Claude bots are not following robots.txt instructions.
    More than that, they are throwing so many dull and repeated request, that I had to spend nearly a full day to be able to block what can almost be considered as DDOS attack.
    I won't go further into the ethical and thrust values in AI, that Anthropic is claiming for, in order to keep this thread nice.

  • πŸ‡¬πŸ‡§United Kingdom scott_euser

    Yeah we've had a couple where we've had to temporarily block ClaudeBot as well (then remove the block) as it was leading to degraded performance for visitors - agreed it sometimes feels DOS-attack like. Beyond ClaudeBot the rest seem to be better throttled at source which seems to reflect the vibe on reddit/X on the topic.

  • πŸ‡ΊπŸ‡ΈUnited States kevinquillen

    We recently had to block on a site because the AI site crawler was effectively taking the site down.

Production build 0.69.0 2024