Disallow GPTBot by default

Issue created by @kevinquillen
Comment about 2 years ago →
🇺🇸United States kevinquillen
Moving to core queue since RobotsTxt module uses the default robots.txt file to create initial configuration.
Comment about 2 years ago →
🇺🇸United States kevinquillen
Comment about 2 years ago →
cilefen
This would be the first crawler blocked by default in robots.txt.
Comment about 2 years ago →
🇺🇸United States kevinquillen
Here?

https://git.drupalcode.org/project/drupal/-/blob/11.x/robots.txt?ref_typ...

Because as I understand it you'd need to declare it again to prevent it from scraping content paths, where the default is currently for assets and admin paths:

# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
Comment about 2 years ago →
cilefen
I don't understand what #5 is communicating (to me?). If it is to me, then maybe my comment wasn't clear.
Comment about 2 years ago →
🇺🇸United States kevinquillen
I thought you were saying that GPTBot is blocked already from scraping a Drupal site by default, however I am not seeing that.
Comment about 2 years ago →
cilefen
That explains the confusion. That's not what I was saying.

This issue would introduce, for the first time, blocking a specific crawler by default. By that I mean, Drupal AFAIK has never done that and it comes with some downsides. Notably, there are many crawlers. Doing this could open the door to other requests to add hated crawlers.

So all I'm saying is that there is another decision implicit in this issue: whether to block specific crawlers.
Comment about 2 years ago →
🇱🇹Lithuania mindaugasd
@kevinquillen why? what does this achieve (long term)?
Is there some article that goes in depth into this? (measuring both sides of the argument)
This would be big decision and needs solid reasons.
Comment about 2 years ago →
🇺🇸United States kevinquillen
https://searchengineland.com/more-popular-websites-blocking-gptbot-432531

https://www.kevin-indig.com/most-sites-will-block-chat-gpt/

https://searchengineland.com/websites-blocking-gptbot-431183

Its not a knock against AI, I am simply saying that as a CMS by default should likely have this to protect the user from their content entering model(s) either prematurely or at all. This all largely entered the public consciousness in the last 12 months, so its brand new to most people. But I can see down the road where some sites find their content parroted out in ChatGPT models without consent and IMO that would look bad on Drupal for assuming users would just know to add that post-install. This is different than search engine crawlers, they've been around a long time. I think it should be something users go "okay I am ready to allow GPTBot now, remove it" or not at all.

I can't see far enough down the road yet, but it seems like once something is in an LLM, its permanent. Depending on the type of site, you may not want this at all, but at the same time it may not even occur to people until its too late to do anything about it.

NPR, for example: https://www.npr.org/robots.txt
Comment about 2 years ago →
🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
Is there a wider list we should be considering, rather than just one AI based crawler?
I agree with @cilefen we shouldn't just pick one.

🇺🇸United States kevinquillen

Reuters has even more.. is there a standard somewhere anyone knows of?

User-agent: PiplBot
Disallow: /

User-agent: CCbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

Comment about 2 years ago →
🇺🇸United States kevinquillen
Comment about 2 years ago →
🇱🇹Lithuania mindaugasd
I read the articles.

Summary:

websites block AI for financial interest.

this is a massive trend of (Top aka. profitable) websites blocking AI.

Conclusion:

this is slowing AI progress (maybe a good thing?)

this is slowing AI usefulness (that is probably very bad for users) aka. good for business, bad for users.

I seem to agree it can be blocked by default, because:

Its a massive trend anyway.

As Kevin have said, its not possible to remove the data after its in the model. So data can end up in model without people having enough time to consent about sharing it (if bots are not blocked by default). People should be given time to decide.

Counter-arguments?
Comment about 2 years ago →
🇺🇸United States kevinquillen
Comment about 2 years ago →
🇺🇸United States kevinquillen
Comment about 2 years ago →
🇱🇹Lithuania mindaugasd
Comment about 2 years ago →
🇬🇧United Kingdom scott_euser
At my agency (Soapbox) we work with Think Tanks and other organisations that generally work to positively influence policy and decision making. We have consulted with a number of them on their thoughts since the Robots.txt blocking was added as an option. The general feeling we gather is that they want their information used to better inform the end user, rather than blocking the information and having AI bots produce a less accurate result. There are concerns about citations and credits and prioritisation of research driven and informed answers of course, but the general idea is that they prefer to allow AI bots.

There is a danger that this goes unnoticed if not front and centre in a site builder's installation steps resulting in an eventual large number of Drupal websites no longer contributing to the quality of the responses that AI bots give. Not sure if there are any statistics but I would expect eg a higher percentage of Drupal sites to come from organisations that influence policy and decision making compared to possibly lower quality content from WordPress sites that may be more prone to opinions of individuals, so a possible danger of reduction in AI bot response quality.

Not against having to opt-in to tracking for our clients of course, but I suppose this needs a general Drupal policy consideration (if we do not already have one).
Comment about 2 years ago →
cilefen
My organization is similarly-minded. All of its published work is for everyone. I imagine scholars here would consider bad or missing citations as an oversight on the user’s end rather than a reason to block anyone. But that is only what I think. I will ask around.
Comment about 2 years ago →
🇺🇸United States mindbet
To answer @cilefen

The proposal above seems analogous to Drupal's blocking of Google FLoC

https://www.drupal.org/project/drupal/issues/3209628 →

FLoC has now gone away and so have the blocking headers.

https://www.drupal.org/project/drupal/issues/3260401 →

That said, my preference is that blocking AI bots shouldn't be in core;
it should be in contrib for those who want to opt out.

Perhaps I am still on a sugar high from drinking the AI Kool-Aid but I think
the benefits of LLMs will greatly outweigh the risks.

AI bots reading content should be considered transformative use.

Imagine multiple super intelligences reading the medical literature,
coming up with new treatments and models -- but wait -- a content troll company
like Elsevier prevents this with exorbitant fees or a complete blockade.

For your consideration, an essay from Benedict Evans:

Generative AI and intellectual property

https://www.ben-evans.com/benedictevans/2023/8/27/generative-ai-ad-intel...
Comment about 2 years ago →
cilefen
I've confirmed in a technical staff committee meeting that blocking this crawler from accessing public information would be perceived as an anti-pattern by my organization. Unattributed citations are the LLM user's responsibility.
Comment over 1 year ago →
🇹🇹Trinidad and Tobago frazras
I see this as an opinionated and somewhat intrusive one-sided stance on artificial intelligence as a whole.

The majority of public-facing websites exist to share content and not for profit. Most of the people with issues about their content ending up in a model are the minority (and powerful) profit-seeking content creators who will suffer with the progress of technology - as they have already had with the progress of the internet killing paper.
Although not apples to apples, this is like blocking the internet archive bot from accessing information about your site because you might post content you want to be removed in the future.

Besides, as weak barriers are setup like this, it will create a market for subversion, with people using lesser-known models to bypass the blacklist and with models like open-source LLAMA and the many others with close to equivalent power of GPT4, these non-commercial options would be just as powerful as the popular ones and hacks will be used to employ them. You will end up chasing a constantly growing list of bots.

I say, still put the code but commented for those who care about their content being indexed by AI, they should have the resources to uncomment these if they really need it.
Comment over 1 year ago →
cilefen
An article about this topic: https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-s...
Comment over 1 year ago →
🇰🇬Kyrgyzstan dan_metille
Unless a customer require not to, I'm totally fine feeding the AI.
But what really got on my nerve is that, especially Antrhopic/Claude bots are not following robots.txt instructions.
More than that, they are throwing so many dull and repeated request, that I had to spend nearly a full day to be able to block what can almost be considered as DDOS attack.
I won't go further into the ethical and thrust values in AI, that Anthropic is claiming for, in order to keep this thread nice.
Comment over 1 year ago →
🇬🇧United Kingdom scott_euser
Yeah we've had a couple where we've had to temporarily block ClaudeBot as well (then remove the block) as it was leading to degraded performance for visitors - agreed it sometimes feels DOS-attack like. Beyond ClaudeBot the rest seem to be better throttled at source which seems to reflect the vibe on reddit/X on the topic.
Comment over 1 year ago →
🇺🇸United States kevinquillen
We recently had to block on a site because the AI site crawler was effectively taking the site down.
Comment over 1 year ago →
🇱🇹Lithuania mindaugasd
Similar issue posted ✨ Strengthening Drupal: Protecting Content from AI Scrapers and Bots Active

Comment about 1 year ago →

🇪🇸Spain alvarodemendoza

After seeing millions of visits from these bots, specially from Claudebot, I think the following list in robots.tx will make the solution more robust.

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

User-agent: Timpibot
Disallow: /

User-agent: Webzio-Extended
Disallow: /

User-agent: YouBot
Disallow: /

Comment about 1 year ago →
cilefen
I believe Drupal Core should be neutral on this matter. Blocking specific bots should be a contributed module feature.
Comment 12 months ago →
🇺🇸United States kentr Durango, CO
Adding to what @scott_euser said, the topic of model collapse has been in the news recently. From WikiPedia: Model collapse is a phenomenon where machine learning models gradually degrade due to errors coming from uncurated training on the outputs of another model, including prior versions of itself.

I'd hypothesize that the majority of Drupal sites are providers of fresh, quality, human-generated content in many fields. Society appears to be hell-bent on using generative AI as a general tool for real work. So I echo @scott_euser's comment about the danger of adversely-affecting AI output.
Comment 9 months ago →
🇺🇸United States karolus
Any updates on this?

I understand that this is a controversial issue, but site operators should have the option of changing what the robots.txt allows. On some sites I manage, what I'm doing now is copying a customized robots.txt in every time I'm updating core or contrib. It's just another item in the checklist, but would be great if there were an easier way to manage this.

Regarding content governance, as posted upthread, some sites operators are fine with anyone scraping their content, others certainly are not.
Comment 9 months ago →
cilefen
Thanks for that comment. I just don't think this feature request should be taken on by Core. The existence of https://www.drupal.org/project/robotstxt → with 40,000 installs and of https://www.drupal.org/project/aitxt → (10 installs) means the flexibility to adjust AI content consumption is already well-supported within the Drupal ecosystem. IMO we don't want to be chasing specific bots in this core file, because to do so by default is controversial, and because of the commit noise it would incur. The contrib modules are a better avenue—even more so if those modules would feature a way to append the latest bots list.

Disallow GPTBot by default

Problem/Motivation

Proposed resolution

Comments & Activities