Strengthening Drupal: Protecting Content from AI Scrapers and Bots

Created on 27 June 2024, about 1 year ago

Problem/Motivation

Trivia: The 'Great Wall of China' was built to protect Chinese farmers' crops from Mongolian raiders. This metaphorically captures the scraping issue faced by Drupal CMS users (webmasters, individuals, and corporations).

On June 26, 2024, one website served 9.5K pages in 120 minutes from two sources: the US and Germany. These are clearly data scrapers, not human readers.

Each page takes about 260 minutes to collaboratively write, excluding moderation and editing. Therefore, 41,000 hours (2,470,000 minutes) of work was scraped in just 2 hours.

This is convenient for scrapers but extremely detrimental to content creators. Website reputation is built one page at a time and remains relevant in search results for only a short period.

To rank in the top 10 search results, we employ metatags, Schema.org, URL-friendly paths, tokens, and more. A well-crafted content page waits for a crawl and validation, which can take days or even months in search systems, but it is often scraped and updated before that process completes.

I've observed that glossary and archive views are prime targets for updates, but XML sitemaps reveal everything despite our efforts to optimize them for search engines.

Proposed resolution

Drupal is a powerful content management system, but with the rise of AI, there's an increasing need to protect content from bots and automated scripts that scrape and update data. To address this, Drupal must incorporate mechanisms to slow down and limit page views from a single IP address, such as by triggering a captcha.

Websites are built by humans for humans, requiring significant time, creativity, and expertise to create valuable resources. The goal is not to enhance AI intelligence but to foster communities and assist people. Therefore, integrating robust bot mitigation strategies into Drupal's core will attract more serious users to the platform, ensuring the content is protected and the community thrives.

Thank you, for your time.

Jo.

✨ Feature request

Status

Active

Version

11.0 🔥

Component

Other →

Last updated 5 minutes ago

Created by

JoAMoS

Live updates comments and jobs are added and updated live.

Comments & Activities

Issue created by @JoAMoS
Comment about 1 year ago →
cilefen
I am moving this to the ideas queue because not everyone is in agreement on this matter and because implementing this is significantly more technically difficult than ✨ Disallow AI bots by default in robots.txt Active . It may be the case that using CDNs is intrinsically more effective at this.
Comment 4 months ago →
🇳🇿New Zealand quietone
The Ideas project is being deprecated. This issue is moved to the Drupal project. Check that the selected component is correct. Also, add the relevant tags, especially any 'needs manager review' tags.
Comment 4 months ago →
🇳🇿New Zealand quietone

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024