Problem/Motivation
Trivia: The 'Great Wall of China' was built to protect Chinese farmers' crops from Mongolian raiders. This metaphorically captures the scraping issue faced by Drupal CMS users (webmasters, individuals, and corporations).
On June 26, 2024, one website served 9.5K pages in 120 minutes from two sources: the US and Germany. These are clearly data scrapers, not human readers.
Each page takes about 260 minutes to collaboratively write, excluding moderation and editing. Therefore, 41,000 hours (2,470,000 minutes) of work was scraped in just 2 hours.
This is convenient for scrapers but extremely detrimental to content creators. Website reputation is built one page at a time and remains relevant in search results for only a short period.
To rank in the top 10 search results, we employ metatags, Schema.org, URL-friendly paths, tokens, and more. A well-crafted content page waits for a crawl and validation, which can take days or even months in search systems, but it is often scraped and updated before that process completes.
I've observed that glossary and archive views are prime targets for updates, but XML sitemaps reveal everything despite our efforts to optimize them for search engines.
Proposed resolution
Drupal is a powerful content management system, but with the rise of AI, there's an increasing need to protect content from bots and automated scripts that scrape and update data. To address this, Drupal must incorporate mechanisms to slow down and limit page views from a single IP address, such as by triggering a captcha.
Websites are built by humans for humans, requiring significant time, creativity, and expertise to create valuable resources. The goal is not to enhance AI intelligence but to foster communities and assist people. Therefore, integrating robust bot mitigation strategies into Drupal's core will attract more serious users to the platform, ensuring the content is protected and the community thrives.
Thank you, for your time.
Jo.