Despite robots.txt, exluded files/folders are indexed

Created on 1 November 2023, about 1 year ago

Problem/Motivation

I recently wondered that on Google but also on Bing, certain files and folders have been indexed. I have a normal robots.txt from Drupal, which I have only extended with a sitemap.xml for the productive environment.

In detail, the robots.txt of Drupal is not complete, can be checked with Google robots.txt validator (Webmaster tools) . Drupal does not require a trailing slash (/) at the end of a URL for all URLs. If a trailing slash is included, it is simply removed by Drupal itself, causing a redirect request.

The current Drupal robots.txt simply lacks the information that both files/folders with and without a trailing slash should be ignored. If this is not the case

Here is an example: https://domain.TLD/admin/ is blocked by definition in robots.txt. But now Drupal removes the trailing slash at the end, https://domain.TLD/admin -> this specification has no definition in robots.txt and is therefore indexed by search engines.

Surely most adjust the robots.txt according to their own wishes, but I think that the default robots.txt delivered by Drupal should already run the correct structure to avoid the above-mentioned problems, while the extension made really fast.

Furthermore, relevant files should be exluded, so that they do not end up unnecessarily in the index, such as CHANGELOG.txt, INSTALL.txt, MAINTAINERS.txt, UPDATE.txt, USAGE.txt.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

πŸ“Œ Task
Status

Active

Version

10.1 ✨

Component
BaseΒ  β†’

Last updated about 5 hours ago

Created by

πŸ‡©πŸ‡ͺGermany zcht

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Production build 0.71.5 2024