- Issue created by @pfrenssen
- Status changed to Needs review
over 1 year ago 4:16pm 25 September 2023 - last update
over 1 year ago 30,208 pass - Status changed to RTBC
over 1 year ago 4:23pm 25 September 2023 - Status changed to Needs work
over 1 year ago 11:44pm 25 September 2023 - πΊπΈUnited States xjm
Good find, thanks!
That said, there are also still a lot of
README.txt
files in various places. Search with:[ayrton:maintainer | Mon 18:41:50] $ find ./ -name README.txt | grep -v "vendor" | grep -v "node_modules"
So, I think we need to add the
README.md
entry, but not remove theREADME.txt
one.We can have a followup to consider converting other
READMEs.txt
to.md
.Thanks!
Does /README.txt match all readme files? I think it matches only the one in the document root.
- Status changed to Needs review
over 1 year ago 5:47pm 26 September 2023 - πΊπΈUnited States xjm
Whoops, you may be right. If so, among all the various READMEs, some are excluded by other rule (like the
/core/
rule. The files that are still allowed to be indexed are:.//modules/README.txt .//themes/README.txt .//composer/Metapackage/README.txt .//composer/Plugin/VendorHardening/README.txt .//composer/Template/README.txt .//sites/README.txt
I tried reading http://www.robotstxt.org/robotstxt.html but it isn't really clear one way or another on this point. Not really clear from Google either.
Maybe someone can just quickly test it to confirm?
I tested those with the official Google Webmaster Tools robots.txt tool.
Those files are not blocked by Drupal's current robots.txt file.
Should we re-scope this one?
- π§π¬Bulgaria pfrenssen Sofia
I found a better technical description of the robots.txt format at the w3c specification: https://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1
It is extremely limited. No wildcards or relative paths are supported. This is basically the full spec:
The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
So we should probably just maintain a list of the README files that core ships with.
- last update
over 1 year ago 30,296 pass, 3 fail - πΈπ°Slovakia poker10
I think the list looks good, thanks!
I reviewed the files and there can be one more README.txt generated in the
sites/default/files/config_ZZZ/sync
directory (see theSiteSettingsForm::createRandomConfigDirectory()
). But not sure if we can target also this file with something like this:sites/default/files/*/sync/README.txt
. The last submitted patch, 9: 3389611-9.patch, failed testing. View results β
- last update
over 1 year ago 30,363 pass - π§π¬Bulgaria pfrenssen Sofia
@poker10, unfortunately we can't target dynamic paths in robots.txt and we wouldn't want to disclose the location of the config folder.
This patch also updates the copy of robots.txt intended for scaffolding.
- πΈπ°Slovakia poker10
If no wildcards are supported, how is this existing
robots.txt
entry supposed to work?Disallow: /*/media/oembed
I proposed the suggestion with the config directory based on this existing entry. It was added by: #3271222: Include Disallow Oembed media links in the robots.txt file for better Drupal SEO β
- πΊπΈUnited States xjm
Huh, very interesting.
That said, I think @pfrenssen's approach is workable. No need to try "supported in practice" wildcards when the list is a fairly short and consistent one.
- Status changed to RTBC
over 1 year ago 5:21pm 2 October 2023 - πΈπ°Slovakia poker10
So in case there is no concern that we do not disallow the readme in the config directory (
sites/default/files/*/sync/README.txt
), then I think this could be RTBC. - last update
over 1 year ago 30,362 pass - πΊπΈUnited States xjm
@poker10 Maybe we could add a followup about that case? If the site is using private files, it should already be access denied, but a site might not be. There's a fairly strongly worded
.htaccess
in the config sync directory also. So I think any additional rule would be a hardening. It could be discussed in a separate issue.So we need two followups:
- One to discuss converting other READMEs to Markdown.
- One to discuss whether we should add additional defense-in-depth around the config sync README,
Thanks everyone!
- πΈπ°Slovakia poker10
@xjm Yes, that sounds good, thanks.
I have created the follow-ups:
π Change all README.txt files to README.md Needs work
π Disallow the config sync directory README.txt by the robots.txt ActiveI have also discovered, that we still have references to the old README.txt in the root directory on some places (after it was renamed), so created one more issue as well: π Change references to README.txt in root directory RTBC
21:15 16:36 Running- Status changed to Fixed
over 1 year ago 2:33pm 5 October 2023 -
alexpott β
committed 505a6460 on 11.x
Issue #3389611 by pfrenssen, cilefen, xjm, poker10: Update robots.txt...
-
alexpott β
committed 505a6460 on 11.x
Automatically closed - issue fixed for 2 weeks with no activity.
- Status changed to Fixed
about 1 year ago 7:31am 1 December 2023 - π³πΏNew Zealand quietone
Manual testing was completed in #7. Therefore removing the tag.