Whitelist the Internet Archive so it can archive Drupal documentation

Issue created by @phlummox
Comment over 2 years ago →
🇮🇹Italy apaderno Brescia, 🇮🇹
Comment over 2 years ago →
🇮🇹Italy apaderno Brescia, 🇮🇹
That happens because a specific JavaScript file must be loaded and executed. When the browser does not have JavaScript enabled, or an extension blocks that file, you see that page.
I take that is a measure against spammers. I have seen that page when I do not disable uBlock Origin and I visit an administrative page.

If anything can be done, that is done in the module that loads that JavaScript file.
Comment over 2 years ago →
🇮🇹Italy apaderno Brescia, 🇮🇹
Comment over 2 years ago →
phlummox
Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. For running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.
Comment over 2 years ago →
phlummox
Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.
Comment over 2 years ago →
phlummox
Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.
Comment over 2 years ago →
phlummox
Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.
Comment over 2 years ago →
phlummox
Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.
Comment over 2 years ago →
phlummox
I currently don't seem to be able to comment, but this diagnosis seems incorrect. I'll attempt to re-post a shortly.
Comment over 2 years ago →
phlummox
Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.
Comment over 2 years ago →
phlummox
Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 "Forbidden" status code.
Comment over 2 years ago →
phlummox
Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.
Comment over 2 years ago →
phlummox
Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers and the Internet Archive; if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.
Comment over 2 years ago →
phlummox
As a further comment - I've identified that the issue seems to hinge on what User-Agent string is supplied by the client, but am having trouble posting comments here. I'll update further when I can.
Comment over 2 years ago →
phlummox
It seems the issue arises from the string passed by a client in the User-Agent header. Many strings – including completely made-up ones like Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) – are accepted by the server and an HTTP 200 response is returned.

User-agent strings matching (from the tests I've made) Google and the Internet Archive's crawlers, however, are blocked, and a 400 "Forbidden" HTTP status code is returned.
Comment over 2 years ago →
phlummox
For reference – the issue seems to arise from the "User-Agent" string. Experimenting with cURL, the www.drupal.org server for some reason seems to be set to blacklist user-agents which match (for instance) Google's crawlers and the Internet Archive. If cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.
Comment over 2 years ago →
phlummox
Status changed to Postponed: needs info over 2 years ago9:35pm 3 April 2023
Comment over 2 years ago →
🇺🇸United States drumm NY, US
We do limited blocking by user agent string, that’s only effective for particularly bad bots. We also block when a user agent string is spoofed. For example, we allow Google crawling; but if you spoof their user agent from an IP that is not known to be used by Google indexing, that’s suspicious and may be blocked.

We have the tool we use for this already set up to explicitly allowlist the Archive.org bot. Most requests that superficially look like an Archive.org bot by the user agent string are indeed allowlisted as a known good bot. I expect something changed and how Archive.org behaves mismatches what the tool can detect. I’ve contacted the vendor providing our abuse mitigation service with some specific examples.

Do you know if Archive.org publishes an updated list of IPs used? Since user agents are easily spoofable, allowlisting based on the UA alone is not great.
Comment over 2 years ago →
🇮🇹Italy apaderno Brescia, 🇮🇹
From their site, I get that pages with JavaScript elements are often hard to archive; if then the JavaScript code needs to contact the originating server in order to work, it will fail when archived.
Is the code that shows the page I shown on comment #3, or that avoids that page is shown, always loaded?
Comment over 2 years ago →
🇺🇸United States drumm NY, US
Since the Archive.org bot is allowlisted, it does not need to be aware of JavaScript, assuming the vendor’s tool is properly detecting the requests from the Archive.org bot.

Logged in confirmed users should also be allowlisted, but that would be a separate issue, as the detection for those requests is very different.
Comment over 2 years ago →
🇺🇸United States drumm NY, US
I made another request to “Save page now” on Archive.org to collect more-detailed log messages. The Internet Archive is actually using my browser’s user agent as it makes the request from an IP address they own. So allowlisting their documented user agent would be completely ineffective since they are forging it to pretend to be my browser.
Status changed to Closed: works as designed over 2 years ago5:28pm 10 April 2023
Comment over 2 years ago →
🇺🇸United States drumm NY, US
Our vendor will not be adjusting their Archive.org detection since the Archive.org bot is not following its documented behavior. I recommend contacting Archive.org to see if their “Save page now” functionality can use their documented user agent instead of forging it; or update their documentation.

Whitelist the Internet Archive so it can archive Drupal documentation

Documentation location/URL

Problem/Motivation

Proposed resolution

Comments & Activities