Whitelist the Internet Archive so it can archive Drupal documentation

Created on 15 March 2023, almost 2 years ago
Updated 10 April 2023, over 1 year ago

Documentation location/URL

https://www.drupal.org/docs/develop/standards/php/api-documentation-and-...

It really applies to all the documentation pages.

Problem/Motivation

It can be useful to preserve noteworthy documentation, like the documentation standards used by major open source projects, and the Internet Archive (http://web.archive.org/) aims to archive pages so they can be easily referred to, even if the URL later changes (or even disappears). However, the Drupal servers seem to be set up to block attempts to archive the documentation pages – see for instance http://web.archive.org/web/20230309142152/https://www.drupal.org/docs/de....

Proposed resolution

I'd propose that, if possible, the Internet Archive be whitelisted so it can archive Drupal documentation.

edited to add: The issue seems to arise because www.drupal.org is black-listing particular user-agent strings. It seems to be set to blacklist (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)). However, a total made-up string like Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) is accepted.

📌 Task
Status

Closed: works as designed

Version

3.0

Component

Code

Created by

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @phlummox
  • 🇮🇹Italy apaderno Brescia, 🇮🇹
  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    That happens because a specific JavaScript file must be loaded and executed. When the browser does not have JavaScript enabled, or an extension blocks that file, you see that page.
    I take that is a measure against spammers. I have seen that page when I do not disable uBlock Origin and I visit an administrative page.

    If anything can be done, that is done in the module that loads that JavaScript file.

  • 🇮🇹Italy apaderno Brescia, 🇮🇹
  • Hi Alberto.

    That happens because a specific JavaScript file must be loaded and executed.

    That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

    Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

    What does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. For running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

    I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

  • Hi Alberto.

    That happens because a specific JavaScript file must be loaded and executed.

    That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

    Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

    What does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

    I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

  • Hi Alberto.

    That happens because a specific JavaScript file must be loaded and executed.

    That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

    Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

    What does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

    I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

  • Hi Alberto.

    That happens because a specific JavaScript file must be loaded and executed.

    That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

    Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

    What does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

    I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

  • Hi Alberto.

    That happens because a specific JavaScript file must be loaded and executed.

    That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

    Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

    What does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

    I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

  • I currently don't seem to be able to comment, but this diagnosis seems incorrect. I'll attempt to re-post a shortly.

  • Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

    I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

  • Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 "Forbidden" status code.

  • Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.

  • Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

    Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

    However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers and the Internet Archive; if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.

  • As a further comment - I've identified that the issue seems to hinge on what User-Agent string is supplied by the client, but am having trouble posting comments here. I'll update further when I can.

  • It seems the issue arises from the string passed by a client in the User-Agent header. Many strings – including completely made-up ones like Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) – are accepted by the server and an HTTP 200 response is returned.

    User-agent strings matching (from the tests I've made) Google and the Internet Archive's crawlers, however, are blocked, and a 400 "Forbidden" HTTP status code is returned.

  • For reference – the issue seems to arise from the "User-Agent" string. Experimenting with cURL, the www.drupal.org server for some reason seems to be set to blacklist user-agents which match (for instance) Google's crawlers and the Internet Archive. If cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.

  • Status changed to Postponed: needs info over 1 year ago
  • 🇺🇸United States drumm NY, US

    We do limited blocking by user agent string, that’s only effective for particularly bad bots. We also block when a user agent string is spoofed. For example, we allow Google crawling; but if you spoof their user agent from an IP that is not known to be used by Google indexing, that’s suspicious and may be blocked.

    We have the tool we use for this already set up to explicitly allowlist the Archive.org bot. Most requests that superficially look like an Archive.org bot by the user agent string are indeed allowlisted as a known good bot. I expect something changed and how Archive.org behaves mismatches what the tool can detect. I’ve contacted the vendor providing our abuse mitigation service with some specific examples.

    Do you know if Archive.org publishes an updated list of IPs used? Since user agents are easily spoofable, allowlisting based on the UA alone is not great.

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    From their site, I get that pages with JavaScript elements are often hard to archive; if then the JavaScript code needs to contact the originating server in order to work, it will fail when archived.
    Is the code that shows the page I shown on comment #3, or that avoids that page is shown, always loaded?

  • 🇺🇸United States drumm NY, US

    Since the Archive.org bot is allowlisted, it does not need to be aware of JavaScript, assuming the vendor’s tool is properly detecting the requests from the Archive.org bot.

    Logged in confirmed users should also be allowlisted, but that would be a separate issue, as the detection for those requests is very different.

  • 🇺🇸United States drumm NY, US

    I made another request to “Save page now” on Archive.org to collect more-detailed log messages. The Internet Archive is actually using my browser’s user agent as it makes the request from an IP address they own. So allowlisting their documented user agent would be completely ineffective since they are forging it to pretend to be my browser.

  • Status changed to Closed: works as designed over 1 year ago
  • 🇺🇸United States drumm NY, US

    Our vendor will not be adjusting their Archive.org detection since the Archive.org bot is not following its documented behavior. I recommend contacting Archive.org to see if their “Save page now” functionality can use their documented user agent instead of forging it; or update their documentation.

Production build 0.71.5 2024