undefined - Contrib.social

For reference – the issue seems to arise from the "User-Agent" string. Experimenting with cURL, the www.drupal.org server for some reason seems to be set to blacklist user-agents which match (for instance) Google's crawlers and the Internet Archive. If cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

It seems the issue arises from the string passed by a client in the User-Agent header. Many strings – including completely made-up ones like Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) – are accepted by the server and an HTTP 200 response is returned.

User-agent strings matching (from the tests I've made) Google and the Internet Archive's crawlers, however, are blocked, and a 400 "Forbidden" HTTP status code is returned.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

As a further comment - I've identified that the issue seems to hinge on what User-Agent string is supplied by the client, but am having trouble posting comments here. I'll update further when I can.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers and the Internet Archive; if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 status code.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will respond with a 403 "Forbidden" status code.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Regarding the cause of the issue: what does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

I currently don't seem to be able to comment, but this diagnosis seems incorrect. I'll attempt to re-post a shortly.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. Running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL is set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment over 1 year ago →

phlummox

Hi Alberto.

That happens because a specific JavaScript file must be loaded and executed.

That's not correct. One can easily demonstrate this, by retrieving the URL https://www.drupal.org/docs/develop/standards/php/api-documentation-and-... → with a tool such as wget or cURL, neither of which execute JavaScript. You will find, in both cases, that the page works perfectly well – the expected content is retrieved, and no 4XX status code is given.

Similarly, if you disable JavaScript in your browser using an extension like uBlock Origin, that will make no difference at all – I have tried this, and the correct page content is still displayed.

What does seem to make a difference is the "User-Agent" header set by the client.

Plenty of user-agent strings seem to work with no issue. For running cURL with the user-agent string set to mimic (for example) a Kindle, a PlayStation, or even using the completely made-up string Mozilla/5.0 (compatible; totally-random-program/99.3; +http://www.mydomain.com/bot.html) all result in a 200 HTTP status code and successful page retrieval.

However, the www.drupal.org server for some reason seems to be blacklisting user-agents which match (for instance) Google's crawlers (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)) and the Internet Archive (Mozilla/5.0 (compatible; archive.org_bot/3.3.0 +https://archive.org/details/archive.org_bot)); if cURL set to supply those strings in the User-Agent header, drupal.org will response with a 403 "Forbidden" status code.

I conclude that the www.drupal.org maintainers have decided to actively blacklist particular user-agents, rather than this issue having anything to do with JavaScript or cookies. This seems like fairly bizarre behaviour to me: the content is freely available under a Creative Commons license, so why deliberately make your site less easy to find on Google and less easy to archive? Nevertheless, if that's the approach they've decided to take, so be it.

📌 | Drupal.org customizations | Whitelist the Internet Archive so it can archive Drupal documentation

Comment almost 2 years ago →

phlummox

phlummox → created an issue.

@phlummox

Recent comments