Disallow crawling paths under /node by default in robots.txt

Created on 26 January 2025, about 2 months ago

Problem/Motivation

  1. The vast majority of sites use Pathauto, as seen by installs January 2025:

    Drupal core: 723,408
    Pathauto:    514,780

    From https://www.drupal.org/project/usage β†’

  2. Getting paths such as /node/100 indexed instead of the human readable URL alias /my-alias is bad for SEO ...

Therefore, it makes sense to disallow all paths under /node from getting crawled by default.

There may be reasons why a site wants to allow paths under /node to get crawled, but they are the minority, and can edit robots.txt to allow this with https://www.drupal.org/project/robotstxt β†’ .

Steps to reproduce

See in search engines that paths such as /node/100 are getting indexed, instead of the intended human readable URL alias such as /my-alias, harming SEO.

Proposed resolution

Disallow all paths under /node from getting crawled by default.

Remaining tasks

Update the robots.txt file

User interface changes

none

API changes

none

Data model changes

none

Release notes snippet

TBD

πŸ“Œ Task
Status

Active

Version

11.0 πŸ”₯

Component

base system

Created by

πŸ‡©πŸ‡°Denmark ressa Copenhagen

Live updates comments and jobs are added and updated live.
  • Needs backport to D7

    After being applied to the 8.x branch, it should be considered for backport to the 7.x branch. Note: This tag should generally remain even after the backport has been written, approved, and committed.

Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @ressa
  • Merge request !11008Disallow crawling paths under /node β†’ (Open) created by ressa
  • πŸ‡©πŸ‡°Denmark ressa Copenhagen
  • πŸ‡©πŸ‡°Denmark ressa Copenhagen
  • πŸ‡©πŸ‡°Denmark ressa Copenhagen
  • πŸ‡©πŸ‡°Denmark ressa Copenhagen
  • Pipeline finished with Failed
    about 2 months ago
    Total: 554s
    #406438
  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Add Workaround in Issue Summary.

  • Although Google recommends descriptive URLs, there is nothing "wrong" with /node paths.

    Because using Redirect to maintain canonical URLs β†’ , that is, <link rel="canonical" />, works, I wonder whether it is ideal not to index /node paths by default.

    These are just my first thoughts after a few minutes' consideration.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Although Google recommends descriptive URLs, there is nothing "wrong" with /node paths.

    That's totally true, but not the focus here. I outlined the problem in the issue Summary:

    Getting paths such as /node/100 indexed instead of the human readable URL alias /my-alias is bad for SEO ...

    The point is, that in most cases, you want the human readable path indexed in the first place, not the node/NID path -- even if it gets redirected.

    I have updated the Issue Summary to make this point clearer.

  • Also note:

    If you have multiple pages that have the same information, try setting up a redirect from non-preferred URLs to a URL that best represents that information. If you can't redirect, use the rel="canonical" link element instead. But again, don't worry too much about this; search engines can generally figure this out for you on their own most of the time.

    https://developers.google.com/search/docs/fundamentals/seo-starter-guide...

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Sure, and the Redirect module can take care of that, as far as I see, should a /node/100 path get exposed and indexed by mistake ...

    Or do you have another point with sharing that sentence?

    Again, the aim with this MR is to get the correct alias indexed in the the first shot, by blocking /node/100 from getting indexed in the first place.

  • In my experiences that article is correct: search engines respect Core's rel="canonical", with or without Redirect. I am trying to understand the downside of having a /node URL indexed to which later the author adds a path alias, which the search engines then accept.

    Does the opposite ever happen?

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    I don't know, I created this issue because /node/100 paths got indexed, for some reason.

    But what would be the downside to doing this?

  • I don't know, really. It just seems a strong default not to index /node. It's a bit of a singular case but this very website would be largely un-indexed with that default. πŸ™

  • Actually that's not completely true because of the /issues auto path.

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    I do appreciate getting the tires of the MR kicked, don't get me wrong!

    I just think that in the majority of new installations, you do not want node/100 paths indexed. And if that's true, we should make the preferred behaviour the default, non?

  • πŸ‡ΊπŸ‡ΈUnited States smustgrave

    Sorry if I'm just repeating something. But I'm thinking of a novice site builder (mom and pop shop). If you don't have pathauto and don't manually set the alias then the URL will be node/123. Think we can all agree that's bad practice and standards. But idk how to vote for this one lol. Robots.txt should be part of core maybe?

  • πŸ‡©πŸ‡°Denmark ressa Copenhagen

    Heh, there has been some debate, but I think the gist of it was condensed into the last sentence of comment #16 -- that the majority would benefit from this.

    Also, this change would probably not cause any big problems, but rather a theoretical challenge, for a select few.

Production build 0.71.5 2024