Instruct search engines to ignore HTML export with noindex directive

Created on 16 October 2024, 6 months ago

Problem/Motivation

Google and other search engines can index the "Printer friendly" HTML export of book pages, with URLs like /book/export/html/65. My feeling is that the export is not really meant to be discoverable by search engines, rather it's a service to visitors who'd like to access that format.

Steps to reproduce

By using Google Search Console, you often find that Google is discovering and indexing these /book/export/html/* pages.

Even if you add Disallow: /book/export/html/ to your robots.txt, Google and other search engines will still discover the "Printer friendly pages", because they are linked from their less-printer friendly counterparts. robots.txt will not block these pages from being discovered because they are linked. In Google Search Console, these pages may be listed as "Indexed, though blocked by robots.txt"

Via Google Search Central:

Warning: Don't use a robots.txt file as a means to hide your web pages (including PDFs and other text-based formats supported by Google) from Google search results.

If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex.

Proposed resolution

Add <meta name="robots" content="noindex"> to the <head></head> of /templates/book-export-html.html.twig.

Individual sites can make this improvement by modifying book-export-html.html.twig in their own theme, of course, but it makes a lot of sense to me to provide this markup by default.

Remaining tasks

User interface changes

API changes

Data model changes

✨ Feature request
Status

Active

Version

2.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States jenna.tollerson Atlanta, Georgia, USA

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Production build 0.71.5 2024