Problem/Motivation
Google and other search engines can index the "Printer friendly" HTML export of book pages, with URLs like /book/export/html/65
. My feeling is that the export is not really meant to be discoverable by search engines, rather it's a service to visitors who'd like to access that format.
Steps to reproduce
By using Google Search Console, you often find that Google is discovering and indexing these /book/export/html/*
pages.
Even if you add Disallow: /book/export/html/
to your robots.txt, Google and other search engines will still discover the "Printer friendly pages", because they are linked from their less-printer friendly counterparts. robots.txt will not block these pages from being discovered because they are linked. In Google Search Console, these pages may be listed as "Indexed, though blocked by robots.txt"
Via Google Search Central:
Warning: Don't use a robots.txt file as a means to hide your web pages (including PDFs and other text-based formats supported by Google) from Google search results.
If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex.
Proposed resolution
Add <meta name="robots" content="noindex">
to the <head></head>
of /templates/book-export-html.html.twig
.
Individual sites can make this improvement by modifying book-export-html.html.twig
in their own theme, of course, but it makes a lot of sense to me to provide this markup by default.
Remaining tasks
User interface changes
API changes
Data model changes