[10.2 regression] RSS feeds invalid due to  

Created on 18 December 2023, 11 months ago
Updated 6 March 2024, 8 months ago

Problem/Motivation

Since upgrading to Drupal 10.2, my RSS feeds have been failing to validate due to   non-breaking space tokens.

Validation is failing on: https://validator.w3.org/feed/check.cgi with the following error:

This feed does not validate.

    line 13, column 57: XML parsing error: <unknown>:13:57: undefined entity [help]

        &lt;span&gt;Drupal Advent Calendar day 18 - The Mentoring&nbsp;Initiative&lt ...
                                                                 ^

I use the Typogrify module, which inserts &nbsp; in certain conditions, which makes this problem worse, but there are lots of reasons it might be necessary to have non-breaking spaces in content.

I have looked at the list of changes in 10.2, but haven't spotted any that obviously point to a cause for this.

Steps to reproduce

On both 10.1 and 10.2 test sites, I created a node with "Hello", a blank line, and "Goodbye" using CKEditor, and published the article on both.

I could not test local sites with the above validator, but when I open the 10.1 rss.xml file in Firefox, it displays as valid XML. On opening the 10.2 file, Firefox displays the following:

XML Parsing Error: undefined entity
Location: file:///home/james/Downloads/rss.xml
Line Number 20, Column 258:

Under 10.1, the RSS feed contains the following:

&lt;p&gt;Hello&lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;p&gt;Goodbye&lt;/p&gt;

However under 10.2 the feed contains:

&lt;p&gt;Hello&lt;/p&gt;&lt;p&gt;&nbsp;&lt;/p&gt;&lt;p&gt;Goodbye&lt;/p&gt;

The first one contains a space character, while the second contains the &nbsp; token.

Proposed resolution

Identify if this behaviour is caused by a change in 10.2, and whether the new behaviour is intentional. Consider whether the previous behaviour should be reinstated.

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

🐛 Bug report
Status

Fixed

Version

10.2

Component
Views 

Last updated about 12 hours ago

Created by

🇮🇪Ireland lostcarpark

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @lostcarpark
  • 🇮🇪Ireland lostcarpark

    Note: I was unsure what component to log against. I picked views because as far as I understand, RSS feeds come from views. Please move if there is a more appropriate component.

  • Could you add some words to the issue summary detailing how you know that the feed is invalid?

    How would we go about duplicating as close as possible, this setup?

  • 🇮🇪Ireland lostcarpark

    Thanks @cilefen, I have added additional detail.

    Both sites were close to vanilla setups of Drupal, although they hadn't been set up exclusively for this test.

  • 🇺🇸United States mfb San Francisco

    I can reproduce this on a fresh install of the 11.x dev branch:

    1. Add the body field to the RSS display for articles.
    2. Create an article, setting the body source (using ckeditor source button) to Hello &nbsp; Goodbye.

    Upon output in RSS format, the &nbsp; is not encoded as &amp;nbsp;, as it should be. Ampersands (just like > and <) need to be encoded when output in an XML element. It's a bit surprising if this isn't happening, as typically an XML writing library would take care of this for you, but I don't know how Drupal writes XML these days...

  • 🇺🇸United States mfb San Francisco

    Ah, I see, it looks like an XML writing library isn't used here, the XML is assembled by the theme layer as if it were HTML markup, so it's not too surprising to find such a bug.

  • 🇺🇸United States mfb San Francisco

    In my debugging, this appears to be caused by Drupal\Core\EventSubscriber\RssResponseRelativeUrlFilter, which calls Drupal\Component\Utility\Html::transformRootRelativeUrlsToAbsolute() on the description element. Haven't (yet) looked further into this code to see what the problem/solution might be.

  • Merge request !5870Issue #3409587: RSS feeds invalid due to &nbsp; → (Closed) created by mfb
  • Status changed to Needs review 11 months ago
  • 🇺🇸United States mfb San Francisco

    Tldr: There is a "quirk" in PHP where setting the nodeValue property (which is not a thing in the spec) results in invalid XML. Instead what one should do (according to the spec) is create a new text node and append it - in this case the text will be properly encoded and the XML is valid.

  • 🇺🇸United States mfb San Francisco

    For simplicity, I added &nbsp; - which should appear in the feed as &amp;nbsp; - to the existing test for this class

  • 🇳🇱Netherlands Lendude Amsterdam

    Can we track down what changed here in 10.2? Neither files touched by the current fix were changed recently, it would be nice to track down what caused this so we can maybe prevent other side effect and make sure we are on the right track with fixing this here and not just band-aiding a much bigger issue

  • 🇺🇸United States mfb San Francisco

    @Lendude this bug was pre-existing, but it was hidden by the fact that prior to 🐛 Upgrade filter system to HTML5 Fixed , the filter system would replace &nbsp; with a literal non-breaking space. After 🐛 Upgrade filter system to HTML5 Fixed , &nbsp; is preserved, thus triggering this bug.

    I didn't do a super deep dive on the older code, but I would guess that &nbsp; was previously decoded because core used to interpret HTML as XML (for various historical reasons), and in XML, &nbsp; is not valid; it would be correct to decode it to a literal non-breaking space. In Drupal 10.2.0, HTML is finally interpreted as HTML, so &nbsp; can be preserved.

  • 🇮🇪Ireland lostcarpark

    Thanks for working on this so quickly, @mfb.

    I have applied the patch from MR !5870 to my 10.2 dev site and verified that the RSS is now validating correctly, and the output I initially reported is now getting output as:

    &lt;p&gt;Hello&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;p&gt;Goodbye&lt;/p&gt;

    This problem will potentially affect all 10.2 sites using RSS, so I would suggest we should try to get this into the next bugfix release.

    However, I wonder should we be looking at introducing an actual XML serializer in the longer term, such as Symfony\Component\Serializer\Encoder\XmlEncoder;?

  • 🇺🇸United States mfb San Francisco

    @lostcarpark Well ironically this bug is related not to how RSS is rendered using templates, but to usage of DOM to modify the RSS, which you'd think should just work, but has some quirks (as does SimpleXML, iirc).

  • 🇬🇧United Kingdom longwave UK

    @Lendude pinged me about this issue in Slack as I was the primary contributor to the HTML5 filter upgrade issue.

    I think @mfb is correct in #10/#15 and we have uncovered an existing bug that was lurking but not triggered until the HTML5 upgrade was completed.

    It is not clear to me the reasoning in #13 though, I am not sure why &nbsp; was previously treated as a normal space instead of a non-breaking one - these are different characters and should be treated as such. Unless perhaps it was being output as an actual non-breaking space character (which looks the same on screen) instead of the entity? The HTML5 serializer is more strict now about outputting entities instead of UTF-8 encoded characters:

    Drupal 10.1:

    > print \Drupal\Component\Utility\Html::normalize("<p>\xc2\xa0</p>");
    <p> </p>⏎
    

    Drupal 10.2:

    > print \Drupal\Component\Utility\Html::normalize("<p>\xc2\xa0</p>");
    <p>&nbsp;</p>⏎
    
  • 🇺🇸United States mfb San Francisco

    not sure why   was previously treated as a normal space instead of a non-breaking one

    There is no "normal space" involved here. In previous versions the saveXML() call would decode &nbsp; to a literal/actual non-breaking space. Which worked fine. Now that XML isn't involved, all the HTML entities which aren't valid in XML can be preserved.

  • Status changed to RTBC 11 months ago
  • 🇬🇧United Kingdom longwave UK

    Yeah, that makes sense, in which case this is RTBC.

  • 🇮🇪Ireland lostcarpark

    I have applied to my live site, and I'm happy to verify that the RSS feed is validating correctly with the patch.

  • 🇬🇧United Kingdom opsdemon

    We hit the same issue today after upgrading to 10.2.0. All our newsletters (based on RSS feeds) were broken.

    Applied the fix to one site and the feed is now validating again (so all the RSS items now appear in mailchimp).

    However the problem seems bigger than that. The description fields for the RSS items now contain unconverted HTML escape codes (such as "’", "—", "&") that weren't there before.

    Please can this also be fixed?

  • 🇺🇸United States mfb San Francisco

    @opsdemon would you be able to write a failing test or provide exact steps to reproduce your issue? I tried adding those HTML entities into the source of an article, and also disabled ckeditor and added it directly. Then looked at rss.xml and it looks correct. Also parsed the RSS and verified it spits out the correct HTML via

    $rss = simplexml_load_file('http://localhost/rss.xml');
    foreach ($rss->channel->item as $item) {
       echo $item->description;
    }
    
  • 🇬🇧United Kingdom opsdemon

    @mfb Thanks for looking into this issue.

    Unfortunately I don't have a test or reproducible steps.

    If it helps, here is a link to an rss feed that has the issue:
    https://teamrelated.com/newsletter/rss.xml
    (Search for "#8217" to see the first item with a HTML entity that has the problem.)

    I did some more investigation and found the following:
    - Checked the newsletter for the previous week and the HTML entities were not appearing. The only change we've made this week was to upgrade from 10.1.7 to 10.2.0.
    - The offending HTML entities in the rss feed are appearing because they were present in the body of the source article (eg. <p>) and have then been in the rss xml file.
    - I then took out the patch from #12 and the double-escaping in no longer present (but the rss feed does not validate with mailchimp).

    In summary, it looks like the issue with the HTML entitles is a side effect of the patch:
    - With the patch: rss file validates with mailchimp but the feed contains unconverted HTML entitles.
    - Without the patch: rss file items look normal, but the feed file does not validate with mailchimp.

    Here follows an example line from the rss file with and without the patch.

    without patch:
    &lt;p&gt;&lt;p&gt;Workforce management is the process of effectively managing and optimizing a company&amp;#8217;s workforce to meet business goals and objectives. It involves a range of activities,

    with patch: (double-escaped)
    &lt;p&gt;&amp;lt;p&amp;gt;Workforce management is the process of effectively managing and optimizing a company&amp;amp;#8217;s workforce to meet business goals and objectives. It involves a range of activities,

    Please let me know if there's anything else I can do to help.

  • 🇺🇸United States mfb San Francisco

    @opsdemon &lt;p&gt;&amp;lt;p&amp;gt; is a mix of single- and double-encoding. &lt;p&gt; is correctly single-encoded, while &amp;lt;p&amp;gt; is double-encoded. Wonder if some code or configuration on your site causes this? Are you able to reproduce on a clean install of drupal core?

  • 🇬🇧United Kingdom opsdemon

    @mfb Sorry, I don't have a clean install that reproduces this.

    The items in the rss feed come from a view. Specifically, the description for each rss item comes from "{{body}}".

    The body of the rss item in my previous example is:
    <p>Workforce management is the process of effectively managing and optimizing a company&#8217;s workforce to meet business goals and objectives. It involves a range of activities, including staffing, scheduling, time and attendance tracking, performance management, and more. With the help of workforce management metrics and key performance indicators (KPIs), businesses can gain valuable insights into their workforce and make informed decisions to improve operational efficiency.</p>

    The Text format is "Raw HTML".

    The line in the patch that causes the issue is:
    $node->replaceChild($rss_dom->createTextNode(Html::transformRootRelativeUrlsToAbsolute($html_markup, $request->getSchemeAndHttpHost())), $node->firstChild);

    Does that help?

  • 🇺🇸United States mfb San Francisco

    Does that help?

    @opsdemon A little, but feels like a wild goose chase when I don't have all the steps to reproduce from a clean install :) Some ideas... Does your "raw HTML" text format escape the content? If so, then that would explain it. Does changing your rewrite template to

    {{ body|raw }}

    help at all? If so, that would imply that Twig is escaping the content.

  • 🇬🇧United Kingdom opsdemon

    I checked the "Raw HTML" text format and it has all filters disabled, so it shouldn't be escaping the content.

    I tried changing the rewrite text to "body | raw" and it made no difference.

  • 🇮🇪Ireland lostcarpark

    @opsdemon, Just wondering could the initial &lt;p&gt; be coming from the Drupal template, and the double escaped &amp;lt;p&amp;gt; be coming from your external feed? Are you certain the content isn't escaped before it gets into Drupal, and therefore Drupal is applying its escaping to already escaped content?

    The current patch is working perfectly for me, but my content all originates inside Drupal, mostly from CKEditor.

  • 🇬🇧United Kingdom opsdemon

    No, the body example I sent in a previous message actually came from drupal. The HTML is all unescaped.

    <p>Workforce management is the process of effectively managing and optimizing a company&#8217;s workforce to meet business goals and objectives. It involves a range of activities, including staffing, scheduling, time and attendance tracking, performance management, and more. With the help of workforce management metrics and key performance indicators (KPIs), businesses can gain valuable insights into their workforce and make informed decisions to improve operational efficiency.</p>

    The content is coming into drupal unescaped. The content has been coming into drupal using the Feeds module since 2019.

    As I said, we don't use an editor. That's why the text format doesn't change the body at all.

  • 🇬🇧United Kingdom opsdemon

    Attaching screenshot of the body.

  • 🇺🇸United States mfb San Francisco

    @opsdemon If you look at the preview of the feed in Views, does it show that the RSS is already incorrect at that point (before the filter alters it)?

    Note it'd be *really* helpful if you could provide steps to reproduce from a clean install. Otherwise, I have no idea if you have e.g. a custom template or other custom code that causes this. Presumably would not be impossible, if you create the content type, content, text format, view, etc. that is causing the problem.

  • 🇬🇧United Kingdom opsdemon

    The preview of the view looks fine. Here is the example line from the view preview:

    &lt;p&gt;&lt;p&gt;Workforce management is the process of effectively managing and optimizing a company&amp;#8217;s workforce to meet business goals and objectives. It involves a range of activities,

    Does this mean that the escaping problem is happening AFTER the view is prepared?

    Would it help if I gave you direct access to a test system where the problem exists?

  • 🇺🇸United States mfb San Francisco

    @opsdemon Yes apparently you have something altering the output after views generates the feed? But I don't know what would cause that. I can create a custom route that spits out that description in an RSS feed, which gets processed by RssResponseRelativeUrlFilter, which does repair the p tags (it does not want nested p tags, so it immediately closes the first p tag, and closes the second p tag as well), but doesn't HTML-encode part of it as you are seeing.

  • 🇬🇧United Kingdom opsdemon

    As far as I'm aware (and I'm the only developer), there is nothing altering the output after the view generates. It should just be core code.

    Let me know if you need access to the test instance.

  • 🇺🇸United States mfb San Francisco

    @opsdemon If there is nothing else responsible for this (such as a custom theme, etc.) then I think what you will need to do is put some debugging code inside of core/lib/Drupal/Core/EventSubscriber/RssResponseRelativeUrlFilter.php to analyze what it's doing. How does the RSS document change when it runs? Is there perhaps something invalid about the markup in the description element that is responsible?

  • 🇬🇧United Kingdom opsdemon

    I'm not familiar with RssResponseRelativeUrlFilter.php. so let me know what debugging you need and I'd be happy to test it.

    I had a look at the .theme file and couldn't find anything related to rss.

    The rss feed has been in use for several years, so I'm not aware of any invalid markup. I sent you the link to the rss.xml earlier in case you could spot anything.

  • 🇺🇸United States mfb San Francisco

    @opsdemon In the RssResponseRelativeUrlFilter::transformRootRelativeUrlsToAbsolute() method, you could compare the value of the initial $rss_markup input to the method's return value to see what changes.

    You could also compare the value of each $html_markup to the value of Html::transformRootRelativeUrlsToAbsolute($html_markup, $request->getSchemeAndHttpHost()).

    Perhaps there is something about $html_markup that causes it to be transformed in the way you're seeing.

  • 🇬🇧United Kingdom opsdemon

    @mfb Sorry, I'm not sure my knowledge of the module (or my PHP coding) is up to that. :(

    I'd be happy to drop in a debug version of the php file and test it for you, or grant you access to the test instance so that you could do it yourself.

  • 🇺🇸United States mfb San Francisco

    @opsdemon In that case why don't you just add return; as the first line of RssResponseRelativeUrlFilter::onResponse() so it returns early without processing the feed thru transformRootRelativeUrlsToAbsolute(). Then look at the feed to see what the original untransformed version is.

  • 🇬🇧United Kingdom opsdemon

    @mfb Tried adding the return to onResponse(), then dropped cache, and it made no difference.

    Both the view preview and the rss.xml look the same as before.

  • 🇺🇸United States mfb San Francisco

    @opsdemon ok it sounds like your view has already generated the incorrect feed; the issue isn't caused by Drupal/Core/EventSubscriber/RssResponseRelativeUrlFilter processing the feed. I think you (or someone) will have to dive into debugging all the code involved in rendering the feed (Views, etc.) to see why you're getting that output.

  • 🇬🇧United Kingdom opsdemon

    @mfb If the issue isn't caused by RssResponseRelativeUrlFilter.php then why does reverting the patch in that file correct the HTML formatting issue?

    Also, why does view preview look correct if the view has already generated the incorrect feed?

  • 🇺🇸United States mfb San Francisco

    @opsdemon according to what you said in #39, the RSS is the same if RssResponseRelativeUrlFilter does not transform it. So my interpretation is that something else must be generating incorrect RSS before it gets to that point. You or anyone who reproduces the issue will have to do more debugging to figure it out.

  • 🇬🇧United Kingdom opsdemon

    @mfb I'm not sure where to go next with this. We're stuck with an upgrade issue and no idea where to look for the fix or even who to speak to next.

    It occurs to me that the previous code in RssResponseRelativeUrlFilter was masking an underlying issue, and when you fixed it the underlying issue resurfaced.

    Out of interest, why is the rss view preview formatted correctly, but the actual view output in rss.xml is incorrect? Aren't they meant to be the same?

  • 🇺🇸United States mfb San Francisco

    @opsdemon If I use Views module to create an intentionally double-encoded RSS feed, then without this patch, the Views preview shows the double-encoded feed, but rss.xml may be "fixed" by RssResponseRelativeUrlFilter (i.e. yes, it can mask the double-encoding by unencoding the ampersands, although the feed may also be invalid depending what it unencoded). With the patch, I see the double-encoded feed in both the Views preview and rss.xml. Is this what you're seeing? If not, good to know, but not much I can do about it without steps to reproduce.

  • 🇬🇧United Kingdom opsdemon

    @mfb There is no "intentionally" double-encoded RSS feed.

    The input article is not encoded (example already provided) and the view rewriting only uses simple, unencoded HTML markup.

    If I get some time over the holiday I'll try to reproduce using a vanilla 10.2.0 install. Would that help?

  • 🇺🇸United States mfb San Francisco

    @opsdemon Yes, you (or someone) needs to reproduce your issue on a fresh install. Theoretically one can upload their config yml/database dump/custom code/etc. here, but better to reproduce on a fresh install to get the minimal steps to reproduce, rather than maximum steps to reproduce :)

  • 🇬🇧United Kingdom opsdemon

    @mfb Some good news at last!

    I did some more digging into the RSS view setup and found that the rewrite results text was actually coming from a twig template (the config in the Text field in the UI was being overidden). Hence, when I tried your suggestion in #25 to use "{{ body|raw }}" for the body expression by updating the rewrite results through the UI it had no effect.

    I went back and tried #25 again by using "{{ body|raw }}" in the twig template and it fixed the problem. The output in rss.xml is now formatted correctly (and the double-encoding has gone).

    Thank you for your help in resolving this issue.

    We're now just waiting for your patch to be released to resolve the original RSS issue permanently. Hopefully it gets included in the next core release.

  • 🇺🇸United States mfb San Francisco

    @opsdemon \o/ great that we understand your situation.

    RssResponseRelativeUrlFilter dates from 2016, so I'm thinking perhaps a change record is needed, if there are any other sites out there that were mistakenly doing an extra round of HTML encoding on their RSS item description, and relying on RssResponseRelativeUrlFilter to transform it back. They can now remove that extra round, although luckily if they don't, there shouldn't be a security issue, just ugly HTML showing up unexpectedly.

  • 🇮🇪Ireland lostcarpark

    @opsdemon, great that you've found cause of your problem!

    @mfb, fantastic that we're getting to the bottom of this. Let me know if I can help get it over the line.

    I was beginning to worry that this was going to turn into one of those issues that hangs around for years generating hundreds of comments.

  • 🇬🇧United Kingdom longwave UK

    I think this qualifies as major as it breaks a feature with no workaround, also tagging as a regression. Will try to get this included in the next patch release.

  • Status changed to Fixed 10 months ago
  • 🇬🇧United Kingdom alexpott 🇪🇺🌍

    Committed and pushed b024332680 to 11.x and a5369ac460 to 10.2.x. Thanks!

    • alexpott committed a5369ac4 on 10.2.x
      Issue #3409587 by mfb, opsdemon, lostcarpark, longwave, cilefen, Lendude...
    • alexpott committed b0243326 on 11.x
      Issue #3409587 by mfb, opsdemon, lostcarpark, longwave, cilefen, Lendude...
  • Status changed to RTBC 10 months ago
  • 🇮🇪Ireland lostcarpark

    Thanks for getting this fixed!

  • Status changed to Fixed 10 months ago
  • 🇮🇪Ireland lostcarpark

    Sorry, just meant to post thank you comment. Didn't realise I was on the edit page. Putting things back as they were.

  • Automatically closed - issue fixed for 2 weeks with no activity.

  • 🇫🇮Finland heikkiy Oulu

    We are running 10.2.2 and we still seem to be experiencing this issue after upgrading our site from 10.1.x to 10.2.2. Looking at our source code we seem to have https://git.drupalcode.org/project/drupal/-/commit/a5369ac460 already applied but when using a text field as rss feed source, it converts empty spaces as &nbsp in the rss feed source.

    The RSS feed was working before the update but now we get a validation error The entity "nbsp" was referenced, but not declared.

    When looking at the RSS feed source, I can see that empty spaces are encoded as   like so:

    <description>Lorem ipsum.&nbsp;</description>
    
  • 🇺🇸United States mfb San Francisco

    @HeikkiY I'm pretty sure this might have only been released in 10.2.3 because 10.2.2 was a security release

  • 🇺🇸United States mfb San Francisco

    Seems like this issue is missing from the 10.2.3 release notes, however. Maybe someone could add it in retroactively...

  • 🇫🇮Finland heikkiy Oulu

    @mfb

    Yes, you are absolutely correct. Sorry for the confusion, I had a failing patch against 10.2.3 and that lead to the codebase already seemingly having the fix in it.

    And yes, it does seem to be missing from the release notes and would be good to be there since it's a regression. We didn't catch it either during our D10.2 testing but only in production after third party contacted the client that the RSS feed is broken.

  • 🇨🇦Canada OMD

    running 10.2.3 here and still getting this issue. Was it supposed to be fixed in 10.2.3?

  • 🇺🇸United States mfb San Francisco

    @OMD yes this was released in 10.2.3 - if you're still seeing an issue you should provide steps to reproduce so folks can look into it.

  • 🇨🇦Canada OMD

    Ok, so steps to reproduce are much the same as before. I have a feed with an RSS description that contains an ampersand "Training & Events" - this is the field in the Feed:Style options setting section in views.

    Checking the feed output against https://validator.w3.org it finds the line and highlights it as an undefined entity: "description should not contain HTML: &" and this is printed in the feed "Training &amp; Events" in place of the ampersand.

  • 🇨🇦Canada OMD

    Also noticing the pubDate for the feed, that used to be at the top of the feed, is no longer included. Is this by design?

  • 🇨🇦Canada OMD

    Sorry, please disregard #64. I had added a date at the top of the feed using a views template.

  • 🇺🇸United States mfb San Francisco

    @OMD, I believe I was able to reproduce what you're talking about. What I see at https://validator.w3.org/feed/#validate_by_input is:

    This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.

    line 6, column 42: description should not contain HTML: &amp;

    <description>Training &amp;amp; Events</description>

    In other words, the feed is valid, but there is a recommendation that Views should simply output <description>Training &amp; Events</description> rather than double-encoding it as <description>Training &amp;amp; Events</description>

    You'll have to discuss that in another (new or existing) issue - this issue is about feeds being actually invalid and unparseable due to &nbsp;

  • 🇺🇸United States mfb San Francisco

    @OMD's followup bug report is at: 🐛 Channel description of RSS feeds is double-escaped Needs review

  • 🇺🇸United States mfb San Francisco

    mfb changed the visibility of the branch 3409587-rss-feeds-invalid to active.

  • 🇺🇸United States mfb San Francisco

    mfb changed the visibility of the branch 3409587-rss-feeds-invalid to active.

Production build 0.71.5 2024