Html::serialize adds unwanted/duplicate xml:* attributes

Created on 2 June 2021, over 3 years ago
Updated 6 November 2023, about 1 year ago
</code><h3 id="summary-problem-motivation">Problem/Motivation</h3>

The <code>\Drupal\Component\Utility\Html::serialize

method uses \DOMDocument::saveXML instead of \DOMDocument::saveHTML to turn a \DOMDocument object back into an HTML string. I know HTML is a form of XML, but in some cases this can cause issues.

For example: when the following piece of HTML is being passed to \Drupal\Component\Utility\Html::normalize (which calls the serialize method):

<p>The Dutch word for example is <span lang="nl">voorbeeld</span></p>

The output is:

<p>The Dutch word for example is <span lang="nl" xml:lang="nl">voorbeeld</span></p>

For some reason, a new xml:lang attribute was added.

Now that alone is not a real big problem. However, if at some point this output is being passed to Html::normalize again for a second time (for example two text filters that uses the Html::normalize), we get the following output:

<p>The Dutch word for example is <span lang="nl" xml:lang="nl" xml:lang="nl">voorbeeld</span></p>

You see that we now have the xml:lang twice which is faulty HTML. This looks like a bug in PHP or in libxml, but if we use saveHTML instead of saveXML, the problem is fixed (no xml:lang attributes are added.

The big question is: Why use the saveXML if there is a special saveHTML function available?

Steps to reproduce

The issue can easily be reproduced in the default Umami example profile:

  1. Install Drupal with the Umami profile
  2. Create a new basic page
  3. Fill the Body field with the following HTML:
    <p>The Dutch word for example is <span lang="nl">voorbeeld</span>.</p>
  4. Make sure the Basic HTML format is selected
  5. Save the page

Now, if you look in the source code of the page, you see the output is:
<p>The Dutch word for example is <span lang="nl" xml:lang="nl" xml:lang="nl">voorbeeld</span>.</p>

This is because in the Basic HTML format, multiple filters are enabled that use the Html::serialize method:

  • Align images
  • Caption images
  • Restrict images to this site
  • Track images uploaded via a Text Editor
  • Embed media

Proposed resolution

I think it is a better option to use \DOMDocument::saveHTML instead of \DOMDocument::saveXML in \Drupal\Component\Utility\Html::serialize.

I am not sure if the impact of this is a big problem.

Remaining tasks

  • Add tests

Release notes snippet

Edit Use \DOMDocument::saveHTML instead of \DOMDocument::saveXML in Html::serialize

🐛 Bug report
Status

Closed: outdated

Version

11.0 🔥

Component
Render 

Last updated 3 days ago

Created by

🇳🇱Netherlands BryanDeNijs

Live updates comments and jobs are added and updated live.
  • Needs framework manager review

    It is used to alert the framework manager core committer(s) that an issue significantly impacts (or has the potential to impact) multiple subsystems or represents a significant change or addition in architecture or public APIs, and their signoff is needed (see the governance policy draft for more information). If an issue significantly impacts only one subsystem, use Needs subsystem maintainer review instead, and make sure the issue component is set to the correct subsystem.

Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Production build 0.71.5 2024