Drupal\Component\Utility\Html::normalize() leaves messy </body></html> in certain situations

Created on 30 March 2009, over 15 years ago
Updated 16 May 2024, 5 months ago

Problem/Motivation

Under certain circumstances, Drupal\Component\Utility\Html::normalize() (in D7, it's _filter_htmlcorrector()) will add messy </body></html> to the resulting HTML. This happens when the HTML ends in the middle of an attribute, for example:

<p>Here <img alt="ao

This will produce output like:

You can reproduce on Drupal 7 or 8 by following these steps:

Drupal 8

  1. Install Drupal 8.3.1 with the standard profile
  2. Go to /admin/structure/types/manage/article/display/teaser and configure the "Body" filed to trim at 20 characters
  3. Go to /admin/config/content/formats/manage/basic_html and both (a) enable the "Correct faulty and chopped off HTML" filter and (b) disable the "Restrict images to this site" filter (only necessary for the example HTML, not necessary to trigger the bug)
  4. Go to /node/add/article and use this HTML as the body (be sure to click the "Source" button in the WYSIWYG toolbar before pasting it, otherwise you're adding text not HTML):
    Here <img alt="aoeunhteoas unthoaesn theoausnth oaesntheo asnthoae" src="http://flowjournal.org/wp-content/uploads/2011/12/Im-Not-Here.png"  /> it is
    
  5. Go to /node and observe output like in the screenshot

Drupal 7

  1. Install Drupal 7.54 with the standard profile
  2. Go to /admin/structure/types/manage/article/display/teaser and configure the "Body" filed to trim at 20 characters
  3. Go to /admin/config/content/formats/filtered_html and add <img> to the "Allowed HTML tags" (under "Limit allowed HTML tags")
  4. Go to /node/add/article and use this HTML as the body:
    Here <img alt="aoeunhteoas unthoaesn theoausnth oaesntheo asnthoae" src="http://flowjournal.org/wp-content/uploads/2011/12/Im-Not-Here.png"  /> it is
    
  5. Go to /node and observe output like in the screenshot

Proposed resolution

Remove the potentially-misinterpreted </body></html> closing tags that \Drupal\Component\Utility\Html::load() adds to the end of the text before parsing into the DOM, as they are not required for the HTML soup to be successfully parsed.

By the way, \Drupal\views\Plugin\views\field\FieldPluginBase::trimText() has some code to avoid this exact problem when trimming fields:

      // Remove scraps of HTML entities from the end of a strings
      $value = rtrim(preg_replace('/(?:<(?!.+>)|&(?!.+;)).*$/us', '', $value));

Remaining tasks

  1. Consider backporting patch to Drupal 7 and Drupal 10

User interface changes

None.

API changes

None.

Data model changes

None.

Original summary

_filter_htmlcorrector leaves in fragmentary tags that may be passed in and break the rest of the page. This is most liable to happen when using the "Field can contain HTML" filter in the Views module, but could also occur any other time a developer were to trim a string that contains HTML and pass it to this function.

Example (trimmed to 250 chars):

Lorem ipsum dolor sit amet, consectetur adipiscing elit. <strong>Aliquam posuere enim</strong>. Sed ultrices semper tortor. Pellentesque cenim consectetur. Nulla sed risus eu ipsum venenatis <a class="sample" href="http://www.example.com/partial/path

Output is identical to input, breaking any HTML that follows on the page. Ideal output would be:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. <strong>Aliquam posuere enim</strong>. Sed ultrices semper tortor. Pellentesque cenim consectetur. Nulla sed risus eu ipsum venenatis 

Patch attached.

🐛 Bug report
Status

Fixed

Version

11.0 🔥

Component
Filter 

Last updated about 6 hours ago

No maintainer
Created by

🇺🇸United States greenbeans

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Production build 0.71.5 2024