[Meta] PHP DOM (libxml2) misinterprets HTML5

Created on 6 November 2011, over 12 years ago
Updated 11 November 2023, 8 months ago

Problem/Motivation

The filters 'htmlcorrector', 'html' and the testing system needs html parsing and a valid DOM to work with. This is done by the libxml2 library provided in PHP that cleans html and transform it to a dom. Libxml2 assumes all html is HTML4 and correct it with HTML4 rules. As Drupal will be based on HTML5, typical HTML5 tags and constructions will be marked invalid, added or stripped.
A small test example is that <span lang="en"> transforms to <span lang="en" xml:lang="en">

Proposed resolution

Chosen solution

The consensus is that the best solution is to use html5-php (rewrite of html5lib). The library has almost all the functionality that we need. And the functionality is already implemented. There was a problem with the library and that has been fixed. The functionality that we miss is when the DOMDocument has a default namespace it's not possible to parse it using XPath unless the namespace is registered as something else. In the patch there are two helper-classes to add this functionality.

Remaining tasks

πŸ› Upgrade tests to HTML5 Needs review
#2667340: Usage of field_prefix and container-inline creates invalid markup. β†’
πŸ› Upgrade filter system to HTML5 Fixed

@todo: needs an issue created or existing issue linked from this one- convert the filter module to use masterminds/html5

User interface changes

None

API changes

Change in behavior of filter_dom_load, the html-filter and html-corrector

Beta phase evaluation

<!--Uncomment the relevant rows for the issue. -->

Blocked Issues

#1277290: Use a proper HTML parser for every core filter β†’

Related Issues

PHP Bug #60021 DOMDocument errors on HTML5 tags
issue that replaced own (faulty) function with libxml for the html corrector filter in 2009: #374441: Refactor Drupal HTML corrector (PHP5) β†’
#725260: Use PHP's Tidy component for the clean-up HTML filter β†’

Original report

The html filter corrector is now based on XHTML, but Drupal8 should output html5
At the moment
<span lang="en"> transforms to <span lang="en" xml:lang="en">
(see issue #1328768: attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss β†’ )

two possible causes:
function filter_dom_load currently loads @$dom_document->loadHTML('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');

function filter_dom_serialize uses $dom_document->saveXML

🌱 Plan
Status

Fixed

Version

10.2 ✨

Component
FilterΒ  β†’

Last updated 1 day ago

No maintainer
Created by

πŸ‡³πŸ‡±Netherlands Hanno

Live updates comments and jobs are added and updated live.
  • html5

    Implements and supports the use of HTML5.

  • Needs issue summary update

    Issue summaries save everyone time if they are kept up-to-date. See Update issue summary task instructions.

Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Production build 0.69.0 2024