Xss::filter() does not handle HTML tags inside attribute values

Created on 12 August 2021, almost 4 years ago

Updated 17 February 2023, over 2 years ago

Problem/Motivation

Initially reported by @lauriii in 🐛 Upgrade filter system to HTML5 Fixed , HTML5 allows unescaped less-than and greater-than in HTML attributes, e.g.

<img src="llama.jpg" data-caption="<em>Loquacious llama!</em>" />

Xss::filter() does not handle this:

>>> use \Drupal\Component\Utility\Xss;

>>> Xss::filter('<img src="llama.jpg" data-caption="Loquacious llama!" />', ['img', 'em']);
=> "<img src="llama.jpg" data-caption="Loquacious llama!" />"

>>> Xss::filter('<img src="llama.jpg" data-caption="<em>Loquacious llama!</em>" />', ['img', 'em']);
=> "<img src="llama.jpg">Loquacious llama!</em>" /&gt;"

In other words when an attribute contains a tag (or even just a >) the output is mangled, and part of the attribute value may end up in the HTML body instead.

Xss::filter() uses two regular expressions to try and extract tags from HTML:

      <[^>]*(>|$)       # a string that starts with a <, up until the > or the end of the string

This trivially matches anything that looks like a tag, but does not handle attributes that contain >.

    if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9\-]+)\s*([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) {

Similarly this seems unable to handle attributes that contain >.

Steps to reproduce

Proposed resolution

Remaining tasks

Determine whether regex is sufficient to filter HTML in this way: https://stackoverflow.com/a/1732454
Improve the regex to handle attributes that contain tag characters, or replace Xss::filter() with something more robust.