FilterHtml data loss when iframe and/or textarea is allowed

Issue created by @luke.leber
Merge request !5919Toss an iframe in the initial configuration to throw monkey wrenches around. → (Closed) created by luke.leber
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
Republishing after discussion with @Luke.Leber and confirmation that there is no security issue here but a critical data loss bug that can be handled in public.
Comment over 1 year ago →
🇺🇸United States dslatkin
I discovered this issue on our own sites, too. I shared a bit of this on Slack, it seems to be caused in the core/modules/filter/src/Plugin/Filter/FilterHtml.php file's getHTMLRestrictions implementation. The $dom = Html::load($html); line uses the new HTML5 parser to convert the "allowed HTML tags" list into a deeply nested DOM data structure since the list has no closing tags. It then walks over the data structure and extracts tags and attribute names. Since iframes are not permitted to have any child content, the new parser drops any elements that would have been inside the iframe.

I found a couple short-term workarounds that worked:

Move the <iframe> tag to the end of the list, so it doesn't have any child content when it gets parsed by Html::load and avoid adding new tags that come after the iframe until the issue is fixed.

Close the <iframe> tag in the list so it becomes something like <iframe></iframe>. Drupal doesn't seem to encourage closing tags here, so using this workaround might cause a future issue.

Also, I'm not entirely sure, but I could imagine there being more elements than just the iframe that drop the child content. It's just that the iframe is one of the most common use cases with the HTML filter.
Comment over 1 year ago →
🇺🇸United States dslatkin
I discovered this issue on our own sites, too. I shared a bit of this on Slack, it seems to be caused in the core/modules/filter/src/Plugin/Filter/FilterHtml.php file's getHTMLRestrictions implementation. The $dom = Html::load($html); line uses the new HTML5 parser to convert the "allowed HTML tags" list into a deeply nested DOM data structure since the list has no closing tags. It then walks over the data structure and extracts tags and attribute names. Since iframes are not permitted to have any child content, the new parser drops any elements that would have been inside the iframe.

I found a couple short-term workarounds that worked:

Move the <iframe> tag to the end of the list, so it doesn't have any child content when it gets parsed by Html::load and avoid adding new tags that come after the iframe until the issue is fixed.

Close the <iframe> tag in the list so it becomes something like <iframe></iframe>. Drupal doesn't seem to encourage closing tags here, so using this workaround might cause a future issue.

Also, I'm not entirely sure, but I could imagine there being more elements than just the iframe that drop the child content. It's just that the iframe is one of the most common use cases with the HTML filter.
Comment over 1 year ago →
🇺🇸United States luke.leber Pennsylvania
FYI - <textarea> has the same effect here as <iframe> does.

That's the only other HTML5 element that I've found to be problematic. Like my comment in the test, it might be worth taking a "kitchen sink" approach with test coverage, given this is largely governed by a third party library nowadays.

Cheers -- thanks for facilitating everything today.
Comment over 1 year ago →
🇺🇸United States luke.leber Pennsylvania
Update issue title / summary.

🇬🇧United Kingdom longwave UK

Looking at Masterminds\HTML5\Elements there are bitmasks for tags that have certain features:

    // From section 8.1.2: "script", "style"
    // From 8.2.5.4.7 ("in body" insertion mode): "noembed"
    // From 8.4 "style", "xmp", "iframe", "noembed", "noframes"
    /**
     * Indicates the contained text should be processed as raw text.
     */
    const TEXT_RAW = 2;

    // From section 8.1.2: "textarea", "title"
    /**
     * Indicates the contained text should be processed as RCDATA.
     */
    const TEXT_RCDATA = 4;

    /**
     * Indicates that the text inside is plaintext (pre).
     */
    const TEXT_PLAINTEXT = 32;

Guessing all these tags will therefore be affected.

Comment over 1 year ago →
🇧🇪Belgium wim leers Ghent 🇧🇪🇪🇺
Thinking that instead of trying to process the allowed list as HTML, we should just use regex instead. Normally regex is insufficient for parsing HTML, but this isn't really HTML anyway, it's just a list of strings that look like HTML tags.

+1

Surprisingly related: the CKEditor 5 module's \Drupal\ckeditor5\HTMLRestrictions::fromString() reuses the parsing that FilterHtml does, precisely because it already was historically brittle:

… // Reuse the parsing logic from FilterHtml::getHTMLRestrictions(). $configuration = ['settings' => ['allowed_html' => $elements_string]]; $filter = new FilterHtml($configuration, 'filter_html', ['provider' => 'filter']); $allowed_elements = $filter->getHTMLRestrictions()['allowed']; …
This also means there's a HUGE amount of implicit test coverage, because HtmlRestrictions has >1500 LoC of test coverage since it's so crucial for the CKEditor 4 → 5 upgrade path (as well as providing detailed validation errors and guidance in the admin UI): \Drupal\Tests\ckeditor5\Unit\HTMLRestrictionsTest.

My point is: we can change the parsing logic and be very confident that if it passes tests, that it works fine 😄
Status changed to Needs review over 1 year ago4:12pm 22 December 2023
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
Switched FilterHtml::getHTMLRestrictions() to use regexes instead of HTML DOM to parse the allowed tags and attributes.

Not sure I have covered all possible combinations with the regexes, I might have missed some allowed characters in tags or attribute names, and if someone has done something really weird with attribute values (for example, using >) then I think the regexes might fail. But I think this is a good start.
Status changed to Needs work over 1 year ago4:31pm 22 December 2023
Comment over 1 year ago →
🇧🇪Belgium wim leers Ghent 🇧🇪🇪🇺
As predicted: lots of test failures in \Drupal\Tests\ckeditor5\Unit\HTMLRestrictionsTest as well as things extensively relying on it such as SmartDefaultSettingsTest 🤓
Merge request !5942Resolve #3410303 "Custom tokenizer" → (Closed) created by longwave
Status changed to Needs review over 1 year ago4:45pm 22 December 2023
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
Opened an alternative approach in MR!5942 that modifies the way the HTML5 parser works, instead of trying to use regex. This feels simpler but relies a bit on the internals of the HTML5 library. Not sure which is better.
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
Also modified MR!5919 to be more lenient in the characters it accepts.
Status changed to RTBC over 1 year ago5:31pm 22 December 2023
Comment over 1 year ago →
🇧🇪Belgium wim leers Ghent 🇧🇪🇪🇺
That looks magnificent! 🤩

(And very nice test coverage too 🦙😜)
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
In case it gets lost in the noise above, not sure which of the two approaches are best, but both are passing tests, so for another committer to decide I guess.
Comment over 1 year ago →
🇺🇸United States alfattal Minnesota
Wim Leers → Which one?!
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
Personally I prefer the custom tokenizer one, as the HTML5 parser is more robust than regex and will handle edge cases that might exist in real world configurations that the regex might choke on in some way - while our test coverage is good it can't hope to cover every strange thing that someone might put in their filter config.
Comment over 1 year ago →
🇺🇸United States adrianm6254
I tried both patches MR!5919 & MR!5942 and they both worked fine.

For now I will keep MR!5942 applied.
Comment over 1 year ago →
🇺🇸United States alfattal Minnesota
MR!5942 fixed the issue for me. Great work, thank you all.
Comment over 1 year ago →
🇳🇿New Zealand quietone
I'm triaging RTBC issues → . I read the IS and the comments.

I did update the proposed resolution to include that there are 2 MRs here with different approaches and related details.

Leaving at RTBC.
Comment over 1 year ago →
🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
Comment over 1 year ago →
🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
larowlan → changed the visibility of the branch 11.x to hidden.
Comment over 1 year ago →
System Message
larowlan → closed merge request !5919
Comment over 1 year ago →
🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
I agree with #19 that the custom tokenizer looks like the better approach

Looking a bit further into the masterminds/html5 internals to understand it a bit better
Comment over 1 year ago →
🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
As this is a critical, not going to hold things up - but I think we need a follow-up for explicit coverage for \Drupal\filter\Plugin\Filter\FilterHtml::getHTMLRestrictions which is a public API.

I will file that.

Updating issue credits
Comment over 1 year ago →
System Message

larowlan → committed cb6d0184 on 10.2.x
Issue #3410303 by longwave, Luke.Leber, Wim Leers, quietone, dslatkin:...

Comment over 1 year ago →

System Message

larowlan → committed 3ae37397 on 11.x

Issue #3410303 by longwave, Luke.Leber, Wim Leers, quietone, dslatkin:...

Status changed to Fixed over 1 year ago10:32pm 1 January 2024
Comment over 1 year ago →
🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
Committed to 11.x and backported to 10.2.x

Filing that follow-up now
Comment over 1 year ago →
🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10
📌 Add test coverage for \Drupal\filter\Plugin\Filter\FilterHtml::getHTMLRestrictions Active
Comment over 1 year ago →
System Message
larowlan → closed merge request !5942
Comment over 1 year ago →
🇧🇪Belgium wim leers Ghent 🇧🇪🇪🇺
Sorry for not having been more explicit in #16 — I was definitely referring to the new class($scanner, $events) extends Tokenizer one, which was also committed 👍
Comment over 1 year ago →
xem8vfdh
I've been told me separate report may be an instance of this same issue. I've detailed my experience/symptoms here: https://www.drupal.org/project/drupal/issues/3412164#comment-15382735 🐛 upgrade from 10.1.7 to 10.2.0 removes HTML tags from many pages Postponed: needs info

The situation affecting my site as a result of upgrading from 10.1.7 to 10.2.0, which seems similar to this, is indeed critical. As outlined in my notes, the database appears uncorrupted by the upgrade, but the rendered output is corrupted, and once someone edits/saves a page, they persist the corruption to the database.
Comment over 1 year ago →
xem8vfdh
given the very bad nature of this bug, I think it should be added to the "Known Issues" section of the 10.2.0 release notes: https://www.drupal.org/project/drupal/releases/10.2.0 →

In fact, users should be advised to not upgrade to 10.2.0 and wait for 10.2.1
Comment over 1 year ago →
xem8vfdh
following @larowlan's wise advice, I tested this commit (as a patch using cweagans/composer-patches) and my initial testing suggests that this patch does indeed fix my problem, after rolling back to 10.1.7 and rerunning the 10.2.0 upgrade. I will report back if I encounter any issues. Thank you all for fixing this!
Comment over 1 year ago →
🇧🇪Belgium wim leers Ghent 🇧🇪🇪🇺
Adding @xeM8VfDh's issue from #34 as a related issue.

Glad to read in #36 that this indeed fixed it 😊
Comment over 1 year ago →
🇬🇧United Kingdom longwave UK
I've added this issue to the known issues list at https://www.drupal.org/project/drupal/releases/10.2.0 →
Comment over 1 year ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

FilterHtml data loss when iframe and/or textarea is allowed

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Merge Requests

!5942FilterHtml data loss when iframe and/or textarea is allowed
Closed

!5919FilterHtml data loss when iframe and/or textarea is allowed
Closed

Comments & Activities

FilterHtml data loss when iframe and/or textarea is allowed

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Merge Requests

!5942FilterHtml data loss when iframe and/or textarea is allowedClosed

!5919FilterHtml data loss when iframe and/or textarea is allowedClosed

Comments & Activities

!5942FilterHtml data loss when iframe and/or textarea is allowed
Closed

!5919FilterHtml data loss when iframe and/or textarea is allowed
Closed