- Issue created by @luke.leber
- Merge request !5919Toss an iframe in the initial configuration to throw monkey wrenches around. β (Closed) created by luke.leber
- π¬π§United Kingdom longwave UK
Republishing after discussion with @Luke.Leber and confirmation that there is no security issue here but a critical data loss bug that can be handled in public.
- πΊπΈUnited States dslatkin
I discovered this issue on our own sites, too. I shared a bit of this on Slack, it seems to be caused in the
core/modules/filter/src/Plugin/Filter/FilterHtml.php
file'sgetHTMLRestrictions
implementation. The$dom = Html::load($html);
line uses the new HTML5 parser to convert the "allowed HTML tags" list into a deeply nested DOM data structure since the list has no closing tags. It then walks over the data structure and extracts tags and attribute names. Since iframes are not permitted to have any child content, the new parser drops any elements that would have been inside the iframe.I found a couple short-term workarounds that worked:
- Move the
<iframe>
tag to the end of the list, so it doesn't have any child content when it gets parsed byHtml::load
and avoid adding new tags that come after the iframe until the issue is fixed. - Close the
<iframe>
tag in the list so it becomes something like<iframe></iframe>
. Drupal doesn't seem to encourage closing tags here, so using this workaround might cause a future issue.
Also, I'm not entirely sure, but I could imagine there being more elements than just the iframe that drop the child content. It's just that the iframe is one of the most common use cases with the HTML filter.
- Move the
- πΊπΈUnited States dslatkin
I discovered this issue on our own sites, too. I shared a bit of this on Slack, it seems to be caused in the
core/modules/filter/src/Plugin/Filter/FilterHtml.php
file'sgetHTMLRestrictions
implementation. The$dom = Html::load($html);
line uses the new HTML5 parser to convert the "allowed HTML tags" list into a deeply nested DOM data structure since the list has no closing tags. It then walks over the data structure and extracts tags and attribute names. Since iframes are not permitted to have any child content, the new parser drops any elements that would have been inside the iframe.I found a couple short-term workarounds that worked:
- Move the
<iframe>
tag to the end of the list, so it doesn't have any child content when it gets parsed byHtml::load
and avoid adding new tags that come after the iframe until the issue is fixed. - Close the
<iframe>
tag in the list so it becomes something like<iframe></iframe>
. Drupal doesn't seem to encourage closing tags here, so using this workaround might cause a future issue.
Also, I'm not entirely sure, but I could imagine there being more elements than just the iframe that drop the child content. It's just that the iframe is one of the most common use cases with the HTML filter.
- Move the
- πΊπΈUnited States luke.leber Pennsylvania
FYI -
<textarea>
has the same effect here as<iframe>
does.That's the only other HTML5 element that I've found to be problematic. Like my comment in the test, it might be worth taking a "kitchen sink" approach with test coverage, given this is largely governed by a third party library nowadays.
Cheers -- thanks for facilitating everything today.
- πΊπΈUnited States luke.leber Pennsylvania
Update issue title / summary.
- π¬π§United Kingdom longwave UK
Looking at
Masterminds\HTML5\Elements
there are bitmasks for tags that have certain features:// From section 8.1.2: "script", "style" // From 8.2.5.4.7 ("in body" insertion mode): "noembed" // From 8.4 "style", "xmp", "iframe", "noembed", "noframes" /** * Indicates the contained text should be processed as raw text. */ const TEXT_RAW = 2; // From section 8.1.2: "textarea", "title" /** * Indicates the contained text should be processed as RCDATA. */ const TEXT_RCDATA = 4; /** * Indicates that the text inside is plaintext (pre). */ const TEXT_PLAINTEXT = 32;
Guessing all these tags will therefore be affected.
- π§πͺBelgium wim leers Ghent π§πͺπͺπΊ
Thinking that instead of trying to process the allowed list as HTML, we should just use regex instead. Normally regex is insufficient for parsing HTML, but this isn't really HTML anyway, it's just a list of strings that look like HTML tags.
+1
Surprisingly related: the CKEditor 5 module's
\Drupal\ckeditor5\HTMLRestrictions::fromString()
reuses the parsing thatFilterHtml
does, precisely because it already was historically brittle:β¦ // Reuse the parsing logic from FilterHtml::getHTMLRestrictions(). $configuration = ['settings' => ['allowed_html' => $elements_string]]; $filter = new FilterHtml($configuration, 'filter_html', ['provider' => 'filter']); $allowed_elements = $filter->getHTMLRestrictions()['allowed']; β¦
This also means there's a HUGE amount of implicit test coverage, because
HtmlRestrictions
has >1500 LoC of test coverage since it's so crucial for the CKEditor 4 β 5 upgrade path (as well as providing detailed validation errors and guidance in the admin UI):\Drupal\Tests\ckeditor5\Unit\HTMLRestrictionsTest
.My point is: we can change the parsing logic and be very confident that if it passes tests, that it works fine π
- Status changed to Needs review
11 months ago 4:12pm 22 December 2023 - π¬π§United Kingdom longwave UK
Switched
FilterHtml::getHTMLRestrictions()
to use regexes instead of HTML DOM to parse the allowed tags and attributes.Not sure I have covered all possible combinations with the regexes, I might have missed some allowed characters in tags or attribute names, and if someone has done something really weird with attribute values (for example, using
>
) then I think the regexes might fail. But I think this is a good start. - Status changed to Needs work
11 months ago 4:31pm 22 December 2023 - π§πͺBelgium wim leers Ghent π§πͺπͺπΊ
As predicted: lots of test failures in
\Drupal\Tests\ckeditor5\Unit\HTMLRestrictionsTest
as well as things extensively relying on it such asSmartDefaultSettingsTest
π€ - Status changed to Needs review
11 months ago 4:45pm 22 December 2023 - π¬π§United Kingdom longwave UK
Opened an alternative approach in MR!5942 that modifies the way the HTML5 parser works, instead of trying to use regex. This feels simpler but relies a bit on the internals of the HTML5 library. Not sure which is better.
- π¬π§United Kingdom longwave UK
Also modified MR!5919 to be more lenient in the characters it accepts.
- Status changed to RTBC
11 months ago 5:31pm 22 December 2023 - π§πͺBelgium wim leers Ghent π§πͺπͺπΊ
That looks magnificent! π€©
(And very nice test coverage too π¦π)
- π¬π§United Kingdom longwave UK
In case it gets lost in the noise above, not sure which of the two approaches are best, but both are passing tests, so for another committer to decide I guess.
- π¬π§United Kingdom longwave UK
Personally I prefer the custom tokenizer one, as the HTML5 parser is more robust than regex and will handle edge cases that might exist in real world configurations that the regex might choke on in some way - while our test coverage is good it can't hope to cover every strange thing that someone might put in their filter config.
- πΊπΈUnited States adrianm6254
I tried both patches MR!5919 & MR!5942 and they both worked fine.
For now I will keep MR!5942 applied.
- π³πΏNew Zealand quietone
I'm triaging RTBC issues β . I read the IS and the comments.
I did update the proposed resolution to include that there are 2 MRs here with different approaches and related details.
Leaving at RTBC.
- π¦πΊAustralia larowlan π¦πΊπ.au GMT+10
larowlan β changed the visibility of the branch 11.x to hidden.
- π¦πΊAustralia larowlan π¦πΊπ.au GMT+10
As this is a critical, not going to hold things up - but I think we need a follow-up for explicit coverage for
\Drupal\filter\Plugin\Filter\FilterHtml::getHTMLRestrictions
which is a public API.I will file that.
Updating issue credits
-
larowlan β
committed cb6d0184 on 10.2.x
Issue #3410303 by longwave, Luke.Leber, Wim Leers, quietone, dslatkin:...
-
larowlan β
committed cb6d0184 on 10.2.x
-
larowlan β
committed 3ae37397 on 11.x
Issue #3410303 by longwave, Luke.Leber, Wim Leers, quietone, dslatkin:...
-
larowlan β
committed 3ae37397 on 11.x
- Status changed to Fixed
11 months ago 10:32pm 1 January 2024 - π¦πΊAustralia larowlan π¦πΊπ.au GMT+10
Committed to 11.x and backported to 10.2.x
Filing that follow-up now
- π¦πΊAustralia larowlan π¦πΊπ.au GMT+10
I've been told me separate report may be an instance of this same issue. I've detailed my experience/symptoms here: https://www.drupal.org/project/drupal/issues/3412164#comment-15382735 π upgrade from 10.1.7 to 10.2.0 removes HTML tags from many pages Postponed: needs info
The situation affecting my site as a result of upgrading from 10.1.7 to 10.2.0, which seems similar to this, is indeed critical. As outlined in my notes, the database appears uncorrupted by the upgrade, but the rendered output is corrupted, and once someone edits/saves a page, they persist the corruption to the database.
given the very bad nature of this bug, I think it should be added to the "Known Issues" section of the 10.2.0 release notes: https://www.drupal.org/project/drupal/releases/10.2.0 β
In fact, users should be advised to not upgrade to 10.2.0 and wait for 10.2.1
following @larowlan's wise advice, I tested this commit (as a patch using
cweagans/composer-patches
) and my initial testing suggests that this patch does indeed fix my problem, after rolling back to 10.1.7 and rerunning the 10.2.0 upgrade. I will report back if I encounter any issues. Thank you all for fixing this!- π§πͺBelgium wim leers Ghent π§πͺπͺπΊ
- π¬π§United Kingdom longwave UK
I've added this issue to the known issues list at https://www.drupal.org/project/drupal/releases/10.2.0 β
Automatically closed - issue fixed for 2 weeks with no activity.