FilterHtml data loss when iframe and/or textarea is allowed

Created on 21 December 2023, 11 months ago
Updated 18 January 2024, 10 months ago

Problem/Motivation

Since πŸ› Upgrade filter system to HTML5 Fixed landed, it seems that any text format that allows <iframe> or <textarea> within the html filter can lead to data loss.

Steps to reproduce

Configure a text format to allow:
<em> <strong> <cite> <blockquote cite> <code> <li> <dl> <dt> <dd> <h4 id class> <h5 id class> <br> <img src alt width height class> <table class> <caption class> <tbody> <thead> <tfoot> <th> <td> <tr> <iframe title src> <div class> <p class> <h2 id class> <h3 id class> <span id class> <a rel accesskey target title name id href hreflang class data-cta-track-sync data-cta-track-async data-cta-description data-cta-placement> <ul type class> <ol start type class> <hr>

Add some content with an <a> tag.

Notice that the <a> will not be rendered on the front-end, and is not persisted into the database.

I'm filing this as critical because existing content being re-saved can lead to data loss.

Proposed resolution

There are 2 MR with different approaches.
MR!5919
MR!5942 uses a custom tokenizer. This is the preferred solution of longwave, #19

@Wim Leers set this to RTBC but did not specify which MR was preferred. #16

Remaining tasks

TBD

User interface changes

TBD

API changes

TBD

Data model changes

TBD

Release notes snippet

TBD

πŸ› Bug report
Status

Fixed

Version

10.2 ✨

Component
FilterΒ  β†’

Last updated about 13 hours ago

No maintainer
Created by

πŸ‡ΊπŸ‡ΈUnited States luke.leber Pennsylvania

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • Issue created by @luke.leber
  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Republishing after discussion with @Luke.Leber and confirmation that there is no security issue here but a critical data loss bug that can be handled in public.

  • πŸ‡ΊπŸ‡ΈUnited States dslatkin

    I discovered this issue on our own sites, too. I shared a bit of this on Slack, it seems to be caused in the core/modules/filter/src/Plugin/Filter/FilterHtml.php file's getHTMLRestrictions implementation. The $dom = Html::load($html); line uses the new HTML5 parser to convert the "allowed HTML tags" list into a deeply nested DOM data structure since the list has no closing tags. It then walks over the data structure and extracts tags and attribute names. Since iframes are not permitted to have any child content, the new parser drops any elements that would have been inside the iframe.

    I found a couple short-term workarounds that worked:

    1. Move the <iframe> tag to the end of the list, so it doesn't have any child content when it gets parsed by Html::load and avoid adding new tags that come after the iframe until the issue is fixed.
    2. Close the <iframe> tag in the list so it becomes something like <iframe></iframe>. Drupal doesn't seem to encourage closing tags here, so using this workaround might cause a future issue.

    Also, I'm not entirely sure, but I could imagine there being more elements than just the iframe that drop the child content. It's just that the iframe is one of the most common use cases with the HTML filter.

  • πŸ‡ΊπŸ‡ΈUnited States dslatkin

    I discovered this issue on our own sites, too. I shared a bit of this on Slack, it seems to be caused in the core/modules/filter/src/Plugin/Filter/FilterHtml.php file's getHTMLRestrictions implementation. The $dom = Html::load($html); line uses the new HTML5 parser to convert the "allowed HTML tags" list into a deeply nested DOM data structure since the list has no closing tags. It then walks over the data structure and extracts tags and attribute names. Since iframes are not permitted to have any child content, the new parser drops any elements that would have been inside the iframe.

    I found a couple short-term workarounds that worked:

    1. Move the <iframe> tag to the end of the list, so it doesn't have any child content when it gets parsed by Html::load and avoid adding new tags that come after the iframe until the issue is fixed.
    2. Close the <iframe> tag in the list so it becomes something like <iframe></iframe>. Drupal doesn't seem to encourage closing tags here, so using this workaround might cause a future issue.

    Also, I'm not entirely sure, but I could imagine there being more elements than just the iframe that drop the child content. It's just that the iframe is one of the most common use cases with the HTML filter.

  • πŸ‡ΊπŸ‡ΈUnited States luke.leber Pennsylvania

    FYI - <textarea> has the same effect here as <iframe> does.

    That's the only other HTML5 element that I've found to be problematic. Like my comment in the test, it might be worth taking a "kitchen sink" approach with test coverage, given this is largely governed by a third party library nowadays.

    Cheers -- thanks for facilitating everything today.

  • πŸ‡ΊπŸ‡ΈUnited States luke.leber Pennsylvania

    Update issue title / summary.

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Looking at Masterminds\HTML5\Elements there are bitmasks for tags that have certain features:

        // From section 8.1.2: "script", "style"
        // From 8.2.5.4.7 ("in body" insertion mode): "noembed"
        // From 8.4 "style", "xmp", "iframe", "noembed", "noframes"
        /**
         * Indicates the contained text should be processed as raw text.
         */
        const TEXT_RAW = 2;
    
        // From section 8.1.2: "textarea", "title"
        /**
         * Indicates the contained text should be processed as RCDATA.
         */
        const TEXT_RCDATA = 4;
    
        /**
         * Indicates that the text inside is plaintext (pre).
         */
        const TEXT_PLAINTEXT = 32;
    

    Guessing all these tags will therefore be affected.

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Thinking that instead of trying to process the allowed list as HTML, we should just use regex instead. Normally regex is insufficient for parsing HTML, but this isn't really HTML anyway, it's just a list of strings that look like HTML tags.

    +1

    Surprisingly related: the CKEditor 5 module's \Drupal\ckeditor5\HTMLRestrictions::fromString() reuses the parsing that FilterHtml does, precisely because it already was historically brittle:

    …
    
        // Reuse the parsing logic from FilterHtml::getHTMLRestrictions().
        $configuration = ['settings' => ['allowed_html' => $elements_string]];
        $filter = new FilterHtml($configuration, 'filter_html', ['provider' => 'filter']);
        $allowed_elements = $filter->getHTMLRestrictions()['allowed'];
    
    …
    

    This also means there's a HUGE amount of implicit test coverage, because HtmlRestrictions has >1500 LoC of test coverage since it's so crucial for the CKEditor 4 β†’ 5 upgrade path (as well as providing detailed validation errors and guidance in the admin UI): \Drupal\Tests\ckeditor5\Unit\HTMLRestrictionsTest.

    My point is: we can change the parsing logic and be very confident that if it passes tests, that it works fine πŸ˜„

  • Status changed to Needs review 11 months ago
  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Switched FilterHtml::getHTMLRestrictions() to use regexes instead of HTML DOM to parse the allowed tags and attributes.

    Not sure I have covered all possible combinations with the regexes, I might have missed some allowed characters in tags or attribute names, and if someone has done something really weird with attribute values (for example, using >) then I think the regexes might fail. But I think this is a good start.

  • Status changed to Needs work 11 months ago
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    As predicted: lots of test failures in \Drupal\Tests\ckeditor5\Unit\HTMLRestrictionsTest as well as things extensively relying on it such as SmartDefaultSettingsTest πŸ€“

  • Merge request !5942Resolve #3410303 "Custom tokenizer" β†’ (Closed) created by longwave
  • Status changed to Needs review 11 months ago
  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Opened an alternative approach in MR!5942 that modifies the way the HTML5 parser works, instead of trying to use regex. This feels simpler but relies a bit on the internals of the HTML5 library. Not sure which is better.

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Also modified MR!5919 to be more lenient in the characters it accepts.

  • Status changed to RTBC 11 months ago
  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    That looks magnificent! 🀩

    (And very nice test coverage too πŸ¦™πŸ˜œ)

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    In case it gets lost in the noise above, not sure which of the two approaches are best, but both are passing tests, so for another committer to decide I guess.

  • πŸ‡ΊπŸ‡ΈUnited States alfattal Minnesota

    Wim Leers β†’ Which one?!

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    Personally I prefer the custom tokenizer one, as the HTML5 parser is more robust than regex and will handle edge cases that might exist in real world configurations that the regex might choke on in some way - while our test coverage is good it can't hope to cover every strange thing that someone might put in their filter config.

  • πŸ‡ΊπŸ‡ΈUnited States adrianm6254

    I tried both patches MR!5919 & MR!5942 and they both worked fine.

    For now I will keep MR!5942 applied.

  • πŸ‡ΊπŸ‡ΈUnited States alfattal Minnesota

    MR!5942 fixed the issue for me. Great work, thank you all.

  • πŸ‡³πŸ‡ΏNew Zealand quietone

    I'm triaging RTBC issues β†’ . I read the IS and the comments.

    I did update the proposed resolution to include that there are 2 MRs here with different approaches and related details.

    Leaving at RTBC.

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    larowlan β†’ changed the visibility of the branch 11.x to hidden.

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    I agree with #19 that the custom tokenizer looks like the better approach

    Looking a bit further into the masterminds/html5 internals to understand it a bit better

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    As this is a critical, not going to hold things up - but I think we need a follow-up for explicit coverage for \Drupal\filter\Plugin\Filter\FilterHtml::getHTMLRestrictions which is a public API.

    I will file that.

    Updating issue credits

    • larowlan β†’ committed cb6d0184 on 10.2.x
      Issue #3410303 by longwave, Luke.Leber, Wim Leers, quietone, dslatkin:...
    • larowlan β†’ committed 3ae37397 on 11.x
      Issue #3410303 by longwave, Luke.Leber, Wim Leers, quietone, dslatkin:...
  • Status changed to Fixed 11 months ago
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Committed to 11.x and backported to 10.2.x

    Filing that follow-up now

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Sorry for not having been more explicit in #16 β€” I was definitely referring to the new class($scanner, $events) extends Tokenizer one, which was also committed πŸ‘

  • I've been told me separate report may be an instance of this same issue. I've detailed my experience/symptoms here: https://www.drupal.org/project/drupal/issues/3412164#comment-15382735 πŸ› upgrade from 10.1.7 to 10.2.0 removes HTML tags from many pages Postponed: needs info

    The situation affecting my site as a result of upgrading from 10.1.7 to 10.2.0, which seems similar to this, is indeed critical. As outlined in my notes, the database appears uncorrupted by the upgrade, but the rendered output is corrupted, and once someone edits/saves a page, they persist the corruption to the database.

  • given the very bad nature of this bug, I think it should be added to the "Known Issues" section of the 10.2.0 release notes: https://www.drupal.org/project/drupal/releases/10.2.0 β†’

    In fact, users should be advised to not upgrade to 10.2.0 and wait for 10.2.1

  • following @larowlan's wise advice, I tested this commit (as a patch using cweagans/composer-patches) and my initial testing suggests that this patch does indeed fix my problem, after rolling back to 10.1.7 and rerunning the 10.2.0 upgrade. I will report back if I encounter any issues. Thank you all for fixing this!

  • πŸ‡§πŸ‡ͺBelgium wim leers Ghent πŸ‡§πŸ‡ͺπŸ‡ͺπŸ‡Ί

    Adding @xeM8VfDh's issue from #34 as a related issue.

    Glad to read in #36 that this indeed fixed it 😊

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    I've added this issue to the known issues list at https://www.drupal.org/project/drupal/releases/10.2.0 β†’

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024