[PP-upstream] Serialize function strips accents

Created on 22 January 2024, 5 months ago
Updated 7 May 2024, about 2 months ago

Upgrading from 10.1 to 10.2(.2) change the serialize function (Drupal/Component/Utility/Html.php).

With this new version, accent seems to be removed.
Ex.
<a href="https://www.mywebsite.com/services/identite">Identit&eacute;</a>

return
<a href="https://www.mywebsite.com/services/identite">Identit</a>

🐛 Bug report
Status

Postponed

Version

11.0 🔥

Component
Base 

Last updated 40 minutes ago

Created by

🇨🇭Switzerland Mistrae

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • Issue created by @Mistrae
  • 🇨🇭Switzerland Mistrae

    I created a patch to revert the lastest change if anyone need it fixed before a better solution can be found.

  • Status changed to Postponed: needs info 5 months ago
  • 🇬🇧United Kingdom longwave UK

    &eacute; is normalised to é but should not be stripped:

    > \Drupal\Component\Utility\Html::normalize('<a href="https://www.mywebsite.com/services/identite">Identit&eacute;</a>');
    = "<a href="https://www.mywebsite.com/services/identite">Identité</a>"
    

    Can you provide an example similar to the above that fails?

  • 🇺🇸United States cilefen
  • 🇨🇭Switzerland Mistrae

    @longwave, serialize not normalize.

    Ex.

    DOMDocument with:

    <?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE html>
        <html>
          <body>
            <a href="https://www.mywebsite.com/services/identite">Identit&eacute;</a>
          </body>
        </html>

    Run Html::serialize and get:
    <a href="https://www.mywebsite.com/services/identite">Identit</a>

  • 🇬🇧United Kingdom longwave UK

    Well, normalize() just calls load() then serialize(). Can you give a full code snippet that fails please?

  • 🇨🇭Switzerland Mistrae

    Here is the full code that can recreate the error:

    $html_dom = \Drupal\Component\Utility\Html::load(\Drupal\Core\Render\Markup::create('Identité'));
    $body = $html_dom->getElementsByTagName('body');
    $node = $body->item(0);
    $child = $node->childNodes->item(0);
    $text = $child->textContent;
    $text = htmlentities($text, ENT_QUOTES, 'UTF-8');
    $element = $html_dom->createElement('a', $text);
    $node->replaceChild($element, $child);
    \Drupal\Component\Utility\Html::serialize($html_dom)
  • 🇬🇧United Kingdom longwave UK
    > $dom = new DOMDocument(); $dom->loadHTML('<?xml version="1.0" encoding="UTF-8"

    Identité');
    = true

    > \Drupal\Component\Utility\Html::serialize($dom);
    = "Identité"
    ?>

  • 🇨🇭Switzerland Mistrae

    If I input the text directly yes it work. Maybe it's the htmlentities that doesn't work with the new function.

  • 🇬🇧United Kingdom longwave UK
    $text = htmlentities($text, ENT_QUOTES, 'UTF-8');
    

    This is the problem. If you remove this line, the issue goes away.

  • 🇬🇧United Kingdom longwave UK

    This might be an upstream bug in \Masterminds\HTML5\Serializer\Traverser::node().

    In this case what has happened is we have injected an entity reference directly into the DOM, $node->nodeType is XML_ENTITY_REF_NODE, but the switch statement does not handle this case.

  • 🇨🇭Switzerland Mistrae

    OK thanks, just to be clear, does that mean that since 10.2 we cannot use htmlentities with serialize and that will be considered as won't fix or should something be done here ?

  • Status changed to Postponed 5 months ago
  • 🇬🇧United Kingdom longwave UK

    Thanks for reporting! I have reported this upstream at https://github.com/Masterminds/html5-php/issues/244 with a slightly modified example, let's wait to see what the maintainer there has to say. If they decline to fix we can still override in Drupal and serialize entity references correctly.

  • 🇮🇳India gaurav.kapoor

    In one of the websites, we are using smart trim and wrapping the generated summary around a link (linked to the respective node). Special characters such as german umaluts 'ä, ö, ü and ß' are then not showing up in the generated trimmed text. Patch from #3 resolved the issue.

  • 🇧🇪Belgium weseze

    Patch from #3 can cause contextual links placeholder to be rendered wrong, causing it to replace portions of your content instead of just the contextual placeholder div-element.
    You should not use this patch.

    Instead, modules should fix their implementations and not use htmlentity encoding/decoding.
    Just encountered this issue using linked_field module. See 🐛 Special characters are stripped Needs review .

Production build 0.69.0 2024