decode_entities() call can cause problems by reverting encoded HTML entities

Created on 23 April 2014, over 10 years ago
Updated 22 May 2024, 6 months ago

We have several nodes with HTML code samples in them, using < and > for the tags so that the HTML tags are displayed as a code sample and not rendered. For example:

<pre class="code">
 &lt;div data-role="page"&gt;
        &lt;div data-role="header" data-position="fixed"&gt;
            &lt;h1&gt;Stairs game&lt;/h1&gt;
        &lt;/div&gt;

        &lt;div data-role="content"&gt;
           &lt;canvas id="c" &gt;&lt; /canvas&gt;
           &lt;audio id="soundEfx" src="gameover.mp3" style="display: none;"&gt;&lt;/audio &gt;
       &lt;audio id="game_id" src="Game.mp3" style="display: none;"&gt;&lt; /audio&gt;
           &lt;audio id="jump_id" src="jumping.mp3" style="display: none;"&gt;&lt;/audio &gt;
        &lt;/div&gt;

        &lt;div data-role="footer" data-position="fixed"&gt;
          &lt;/div&gt; 
</pre>

When the document is uploaded to Lingotek, and downloaded through the API, this code sample remains intact and unchanged; I added logging in various places in the Lingotek module to confirm this. The problem happens right when the value is saved to the database: in lingotek_process_entity_xml(), decode_entities() is called on the text before passing it to lingotek_unfilter_placeholders(). This converts &gt; and &lt; back to > and < and saves those to the DB, like this:

<pre class="code">
 <div data-role="page">
        <div data-role="header" data-position="fixed">
            <h1>Stairs game</h1>
        </div>

        <div data-role="content">
           <canvas id="c" >< /canvas>
           <audio id="soundEfx" src="gameover.mp3" style="display: none;"></audio >
       <audio id="game_id" src="Game.mp3" style="display: none;">< /audio>
           <audio id="jump_id" src="jumping.mp3" style="display: none;"></audio >
        </div>

        <div data-role="footer" data-position="fixed">
          </div> 
</pre>

This becomes a problem when viewing the node. When viewing the original node, those entities are displayed as < and > by the browser, so the HTML code sample is displayed as desired. But, when viewing the translation of the node, these HTML tags are rendered, which makes the node look pretty funky and defeats the purpose.

I've tracked that decode_entities() call back to tbe origin of lingotek.api.inc, in a Sept 2011 code restructure. I wonder if anyone even remembers at this point: does decode_entities() serve a purpose here? I removed it locally and didn't see any problems, but I haven't tested it thoroughly.

It's worth noting that we have pre tags set to ignore in the secondary configuration under Advanced Content Parsing; I'm not sure if this is relevant, but it sure seems like it could be.

  pre:
    ruleTypes: [EXCLUDE]
    idAttributes: [id]
πŸ› Bug report
Status

Needs work

Version

5.4

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States BrockBoland

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • I faced a similar scenario while downloading the translation I uploaded from a Drupal site (10.2.x), the Lingotek version we have is 4.1.2. Here are the steps I followed for that:

    - Go to /admin/lingotek/manage/node

    - Select an item from the node list

    - Request language (e.g. Spanish) translation

    The original node WYSIWYG text has characters like "<", ">", among others that don't need the encode function. It seems the issue is resolved by removing the html_entity_decode function from a couple of places. Here's the patch I used to fix the problem we experienced. Let me know your thoughts on it and if we need to create a new issue for it.

Production build 0.71.5 2024