Refactor Text Processing to Use mb_encode_numericentity for HTML Encoding Compatibility

Created on 16 July 2024, 5 months ago
Updated 31 July 2024, 5 months ago

Problem/Motivation

We utilize a htmlspecialchars_decode(iconv()) which helps encode our text, however there are certain edge cases where:

$dom->loadHTML(htmlspecialchars_decode(iconv('UTF-8', 'ISO-8859-1', htmlentities($text, ENT_COMPAT, 'UTF-8')), ENT_QUOTES));

could result in data loss if there are encoding mismatches when using functions like iconv.

Proposed resolution

It looks like for PHP 8.2 the accepted solution for this is leveraging:

$dom->loadHTML(mb_encode_numericentity($text, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));

Let's switch it to use that instead.

πŸ› Bug report
Status

Fixed

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States j-barnes

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.71.5 2024