UTF-8 character handling in meta tag tidy()

Created on 12 June 2025, about 1 month ago

Problem/Motivation

JSON API endpoints return HTTP 500 errors with "Malformed UTF-8 characters, possibly incorrectly encoded" when serving content that contains French accented characters (à, é, è, ç, etc.). The error occurs specifically in the Symfony JsonEncoder during response serialization.

Root Cause: PHP's PCRE functions (preg_replace, preg_match, etc.) are not UTF-8 aware by default. When these functions process strings containing multibyte UTF-8 characters without the u (unicode) modifier, they treat each byte separately instead of as complete UTF-8 characters, corrupting the encoding.

Impact:

  • JSON API endpoints become inaccessible for content with accented characters
  • Affects multilingual sites, particularly French content
  • More prevalent on macOS development environments due to filesystem NFD/NFC normalization differences
  • Breaks frontend applications consuming JSON API data

The UTF-8 sequence for 'à' (\xC3\xA0) gets corrupted when processed by preg_replace('/\s+/', ' ', $value) without the unicode modifier, making the string invalid UTF-8 and causing json_encode() to fail during JSON API response generation.

Patch following.

🐛 Bug report
Status

Active

Version

2.1

Component

Code

Created by

🇫🇷France jchatard

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Production build 0.71.5 2024