Consider checking incoming string for UTF-8

Created on 10 April 2024, 8 months ago

Problem/Motivation

It appears that TrimWhitspace's use of the "u" regular expression parameter can sometimes deliver null responses for content.

TrimWhitespace uses 4 preg_replace functions that use the "u" paramter to parse incoming text.

 protected function processFieldValue(&$value, $type) {
    if (!$this->getDataTypeHelper()->isTextType($type, ['text', 'string'])) {
      return $value;
    }

    $preserve = $value;

    $value = str_replace(" ", '', $value);

    // Remove multiple spaces.
    $value = preg_replace('/( {2,})+/imu', ' ', $value);

    // Remove spaces before punctuation.
    $value = preg_replace('/\s+([!?.,])/imu', "$1", $value);

    // Remove any space at the start of a string.
    $value = preg_replace('/^\s+/imu', '', $value);

    // Remove any non-printable characters.
    $value = preg_replace('/[[:^print:]]/imu', '', $value);

    $value = trim($value);
  }

When $value is a string that isn't UTF-8 encoded this will return null.

Steps to reproduce

Not sure exactly how to rig up this test but if you ever process content that isn't UTF-8 encoded then the TrimWhitespace filter will turn all provided values to Null.

Proposed resolution

While I don't understand how I am delivering non-UTF-8 text to indexing, I don't think I've done anything particularly strange to get here.

I wonder if your module should check the incoming encoding of the string and only use UTF filtering when the string is UTF-8.
https://www.php.net/manual/en/function.mb-detect-encoding.php

Remaining tasks

Get consensus about this fix.

πŸ› Bug report
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States cosmicdreams Minneapolis/St. Paul

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.71.5 2024