- Issue created by @cosmicdreams
It appears that TrimWhitspace's use of the "u" regular expression parameter can sometimes deliver null responses for content.
TrimWhitespace uses 4 preg_replace functions that use the "u" paramter to parse incoming text.
protected function processFieldValue(&$value, $type) {
if (!$this->getDataTypeHelper()->isTextType($type, ['text', 'string'])) {
return $value;
}
$preserve = $value;
$value = str_replace(" ", '', $value);
// Remove multiple spaces.
$value = preg_replace('/( {2,})+/imu', ' ', $value);
// Remove spaces before punctuation.
$value = preg_replace('/\s+([!?.,])/imu', "$1", $value);
// Remove any space at the start of a string.
$value = preg_replace('/^\s+/imu', '', $value);
// Remove any non-printable characters.
$value = preg_replace('/[[:^print:]]/imu', '', $value);
$value = trim($value);
}
When $value is a string that isn't UTF-8 encoded this will return null.
Not sure exactly how to rig up this test but if you ever process content that isn't UTF-8 encoded then the TrimWhitespace filter will turn all provided values to Null.
While I don't understand how I am delivering non-UTF-8 text to indexing, I don't think I've done anything particularly strange to get here.
I wonder if your module should check the incoming encoding of the string and only use UTF filtering when the string is UTF-8.
https://www.php.net/manual/en/function.mb-detect-encoding.php
Get consensus about this fix.
Active
1.0
Code