When using "HTML filter" processor, turning on "Index title attribute" or "Index alt attribute" causes HTML to be indexed after all

Created on 27 July 2022, over 2 years ago
Updated 4 June 2023, over 1 year ago

Setup

  • Solr version: 8
  • Drupal Core version: 9.3.15
  • Search API version: 8.x-1.21
  • Search API Solr version: 4.26
  • Configured Solr Connector: Basic Auth

Problem/Motivation

I got a note from my 3rd-party Solr provider that, after months of using very little bandwidth, my little site had started exceeding its bandwidth limits. Digging into it with his help, I found that some editors had been pasting images into nodes' body fields; these had been getting indexed as HTML like
<p><img alt="" src="...A VERY LONG STRING ENSUES..." /></p>

I figured turning on the "HTML Filter" processor should make that stop happening, so that my site and the Solr provider stop passing that very long string back and forth with every index/search. However, I found that all HTML got through to the index just fine whenever I turned on either the "Index title attribute" or "Index alt attribute" features of that processor; only by leaving them unchecked did I actually succeed in stripping HTML from the indexed node body--see below , in the tm_x3b_en_body fields in the screengrabs from my Solr GUI.

FWIW, I think the issue is just with the regexes in Drupal\search_api\Plugin\search_api\processor\HtmlFilter::processFieldValue(), specifically here and here.

Steps to reproduce

Turn on the "HTML Filter" processor at /admin/config/search/search-api/index/YOUR INDEX NAME/processors
Reindex.
In some way, look at the contents of your Solr index. One way: open your Solr GUI client and choose "Query", and just submit the default search. You will see HTML in your indexed nodes' body fields.

Proposed resolution

Make the above-mentioned regexes more flexible, to accommodate all possible arrangements of HTML elements that might have alt or title attributes. Alternatively, if this functionality proves not worth this effort, remove these options from the processor.

Remaining tasks

πŸ› Bug report
Status

Postponed: needs info

Version

1.21

Component

Plugins

Created by

πŸ‡ΊπŸ‡ΈUnited States bdimaggio Boston, MA

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡§πŸ‡ͺBelgium ΓΈkse

    I stumbled accross the exact same problem when a SOLR container kept on dropping out even after memory increase.

    An image dropped within a body field, in combination with the index title and alt atrribute options, caused the HTML not to be stripped.

    Turning off the option to index both title and alt attribute options resolved the issue.

    This should be addressed to get a more pragmatic solution though.

  • Status changed to Postponed: needs info over 1 year ago
  • πŸ‡¦πŸ‡ΉAustria drunken monkey Vienna, Austria

    Thanks a lot for reporting this problem, and sorry it took me so long to get back to you!
    This sounds a lot like πŸ› PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor Fixed , which has just been fixed. Could you therefore please give the latest dev version of the module a try and see whether the problem still persists?

Production build 0.71.5 2024