Setup
- Solr version: 8
- Drupal Core version: 9.3.15
- Search API version: 8.x-1.21
- Search API Solr version: 4.26
- Configured Solr Connector: Basic Auth
Problem/Motivation
I got a note from my 3rd-party Solr provider that, after months of using very little bandwidth, my little site had started exceeding its bandwidth limits. Digging into it with his help, I found that some editors had been pasting images into nodes' body fields; these had been getting indexed as HTML like
<p><img alt="" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAADRYAAAXACAYAAACZf/GpAAAMamlDQ1BJQ0MgU...A VERY LONG STRING ENSUES..." /></p>
I figured turning on the "HTML Filter" processor should make that stop happening, so that my site and the Solr provider stop passing that very long string back and forth with every index/search. However, I found that all HTML got through to the index just fine whenever I turned on either the "Index title attribute" or "Index alt attribute" features of that processor; only by leaving them unchecked did I actually succeed in stripping HTML from the indexed node body--see below , in the tm_x3b_en_body
fields in the screengrabs from my Solr GUI.
FWIW, I think the issue is just with the regexes in Drupal\search_api\Plugin\search_api\processor\HtmlFilter::processFieldValue()
, specifically here and here.
Steps to reproduce
Turn on the "HTML Filter" processor at /admin/config/search/search-api/index/YOUR INDEX NAME/processors
Reindex.
In some way, look at the contents of your Solr index. One way: open your Solr GUI client and choose "Query", and just submit the default search. You will see HTML in your indexed nodes' body fields.
Proposed resolution
Make the above-mentioned regexes more flexible, to accommodate all possible arrangements of HTML elements that might have alt or title attributes. Alternatively, if this functionality proves not worth this effort, remove these options from the processor.
Remaining tasks