Add option to skip highlighted term processing

Created on 18 August 2020, over 4 years ago
Updated 11 May 2022, almost 3 years ago

Problem/Motivation

Search API removes and then reapplies highlighted terms which can cause problems, specifically with the Solr backend.

Issues can arise:

  • with apostrophes or similar characters which are stripped as individual words: https://git.drupalcode.org/project/search_api/-/blob/8.x-1.x/src/Plugin/... . For example the term "children's" is split into ["chidren',"s"].
  • with quoted search terms. For example "children's toys" is split into ["children", "s", "toys"]. So a quoted query which should highlight "childrens toys" highlights "children", "s", and "toys".

I'm not sure that there is a good way for Search API to know if the query is quoted or how the search engine is handling something like apostrophes.

Steps to reproduce

1. Install Search API 8.x-1.x and Search API Solr 4.x and Solr Server 8.4
2. Create Solr Server with "Retrieve highlighted snippets" selected
3. Create Index that includes fulltext field for body content and enable the Highlight process with "Create Excerpt" selected and the following advanced options:

4. Create test content that include apostrophes and terms "toys", and "son's toys" with fulltext fields
5. Create search page in views that uses "Multiple words" parse mode for the fulltext fields
6. Search terms that include apostrophes, ie "son't toys"
7. See results that include highlighted "s", words like "reason" with "son" in it:

8. Note that the highlight returned from Solr looks something like this:

damaged in that move. Is reasonably supported. … Matter of: Andrews Road Homes, … (which also contained a music holder and mouthpiece) with his [HIGHLIGHT]sons[/HIGHLIGHT] [HIGHLIGHT]toys[/HIGHLIGHT] in item 48. Second, the member provided a standard form statement …

Proposed resolution

As noted above this happens because the highlighted search terms are removed and re-added and characters like apostrophes are treated like word breaks.

The solution in the patch adds a new setting for the Highlight process, "Keep highlighted terms from server". When enabled the highlighted terms returned from the server are not removed. Instead they are kept as is with the highlight prefix and suffix.

Remaining tasks

1. See if maintainers are interested in pursuing this approach
2. Refine the implementation per maintainers
3. Write tests

πŸ› Bug report
Status

Needs work

Version

1.0

Component

Plugins

Created by

πŸ‡ΊπŸ‡ΈUnited States acouch

Live updates comments and jobs are added and updated live.
  • Needs tests

    The change is currently missing an automated test that fails when run with the original code, and succeeds when the bug has been fixed.

Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

Production build 0.71.5 2024