Add support for phrase searching

Issue created by @smokris
Issue was unassigned.
Status changed to Needs review over 2 years ago6:58pm 4 March 2023
Comment over 2 years ago →
🇺🇸United States smokris Athens, Ohio, USA
Patch attached.
Comment over 2 years ago →
System Message
The last submitted patch, 2: 3345999-bigram.patch, failed testing. View results →
Comment over 2 years ago →
🇺🇸United States smokris Athens, Ohio, USA
Revised patch to address failure in testAutocompletion.
Comment over 2 years ago →
🇺🇸United States smokris Athens, Ohio, USA
Rebased patch.
Comment over 2 years ago →
🇦🇹Austria drunken monkey Vienna, Austria
Thanks a lot, this all sounds reasonable to me.
Had to think a bit about the cleanNumericString() special case, but yes: When a numeric string becomes empty after stripping minuses and zeroes, then I guess 0 really is the only possibly correct token, it should never be empty.
I just added a reference to the cleanNumericString() method into its test method, as I try to always do that when accessing something via reflection. That way, it’s still possible to find all usages using an IDE.

Now, I think only this is missing:

Yes, optional makes sense to me. Should it be enabled or disabled by default?

Hm, since it will bloat the database quite a bit, I guess we’d better keep it disabled by default? Or just disabled for existing indexes, but enabled for new ones, since in the latter case the users have a higher chance of seeing the option?
What do you think? Will users appreciate it more if phrase search “just works” (as most people would expect), or if we use their database storage sparingly? (Now that I write it like that, it more sounds like the former to me … Please just pick the option that makes most sense to you, and we’ll hope for the best.)
Comment over 2 years ago →
🇺🇸United States tedfordgif
Although Solr is the primary backend we use, I find myself considering the db backend for specific use cases when Solr is not available, sometimes on large sites. I like the option of not touching existing indexes, and enabling it by default for new indexes with a sufficiently noticeable warning. Thanks for your work on this!
Status changed to Needs work over 2 years ago9:38pm 26 March 2023
Comment over 2 years ago →
🇦🇹Austria drunken monkey Vienna, Austria
NW for the option.
Are you going to work on that, smokris? Otherwise, I can also put it on my todo list, but no guarantees on when I’ll get to it.
Status changed to Needs review over 2 years ago6:20pm 29 March 2023
Comment over 2 years ago →
🇺🇸United States smokris Athens, Ohio, USA
Will users appreciate it more if phrase search “just works” (as most people would expect), or if we use their database storage sparingly?

For new installations, enabling it by default sounds great.

For existing installations, if this change will be included in an upcoming 1.x minor-version release, I think it might be disruptive/surprising if it were automatically enabled and suddenly the site's index size grew significantly. (Though the upcoming 2.0 major-version release ( #3126115: Create a 2.0.0 release → ) might be a suitable time to automatically enable it for existing installations too?)

I've attached an updated patch that includes a first draft of the UI for toggling support for phrase indexing/searching, and a screenshot of it with the new option highlighted in pink. It's enabled by default for new installations, and for existing installations I added a hook_update_N implementation that disables it.
Open in Jenkins → Open on Drupal.org →
Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
last update about 2 years ago
535 pass
Comment about 2 years ago →
🇦🇹Austria drunken monkey Vienna, Austria
Thanks a lot for your continued work, looks great!

+++ b/modules/search_api_db/src/Plugin/search_api/backend/Database.php @@ -580,6 +592,11 @@ + 'info' => ucfirst($this->configuration['phrase']),
This would fail to be translated on non-English websites. However, in this specific case, it should be fine to just wrap the expression in $this->t() (since we those values are already passed as literals in another place).

Otherwise, this patch is already very good, there were only a few small details I changed. Please see the attached revision.

Most notably, I changed the description for the “Phrase indexing” option to be split into one description for the setting itself, and separate descriptions for the individual options. It now reads as follows:

Phrase indexing
Choose how phrase queries (queries with double quotes) should be handled. Better matching comes at the price of slower indexing and larger database size.

Disabled: Double-quoted phrases in search queries will be expanded into individual words, so searches will include results that contain those keywords but not necessarily consecutively and not necessarily in the same order. (This is the standard behavior of version 8.x-1.29 and earlier.)

Bigram: Pairs of consecutive words will be included in the index, so double-quoted searches will only include results that contain those keywords consecutively and in the same order. Warning: this will increase the database index size by about 5x, and will increase indexing time by about 2x.

I’m not convinced that this is the best we can do, but it’s an improvement. Unfortunately, I’m rather bad at writing such descriptions in a way that make them understandable to laypeople.
This would be an alternative (just for the two options) – feedback appreciated on which you think would be better:

<ul> <li>Disabled: This will ignore quoted phrase in search keywords and just look for the individual words separately. For instance, "blue house" would match any item that contains the words “blue” and “house” somewhere in its indexed text, no matter where and in what order.</li> <li>Bigram: This treats quoted phrases in keywords like it is generally expected, matching only items that contain those words in the exact same order, consecutively. For instance, "blue house" would match only items that contain the word “blue” immediately followed by the word “house”. Please make sure that the associated increase in database size (about 5x) and indexing time (about 2x) is acceptable for your site.</li> </ul>
Any combination of the two, or completely new suggestions, are of course also very welcome!
Comment about 2 years ago →
🇩🇰Denmark ressa Copenhagen
This would be an awesome feature in Search API. And I agree, it's something the users expect, so thanks a lot for working on it!

I just tried the latest patch, and it works perfectly.

+1 for enabling it by default in new installations.

About the wording, it is difficult to formulate these concepts, but I think you succeeded with the #2 version, which makes sense to me. Here is a new suggestion, with some slight tweaks:

Disabled: Ignore quotes in searches and just look for the individual words separately. For instance, "blue house" matches any item that contains the words "blue" and "house" somewhere in its indexed text, no matter where and in what order.

Bigram: Treat a quoted phrase in a search like it is generally expected [...] (the rest is great!)
Comment about 2 years ago →
🇺🇸United States aharown07
I have applied the patch to a dev site with good success in results so far, with Bigram enabled.
The test site has just over 23,000 items indexed.

Seems to work really well with facets and everything as expected.

To echo ressa, thanks for your work on this. I hope it makes it into a future release. Also makes sense to me to enable by default in installs.
Open in Jenkins → Open on Drupal.org →
Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
last update about 2 years ago
535 pass
Comment about 2 years ago →
🇦🇹Austria drunken monkey Vienna, Austria
Thanks a lot for both your feedback!
Sounds very good so far. I’m just gonna leave it open a few more days to maybe get another review, but then I thinwe can merge this.

Attached is a patch with the adapted descriptions suggested in #15, thanks a lot for those as well!
Comment about 2 years ago →
System Message

drunken monkey → committed a834e7b3 on 8.x-1.x authored by smokris →
Issue #3345999 by smokris, drunken monkey, ressa: Added support for...
Status changed to Fixed about 2 years ago5:25pm 27 May 2023
Comment about 2 years ago →
🇦🇹Austria drunken monkey Vienna, Austria
OK, merged.
Thanks again, everyone – especially, thanks a lot, @smokris!
Comment about 2 years ago →
🇩🇰Denmark ressa Copenhagen
So cool, this is a great feature to have. Thanks @drunken monkey and @smokris!
Comment about 2 years ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.
Status changed to Fixed almost 2 years ago2:11pm 21 September 2023
Comment almost 2 years ago →
🇺🇸United States ja09
Excellent feature, thanks everyone. One possible addition that could be very useful, in addition to Bigram, would be "Required". This would default to phrase searching but without the user needing to enter the quotes. This would be useful for specific/technical searches, names of places/locations, names of people, leading zeros, etc.

Not sure about the performance ramifications of doing this and if another index would be needed. Seems like it could be done within the Bigram index.
Comment over 1 year ago →
🇨🇭Switzerland berdir Switzerland
The release notes are a bit confusing, they say the new feature is disabled by default, but that's not correct, for new servers, it's enabled by default, only existing servers/sites have it disabled through the update function.

That matters with default config in install profiles for example, if you don't explicitly provide the setting, it will default to bigram.

Add support for phrase searching

Problem/Motivation

Proposed resolution

Remaining tasks

Comments & Activities