Add support for phrase searching

Created on 4 March 2023, over 1 year ago
Updated 11 November 2023, 11 months ago

Problem/Motivation

Users expect that searching for a phrase surrounded by double quotes should only show results that include the quoted words consecutively in the specified order.

Currently search_api_db expands double-quoted phrases into individual words, so searching for a double-quoted phrase includes results that contain those keywords but not necessarily consecutively and not necessarily in the specified order, leading to poor search result relevance.

Previously this was mentioned in Fulltext search + phrase search Fixed , but that issue is closed with the advice to use Apache Solr. Nevertheless, phrase searching can be important to users of smaller sites that don't have the resources to maintain an Apache Solr instance, so I think it could be helpful to include phrase searching in search_api_db.

Proposed resolution

The attached patch adds a simple phrase matching implementation:

  • When indexing documents, it includes bigrams (pairs of consecutive words, a.k.a. biwords) in the index.
  • When searching, it preserves double-quoted phrases instead of expanding them into individual words. Since the index contains bigrams, and not longer phrases (N-grams), it splits phrases with more than 2 words into multiple ANDed 2-word phrase groupings.

This new feature is optional, toggled by a new option in the database backend configuration form:

Caveats:

  • This bigram-based implementation only considers whether each individual pair of double-quoted words is consecutive, not whether an entire 3+ word phrase is consecutive. For example, it converts the query "malleable logarithmic casing" to "malleable logarithmic" AND "logarithmic casing". Even so, this still provides a substantial improvement in search result relevance compared to not having any phrase search capability.
  • Including bigrams increases the size of the index and increases indexing time. (On the plus side, it doesn't appreciably affect search performance.) I tested on a local copy of a site consisting of about 6,500 hand-written English articles (about 55 MB storage according to information_schema.tables):
    • Baseline:
      • search_api_db_content_search_text has about 1.6 million rows (about 210 MB storage according to information_schema.tables)
      • 88 seconds to reindex entire site
    • With this patch:
      • search_api_db_content_search_text has about 4.8 million rows (about 1.1 GB storage according to information_schema.tables)
      • 190 seconds to reindex entire site

    Since search_api_db is recommended for smaller sites, probably having much less content than the aforementioned test site, I wouldn't expect the index storage/time increases to have a significant impact on the recommended use case for search_api_db.

Remaining tasks

  • Decide whether this is an acceptable solution
  • Review and commit patch
Feature request
Status

Fixed

Version

1.0

Component

Database backend

Created by

🇺🇸United States smokris Athens, Ohio, USA

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • Issue created by @smokris
  • Issue was unassigned.
  • Status changed to Needs review over 1 year ago
  • 🇺🇸United States smokris Athens, Ohio, USA

    Patch attached.

  • 🇺🇸United States smokris Athens, Ohio, USA

    Revised patch to address failure in testAutocompletion.

  • 🇺🇸United States smokris Athens, Ohio, USA

    Rebased patch.

  • 🇦🇹Austria drunken monkey Vienna, Austria

    Thanks a lot, this all sounds reasonable to me.
    Had to think a bit about the cleanNumericString() special case, but yes: When a numeric string becomes empty after stripping minuses and zeroes, then I guess 0 really is the only possibly correct token, it should never be empty.
    I just added a reference to the cleanNumericString() method into its test method, as I try to always do that when accessing something via reflection. That way, it’s still possible to find all usages using an IDE.

    Now, I think only this is missing:

    Yes, optional makes sense to me. Should it be enabled or disabled by default?

    Hm, since it will bloat the database quite a bit, I guess we’d better keep it disabled by default? Or just disabled for existing indexes, but enabled for new ones, since in the latter case the users have a higher chance of seeing the option?
    What do you think? Will users appreciate it more if phrase search “just works” (as most people would expect), or if we use their database storage sparingly? (Now that I write it like that, it more sounds like the former to me … Please just pick the option that makes most sense to you, and we’ll hope for the best.)

  • 🇺🇸United States tedfordgif

    Although Solr is the primary backend we use, I find myself considering the db backend for specific use cases when Solr is not available, sometimes on large sites. I like the option of not touching existing indexes, and enabling it by default for new indexes with a sufficiently noticeable warning. Thanks for your work on this!

  • Status changed to Needs work over 1 year ago
  • 🇦🇹Austria drunken monkey Vienna, Austria

    NW for the option.
    Are you going to work on that, smokris? Otherwise, I can also put it on my todo list, but no guarantees on when I’ll get to it.

  • Status changed to Needs review over 1 year ago
  • 🇺🇸United States smokris Athens, Ohio, USA

    Will users appreciate it more if phrase search “just works” (as most people would expect), or if we use their database storage sparingly?

    For new installations, enabling it by default sounds great.

    For existing installations, if this change will be included in an upcoming 1.x minor-version release, I think it might be disruptive/surprising if it were automatically enabled and suddenly the site's index size grew significantly. (Though the upcoming 2.0 major-version release ( #3126115: Create a 2.0.0 release ) might be a suitable time to automatically enable it for existing installations too?)

    I've attached an updated patch that includes a first draft of the UI for toggling support for phrase indexing/searching, and a screenshot of it with the new option highlighted in pink. It's enabled by default for new installations, and for existing installations I added a hook_update_N implementation that disables it.

  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update over 1 year ago
    535 pass
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Thanks a lot for your continued work, looks great!

    +++ b/modules/search_api_db/src/Plugin/search_api/backend/Database.php
    @@ -580,6 +592,11 @@
    +      'info' => ucfirst($this->configuration['phrase']),
    

    This would fail to be translated on non-English websites. However, in this specific case, it should be fine to just wrap the expression in $this->t() (since we those values are already passed as literals in another place).

    Otherwise, this patch is already very good, there were only a few small details I changed. Please see the attached revision.

    Most notably, I changed the description for the “Phrase indexing” option to be split into one description for the setting itself, and separate descriptions for the individual options. It now reads as follows:

    Phrase indexing
    Choose how phrase queries (queries with double quotes) should be handled. Better matching comes at the price of slower indexing and larger database size.

    • Disabled: Double-quoted phrases in search queries will be expanded into individual words, so searches will include results that contain those keywords but not necessarily consecutively and not necessarily in the same order. (This is the standard behavior of version 8.x-1.29 and earlier.)
    • Bigram: Pairs of consecutive words will be included in the index, so double-quoted searches will only include results that contain those keywords consecutively and in the same order. Warning: this will increase the database index size by about 5x, and will increase indexing time by about 2x.

    I’m not convinced that this is the best we can do, but it’s an improvement. Unfortunately, I’m rather bad at writing such descriptions in a way that make them understandable to laypeople.
    This would be an alternative (just for the two options) – feedback appreciated on which you think would be better:

    <ul>
      <li><strong>Disabled:</strong> This will ignore quoted phrase in search keywords and just look for the individual words separately. For instance, <em>"blue house"</em> would match any item that contains the words “blue” and “house” somewhere in its indexed text, no matter where and in what order.</li>
      <li><strong>Bigram:</strong> This treats quoted phrases in keywords like it is generally expected, matching only items that contain those words in the exact same order, consecutively. For instance, <em>"blue house"</em> would match only items that contain the word “blue” immediately followed by the word “house”. Please make sure that the associated increase in database size (about 5x) and indexing time (about 2x) is acceptable for your site.</li>
    </ul>
    

    Any combination of the two, or completely new suggestions, are of course also very welcome!

  • 🇩🇰Denmark ressa Copenhagen

    This would be an awesome feature in Search API. And I agree, it's something the users expect, so thanks a lot for working on it!

    I just tried the latest patch, and it works perfectly.

    +1 for enabling it by default in new installations.

    About the wording, it is difficult to formulate these concepts, but I think you succeeded with the #2 version, which makes sense to me. Here is a new suggestion, with some slight tweaks:

    Disabled: Ignore quotes in searches and just look for the individual words separately. For instance, "blue house" matches any item that contains the words "blue" and "house" somewhere in its indexed text, no matter where and in what order.

    Bigram: Treat a quoted phrase in a search like it is generally expected [...] (the rest is great!)

  • 🇺🇸United States aharown07

    I have applied the patch to a dev site with good success in results so far, with Bigram enabled.
    The test site has just over 23,000 items indexed.

    Seems to work really well with facets and everything as expected.

    To echo ressa, thanks for your work on this. I hope it makes it into a future release. Also makes sense to me to enable by default in installs.

  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update over 1 year ago
    535 pass
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Thanks a lot for both your feedback!
    Sounds very good so far. I’m just gonna leave it open a few more days to maybe get another review, but then I thinwe can merge this.

    Attached is a patch with the adapted descriptions suggested in #15, thanks a lot for those as well!

  • Status changed to Fixed over 1 year ago
  • 🇦🇹Austria drunken monkey Vienna, Austria

    OK, merged.
    Thanks again, everyone – especially, thanks a lot, @smokris!

  • 🇩🇰Denmark ressa Copenhagen

    So cool, this is a great feature to have. Thanks @drunken monkey and @smokris!

  • Automatically closed - issue fixed for 2 weeks with no activity.

  • Status changed to Fixed about 1 year ago
  • 🇺🇸United States ja09

    Excellent feature, thanks everyone. One possible addition that could be very useful, in addition to Bigram, would be "Required". This would default to phrase searching but without the user needing to enter the quotes. This would be useful for specific/technical searches, names of places/locations, names of people, leading zeros, etc.

    Not sure about the performance ramifications of doing this and if another index would be needed. Seems like it could be done within the Bigram index.

  • 🇨🇭Switzerland berdir Switzerland

    The release notes are a bit confusing, they say the new feature is disabled by default, but that's not correct, for new servers, it's enabled by default, only existing servers/sites have it disabled through the update function.

    That matters with default config in install profiles for example, if you don't explicitly provide the setting, it will default to bigram.

Production build 0.71.5 2024