Add Advanced Boolean Operators Parse Mode with complex query support

Created on 20 July 2025, about 2 months ago

Problem/Motivation

The current Search API module provides basic parse modes (Terms, Phrase, Direct) but lacks support for complex boolean search queries.
Examples:

  • Complex boolean operators: cats AND dogs, cats OR dogs, cats NOT dogs
  • Parenthetical grouping: (cats OR dogs) AND (fish NOT turtles)
  • Mixed expressions: "exact phrase" AND (term1 OR term2) NOT excluded
  • Multi-language queries: Proper Unicode support for international content

Steps to reproduce

Right now it is a new feature.

Proposed resolution

Implement a new BooleanAdvanced parse mode with the following features:

  • Boolean Operators: Support for AND, OR, NOT with proper precedence.
  • Parenthetical Grouping: Unlimited nesting depth for complex expressions.
  • Quoted Phrases: Exact phrase matching within boolean expressions.
  • Operator Precedence: Logical evaluation order matching user expectations.

Remaining tasks

In pretty early stage, a basic manual test has been done using the search_api_db module, but might be useful for a potential use case related with πŸ’¬ Potential risks using "Direct query" parse mode with views? Active

✨ Feature request
Status

Active

Version

1.0

Component

Plugins

Created by

πŸ‡ͺπŸ‡ΈSpain tuwebo

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @tuwebo
  • Merge request !250Issue #3537045: Initial commit β†’ (Open) created by tuwebo
  • Pipeline finished with Failed
    about 2 months ago
    Total: 355s
    #552511
  • Pipeline finished with Failed
    20 days ago
    Total: 355s
    #578000
  • Pipeline finished with Success
    20 days ago
    Total: 321s
    #578041
  • Pipeline finished with Success
    19 days ago
    Total: 394s
    #578406
  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    I've updated the code, and added a Unit test which I'll need to review since Terms parse mode is removing the token "-", and I don't know yet if I should do the same thing.
    In the other hand I am facing an issue regarding the logic behind Database::createKeysQuery() flow for matching = β€œwords” with a tree like:
    (foob OR bar) AND (buz OR qux)
    which will not happen for a tree like:
    (foob OR bar) OR (buz OR qux)

    So I am thinking about changing this part of the code to take into account that use cases:

    if (!$match_parts) {
      if ($subs > $word_count) {
        // We have nested groups. Count unioned rows (one per matched group).
        $db_query->having('COUNT(*) >= ' . $var, [$var => $subs]);
      }
      elseif ($mul_words) {
        // Plain multi-word, no nesting: preserve 'distinct words' semantics.
        $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]);
      }
      else {
        $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]);
      }
    }
  • Pipeline finished with Success
    16 days ago
    Total: 362s
    #580906
  • Pipeline finished with Success
    13 days ago
    Total: 350s
    #583129
  • Pipeline finished with Success
    13 days ago
    Total: 344s
    #583179
  • Pipeline finished with Success
    13 days ago
    Total: 439s
    #583199
  • Pipeline finished with Success
    13 days ago
    Total: 349s
    #583267
  • πŸ‡¦πŸ‡ΉAustria drunken monkey Vienna, Austria

    drunken monkey β†’ made their first commit to this issue’s fork.

  • πŸ‡¦πŸ‡ΉAustria drunken monkey Vienna, Austria

    Wow, thanks a lot for this, great job! Would be great to add support for such a more powerful parse mode, this has been sorely missed since the beginning of the module.

    I left some minor comments in the MR, most importantly we need a regression test for the DB backend bug. (Or split it into a separate issue, since it’s technically unrelated, but probably not worth it – should be simple enough if you know which parsed key structure led to the problem.)

    The only other matter is that of the name: I think we should avoid anything with β€œBoolean” in it – while it is instantly recognizable for anyone familiar with search engines, I think it will leave the average user utterly confused.
    How about something like β€œKeywords with operators” as the label? For the class names and internal plugin ID we could just use something like Complex, though, or keep Boolean (though I would remove the Advanced part).
    I’m very open for other suggestions, though, I know I’m pretty bad with naming in general. (We can of course also adapt the user-facing label later, but the class names would then be stuck.)

    In any case, thanks a lot again!

  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    Wow, thank you very much for the review.
    Naming and caching is the most difficult thing ;)
    I'll change the names as you suggested. I am working on the tests for the regression and also to make sure that the parser works for different inputs.
    For the database regression, there were two use cases, but I'll make it clear in the tests. Just leaving it here in case I don't have time this week for complete the task.

    (foo OR bar) AND (baz OR fux)

    if (!$match_parts) {
              if ($mul_words) {
                $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]);
              }
              else {
                $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]);
              }
            }

    Count unioned rows (one per matched group) for avoiding counting in non existing [t].[word].

    if (!$match_parts) {
              if ($subs > $word_count) {
                $db_query->having('COUNT(*) >= ' . $var, [$var => $subs]);
              }
              elseif ($mul_words) {
                $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]);
              }
              else {
                $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]);
              }

    NOT (foo AND (bar OR fux))

          if ($neg_nested) {
            $db_query->fields('t', ['item_id', 'word']);
          }
          elseif ($neg) {
            $db_query->fields('t', ['item_id']);
          }

    Ensure negated word queries select consistent fields preventing UNION column mismatches.

          if ($neg_nested) {
            $db_query->fields('t', ['item_id', 'word']);
          }
          elseif ($neg) {
            $db_query->fields('t', ['item_id', 'word']);
          }
  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    Working on it, I've realized about another issue, I am looking into it for this use case:
    foo AND (NOT bar OR (baz NOT qux))

    While it seems that we are taking into account per negation $nested_query for a "NOT IN" to not have more than one column, the missing piece is doing the same for $old_query, since it might still carries score or word.
    In the $negated branch:

    if (isset($old_query)) {
            $condition->condition('t.item_id', $old_query, 'NOT IN');
          }
    

    Ensure NOT IN subquery returns exactly one column:

      if (isset($old_query)) {
            $old_query = $this->database->select($old_query, 't')
              ->fields('t', ['item_id']);
            $condition->condition('t.item_id', $old_query, 'NOT IN');
          }
    
  • Pipeline finished with Failed
    6 days ago
    Total: 346s
    #589701
  • Pipeline finished with Failed
    6 days ago
    Total: 416s
    #589726
  • Pipeline finished with Running
    5 days ago
    #590199
  • Pipeline finished with Failed
    5 days ago
    Total: 1072s
    #590210
  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    Moving the issue to needs review, I've added some comments on the MR.

  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    Back to need work, two tests failing that I was not aware of.

  • Pipeline finished with Failed
    5 days ago
    Total: 1055s
    #590319
  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    Moving to Needs Review, I think the only failing test is not related with this issue. Before keep moving forwards (if necessary), it is better some review to make sure that I am not messing up too much.

  • πŸ‡¦πŸ‡ΉAustria drunken monkey Vienna, Austria

    Once again, thanks a lot for your work, looks great! I commented (a lot) on the MR but I think everything is resolved now.
    If the pipelines are green and you are happy with my changes then I think we can merge.

    (I was not entirely sure whether we still needed the unit test if the kernel test already covers the exact same thing but probably still good to have something that executes more quickly. In any case, the resource cost is negligible, and I just reused the same data provider method to avoid the duplicate code (in case we ever need to add additional test cases).

  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    Reviewing the tests results for user's input: NOT (foo AND (bar OR "baz qux"))

    Dataset (indexed docs)

    Matching across title + body. "baz qux" is an exact phrase.

    1. Doc 1 β€” title: foo who bar going body: We are in fux their baz
    2. Doc 2 β€” title: bar project body: Nothing else mentioned here.
    3. Doc 3 β€” title: cogito ergo sum body: ""
    4. Doc 4 β€” title: pos body: ""
    5. Doc 5 β€” title: neg body: ""
    6. Doc 6 β€” title: quoted pos with -minus/ body: ""
    7. Doc 7 β€” title: quoted neg body: ""
    8. Doc 8 β€” title: η₯žε₯ˆε·ηœŒγ€€ι€£ζΊ body: ""
    9. Doc 9 β€” title: buz body: ""
    10. Doc 10 β€” title: qux body: ""
    11. Doc 11 β€” title: baz qux body: ""
    12. Doc 12 β€” title: core grunt body: ""
    13. Doc 13 β€” title: foo body: body has bar
    14. Doc 14 β€” title: baz body: body contains qux quux phrase
    15. Doc 15 β€” title: bar body: fux here
    16. Doc 16 β€” title: foo buz body: ""

    Query 1

    Input: foo AND (bar OR "baz qux")

    Expected matches: Doc 1, Doc 13

    • Doc 1 – has foo and bar.
    • Doc 13 – has foo and bar.

    Expected non-matches: Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16

    • Doc 11 – has the phrase "baz qux" but no foo.
    • Doc 16 – has foo but neither bar nor the phrase.
    • Others – lack foo or both bar and the phrase.

    Query 2

    Input: NOT (foo AND (bar OR "baz qux"))

    De Morgan’s: NOT (A AND (B OR C)) β‡’ (NOT A) OR ((NOT B) AND (NOT C))

    Expected matches: Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16

    • Docs without foo are included (2,3,4,5,6,7,8,9,10,12,14,15).
    • Doc 11 – has the phrase but no foo β‡’ included.
    • Doc 16 – has foo but neither bar nor the phrase β‡’ included.

    Expected non-matches: Doc 1, 13

    • Doc 1 – has foo and bar β‡’ excluded.
    • Doc 13 – has foo and bar β‡’ excluded.
  • Pipeline finished with Success
    3 days ago
    Total: 359s
    #592167
  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    Clarifying the test use case nested negation and grouping for user's input:

    foo AND (NOT bar OR (baz NOT "qux quux"))

    Dataset (indexed docs)

    Matching across title + body. "qux quux" is an exact phrase.

    1. Doc 1 β€” title: foo who bar going body: We are in fux their baz
    2. Doc 2 β€” title: bar project body: Nothing else mentioned here.
    3. Doc 3 β€” title: cogito ergo sum body: ""
    4. Doc 4 β€” title: pos body: ""
    5. Doc 5 β€” title: neg body: ""
    6. Doc 6 β€” title: quoted pos with -minus body: ""
    7. Doc 7 β€” title: quoted neg body: ""
    8. Doc 8 β€” title: η₯žε₯ˆε·ηœŒγ€€ι€£ζΊ body: ""
    9. Doc 9 β€” title: buz body: ""
    10. Doc 10 β€” title: qux body: ""
    11. Doc 11 β€” title: baz qux body: ""
    12. Doc 12 β€” title: core grunt body: ""
    13. Doc 13 β€” title: foo body: body has bar
    14. Doc 14 β€” title: baz body: body contains qux quux phrase
    15. Doc 15 β€” title: bar body: fux here
    16. Doc 16 β€” title: foo buz body: ""

    Expected interpretation

    The query requires foo AND one of the following:

    • NOT bar, or
    • baz AND NOT the exact phrase "qux quux".

    Expected matches

    Doc 1, Doc 16

    • Doc 1 β€” has foo; NOT bar is false (it has bar), but it also has baz and does not have the phrase "qux quux" β‡’ the second branch of the OR is true.
    • Doc 16 β€” has foo; does not have bar β‡’ NOT bar branch is true.

    Expected non-matches

    Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

    • Docs 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15 β€” lack foo, so they fail the outer foo AND (…) .
    • Doc 13 β€” has foo but also has bar (so NOT bar is false) and lacks baz (so baz AND NOT "qux quux" is false) β‡’ inner OR fails.

    Note: (baz NOT "qux quux") means baz AND NOT "qux quux" (no De Morgan transformation is applied here).

  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    A search_api Database::createKeysQuery flowchart (to be reviewed, I've just vive coded it)

    flowchart TD
      A[Start createKeysQuery] --> B{Are keys an array}
      B -- No --> B1[Wrap as AND with single key]
      B -- Yes --> C[Init flags and buckets]
    
      B1 --> C
      C --> D[Partition keys into words nested negated]
      D --> E[Compute counts and flags]
    
      E --> F{Has words}
      F -- No --> J
      F -- Yes --> G[Create base select on text table]
    
      G --> G1{Choose selected columns}
      G1 -- neg nested or overall neg --> G1a[Select item_id and word]
      G1 -- partial or prefix --> G1b[Select item_id then sum score later]
      G1 -- not nested fast path --> G1c[Select item_id and score]
      G1 -- otherwise --> G1d[Select item_id score word]
    
      G1a --> H
      G1b --> H
      G1c --> H
      G1d --> H
    
      H{Matching mode}
      H -- exact --> H1[Filter t.word IN words]
      H -- partial or prefix --> H2[Build OR of t.word LIKE add CASE aliases group by]
    
      H1 --> I[Filter field_name IN fulltext fields]
      H2 --> I
    
      I --> J{Has positive nested groups}
      J -- No --> L
      J -- Yes --> K[For each nested recurse add constants if needed union]
    
      K --> L{Have db_query and not fast path}
      L -- No --> N
      L -- Yes --> M[Nest subquery select item_id then if not neg add sum score and group by then if AND and subs greater than 1 add having rules]
    
      M --> N{Has negated groups}
      N -- No --> Q
      N -- Yes --> O{Need new base none built or conjunction is OR}
      O -- Yes --> O1[If prior query exists save as old query then new base on index table select item_id then if not neg add const score and distinct]
      O -- No --> P
    
      O1 --> P
      P{Negation conjunction}
      P -- AND --> P1[Attach NOT IN conditions]
      P -- OR --> P2[Create OR group and attach]
    
      P1 --> R
      P2 --> R
    
      R[For each negated group recurse ensure nested returns one column item_id then add NOT IN condition]
      R --> S{Old query set}
      S -- Yes --> S1[Project old query to select item_id then add IN condition]
      S -- No --> T
    
      S1 --> T
      T{neg nested}
      T -- Yes --> U[Final projection select item_id only]
      T -- No --> V[Return db query]
    
      U --> V
  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    A search_api Database::createKeysQuery flowchart focusing on negated groups (to be reviewed, I've just vive coded it). Attached image.

    flowchart TD
      A[Any negated groups] --> B{Need new base no db query or OR}
      B -- Yes --> C[If prior query exists save as old query]
      C --> D[Create base on index table select item id]
      D --> E[If not overall negation add constant score and distinct]
      B -- No --> F[Use existing query]
      E --> G{Conjunction type}
      F --> G
      G -- AND --> H[Attach NOT IN conditions to base]
      G -- OR --> I[Create OR group and attach]
      H --> J[For each negated group build nested query ensure single column item id add NOT IN]
      I --> J
      J --> K{Old query set}
      K -- Yes --> L[Project old query to item id then add IN condition]
      K -- No --> M[No action]
      L --> N[Return query]
      M --> N[Return query]
  • πŸ‡ͺπŸ‡ΈSpain tuwebo

    I really appreciate your review, thank you so much for the fast responses, everything seems Ok for me. I've added some documentation in my previous comments and review a couple of "false negatives" tests. I think it is working fine and all tests are validating now.

  • πŸ‡ͺπŸ‡ΈSpain tuwebo
  • πŸ‡ͺπŸ‡ΈSpain tuwebo
Production build 0.71.5 2024