Add Advanced Boolean Operators Parse Mode with complex query support

Issue created by @tuwebo
Merge request !250Issue #3537045: Initial commit → (Merged) created by tuwebo
Pipeline finished with Failed
2 months ago
Total: 355s
#552511
Pipeline finished with Failed
about 1 month ago
Total: 355s
#578000
Pipeline finished with Success
about 1 month ago
Total: 321s
#578041
Pipeline finished with Success
about 1 month ago
Total: 394s
#578406
Comment about 1 month ago →
🇪🇸Spain tuwebo
I've updated the code, and added a Unit test which I'll need to review since Terms parse mode is removing the token "-", and I don't know yet if I should do the same thing.
In the other hand I am facing an issue regarding the logic behind Database::createKeysQuery() flow for matching = “words” with a tree like:
(foob OR bar) AND (buz OR qux)
which will not happen for a tree like:
(foob OR bar) OR (buz OR qux)

So I am thinking about changing this part of the code to take into account that use cases:

if (!$match_parts) { if ($subs > $word_count) { // We have nested groups. Count unioned rows (one per matched group). $db_query->having('COUNT(*) >= ' . $var, [$var => $subs]); } elseif ($mul_words) { // Plain multi-word, no nesting: preserve 'distinct words' semantics. $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]); } else { $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]); } }
Pipeline finished with Success
about 1 month ago
Total: 362s
#580906
Pipeline finished with Success
about 1 month ago
Total: 350s
#583129
Pipeline finished with Success
about 1 month ago
Total: 344s
#583179
Pipeline finished with Success
about 1 month ago
Total: 439s
#583199
Pipeline finished with Success
about 1 month ago
Total: 349s
#583267
Comment about 1 month ago →
🇪🇸Spain tuwebo
Comment about 1 month ago →
🇦🇹Austria drunken monkey Vienna, Austria
drunken monkey → made their first commit to this issue’s fork.
Comment about 1 month ago →
🇦🇹Austria drunken monkey Vienna, Austria
Wow, thanks a lot for this, great job! Would be great to add support for such a more powerful parse mode, this has been sorely missed since the beginning of the module.

I left some minor comments in the MR, most importantly we need a regression test for the DB backend bug. (Or split it into a separate issue, since it’s technically unrelated, but probably not worth it – should be simple enough if you know which parsed key structure led to the problem.)

The only other matter is that of the name: I think we should avoid anything with “Boolean” in it – while it is instantly recognizable for anyone familiar with search engines, I think it will leave the average user utterly confused.
How about something like “Keywords with operators” as the label? For the class names and internal plugin ID we could just use something like Complex, though, or keep Boolean (though I would remove the Advanced part).
I’m very open for other suggestions, though, I know I’m pretty bad with naming in general. (We can of course also adapt the user-facing label later, but the class names would then be stuck.)

In any case, thanks a lot again!

🇪🇸Spain tuwebo

Wow, thank you very much for the review.
Naming and caching is the most difficult thing ;)
I'll change the names as you suggested. I am working on the tests for the regression and also to make sure that the parser works for different inputs.
For the database regression, there were two use cases, but I'll make it clear in the tests. Just leaving it here in case I don't have time this week for complete the task.

(foo OR bar) AND (baz OR fux)

if (!$match_parts) {
          if ($mul_words) {
            $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]);
          }
          else {
            $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]);
          }
        }

Count unioned rows (one per matched group) for avoiding counting in non existing [t].[word].

if (!$match_parts) {
          if ($subs > $word_count) {
            $db_query->having('COUNT(*) >= ' . $var, [$var => $subs]);
          }
          elseif ($mul_words) {
            $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]);
          }
          else {
            $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]);
          }

NOT (foo AND (bar OR fux))

      if ($neg_nested) {
        $db_query->fields('t', ['item_id', 'word']);
      }
      elseif ($neg) {
        $db_query->fields('t', ['item_id']);
      }

Ensure negated word queries select consistent fields preventing UNION column mismatches.

      if ($neg_nested) {
        $db_query->fields('t', ['item_id', 'word']);
      }
      elseif ($neg) {
        $db_query->fields('t', ['item_id', 'word']);
      }

Comment 29 days ago →
🇪🇸Spain tuwebo
Working on it, I've realized about another issue, I am looking into it for this use case:
foo AND (NOT bar OR (baz NOT qux))

While it seems that we are taking into account per negation $nested_query for a "NOT IN" to not have more than one column, the missing piece is doing the same for $old_query, since it might still carries score or word.
In the $negated branch:

if (isset($old_query)) { $condition->condition('t.item_id', $old_query, 'NOT IN'); }
Ensure NOT IN subquery returns exactly one column:

if (isset($old_query)) { $old_query = $this->database->select($old_query, 't') ->fields('t', ['item_id']); $condition->condition('t.item_id', $old_query, 'NOT IN'); }
Pipeline finished with Failed
28 days ago
Total: 346s
#589701
Pipeline finished with Failed
28 days ago
Total: 416s
#589726
Pipeline finished with Running
28 days ago
#590199
Pipeline finished with Failed
28 days ago
Total: 1072s
#590210
Comment 28 days ago →
🇪🇸Spain tuwebo
Moving the issue to needs review, I've added some comments on the MR.
Comment 28 days ago →
🇪🇸Spain tuwebo
Back to need work, two tests failing that I was not aware of.
Pipeline finished with Failed
28 days ago
Total: 1055s
#590319
Comment 27 days ago →
🇪🇸Spain tuwebo
Moving to Needs Review, I think the only failing test is not related with this issue. Before keep moving forwards (if necessary), it is better some review to make sure that I am not messing up too much.
Comment 26 days ago →
🇦🇹Austria drunken monkey Vienna, Austria
Once again, thanks a lot for your work, looks great! I commented (a lot) on the MR but I think everything is resolved now.
If the pipelines are green and you are happy with my changes then I think we can merge.

(I was not entirely sure whether we still needed the unit test if the kernel test already covers the exact same thing but probably still good to have something that executes more quickly. In any case, the resource cost is negligible, and I just reused the same data provider method to avoid the duplicate code (in case we ever need to add additional test cases).
Comment 25 days ago →
🇪🇸Spain tuwebo
Reviewing the tests results for user's input: NOT (foo AND (bar OR "baz qux"))

Dataset (indexed docs)

Matching across title + body. "baz qux" is an exact phrase.

Doc 1 — title: foo who bar going body: We are in fux their baz

Doc 2 — title: bar project body: Nothing else mentioned here.

Doc 3 — title: cogito ergo sum body: ""

Doc 4 — title: pos body: ""

Doc 5 — title: neg body: ""

Doc 6 — title: quoted pos with -minus/ body: ""

Doc 7 — title: quoted neg body: ""

Doc 8 — title: 神奈川県　連携 body: ""

Doc 9 — title: buz body: ""

Doc 10 — title: qux body: ""

Doc 11 — title: baz qux body: ""

Doc 12 — title: core grunt body: ""

Doc 13 — title: foo body: body has bar

Doc 14 — title: baz body: body contains qux quux phrase

Doc 15 — title: bar body: fux here

Doc 16 — title: foo buz body: ""

Query 1

Input: foo AND (bar OR "baz qux")

Expected matches: Doc 1, Doc 13

Doc 1 – has foo and bar.

Doc 13 – has foo and bar.

Expected non-matches: Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16

Doc 11 – has the phrase "baz qux" but no foo.

Doc 16 – has foo but neither bar nor the phrase.

Others – lack foo or both bar and the phrase.

Query 2

Input: NOT (foo AND (bar OR "baz qux"))

De Morgan’s: NOT (A AND (B OR C)) ⇒ (NOT A) OR ((NOT B) AND (NOT C))

Expected matches: Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16

Docs without foo are included (2,3,4,5,6,7,8,9,10,12,14,15).

Doc 11 – has the phrase but no foo ⇒ included.

Doc 16 – has foo but neither bar nor the phrase ⇒ included.

Expected non-matches: Doc 1, 13

Doc 1 – has foo and bar ⇒ excluded.

Doc 13 – has foo and bar ⇒ excluded.
Pipeline finished with Success
25 days ago
Total: 359s
#592167
Comment 25 days ago →
🇪🇸Spain tuwebo
Clarifying the test use case nested negation and grouping for user's input:

foo AND (NOT bar OR (baz NOT "qux quux"))

Dataset (indexed docs)

Matching across title + body. "qux quux" is an exact phrase.

Doc 1 — title: foo who bar going body: We are in fux their baz

Doc 2 — title: bar project body: Nothing else mentioned here.

Doc 3 — title: cogito ergo sum body: ""

Doc 4 — title: pos body: ""

Doc 5 — title: neg body: ""

Doc 6 — title: quoted pos with -minus body: ""

Doc 7 — title: quoted neg body: ""

Doc 8 — title: 神奈川県　連携 body: ""

Doc 9 — title: buz body: ""

Doc 10 — title: qux body: ""

Doc 11 — title: baz qux body: ""

Doc 12 — title: core grunt body: ""

Doc 13 — title: foo body: body has bar

Doc 14 — title: baz body: body contains qux quux phrase

Doc 15 — title: bar body: fux here

Doc 16 — title: foo buz body: ""

Expected interpretation

The query requires foo AND one of the following:

NOT bar, or

baz AND NOT the exact phrase "qux quux".

Expected matches

Doc 1, Doc 16

Doc 1 — has foo; NOT bar is false (it has bar), but it also has baz and does not have the phrase "qux quux" ⇒ the second branch of the OR is true.

Doc 16 — has foo; does not have bar ⇒ NOT bar branch is true.

Expected non-matches

Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

Docs 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15 — lack foo, so they fail the outer foo AND (…) .

Doc 13 — has foo but also has bar (so NOT bar is false) and lacks baz (so baz AND NOT "qux quux" is false) ⇒ inner OR fails.

Note: (baz NOT "qux quux") means baz AND NOT "qux quux" (no De Morgan transformation is applied here).

Comment 25 days ago →

🇪🇸Spain tuwebo

A search_api Database::createKeysQuery flowchart (to be reviewed, I've just vive coded it)

flowchart TD
  A[Start createKeysQuery] --> B{Are keys an array}
  B -- No --> B1[Wrap as AND with single key]
  B -- Yes --> C[Init flags and buckets]

  B1 --> C
  C --> D[Partition keys into words nested negated]
  D --> E[Compute counts and flags]

  E --> F{Has words}
  F -- No --> J
  F -- Yes --> G[Create base select on text table]

  G --> G1{Choose selected columns}
  G1 -- neg nested or overall neg --> G1a[Select item_id and word]
  G1 -- partial or prefix --> G1b[Select item_id then sum score later]
  G1 -- not nested fast path --> G1c[Select item_id and score]
  G1 -- otherwise --> G1d[Select item_id score word]

  G1a --> H
  G1b --> H
  G1c --> H
  G1d --> H

  H{Matching mode}
  H -- exact --> H1[Filter t.word IN words]
  H -- partial or prefix --> H2[Build OR of t.word LIKE add CASE aliases group by]

  H1 --> I[Filter field_name IN fulltext fields]
  H2 --> I

  I --> J{Has positive nested groups}
  J -- No --> L
  J -- Yes --> K[For each nested recurse add constants if needed union]

  K --> L{Have db_query and not fast path}
  L -- No --> N
  L -- Yes --> M[Nest subquery select item_id then if not neg add sum score and group by then if AND and subs greater than 1 add having rules]

  M --> N{Has negated groups}
  N -- No --> Q
  N -- Yes --> O{Need new base none built or conjunction is OR}
  O -- Yes --> O1[If prior query exists save as old query then new base on index table select item_id then if not neg add const score and distinct]
  O -- No --> P

  O1 --> P
  P{Negation conjunction}
  P -- AND --> P1[Attach NOT IN conditions]
  P -- OR --> P2[Create OR group and attach]

  P1 --> R
  P2 --> R

  R[For each negated group recurse ensure nested returns one column item_id then add NOT IN condition]
  R --> S{Old query set}
  S -- Yes --> S1[Project old query to select item_id then add IN condition]
  S -- No --> T

  S1 --> T
  T{neg nested}
  T -- Yes --> U[Final projection select item_id only]
  T -- No --> V[Return db query]

  U --> V

Comment 25 days ago →

🇪🇸Spain tuwebo

A search_api Database::createKeysQuery flowchart focusing on negated groups (to be reviewed, I've just vive coded it). Attached image.

flowchart TD
  A[Any negated groups] --> B{Need new base no db query or OR}
  B -- Yes --> C[If prior query exists save as old query]
  C --> D[Create base on index table select item id]
  D --> E[If not overall negation add constant score and distinct]
  B -- No --> F[Use existing query]
  E --> G{Conjunction type}
  F --> G
  G -- AND --> H[Attach NOT IN conditions to base]
  G -- OR --> I[Create OR group and attach]
  H --> J[For each negated group build nested query ensure single column item id add NOT IN]
  I --> J
  J --> K{Old query set}
  K -- Yes --> L[Project old query to item id then add IN condition]
  K -- No --> M[No action]
  L --> N[Return query]
  M --> N[Return query]

Comment 25 days ago →
🇪🇸Spain tuwebo
I really appreciate your review, thank you so much for the fast responses, everything seems Ok for me. I've added some documentation in my previous comments and review a couple of "false negatives" tests. I think it is working fine and all tests are validating now.
Comment 25 days ago →
🇪🇸Spain tuwebo
Comment 25 days ago →
🇪🇸Spain tuwebo

Add Advanced Boolean Operators Parse Mode with complex query support

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

Merge Requests

!250Add Advanced Boolean Operators Parse Mode with complex query support
Merged

Comments & Activities

Dataset (indexed docs)

Query 1

Query 2

Dataset (indexed docs)

Expected interpretation

Expected matches

Expected non-matches

Add Advanced Boolean Operators Parse Mode with complex query support

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

Merge Requests

!250Add Advanced Boolean Operators Parse Mode with complex query supportMerged

Comments & Activities

Dataset (indexed docs)

Query 1

Query 2

Dataset (indexed docs)

Expected interpretation

Expected matches

Expected non-matches

!250Add Advanced Boolean Operators Parse Mode with complex query support
Merged