- Issue created by @tuwebo
- πͺπΈSpain tuwebo
I've updated the code, and added a Unit test which I'll need to review since Terms parse mode is removing the token "-", and I don't know yet if I should do the same thing.
In the other hand I am facing an issue regarding the logic behind Database::createKeysQuery() flow for matching = βwordsβ with a tree like:
(foob OR bar) AND (buz OR qux)
which will not happen for a tree like:
(foob OR bar) OR (buz OR qux)
So I am thinking about changing this part of the code to take into account that use cases:
if (!$match_parts) { if ($subs > $word_count) { // We have nested groups. Count unioned rows (one per matched group). $db_query->having('COUNT(*) >= ' . $var, [$var => $subs]); } elseif ($mul_words) { // Plain multi-word, no nesting: preserve 'distinct words' semantics. $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]); } else { $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]); } }
- π¦πΉAustria drunken monkey Vienna, Austria
drunken monkey β made their first commit to this issueβs fork.
- π¦πΉAustria drunken monkey Vienna, Austria
Wow, thanks a lot for this, great job! Would be great to add support for such a more powerful parse mode, this has been sorely missed since the beginning of the module.
I left some minor comments in the MR, most importantly we need a regression test for the DB backend bug. (Or split it into a separate issue, since itβs technically unrelated, but probably not worth it β should be simple enough if you know which parsed key structure led to the problem.)
The only other matter is that of the name: I think we should avoid anything with βBooleanβ in it β while it is instantly recognizable for anyone familiar with search engines, I think it will leave the average user utterly confused.
How about something like βKeywords with operatorsβ as the label? For the class names and internal plugin ID we could just use something likeComplex
, though, or keepBoolean
(though I would remove theAdvanced
part).
Iβm very open for other suggestions, though, I know Iβm pretty bad with naming in general. (We can of course also adapt the user-facing label later, but the class names would then be stuck.)In any case, thanks a lot again!
- πͺπΈSpain tuwebo
Wow, thank you very much for the review.
Naming and caching is the most difficult thing ;)
I'll change the names as you suggested. I am working on the tests for the regression and also to make sure that the parser works for different inputs.
For the database regression, there were two use cases, but I'll make it clear in the tests. Just leaving it here in case I don't have time this week for complete the task.(foo OR bar) AND (baz OR fux)
if (!$match_parts) { if ($mul_words) { $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]); } else { $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]); } }
Count unioned rows (one per matched group) for avoiding counting in non existing [t].[word].
if (!$match_parts) { if ($subs > $word_count) { $db_query->having('COUNT(*) >= ' . $var, [$var => $subs]); } elseif ($mul_words) { $db_query->having('COUNT(DISTINCT [t].[word]) >= ' . $var, [$var => $subs]); } else { $db_query->having('COUNT([t].[word]) >= ' . $var, [$var => $subs]); }
NOT (foo AND (bar OR fux))
if ($neg_nested) { $db_query->fields('t', ['item_id', 'word']); } elseif ($neg) { $db_query->fields('t', ['item_id']); }
Ensure negated word queries select consistent fields preventing UNION column mismatches.
if ($neg_nested) { $db_query->fields('t', ['item_id', 'word']); } elseif ($neg) { $db_query->fields('t', ['item_id', 'word']); }
- πͺπΈSpain tuwebo
Working on it, I've realized about another issue, I am looking into it for this use case:
foo AND (NOT bar OR (baz NOT qux))
While it seems that we are taking into account per negation $nested_query for a "NOT IN" to not have more than one column, the missing piece is doing the same for $old_query, since it might still carries score or word.
In the $negated branch:if (isset($old_query)) { $condition->condition('t.item_id', $old_query, 'NOT IN'); }
Ensure NOT IN subquery returns exactly one column:
if (isset($old_query)) { $old_query = $this->database->select($old_query, 't') ->fields('t', ['item_id']); $condition->condition('t.item_id', $old_query, 'NOT IN'); }
- πͺπΈSpain tuwebo
Moving the issue to needs review, I've added some comments on the MR.
- πͺπΈSpain tuwebo
Back to need work, two tests failing that I was not aware of.
- πͺπΈSpain tuwebo
Moving to Needs Review, I think the only failing test is not related with this issue. Before keep moving forwards (if necessary), it is better some review to make sure that I am not messing up too much.
- π¦πΉAustria drunken monkey Vienna, Austria
Once again, thanks a lot for your work, looks great! I commented (a lot) on the MR but I think everything is resolved now.
If the pipelines are green and you are happy with my changes then I think we can merge.(I was not entirely sure whether we still needed the unit test if the kernel test already covers the exact same thing but probably still good to have something that executes more quickly. In any case, the resource cost is negligible, and I just reused the same data provider method to avoid the duplicate code (in case we ever need to add additional test cases).
- πͺπΈSpain tuwebo
Reviewing the tests results for user's input: NOT (foo AND (bar OR "baz qux"))
Dataset (indexed docs)
Matching across title + body.
"baz qux"
is an exact phrase.- Doc 1 β title:
foo who bar going
body:We are in fux their baz
- Doc 2 β title:
bar project
body:Nothing else mentioned here.
- Doc 3 β title:
cogito ergo sum
body:""
- Doc 4 β title:
pos
body:""
- Doc 5 β title:
neg
body:""
- Doc 6 β title:
quoted pos with -minus
/ body:""
- Doc 7 β title:
quoted neg
body:""
- Doc 8 β title:
η₯ε₯ε·ηγι£ζΊ
body:""
- Doc 9 β title:
buz
body:""
- Doc 10 β title:
qux
body:""
- Doc 11 β title:
baz qux
body:""
- Doc 12 β title:
core grunt
body:""
- Doc 13 β title:
foo
body:body has bar
- Doc 14 β title:
baz
body:body contains qux quux phrase
- Doc 15 β title:
bar
body:fux here
- Doc 16 β title:
foo buz
body:""
Query 1
Input:
foo AND (bar OR "baz qux")
Expected matches: Doc 1, Doc 13
- Doc 1 β has
foo
andbar
. - Doc 13 β has
foo
andbar
.
Expected non-matches: Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16
- Doc 11 β has the phrase
"baz qux"
but nofoo
. - Doc 16 β has
foo
but neitherbar
nor the phrase. - Others β lack
foo
or bothbar
and the phrase.
Query 2
Input:
NOT (foo AND (bar OR "baz qux"))
De Morganβs:
NOT (A AND (B OR C))
β(NOT A) OR ((NOT B) AND (NOT C))
Expected matches: Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16
- Docs without
foo
are included (2,3,4,5,6,7,8,9,10,12,14,15). - Doc 11 β has the phrase but no
foo
β included. - Doc 16 β has
foo
but neitherbar
nor the phrase β included.
Expected non-matches: Doc 1, 13
- Doc 1 β has
foo
andbar
β excluded. - Doc 13 β has
foo
andbar
β excluded.
- Doc 1 β title:
- πͺπΈSpain tuwebo
Clarifying the test use case nested negation and grouping for user's input:
foo AND (NOT bar OR (baz NOT "qux quux"))
Dataset (indexed docs)
Matching across title + body.
"qux quux"
is an exact phrase.- Doc 1 β title:
foo who bar going
body:We are in fux their baz
- Doc 2 β title:
bar project
body:Nothing else mentioned here.
- Doc 3 β title:
cogito ergo sum
body:""
- Doc 4 β title:
pos
body:""
- Doc 5 β title:
neg
body:""
- Doc 6 β title:
quoted pos with -minus
body:""
- Doc 7 β title:
quoted neg
body:""
- Doc 8 β title:
η₯ε₯ε·ηγι£ζΊ
body:""
- Doc 9 β title:
buz
body:""
- Doc 10 β title:
qux
body:""
- Doc 11 β title:
baz qux
body:""
- Doc 12 β title:
core grunt
body:""
- Doc 13 β title:
foo
body:body has bar
- Doc 14 β title:
baz
body:body contains qux quux phrase
- Doc 15 β title:
bar
body:fux here
- Doc 16 β title:
foo buz
body:""
Expected interpretation
The query requires foo AND one of the following:
- NOT bar, or
- baz AND NOT the exact phrase
"qux quux"
.
Expected matches
Doc 1, Doc 16
- Doc 1 β has
foo
;NOT bar
is false (it hasbar
), but it also hasbaz
and does not have the phrase"qux quux"
β the second branch of the OR is true. - Doc 16 β has
foo
; does not havebar
βNOT bar
branch is true.
Expected non-matches
Doc 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
- Docs 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15 β lack
foo
, so they fail the outerfoo AND (β¦)
. - Doc 13 β has
foo
but also hasbar
(soNOT bar
is false) and lacksbaz
(sobaz AND NOT "qux quux"
is false) β inner OR fails.
Note:
(baz NOT "qux quux")
meansbaz
ANDNOT "qux quux"
(no De Morgan transformation is applied here). - Doc 1 β title:
- πͺπΈSpain tuwebo
A search_api Database::createKeysQuery flowchart (to be reviewed, I've just vive coded it)
flowchart TD A[Start createKeysQuery] --> B{Are keys an array} B -- No --> B1[Wrap as AND with single key] B -- Yes --> C[Init flags and buckets] B1 --> C C --> D[Partition keys into words nested negated] D --> E[Compute counts and flags] E --> F{Has words} F -- No --> J F -- Yes --> G[Create base select on text table] G --> G1{Choose selected columns} G1 -- neg nested or overall neg --> G1a[Select item_id and word] G1 -- partial or prefix --> G1b[Select item_id then sum score later] G1 -- not nested fast path --> G1c[Select item_id and score] G1 -- otherwise --> G1d[Select item_id score word] G1a --> H G1b --> H G1c --> H G1d --> H H{Matching mode} H -- exact --> H1[Filter t.word IN words] H -- partial or prefix --> H2[Build OR of t.word LIKE add CASE aliases group by] H1 --> I[Filter field_name IN fulltext fields] H2 --> I I --> J{Has positive nested groups} J -- No --> L J -- Yes --> K[For each nested recurse add constants if needed union] K --> L{Have db_query and not fast path} L -- No --> N L -- Yes --> M[Nest subquery select item_id then if not neg add sum score and group by then if AND and subs greater than 1 add having rules] M --> N{Has negated groups} N -- No --> Q N -- Yes --> O{Need new base none built or conjunction is OR} O -- Yes --> O1[If prior query exists save as old query then new base on index table select item_id then if not neg add const score and distinct] O -- No --> P O1 --> P P{Negation conjunction} P -- AND --> P1[Attach NOT IN conditions] P -- OR --> P2[Create OR group and attach] P1 --> R P2 --> R R[For each negated group recurse ensure nested returns one column item_id then add NOT IN condition] R --> S{Old query set} S -- Yes --> S1[Project old query to select item_id then add IN condition] S -- No --> T S1 --> T T{neg nested} T -- Yes --> U[Final projection select item_id only] T -- No --> V[Return db query] U --> V
- πͺπΈSpain tuwebo
A search_api Database::createKeysQuery flowchart focusing on negated groups (to be reviewed, I've just vive coded it). Attached image.
flowchart TD A[Any negated groups] --> B{Need new base no db query or OR} B -- Yes --> C[If prior query exists save as old query] C --> D[Create base on index table select item id] D --> E[If not overall negation add constant score and distinct] B -- No --> F[Use existing query] E --> G{Conjunction type} F --> G G -- AND --> H[Attach NOT IN conditions to base] G -- OR --> I[Create OR group and attach] H --> J[For each negated group build nested query ensure single column item id add NOT IN] I --> J J --> K{Old query set} K -- Yes --> L[Project old query to item id then add IN condition] K -- No --> M[No action] L --> N[Return query] M --> N[Return query]
- πͺπΈSpain tuwebo
I really appreciate your review, thank you so much for the fast responses, everything seems Ok for me. I've added some documentation in my previous comments and review a couple of "false negatives" tests. I think it is working fine and all tests are validating now.