DeleteQuery for orphaned child documents might run into "too many boolean clauses"

Created on 26 July 2022, over 2 years ago
Updated 15 November 2023, about 1 year ago

Hello!

We have Drupal 9 website which has around 1.4 million documents on it. While indexing the website it takes more time. When we checked the code, we found out that while reindexing, module calls deletebyquery function which takes more time on solr servers (roughly between 60 to 90 seconds).

Problem with deletebyquery is that, it will scans all the documents and fields within that document and it will retrieve the documents which needs to be deleted. If the number of document is high, this deletebyquery request will take more time and resources to process the request. In our case it's 1.4 million so it has a larger impact on indexing process.
Comparatively deletebyid is much faster. In search_api_solr module code it is using deletebyquery and deletebyid function in SearchApiSolrBackend.php
Ref : https://git.drupalcode.org/project/search_api_solr/-/blob/4.x/src/Plugin...

Here we are proposing, instead of calling both functions we need to check first if that document has child documents or not, then grab the ids for that and call deletebyid function to delete it.
Also sometimes _root_ doesnโ€™t have any document id's so before passing the _root_ we need to check first if it have any documents or not.

Thank you!!

๐Ÿ› Bug report
Status

Fixed

Version

4.0

Component

Code

Created by

๐Ÿ‡ฎ๐Ÿ‡ณIndia mpotdar

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • ๐Ÿ‡ฉ๐Ÿ‡ชGermany mkalkbrenner ๐Ÿ‡ฉ๐Ÿ‡ช
  • Status changed to Fixed over 1 year ago
  • ๐Ÿ‡ฉ๐Ÿ‡ชGermany mkalkbrenner ๐Ÿ‡ฉ๐Ÿ‡ช
  • Automatically closed - issue fixed for 2 weeks with no activity.

  • Status changed to Fixed about 1 year ago
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States pdcarto

    I'm not sure that this actually fixed the problem, or possibly I'm seeing a different problem. I see a `deleteItems` task in `search_api_task` with 1680 ids. Solr fails with a "too many boolean clauses" message.

    In my case, a parent object is being deleted (a pdf file), spawning the deletion of the indexed hocr text for each of its 1680 children (pages).

    I tried editing `maxBooleanClauses` - setting it to a very big number (default is 1024) . Initially I edited and re-installed `solrconfig_query.xml` and restarted solr, which had no impact. Then I found search_api_solr's `search_api_solr.solr_cache.cache_queryresult_default_7_0_0` configuration and changed it there, which again had no impact.

    It seems to me that there is one problem here with two possible solutions:

    1. Figure out how to make solr use and honor the `maxBooleanClauses` setting.
    2. Actually chunk the solr queries (notwithstanding the changes in 8e3cf13e, solr doesn't seem to actually be splitting the huge number of booleans into separate queries)
Production build 0.71.5 2024