Massively improve batch performance by reducing the number of queries

Created on 29 January 2025, 24 days ago

Problem/Motivation

In \Drupal\entity_usage\EntityUsageBatchManager::doBulkRevisionable and \Drupal\entity_usage\EntityUsageBatchManager::doBulkNonRevisionable we are selecting the entity revision IDs for each loop of the batch. On massive tables these turns out to be the bulk of the time spent. Let's just add an massive array of IDs to the batch context and not do the queries.

Proposed resolution

Remaining tasks

User interface changes

None

API changes

None

Data model changes

None

πŸ“Œ Task
Status

Active

Version

2.0

Component

Code

Created by

πŸ‡¬πŸ‡§United Kingdom alexpott πŸ‡ͺπŸ‡ΊπŸŒ

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @alexpott
  • Merge request !105Improved batch processing β†’ (Merged) created by alexpott
  • Pipeline finished with Failed
    24 days ago
    Total: 441s
    #409244
  • Pipeline finished with Success
    24 days ago
    Total: 357s
    #409247
  • πŸ‡¬πŸ‡§United Kingdom alexpott πŸ‡ͺπŸ‡ΊπŸŒ

    This has a massive impact - processing 2000 paragraph revisions on site went from 50+ seconds to 1.3 seconds. There are 2,000,000+ paragraph revisions on the site.

  • πŸ‡©πŸ‡ͺGermany chr.fritsch πŸ‡©πŸ‡ͺπŸ‡ͺπŸ‡ΊπŸŒ

    Just incredible @alexpott πŸ₯³

  • πŸ‡¬πŸ‡§United Kingdom alexpott πŸ‡ͺπŸ‡ΊπŸŒ

    So we need to be careful here. 4 million integers in an array is going to take up about 64 MB.... and 1 million is 16 MB - I think we should assume that Drush is being run in an environment with 512 MB of memory - I think we should load 500000 in one go and maybe make it configurable.

  • Pipeline finished with Success
    23 days ago
    Total: 216s
    #409796
  • Pipeline finished with Canceled
    23 days ago
    Total: 74s
    #409798
  • Pipeline finished with Success
    23 days ago
    Total: 249s
    #409800
  • Pipeline finished with Success
    23 days ago
    Total: 216s
    #409823
  • πŸ‡¬πŸ‡§United Kingdom alexpott πŸ‡ͺπŸ‡ΊπŸŒ
  • First commit to issue fork.
  • Pipeline finished with Skipped
    22 days ago
    #410956
  • πŸ‡ͺπŸ‡ΈSpain marcoscano Barcelona, Spain

    This is indeed a great idea. Thanks for contributing! πŸ‘

      /**
       * The number of IDs to load when in bulk mode.
       */
      const BULK_ID_LOAD = 1000000;
    

    @alexpott in you testing, did you see a significant impact in changing this value? I did a quick check in my local and it does seem to have a meaningful change in how fast the batch processes, so I am wondering if it makes sense to make this configurable. I don't think we need to expose this on the UI, but a settings value that people could define/override per environment (falling back to the constant) seems to make sense for me. In any case, if that's the case we can make that as an improvement in a follow-up.

    For now this looks good to go from me. Thanks again!

  • πŸ‡¬πŸ‡§United Kingdom alexpott πŸ‡ͺπŸ‡ΊπŸŒ

    I selected defaults that worked in an environment with millions of revisions and only 512 MB available to PHP and ran drush to rebuild the table. I agree that these could be configurable - it'd be nice if it was config and drush because then if you rebuild via the UI you can tweak stuff and you can also tweak on the command level. Feel free to open an issue about that - I'll get round to it but not sure when as it won't be a priority :)

  • πŸ‡¬πŸ‡§United Kingdom alexpott πŸ‡ͺπŸ‡ΊπŸŒ

    @marcoscano I created the follow-up - see πŸ“Œ Make batch performance constants configurable Active

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024