Processing items gets progressively slower and slower as it proceeds

Created on 16 April 2024, 7 months ago
Updated 1 August 2024, 4 months ago

Problem/Motivation

I was running a script on a batch of 16021.

It had disconnected to our PR environment running on Tugboat. I opened up another terminal instance via the browser and ran a script with codit-batch-operations via drush with the following:
`drush codit-batch-operations:run {MyScriptName}`

It restarted where it had left off, at 4576/16021. It was running very slowly, with several seconds between each item showing as processed.

  • I thought it might be my browser (Safari), so I tried Chrome and had the same issue.
  • Then, I thought it might be my computer (for whatever reason), and I restarted. The issue persisted.
  • Then, I rebuilt the Tugboat environment and started it from the beginning.

At first, it was going well, and I started timing different periods to see how long it took to process 1% of the total. Here's how the numbers looked:

  • 5 - 6%: 2:30 min
  • 9 - 10%: 4:37 min
  • 11 - 12%: 5:24 min

After 12%, I stopped it, assured that I'd only encounter the same issue again.

Steps to reproduce

  1. Get a large number of items to process, at least 2,000
  2. Process them
  3. Note how long it takes to get each 1% increment processes for the first 10%
  4. Confirm that it gets increasingly longer

Proposed resolution

Cap logs at 1000 and start another log of the same run, distinguished by a consistent nomenclature.

Currently, log ids and log runs are synonymous.

For example: admin/config/development/batch_operations/log/6 is the 6th run of any script by Codit: Batch Operations

I think we need to make it so runs have an id number but each log for that run is distinguished by second number.

For example:
- admin/config/development/batch_operations/log/6_1 would be the 6th run of any script, but only the first 1,000 items.
- admin/config/development/batch_operations/log/6_2 would be the 6th run, too, but would consist of the next 1,000 items.
- etc.

πŸ› Bug report
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States skyriter

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @skyriter
  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    > It had disconnected to our PR environment running on Tugboat. I opened up another terminal instance via the browser and ran a script with codit-batch-operations via drush

    Doesn't that result in two batch operations running at the same time, or did you confirm that the first had stopped?

    The time may be related to bulk operations, the log will get bigger and bigger and it constantly has to save it... but it may be something else, because I used one of the test scripts to do no op except log 10,0000 things I did not see it get significantly slower.

    It may be something goofy with the migration class? Does using the original migration before converting to CBO also slow down?

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    I just tried it by altering TestDo10Things to do 20,000, so it logged a lot of things and I saw no degradation of speed. That tells me it is something that is happening in your specific processOne();

  • πŸ‡ΊπŸ‡ΈUnited States skyriter

    When I try to rerun it, the codit code returns a message to delete the running instance first, so I did that.

    But the time I ran it after the rebuild of the environment, I did not rerun it. It slowed down, nevertheless.

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    > When I try to rerun it, the codit code returns a message to delete the running instance first, so I did that.

    That delete command that it tells you to use just deletes the record of it running. It does not actually stop the thing. I am not sure how it would stop it... that is a song for another time :)

  • πŸ‡ΊπŸ‡ΈUnited States skyriter

    Because we are pushing changed data to a queue to send to an api, we found that a source of significant slowdown a deduplication process to make sure to we don't add duplicate data to the queue. Turning off that process, which we have in a config form on our site, made a huge difference.

    Running this locally in DDEV, before I had the deduping turned off, I was getting the following performance as the script processed ~16,000 items (broken out during certain percentage bands of completeness):

    • 4 - 5%: 5:11 minutes (at around 30 row/min)
    • 7 - 8%: 7:02 minutes (at around 25 row/min)
    • 8 - 9%: 7:19 minutes (at around 25 row/min)
    • 8 - 10%: 8:28 minutes (at around 17 row/min)

    After I turned off deduping, I saw this performance:

    • 38 - 39%: 1:00 minute
    • 39 - 40%: 1:00 minute
    • 40 - 41%: 1:00 minute
    • 61 - 62%: 1:45 minutes
    • 62 - 63%: 1:49 minutes

    Not, that we're still seeing a slowdown, but not nearly as extreme. The changed content is still being added to our queue to push to an api, as it would in production. If we didn't have content that did so, we would probably see better performance.

  • πŸ‡ΊπŸ‡ΈUnited States skyriter
  • πŸ‡ΊπŸ‡ΈUnited States skyriter
  • πŸ‡ΊπŸ‡ΈUnited States skyriter
  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    I do not believe it is the size of the log that is slowing things down. I did a test with 20,000 log items (but not doing any actual node manipulation) and there was little to no increase in time.

    I think you may be running into other drupal processes that run post node save accumulating (example: search indexing)

    The test would be to comment out the actual node saves in your script, and see how it compares.

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    I just created a test file with long multi sentence text entries per line. It contains 100,000 pf them. The file size is only 18.6 MB

    I do not think the size of the log is the main limiting factor we are seeing here. Though I think it warrants adding another test migration to the tests that does create a long log and inserts time entries so we can keep an eye on it.

  • πŸ‡ΊπŸ‡ΈUnited States skyriter

    After more research into the codit_batch_operations script behavior and that of a similar previously-written custom script that does the same thing, I have to concur. I see no evidence that it has anything to do with logging. Based on conversations @swirt and I have had privately, I'm convinced by his initial supposition that it was about other processes associated with saving the nodes that begin to stack up.

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    I just added a 10,000 log item test that addresses the timing due to the growing log.
    ✨ Add long log test script with time checks Fixed
    General conclusion, it is not the logging process that is the time vampire. The logging is pretty performant.

    Can we close this ticket?

  • πŸ‡ΊπŸ‡ΈUnited States skyriter

    Even though I agree we've ruled out logging, the slowdown is still a legitimate phenomenon, at least when running the script on my content. I suspect this will be experienced by others with mature platforms. I would like to keep this ticket open to investigate a possible mitigation.

    If it turns out that some combination of sleep() and resume after a certain number of items achieves the desired effect of shortening the overall execution time significantly, we could add a feature request issue to add config in the UI to pause at after a certain number of items and set the length of the pause. We could even provide some guidance while leaving the option as an advanced setting, perhaps.

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    I think having a default pause that occurs for all BatchOperations is a bit of overkill or at least opinionated for those who may have scripts that don't need it.

    There is a possibility to tie it to batch size in some way, or an optional callback in the script that defines the cooldown period. (default's to 0)

    I am taking off the 'critical' label as I don't think this is a blocker for anyone. If someone needs their script to have a cooldown, it could be added to their processOne() simply enough.

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida
  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    I am thinking something along the lines of getBatchSize() returns either an int like it does now, or optionally an array with a size element and a cooldown period.

    So it would continue to support this format

    getBatchSize() {
      return 5;
    }
    

    but it would also support

    getBatchSize() {
      return [
        // The number of items in the batch.
        'size' => 5,
        // The number of seconds to sleep between batches.
        'cooldown' => 60
      ];
    }
    

    The additional logic could all be worked into addBatchSeparator() which is already called at the right time and getBatchSize() which now needs to see what kind of data it is dealing with and respond accordingly.

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    Skyriter and I discussed this on a call.

    He found that sleep() paused the process but when the processing resumed after sleep(), the processing was still slow. It did not speed up the way he saw it happen when the process is actually killed and restarted. This leads me to think that that the sleep() does not allow post processing to run (like search indexing...), because the sleep is still technically using the resources preventing other processes from running, so they stack up.

    What we discussed as an option is that we could connect the batch size to how much it does before calling it quits and picks up the next batch when cron runs. This would likely solve the problem but it would add the uncertainty of when the next batch would run and when the entire BatchOperation would be considered complete.

    If this is the chosen path there are two issues that would be blocking it
    1. ✨ Add an option to run a BatchOperation via cron Active
    2. ✨ Add base fields to BatchOpLog to save items to process, and last item processed Active

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    In pondering this some more, I think this under the following conditions

    • Really large size (thousands of entity saves)
    • Site with processes that take place post entity save (indexing, queueing...)
    • Restricted to methods that don't use a proper batch. Proper batch calling as opposed to simulated batch calling
      • run BatchOperation with cron
      • run BatchOperation with drush command

    Might be resolveable by ✨ Consider a queue approach Active

  • πŸ‡ΊπŸ‡ΈUnited States swirt Florida

    An entity save might also benefit from isSyncing being set to TRUE.

Production build 0.71.5 2024