- Issue created by @skyriter
- πΊπΈUnited States swirt Florida
> It had disconnected to our PR environment running on Tugboat. I opened up another terminal instance via the browser and ran a script with codit-batch-operations via drush
Doesn't that result in two batch operations running at the same time, or did you confirm that the first had stopped?
The time may be related to bulk operations, the log will get bigger and bigger and it constantly has to save it... but it may be something else, because I used one of the test scripts to do no op except log 10,0000 things I did not see it get significantly slower.
It may be something goofy with the migration class? Does using the original migration before converting to CBO also slow down?
- πΊπΈUnited States swirt Florida
I just tried it by altering TestDo10Things to do 20,000, so it logged a lot of things and I saw no degradation of speed. That tells me it is something that is happening in your specific processOne();
- πΊπΈUnited States skyriter
When I try to rerun it, the codit code returns a message to delete the running instance first, so I did that.
But the time I ran it after the rebuild of the environment, I did not rerun it. It slowed down, nevertheless.
- πΊπΈUnited States swirt Florida
> When I try to rerun it, the codit code returns a message to delete the running instance first, so I did that.
That delete command that it tells you to use just deletes the record of it running. It does not actually stop the thing. I am not sure how it would stop it... that is a song for another time :)
- πΊπΈUnited States skyriter
Because we are pushing changed data to a queue to send to an api, we found that a source of significant slowdown a deduplication process to make sure to we don't add duplicate data to the queue. Turning off that process, which we have in a config form on our site, made a huge difference.
Running this locally in DDEV, before I had the deduping turned off, I was getting the following performance as the script processed ~16,000 items (broken out during certain percentage bands of completeness):
- 4 - 5%: 5:11 minutes (at around 30 row/min)
- 7 - 8%: 7:02 minutes (at around 25 row/min)
- 8 - 9%: 7:19 minutes (at around 25 row/min)
- 8 - 10%: 8:28 minutes (at around 17 row/min)
After I turned off deduping, I saw this performance:
- 38 - 39%: 1:00 minute
- 39 - 40%: 1:00 minute
- 40 - 41%: 1:00 minute
- 61 - 62%: 1:45 minutes
- 62 - 63%: 1:49 minutes
Not, that we're still seeing a slowdown, but not nearly as extreme. The changed content is still being added to our queue to push to an api, as it would in production. If we didn't have content that did so, we would probably see better performance.
- πΊπΈUnited States swirt Florida
I do not believe it is the size of the log that is slowing things down. I did a test with 20,000 log items (but not doing any actual node manipulation) and there was little to no increase in time.
I think you may be running into other drupal processes that run post node save accumulating (example: search indexing)
The test would be to comment out the actual node saves in your script, and see how it compares.
- πΊπΈUnited States swirt Florida
I just created a test file with long multi sentence text entries per line. It contains 100,000 pf them. The file size is only 18.6 MB
I do not think the size of the log is the main limiting factor we are seeing here. Though I think it warrants adding another test migration to the tests that does create a long log and inserts time entries so we can keep an eye on it.
- πΊπΈUnited States skyriter
After more research into the codit_batch_operations script behavior and that of a similar previously-written custom script that does the same thing, I have to concur. I see no evidence that it has anything to do with logging. Based on conversations @swirt and I have had privately, I'm convinced by his initial supposition that it was about other processes associated with saving the nodes that begin to stack up.
- πΊπΈUnited States swirt Florida
I just added a 10,000 log item test that addresses the timing due to the growing log.
β¨ Add long log test script with time checks Fixed
General conclusion, it is not the logging process that is the time vampire. The logging is pretty performant.Can we close this ticket?
- πΊπΈUnited States skyriter
Even though I agree we've ruled out logging, the slowdown is still a legitimate phenomenon, at least when running the script on my content. I suspect this will be experienced by others with mature platforms. I would like to keep this ticket open to investigate a possible mitigation.
If it turns out that some combination of sleep() and resume after a certain number of items achieves the desired effect of shortening the overall execution time significantly, we could add a feature request issue to add config in the UI to pause at after a certain number of items and set the length of the pause. We could even provide some guidance while leaving the option as an advanced setting, perhaps.
- πΊπΈUnited States swirt Florida
I think having a default pause that occurs for all BatchOperations is a bit of overkill or at least opinionated for those who may have scripts that don't need it.
There is a possibility to tie it to batch size in some way, or an optional callback in the script that defines the cooldown period. (default's to 0)
I am taking off the 'critical' label as I don't think this is a blocker for anyone. If someone needs their script to have a cooldown, it could be added to their processOne() simply enough.
- πΊπΈUnited States swirt Florida
I am thinking something along the lines of getBatchSize() returns either an int like it does now, or optionally an array with a size element and a cooldown period.
So it would continue to support this format
getBatchSize() { return 5; }
but it would also support
getBatchSize() { return [ // The number of items in the batch. 'size' => 5, // The number of seconds to sleep between batches. 'cooldown' => 60 ]; }
The additional logic could all be worked into addBatchSeparator() which is already called at the right time and getBatchSize() which now needs to see what kind of data it is dealing with and respond accordingly.
- πΊπΈUnited States swirt Florida
Skyriter and I discussed this on a call.
He found that sleep() paused the process but when the processing resumed after sleep(), the processing was still slow. It did not speed up the way he saw it happen when the process is actually killed and restarted. This leads me to think that that the sleep() does not allow post processing to run (like search indexing...), because the sleep is still technically using the resources preventing other processes from running, so they stack up.
What we discussed as an option is that we could connect the batch size to how much it does before calling it quits and picks up the next batch when cron runs. This would likely solve the problem but it would add the uncertainty of when the next batch would run and when the entire BatchOperation would be considered complete.
If this is the chosen path there are two issues that would be blocking it
1. β¨ Add an option to run a BatchOperation via cron Active
2. β¨ Add base fields to BatchOpLog to save items to process, and last item processed Active - πΊπΈUnited States swirt Florida
In pondering this some more, I think this under the following conditions
- Really large size (thousands of entity saves)
- Site with processes that take place post entity save (indexing, queueing...)
- Restricted to methods that don't use a proper batch. Proper batch calling as opposed to simulated batch calling
- run BatchOperation with cron
- run BatchOperation with drush command
Might be resolveable by β¨ Consider a queue approach Active
- πΊπΈUnited States swirt Florida
An entity save might also benefit from isSyncing being set to TRUE.