Ability to resume a queue operation from where it failed

Created on 26 October 2023, 11 months ago
Updated 21 November 2023, 10 months ago

Problem/Motivation

During a queue operation (orange-dam:queue-content-by-type or orange-dam:queue-content-all for example), if the process fails for any reason during execution, a subsequent call to the queue operation starts over from the beginning.

Example

drush orange-dam:queue-content-by-type --type=Photograph
21,000 Photographs found
....
...
eight hours of queueing
...
...
19,000/21,000 queued...
FAILED! (exception/out of memory/container or session aborted)

We have to begin again at the beginning.

Steps to reproduce

1. start a queue operation. queue some items
2. at some point, fail the process, maybe by hitting CTRL+C.
3. start the queue operation again. you will see it starts queueing at the start again.

This also happens with the --limit parameter. If I queue 1000 Photographs using --limit, and then queue another 1000 Photographs using --limit, those will be the same 1000 photographs.

Proposed resolution

Crude, brittle, but maybe good enough:

1. Count the number of your content items in the queue table.
2. Divide that number by the items per page constant.
3. Resume your queue on the resulting page.

Advantages:
a. Easy to implement and may be good enough for most use cases.
b. No need to reset a counter if/when you want to re-queue everything.

Disadvantages:
a. If items are being migrated in the background, you may still end up queue-ing all or most of your items from the start.

More sophisticated, but also problematic:

1. Store a page counter state (or date of last item queued) for each queue, and use that page number or date to restart the queue next time.

Advantage:
1. No matter what is going on elsewhere (migrations, item deletions etc), you still go back to the page/date you were last on.

Disadvantage:
1. If a queue fails and the page or date is set in state and nobody notices that the que failed, on next invocation, a person might expect to start from page 1, and might not notice that the queue is starting from page 72. We would also need to implement an erasure of the page/date, like drush orange-dam:queue-content-by-type --type=Photograph --init

Overall, I am pretty comfortable with the quick and dirty queue table count approach, since it suits well the specific use-case for this feature request: I am queue-ing items for migration, something fails, and I want to resume. For this, hands-on, use-case, the crude solution is fine and has the fewest hidden downsides.

But there might be a good way to store pages/date state without the side effects, in which case, that might be the better approach.

✨ Feature request
Status

Fixed

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States apotek

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @apotek
  • πŸ‡ΊπŸ‡ΈUnited States apotek
  • πŸ‡ΊπŸ‡ΈUnited States adamzimmermann

    Thank you for creating this issue and providing such a detailed request.

    I want to acknowledge that I 100% see the desire for this and the value, but I have two concerns off-hand.

    1. The complexity around managing the state of the current page/the page that should be resumed from and keeping that separate for different types of requests. This is solvable in my mind though.
    2. The issue around the result set changing between the initial request and the next request, which might completely alter the items in a given page of results depending on the sorting being used. This is potentially solvable if new items are only ever added to later pages, but that is something we need to determine based upon the sorting algorithm being used.

    We will have to think about this more when we get to this.

  • πŸ‡ΊπŸ‡ΈUnited States apotek

    The issue around the result set changing between the initial request and the next request, which might completely alter the items in a given page of results depending on the sorting being used. This is potentially solvable if new items are only ever added to later pages, but that is something we need to determine based upon the sorting algorithm being used.

    I think, given the option of having to re-queue 1 million items from the start, if the process is interrupted, versus possibly having some paging overlap or possible interstitial changes (which could be queued later with a date-based query), I would prefer having to re-queue and re-migrate a few hundred over having to requeue and remigrate a million. I think even a "sloppy" resume is worth it. My $0.02

  • πŸ‡ΊπŸ‡ΈUnited States adamzimmermann

    I think, given the option of having to re-queue 1 million items from the start, if the process is interrupted, versus possibly having some paging overlap or possible interstitial changes (which could be queued later with a date-based query), I would prefer having to re-queue and re-migrate a few hundred over having to requeue and remigrate a million. I think even a "sloppy" resume is worth it. My $0.02

    When you put it that way, it makes my concern seem like less of an issue haha.

    I have a MR coming. I don't think it's ready for merging or final review, but it could be a interim solution that we use via patch and continue to improve on as time permits. I'm open to feedback on the approach here.

  • @adamzimmermann opened merge request.
  • Status changed to RTBC 11 months ago
  • πŸ‡ΊπŸ‡ΈUnited States markdorison

    The MR LGTM

  • Status changed to Active 11 months ago
  • πŸ‡ΊπŸ‡ΈUnited States adamzimmermann

    The initial MR was merged, but I would like to keep this open for adding the final polish/fixes as the work I did was more of a prototype of the functionality and not feature complete.

    @apotek has noted some of those changes in his comments above.

  • Status changed to Fixed 10 months ago
  • πŸ‡ΊπŸ‡ΈUnited States apotek

    This core need in this issue is fixed. For further refinement of user output, I have opened https://www.drupal.org/project/orange_dam/issues/3402985 πŸ› Improve output when queuing with the --page= parameter Active .

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024