When one job item fails, the whole job fails resulting in lost time and credits

Created on 2 December 2024, 20 days ago

Problem/Motivation

Hi all,

I've been using ai_tmgmt quite extensively in the past days after the great improvements made recently. One issue that consistently rears its head is that when I initiate translating of multiple entities in a number of languages I am regularly faced with the process failing for some reason (moderation, unspecified causes) for just one of the many job items contained in the job. When this happens, all of the successful translations are lost. I've wasted a lot of time and a a lot of AI tokens this way. The process may run for 30 minutes and the expenses chart in the openai backend keeps growing and growing, and then just before the job finishes, an error appears and I'm left with nothing but an ocean of blue "in progress" icons. I have to manually delete all the jobs and job items and start the process again, in smaller chunks, to have it ultimately succeed.

I understand that TMGMT itself is perhaps geared more towards manual translation flows and so the current logic doesn't present much downside. But for AI translation workflows there is really a need for a mechanism that stores translations as soon as they come in at the job item level, and still allows for them to be auto-accepted if the overall job somehow fails (which is very common on larger jobs). I think that, just like with the "auto-accept" option, there is a need to accommodate this AI specific scenario on the side of ai_tmgmt rather than (waiting for) a change in the underlying TMGMT module.

Would this be possible to fix?

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Feature request
Status

Active

Version

1.0

Component

Code

Created by

🇹🇭Thailand AlfTheCat

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @AlfTheCat
  • First commit to issue fork.
  • Merge request !5Resolve #3490979 "Job partial save" → (Open) created by scott_euser
  • 🇬🇧United Kingdom scott_euser

    Makes sense! Have not had time to test at all (not even open up code for fatal error) but hopefully this gives a starting point for you or someone else. Maybe it 'just works' (though I am doubtful)

  • 🇹🇭Thailand AlfTheCat

    Awesome, thanks! I'll test this tomorrow and report back.

  • 🇹🇭Thailand AlfTheCat

    Maybe it 'just works'

    Maybe it actually does....! I ran a test and now one-by-one the translations are stored as the job is running. If I disrupt the process, all the finished translations are still there. A massive win for large translation sets.Seems perfect!

    One last thing to consider is to either automatically purge the job items that are left behind if the main job is stuck, or to restart them somehow. Currently, to delete stuck jobs and items requires either 2 custom VBO views to do in bulk, or, site admins have to go into every individual job item, click "abort", wait for a page reload, and then hit "delete".

  • 🇬🇧United Kingdom scott_euser

    Nice, thanks for checking! Seems like worth merging this then + creating a follow-up in TMGMT core module for restarting/retrying failed machine translation jobs. Do you agree?

    Thanks!

Production build 0.71.5 2024