Make the translator able to translate longer HTML content

Created on 2 September 2023, 10 months ago
Updated 17 September 2023, 9 months ago

Problem/Motivation

When translating content, body field commonly has HTML in it and it may cause the length of OpenAI calls exceed the token limitation of the model.

Proposed resolution

First, before sending the OpenAI request, split content into smaller chunks. The length of chunks can be defined by user according to the model and server environment.

Second, implement the Batch API β†’ to make the request to OpenAI.
This would avoid requesting with one single long content and causing timeout error.

User interface changes

Add a new field for "Max chunk tokens" in the provider settings

✨ Feature request
Status

Fixed

Version

1.0

Component

Code

Created by

πŸ‡ΉπŸ‡ΌTaiwan amourow

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @amourow
  • @amourow opened merge request.
  • Status changed to Needs review 10 months ago
  • πŸ‡ΉπŸ‡ΌTaiwan amourow

    Made changes to make the chunk processing and Batch API work.

  • πŸ‡ΉπŸ‡­Thailand AlfTheCat

    This is great to have!

    For me the patch didn't work, the log shows:

    Error: Class "Rajentrivedi\TokenizerX\TokenizerX" not found in Drupal\tmgmt_openai\Plugin\tmgmt\Translator\OpenAiTranslator->countTokens() (line 211 of /var/www/XXXX/modules/contrib/tmgmt_openai/src/Plugin/tmgmt/Translator/OpenAiTranslator.php).

    Hope this helps.

  • πŸ‡ΉπŸ‡ΌTaiwan amourow

    @AlfTheCat

    because the additional library to calculate the Token is added in the composer.json.

    Adding the patch via the root composer.json won't install it.
    Because during the composer update, the additional vendor is not yet in the composer.json of this module.

    There are two ways you can do.

    1. do composer require rajentrivedi/tokenizer-x in the drupal project to make sure it installed
    2. Since the maintainer merge the MR, you can try require this module with version 1.x-dev

    It should work for you, and please let me know how does it work.

  • @amourow opened merge request.
  • Status changed to Fixed 9 months ago
  • πŸ‡ΉπŸ‡ΌTaiwan amourow

    Mark this fixed due the merged MR.

  • Status changed to Fixed 9 months ago
  • πŸ‡ΉπŸ‡­Thailand AlfTheCat

    @amourow Thanks for the info! Option 2 didn't work for me, composer reports that it can't find a dev release of this module. I don't see it under "all releases" on the project page either.

    Option 1 worked though, I'm no longer getting errors. I don't see an option in the UI according to the proposed solution:

    "First, before sending the OpenAI request, split content into smaller chunks. The length of chunks can be defined by user according to the model and server environment."

    If that's by design then I suppose it works :)

    Thanks again!

  • πŸ‡ΉπŸ‡ΌTaiwan amourow

    @AlfTheCat

    I realized the option 2 doesn't work in the module recently, because the maintainer doesn't have the branch.
    Let's make a issue for that.

    There is indeed one additional field "Maximum chunk tokens" in the provider settings.
    It defines how maximum tokens per chunk when sending the text in the request of OpenAI API.

    Also, I have found the update here may cause two batch runner, so I made another patch to fix the issue.
    Can you also try include the patch from πŸ› Reduce redundant batch runner Needs review along with this one?

  • πŸ‡ΉπŸ‡­Thailand AlfTheCat

    Awesome, thanks, I found that setting :)

Production build 0.69.0 2024