Make the translator able to translate longer HTML content

Created on 2 September 2023, about 1 year ago
Updated 17 September 2023, about 1 year ago

Problem/Motivation

When translating content, body field commonly has HTML in it and it may cause the length of OpenAI calls exceed the token limitation of the model.

Proposed resolution

First, before sending the OpenAI request, split content into smaller chunks. The length of chunks can be defined by user according to the model and server environment.

Second, implement the Batch API to make the request to OpenAI.
This would avoid requesting with one single long content and causing timeout error.

User interface changes

Add a new field for "Max chunk tokens" in the provider settings

Feature request
Status

Fixed

Version

1.0

Component

Code

Created by

🇹🇼Taiwan amourow

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @amourow
  • @amourow opened merge request.
  • Status changed to Needs review about 1 year ago
  • 🇹🇼Taiwan amourow

    Made changes to make the chunk processing and Batch API work.

  • 🇹🇭Thailand AlfTheCat

    This is great to have!

    For me the patch didn't work, the log shows:

    Error: Class "Rajentrivedi\TokenizerX\TokenizerX" not found in Drupal\tmgmt_openai\Plugin\tmgmt\Translator\OpenAiTranslator->countTokens() (line 211 of /var/www/XXXX/modules/contrib/tmgmt_openai/src/Plugin/tmgmt/Translator/OpenAiTranslator.php).

    Hope this helps.

  • 🇹🇼Taiwan amourow

    @AlfTheCat

    because the additional library to calculate the Token is added in the composer.json.

    Adding the patch via the root composer.json won't install it.
    Because during the composer update, the additional vendor is not yet in the composer.json of this module.

    There are two ways you can do.

    1. do composer require rajentrivedi/tokenizer-x in the drupal project to make sure it installed
    2. Since the maintainer merge the MR, you can try require this module with version 1.x-dev

    It should work for you, and please let me know how does it work.

  • @amourow opened merge request.
  • Status changed to Fixed about 1 year ago
  • 🇹🇼Taiwan amourow

    Mark this fixed due the merged MR.

  • Status changed to Fixed about 1 year ago
  • 🇹🇭Thailand AlfTheCat

    @amourow Thanks for the info! Option 2 didn't work for me, composer reports that it can't find a dev release of this module. I don't see it under "all releases" on the project page either.

    Option 1 worked though, I'm no longer getting errors. I don't see an option in the UI according to the proposed solution:

    "First, before sending the OpenAI request, split content into smaller chunks. The length of chunks can be defined by user according to the model and server environment."

    If that's by design then I suppose it works :)

    Thanks again!

  • 🇹🇼Taiwan amourow

    @AlfTheCat

    I realized the option 2 doesn't work in the module recently, because the maintainer doesn't have the branch.
    Let's make a issue for that.

    There is indeed one additional field "Maximum chunk tokens" in the provider settings.
    It defines how maximum tokens per chunk when sending the text in the request of OpenAI API.

    Also, I have found the update here may cause two batch runner, so I made another patch to fix the issue.
    Can you also try include the patch from 🐛 Reduce redundant batch runner Active along with this one?

  • 🇹🇭Thailand AlfTheCat

    Awesome, thanks, I found that setting :)

Production build 0.71.5 2024