Chunk single text field if it is too large

Open on Drupal.org →

Created on 18 March 2024, over 1 year ago

Updated 9 April 2024, over 1 year ago

Problem/Motivation

The DeeplTranslator module currently lacks the capability to intelligently chunk large HTML content from a single field for translation, based on the content's structure (e.g., paragraph or div tags). This limitation is evident when attempting to translate large amounts of content that exceed the API's size limit. Our particular use case where we encounter this problem is landing pages built with DXPR Builder → . If the page is long, with many sections, there is a lot of text as well as a significant amount of HTML. We wouldn't want to strip all the HTML because some attributes (alt, title, etc.) need translation as well.

Steps to reproduce

Insert HTML content exceeding 30kb into a text field in Drupal. I'm attaching an example text of 100kb.
Use the TMGMT with DeeplTranslator plugin to translate the content.
Error: Drupal\tmgmt\TMGMTException: DeepL API service returned following error: Request Entity Too Large.

Proposed resolution

Introduce a configurable chunking mechanism within the DeeplTranslator module that leverages the PHP DOMDocument class for parsing HTML content. This mechanism should allow for flexible configuration to enable chunking based on different HTML structures (e.g., paragraphs, divs with specific classes, or first-level child elements). By parsing the field content into a DOM structure and operating on individual DOM-nodes, similar to how the DXPR Builder module processes HTML, this approach can accommodate various use cases and content structures, ensuring that translations preserve the integrity and context of the original HTML content.

In DXPR Builder you can find example implementations of DOMDocument to parse, modify, and piece together again HTML from a Drupal field: https://git.drupalcode.org/project/dxpr_builder/-/blob/2.1.x/src/Service...

I'm not sure if it should operate with a setting or an API. It would be nice if DXPR Builder itself can maintain the code with the chunking configuration through use of an API. But if there is interest in this feature from the community we should develop it here, or in TMGMT (probably infeasible).

Possible simpler solution

It would be much simpler if we just split the string without being aware of the HTML structure. Maybe DeepL doesn't care about valid HTML and it will work fine. It would save us from a lot of potential trouble if we don't have to use the DOMDocument class.

I also just noticed that DeepL doesn't translate alt and title attribute values, which the Google Cloud AI translator does. DeepL's HTML handling needs more investigation.

Possible scope elaboration

If it turns out we cannot make Deepl translate alt and title attributes without extracting those texts and putting them in markup that will be translated, this means we need to implement DOMNode support for a broader scope of functionality. In this case it could make more sense to first implement DOMNode parsing to prepare attribute values for translation, and build chunking of DOMNodes based on text length on top of that. We can then also evaluate whether extraction of all text for translation is a good choice. We send only the text strings together marked up in clean HTML (no classes or other attributes) and possibly avoid the need for chunking altogether in 99% of DXPR Builder use cases.

We can then monitor occurrence of the above error with product analytics to see if there is still a need for chunking.

Remaining tasks

Analyze the current module implementation to identify where enhancements can be integrated.
Design and implement a flexible chunking mechanism that uses DOMDocument for HTML parsing and chunking.
Ensure the chunking mechanism can be configured to handle different HTML structures based on module or site needs.
Test the new functionality with diverse HTML content to validate its effectiveness and flexibility.
Update documentation to include guidance on configuring and using the new chunking features.

User interface changes

If the business logic for chunking is integrated into tmgmt_deepl we add configuration settings for the chunking. If we only provided an API that facilitates splitting a single field translation into multiple batch jobs, there are no user interface changes.

API changes

See above

Data model changes

No changes to the data model are expected, as the enhancements primarily involve processing techniques for existing content.

✨ Feature request

Status

Closed: won't fix

Version

2.2

Component

Code

Created by

🇳🇱Netherlands jurriaanroelofs

Live updates comments and jobs are added and updated live.

Sign in to follow issues

Comments & Activities

Issue created by @jurriaanroelofs
Comment over 1 year ago →
🇳🇱Netherlands jurriaanroelofs
Comment over 1 year ago →
🇳🇱Netherlands jurriaanroelofs
Comment over 1 year ago →
🇳🇱Netherlands jurriaanroelofs
Comment over 1 year ago →
🇳🇱Netherlands jurriaanroelofs
Comment over 1 year ago →
🇳🇱Netherlands jurriaanroelofs
Comment over 1 year ago →
🇳🇱Netherlands jurriaanroelofs
Comment over 1 year ago →
🇳🇱Netherlands jurriaanroelofs
Comment over 1 year ago →
🇩🇪Germany SteffenR Germany
Hi JurriaanRoelofs,

Thanks for the issue and the well defined requirements.
I'll take a look at it after my easter holiday.

Usually we had no need handling such large amounts of data/ text within one single field. The internal batch processing of the module is already chunking the input data, but only to split up requests. It doesn't handle the DOM chunking you are looking for right now.
Status changed to Closed: won't fix over 1 year ago8:34am 9 April 2024
Comment over 1 year ago →
🇩🇪Germany SteffenR Germany
@JurriaanRoelofs I just had a call with deepl on this issue and they don't recommend the chunking of those large text volumes. If we do so, the quality of the translations could get worse, since context is missing.

Therefore i'll set the status to Closed (won't fix), since this limitation is given by the API and will not be changed (for now)...

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024