Allow to overcome character limit, translate only text

Created on 8 June 2020, over 4 years ago
Updated 10 April 2023, over 1 year ago

Microsoft translator has a 5000 character limit, that can be overcome by splitting longer text into sentences and constructing text chunks shorter than the limit.

Also, currently the module puts everything to translation including html tags which doesn't make much sense. I guess this should've been fixed in tmgmt but I fixed it here.

As the module maintainers are long dead, I'll just put a patch combo from my composer.json here we apply on our project to make the module more usable.

    "require": {
        "drupal/tmgmt_microsoft": "1.x-dev#0a6397d77f84d30faf4026324ab829067b356b34"
    },
    "extra": {
        "patches": {
            "drupal/tmgmt_microsoft": {
                "Azure Datamarket update, fix php error after entity creation": "./patches/2840876-81.patch",
                "Support category in request query per language": "./patches/category-support-3025309-2.patch",
                "Support long strings and HTML markup": "./patches/tmgmt_microsoft-html-chunks.patch"
            }
        }
    }

Other patches can be found according to their issue IDs in names, e.g. drupal.org/node/2840876

I hope the module maintainers will rise sometime and work on it, in the meanwhile anyone that needs this - just add this stuff to your composer.json along with proper patch files.

📌 Task
Status

Needs work

Version

1.0

Component

Code

Created by

🇵🇱Poland Graber

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇨🇦Canada Shiraz Dindar Sooke, BC

    Thanks for this patch Graber!

    This works great for me for chunking translations that are over the character limit.

    I did have to re-roll it to work on the latest dev release.

    I also took out the "translate only text" part of this patch -- it was throwing errors, and in any case, I was just interested in the chunking.

    So attached is the patch rerolled by myself, which only does chunking.

    Anyone reading this should also check out my other patch 🐛 increase max character limit to 50000 Fixed that increases the character limit to 50,000 (from 5000), as that alone may take care of your needs. In my case, we have translations over 50,000 characters so chunking is needed.

  • 🇨🇦Canada Shiraz Dindar Sooke, BC

    Please note that, even with chunking in place, I have run into certain translations that fail with "429 - too many requests" from the API. This is described at https://towardsdatascience.com/advanced-guide-avoiding-max-character-lim....

    I've tried sleeping between the chunk translation submissions but even at 2 second sleeps it wasn't enough. I presume because the limit is per minute. (it's not super well documented. The above link is the bet I could find)

    To be sure, this is separate from the character limitation.

    I'm not sure if the account tier makes a difference here. I'll update this task as I find out more.

  • 🇨🇦Canada Shiraz Dindar Sooke, BC

    1. The 429 too many requests error I mentioned above is no longer occurring. I think it was just a temporary thing, not a real issue.

    2. I found that some of the nodes I was submitting for chunked translations were failing because the previous patch would throw an exception if there was a single sentence over the character limit (because that would make for an untranslatable chunk). In fact the over-limit sentences in question were base64-embedded images in the text of the field I was translating. So I've updated the patch to not fail on these, but instead not submit over-limit "sentences" for translation, but still include them in place. This way base64 images are still included in the translated node and there are no fails.

    3. Further to #2, the regex that was being used to split text into sentences was failing on text which have base64-encoded images (ie. was not splitting these correctly). I played around with several tweaks on the regex but couldn't get it to work satisfactorily. So instead I found a php library on github which is designed specifically for splitting text into sentences. However, there were just a couple extra things it was doing which caused their own issues, so I forked that repo with the changes needed to make it work. SO, for anyone that happens to be reading this (I suspect in the future this *will* be needed), to get this patch to work, you will also need to add these lines to your project's composer.json:

    In the repositories section:

     {
                "type": "git",
                "url": "https://github.com/kanopi/php-sentence"
            }

    In the require section:

            "vanderlee/php-sentence": "dev-do-not-clean-unicode-and-do-not-replace-floats",
    

    Hoping this helps someone out!

  • heddn Nicaragua

    Additions of new external dependencies are going to be harder then optional dependencies. Could this be re-worked into something that does a check if the php sentance codebase is available, use it?

Production build 0.71.5 2024