Error cropping Japanese

Created on 1 April 2020, about 4 years ago
Updated 16 June 2023, about 1 year ago

Steps:
1. Create views and add 'body' filed to fields. The field settings see settings.png .
2. Add a node and body value like below:
‐ADTによる単独治療と比べ、ADTにエンザルタミドを併用することで転移および死亡するリスクが71%減少 -  アステラス製薬株式会社(本社:東京、代表取締役社長CEO:安川 健司、以下「アステラス製薬」)は、Pfizer Inc(本社:ニューヨーク州、以下「Pfizer社」)と共同で開発・商業化を進めている経口アンドロゲン受容体シグナル伝達阻害剤であるXTANDI®(一般名:エンザルタミド*1)について、非転移性去勢前立腺がん患者を対象にアンドロゲン除去療法(androgen deprivation therapyADT)とエンザルタミドを投与した群と、ADT単独治療群を比較した第III相PROSPERの試験結果が、6月28日発刊のNew England Journal of Medicineに掲載されましたので、お知らせします。  本試験において

The code preg_match('/[\.,:;\?!…]$/', $domnode->nodeValue) will match error. Then excute $domnode->nodeValue = substr($domnode->nodeValue, 0, -1);

Because substr is intercepted according to bytes, and the Japanese word is equal to 3 bytes under utf8, it will eventually cause the last ‘て’ to be truncated, which will cause encoding errors

🐛 Bug report
Status

Needs work

Version

2.0

Component

Code

Created by

🇨🇳China simbaw

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇺🇸United States ultimike Florida, USA

    This patch breaks the current testTruncateWords() unit test.

    So, I think we need to figure out why it is breaking that test, as well as ad some additional data (Japanese characters would make sense) to the two data providers in the existing Unit tests.

    -mike

  • 🇺🇸United States ultimike Florida, USA

    I have reached out to some folks who might be able to provide some sample test content for us - fingers crossed.

    -mike

  • 🇬🇧United Kingdom rajeevk London

    Hello Mike,

    Here are a few examples in the Hindi language.

    1. एक भूरे लोमड़ी ने आलसी काले कुत्ते पर छलॉँग लगाया। कुत्ता चिल्लाया। और फिर काला कुत्ता उठा और चलता बना।
    2. एक भूरे लोमड़ी ने आलसी काले कुत्ते पर छलॉँग लगाया। और फिर
    3. एक भूरे लोमड़ी ने आलसी काले कुत्ते
    4. एक भूरे लोमड़ी ने आलसी काले कुत्ते पर छलॉँग लगाया

    I hope it helps. Thanks

  • 🇺🇸United States ultimike Florida, USA

    @rajeevk - thank you so much!

    -mike

  • 🇮🇪Ireland lostcarpark

    Here's an unusual one for you: Ogham, an old script for writing Irish and Celtic languages.

    It can be written either vertically (bottom to top) or horizontally (left to right). It's Unicode spec is here: https://www.unicode.org/charts/PDF/U1680.pdf

    Characters are written along a line, so an unusual aspect is the space character,   (0x1680). Smart Trim doesn't currently recognise this as a space.

    Phrases can be wrapped in ᚛᚜ (0x169b and 0x169c), so ideally if a string is trimmed after an opening ᚛, it should be closed with a matching ᚜, but I don't think this is a deal breaker. However, if two phrases follow each other without a break ᚜᚛, they ought to be treated as having in invisible space between them (I think most writers of Ogham would put a normal space (0x20) between phrases).

    The sample below is a verse of a poem about rainbows.

    1. ᚛ᚇᚓᚐᚏᚌ ᚐᚌᚒᚄ ᚌᚂᚐᚄ᚜ ᚛ᚌᚑᚏᚋ ᚐᚌᚒᚄ ᚁᚒᚔ᚜ ᚛ᚃᚓᚐᚉᚆ ᚄᚐ ᚄᚚᚓᚔᚏ᚜ ᚛ᚐᚅ ᚁᚑᚌᚆᚐ ᚁᚐᚔᚄᚈᚔ᚜
    2. ᚛ᚇᚓᚐᚏᚌ ᚐᚌᚒᚄ ᚌᚂᚐᚄ᚜ ᚛ᚌᚑᚏᚋ ᚐᚌᚒᚄ ᚁᚒᚔ᚜ ᚛ᚃᚓᚐᚉᚆ ᚄᚐ ᚄᚚᚓᚔᚏ᚜
    3. ᚛ᚇᚓᚐᚏᚌ ᚐᚌᚒᚄ ᚌᚂᚐᚄ᚜ ᚛ᚌᚑᚏᚋ ᚐᚌᚒᚄ ᚁᚒᚔ᚜ ᚛ᚃᚓᚐᚉᚆ᚜
    4. ᚛ᚇᚓᚐᚏᚌ ᚐᚌᚒᚄ ᚌᚂᚐᚄ᚜ ᚛ᚌᚑᚏᚋ ᚐᚌᚒᚄ ᚁᚒᚔ᚜ ᚛ᚃᚓᚐᚉᚆ ᚄᚐ ᚄᚚᚓᚔᚏ᚜

    If you ignore the rule about closing phrases, 3 becomes:
    3. ᚛ᚇᚓᚐᚏᚌ ᚐᚌᚒᚄ ᚌᚂᚐᚄ᚜ ᚛ᚌᚑᚏᚋ ᚐᚌᚒᚄ ᚁᚒᚔ᚜ ᚛ᚃᚓᚐᚉᚆ

  • 🇺🇸United States ultimike Florida, USA

    Trimming to the end of the sentence following a number of words (case 4 from my comment 20 above) is part of Trim paragraph containing word count Closed: duplicate and will have to wait for that issue to be committed.

    -mike

  • 🇺🇸United States ultimike Florida, USA

    @lostcarpark - the Ogham text passes the "50 characters" test, but both versions of the "7 words" test fail (with and without the final ᚜). Thoughts?

    @rajeevk my original instructions were flawed (sorry) - for the "50 characters" test, I forgot to add the second sentence to my instructions: "The sample content trimmed to 50 characters. If the end of 50 characters ends in the middle of a word, then drop the last word fragment." Would you mind updating the expected Hindi text for the "50 characters" test? Thank you!

    -mike

  • 🇮🇪Ireland lostcarpark

    @ultimike, I think the problem is the list of characters TruncateHTML.php considers to be spaces, which appears to be on line 204.

    I think the Ogham space character (0x1680) would need to be added to that list.

    Here is a page about Unicode space characters: https://jkorpela.fi/chars/spaces.html

  • 🇺🇸United States ultimike Florida, USA

    @lostcarpark - thanks for the info - that led me to this:

    const SPACES = "/[\\p{Z}]+/u";

    I created a new constant to define the pattern for spaces using \p{Z} which covers all unicode separators - testTruncateWords() now passes with both Odgem and Hindi.

    I've created a patch, but I'm waiting for updated Hindi text as well as test data from several other non-Western languages (the patch will currently fail testTruncateChars(). I'll update the issue fork when we have something that passes. For now, I just wanted to provide the patch for informational use.

    -mike

  • I tested with a paragraph of Chinese, It seems only work when I set 'Trim units' to Characters, not "Words". Here are some test results:

    Test example: 兔子吹嘘自己跑得有多快。 他在嘲笑乌龟这么慢。 令兔子大吃一惊的是,乌龟向他发起了一场比赛。 兔子觉得这是个好笑话,就接受了挑战。 狐狸是比赛的裁判。 比赛开始时,兔子跑在乌龟前面,正如大家所想的那样.

    1. Trimmed to 50 characters, it should stops at here-- 兔子吹嘘自己跑得有多快。 他在嘲笑乌龟这么慢。 令兔子大吃一惊的是,乌龟向他发起了一场比赛。 兔子觉得这是个好笑话, but maybe it counts punctuation as a character, it stops at the closest end of a sentence before 50 characters at here 兔子吹嘘自己跑得有多快。 他在嘲笑乌龟这么慢。 令兔子大吃一惊的是,乌龟向他发起了一场比赛。

    2. Trimmed to 7 words, It should stop at here --兔子吹嘘自己跑, but it displayed the whole paragraph. Not working with words!
    3. Trimmed to 7 characters, It should stop at here --兔子吹嘘自己跑, It did.
    4. Trimmed to the end of the sentence following the first 7 character -- Add Suffix "想的那样", it should look like this " 兔子吹嘘自己跑想的那样". It did.

  • 🇯🇵Japan hodota

    Hi,

    This is Japanese 50 characters,

    Type A

    新型コロナウイルス対策に便乗した各省庁の予算獲得が繰り返されている。日本経済新聞が調べたところ、コロ

    Type B

    この問題を巡っては近鉄グループホールディングス(GHD)が、KNT-CTホールディングス(HD)社長の米田昭正氏が6

    thanks,

    Kazu Hodota

  • 🇺🇸United States ultimike Florida, USA

    @pilot - thanks so much.

    Is this Chinese (Mandarin)?

    For the 50-character trim, yes, it does count punctuation, so it works 😀

    For the 7-word trim, is each character considered a word?

    Thanks,
    -mike

  • 🇺🇸United States ultimike Florida, USA

    @hodata - thanks for the sample above, but I also need the correct answers when trimming the sample you provided to:

    • The sample content trimmed to 50 characters. If the end of 50 characters ends in the middle of a word, then drop the last word fragment.
    • The sample content trimmed to 7 words.
    • The sample content trimmed to the end of the sentence following the first 7 words.

    (see my comment 20 above).

    thanks,
    -mike

  • 🇺🇸United States ultimike Florida, USA

    I tried using ChatGPT to get the correct values of the trimmed Japanese and Chinese string provided above by @hodata and @pilot3, but as I am not familiar with either language, I have no idea if ChatGPT gave me the proper answers.

    I did update the not-yet-passing tests with the ChatGPT values, but I think we need some additional non-AI help 👍🏼

    Updated patch attached - again, not ready for testing.

    -mike

  • @ultimike -- Yes. Those are Chinese - Hanzi, ( Mandarin and Cantonese are almost same with written style, It's based on characters rather than the alphabet letter.)
    A Chinese character can be a word, or multiple Chinese characters can combine into a word. In this case for testing, I think each Chinese character can consider a word for statistics.

  • 🇮🇪Ireland lostcarpark

    Just curious, I was vaguely aware that Chinese characters can combine to form a word (though words will have far fewer characters than in alphabetic writing systems), but is there a way to tell from the Unicode if a character represents a whole word, or is part of a multi-character word? Presumably if you split a word made of multiple characters, it will change the meaning.

  • @lostcarpark-- There are many two-character words, and four-character words(phrases) in Chinese. Sometimes two words have similar meanings to use it formally, sometimes two words have completely different meanings, but to support each other to have a new meaning. It is hard to understand sentences with single character words . Not really sure a good way to split a word in a sentence. I think it make sense to split it by punctuation in a paragraph. Hope this will help.

  • 🇯🇵Japan hodota

    Hi,

    This is a Japanese langaugae for sample test #20

    For example:

    Sample content: The quick brown fox jumped over the lazy dog. Then the dog got up and walked away.
    Trimmed to 50 characters: The quick brown fox jumped over the lazy dog. Then
    Trimmed to 7 words: The quick brown fox jumped over the
    Trimmed to end of sentence after 7th word: The quick brown fox jumped over the lazy dog.

    Sample content: (65 characters by Japanese)
    サンプルコンテンツの内容です。動きの素早い、茶色の狐は、怠け者の犬を飛び越しました。 すると、犬は起き上がり、歩き去って行きました。

    Trimmed to 50 characters: (50 characters by Japanese)
    サンプルコンテンツの内容です。動きの素早い、茶色の狐は、怠け者の犬を飛び越しました。 すると、犬は起き

    Trimmed to 7 words: (7 words by Japanese)
    サンプルコンテンツの内容です。動き

    Trimmed to end of sentence after 7th word:
    サンプルコンテンツの内容です。動きの素早い、茶色の狐は、怠け者の犬を飛び越しました。

    Also I used this website for this Japanese words anaytics, and I attached page screenshot.
    https://tool.konisimple.net/text/hinshi_keitaiso

  • 🇺🇸United States ultimike Florida, USA

    @pilot - thanks so much for your help on this, very insightful.

    I figured out how to do this, and it's going to require a bigger change to the module than I previously suspected. It appears that we'll have to use a different regex pattern to separate non-Latin words (and to find sentence endings).

    Specifically, for Chinese (CJK), we'll need to use the Unicode "script" \p{Han}. Similarly, for Japanese characters, we'll need to use the Hiragana and/or Katakana Unicode script (see https://www.regular-expressions.info/unicode.html for more details on Unicode scripts).

    Currently, Smart Trim's regex patterns are hard-coded for Latin (western) languages. I'm thinking that in order to support non-Latin languages, we'll need to modify the regex pattern based either on a new formatter config option or automatically based on the page's HTML tag's xml:lang attribute (I'm guessing).

    Even doing that, I banged away at a regex pattern to successfully trim Chinese (Han) and Japanese (Hiragana) words with no luck.

    I'm not sure where that leaves us with this issue, but I think someone with expertise in this area (regular expressions with non-Latin languages) might be very helpful.

    -mike

  • @ultimike - You are welcome! Thanks for putting your time and effort into this!

    As for Chinese, It is better to trim with character, not words. A word can be combined with multiple Chinese Characters, and It is hard to separate words when we use them without understanding the whole sentence. The meaning can be totally different.

    For instance-when there is a requirement to write an 800 words article, It is counted as per Chinese character.

  • 🇺🇸United States Amirez Houston, TX

    @ultimike I was wondering if I could ask you a favor. Would you mind adding a new automated test for Persian (farsi) language which is written from right-to-left.

    For example:

    1. Sample content:
      سام پس از نیایش با گروھی به سوی کوه البرز رفت. سیمرغ از فراز کوه سام و گروه او را دید و دانست که در پی کودک آمده‌اند.
    2. Trimmed to 50 characters:
      سام پس از نیایش با گروھی به سوی کوه البرز رفت. سیم
    3. Trimmed to 7 words:
      سام پس از نیایش با گروھی به
    4. Trimmed to end of sentence after 7th word:
      سام پس از نیایش با گروھی به سوی کوه البرز رفت.

    Also there is no CKEditor button for this text field to insert "the content of input start from the right", which I think that is causing the period "." to appears at the start of the sentence in example 1 and 4.

    Thanks

  • 🇮🇳India mohit_aghera Rajkot

    Adding a few Gujarati sentences.
    Gujarati:
    252 Chars( 51 words)
    ઝડપી ભૂરું શિયાળ આળસુ કૂતરા પર કૂદી પડ્યું. કુતરા એ રાડ નાખી અને પછી કૂતરો ઊભો થઈને ચાલ્યો ગયો.

    Trimmed to 50 characters:
    ઝડપી ભૂરું શિયાળ આળસુ કૂતરા પર કૂદી પડ્યું. કુતરા એ રાડ નાખી અને પછી કૂતરો

    Trimmed to 7 words:
    ઝડપી ભૂરું શિયાળ આળસુ કૂતરા પર કૂદી => Test case passsing

    Trimmed to end of sentence after 7th word:
    ઝડપી ભૂરું શિયાળ આળસુ કૂતરા પર કૂદી પડ્યું.

    I tried to run the test cases on local and it passed for the 7 word use case.

    Regarding 50 character trim failures:
    It seems that Google and other online character counter are considering additional character for diacritics:
    For ex:
    ઝડપી is of 3 letters, however Google considers 4 because of diacritics added in the last character "પી".
    I am not sure if this is happening with other languages well.

    I noticed the same failures happening with Hindi in #22

  • 🇺🇸United States ultimike Florida, USA

    Thank you @pilot3, @Amirez, and @mohit_aghera!

    -mike

  • 🇨🇦Canada phjou Vancouver/Paris 🇨🇦 🇪🇺

    I encountered a similar issue with the okina character from the Hawaiian alphabet, that is counted as 2 instead of one.

    For me the issue is that the formatter is using strlen and not mb_strlen.
    Strlen is returning the size of the string in bytes, not in characters.

    Patch #19 solve the issue for me.

  • 🇺🇸United States ultimike Florida, USA

    @phjou would you be so kind as to provide us with some test data that we can add to this module for the Hawaiian alphabet? See my comment 20 above.

    thanks,
    -mike

  • 🇨🇦Canada phjou Vancouver/Paris 🇨🇦 🇪🇺

    Not speaking hawaiian, but I added one case for the okina, hopefully that's good enough for you.

  • 🇨🇦Canada phjou Vancouver/Paris 🇨🇦 🇪🇺

    Forgot to attach the diff

Production build 0.69.0 2024