Silence the warnings: An overlong word (more than 50 characters) ...

Created on 23 July 2019, over 5 years ago
Updated 13 February 2023, almost 2 years ago

I would like to remove the warning message from my logs:

An overlong word (more than 50 characters) was encountered while indexing: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.
Since database search servers currently cannot index words of more than 50 characters, the word was truncated for indexing. If this should not be a single word, please make sure the "Tokenizer" processor is enabled and configured correctly for index Website search.

I tried enabling and configuring the Tokenizer procesor but it broke my search.

I created a simple patch for this.

💬 Support request
Status

Needs work

Version

1.28

Component

Plugins

Created by

🇦🇺Australia Mirelyj

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • I got this problem after setting up Search API (v8.x-1.28) on my Drupal 9.5.3 site. I'm using the pre-processors: Ignore Case, Transliteration, HTML Filter, Tokenizer, Stemmer. I'm not using the Ignore Characters processor, but I see that the Tokenizer itself has an "Ignore Characters" parameter. This is set by default to ignore the three characters "._-". I changed this, setting it to empty.

    This has pretty much solved the problem for me. I now get a much smaller number of "overlong word" errors, and these are almost entirely from strings of characters which are not words, so dont need to be indexed. So, the solution seems simple.

    I am wondering why nobody mentioned the "Ignore characters" parameter in the Tokenizer before? Is this a feature that has been added recently?

    It's a shame that the documentation on the processors doesn't go into detail about how the Tokenizer works. It would be helpful if people could better understand how best to configure it. And what's the rationale behind ignoring the characters "._-"? Are there any downsides to not ignoring these characters?

  • 🇦🇹Austria drunken monkey Vienna, Austria

    It's a shame that the documentation on the processors doesn't go into detail about how the Tokenizer works. It would be helpful if people could better understand how best to configure it. And what's the rationale behind ignoring the characters "._-"? Are there any downsides to not ignoring these characters?

    Good point, thanks for the suggestion. I edited the page to provide a little bit more detail about how the Tokenizer works – hope that helps.

    Regarding your specific question: I think this mostly has to do with the fact that the Database backend currently doesn’t support phrase queries, so if you split a word into two tokens, the fact that they appeared right next to each other is effectively lost. Thus, the only way to make sure that a search for “rag-tag” doesn’t match a text that just contains “rag” and “tag” somewhere is to index it as a single token – either by ignoring the dash, or by indexing it. And since dashes are optional a lot of the time in English (e.g., in “re-introduce”/“reintroduce”), ignoring them seemed like the better option. Similar for dots and underscores. (Note that dots used as normal punctuation, at the end of a sentence, can safely be ignored as they are followed by a space anyways.)
    I hope this helps you understand the reasoning behind that default. In any case, if you see better results when clearing that option, then by all means do so.

    The option was added in September 2020 (see #3253986: Tokenizer doesn't allow no ignore characters ), so indeed it didn’t exist when this issue was initially posted.

  • @drunken monkey: thanks, that makes things much clearer. So, I see there are 3 basic options when deciding what to do with characters like "-", "." and "_".

    1. Ignore them - in which case we are liable to get "overlong word" errors if the text contains strings like "this-is-a-long-file_name.which.is.over-the.50-character-word-length_limit". Shorter strings like "smith-and-jones" will be indexed ok, but as "smithandjones", and so a search for "smith" wont match. A word like "re-introduce" will get indexed as "reintroduce", and will match a search for "re-introduce" or "reintroduce", but not a search for "introduce"

    2. Dont ignore them, and treat them as whitespace. In this case, "smith-and-jones" is indexed as 3 separate words, and so a search for "smith" will match. But "re-introduce" will also be treated as separate words, and so a search for "reintroduce" will not match. (A search for "re-introduce" will match, as will a search for "introduce".)

    3. Dont ignore them, and dont treat them as whitespace. In this case, "smith-and-jones" is indexed as-is, so, again, a search for "smith" will not match. A search for "reintroduce" wont match "re-introduce".

    I see that the list of default whitespace characters is very long, and does include "-._", so by removing these characters from the Tokenizer's "Ignore Characters" option, they will instead be treated as whitespace, as in (2).

  • 🇬🇧United Kingdom sittard

    We are using the pre-processors in the following order: Ignore Case, Transliteration, HTML Filter, Tokenizer, Ignore Characters, Stemmer and Type Specific Boosting.

    We are seeing the following error: An overlong word (more than 50 characters) was encountered while indexing: tsukaketonaruyounacijidenazhuangkuangkafurokuramunishengrirumareteirukotokabiyaotesu.

    On investigation we notice this error happening because of the “Transliteration” processor which converts chinese/japanese characters into romanized words causing the long words.

    When disabling the transliteration most of the errors disappear apart from a few words that are converted from links. However disabling the transliterator might cause search issues for other languages with special characters (such as polish or spanish) as searching those words without special characters will not work. For example a search string that contains letter “o” will not match special characters in strings such as “ò”.

    Please can you advise

  • 🇦🇹Austria drunken monkey Vienna, Austria

    Try putting the “Transliteration” processor after “Tokenizer”, and maybe try whether switching the “Simple CJK handling” setting of the “Tokenizer” processor on or off helps in your use case.

    For transliteration, we simply use the transliteration service provided by Drupal core. I know very little about Asian scripts so cannot really offer any more advice or provide suggestions for solving this within our module. It seems like, ideally, transliteration would insert spaces between separate words, but I don’t know if that’s really possible (that is, if it is simple enough to spot which characters form words).

  • 🇦🇹Austria maxilein

    Thank you all.
    I spent days on that problem... and could not figure out where the problems came from...

    #23 💬 Silence the warnings: An overlong word (more than 50 characters) ... Needs work made the options very clear.

    Turns out we have lots of links in the body/rendered item which are by default settings converted to overlong words:

    https://www...../resources/blog/create-custom-content-type-programmatica...
    becomes
    createcustomcontenttypeprogrammaticallyusingconfigurationapidrupal8

    So the solution is very simple.

    I chose

    2. Don't ignore them, and treat them as whitespace. In this case, "smith-and-jones" is indexed as 3 separate words, and so a search for "smith" will match. But "re-introduce" will also be treated as separate words, and so a search for "reintroduce" will not match. (A search for "re-introduce" will match, as will a search for "introduce".)

    see screenshot for entire setting (which is basically the default after installing search api)

  • 🇦🇹Austria maxilein

    please, what is that patch for?

  • 🇱🇻Latvia mr.valters Georgia

    This paths is to remove overlong (longer than 50 characters). 99% cases these are links (not visible in page), so visitors will never see the results.

  • 🇺🇸United States Taiger Bend, Oregon

    The code based fixed did not work for my situation. It was the configuration change mentioned in #26 that worked for me.

  • 🇺🇸United States joegl

    Thanks #23 and #26 for the informative comments. I was able to implement the solution in #26 (second option outlined in #23) while leaving the Ignore Characters processer enabled for other characters outside of the troublemakers: "._-".

    In the Ignore Characters configuration:
    - Remove the "." from the Strip by Regular Expression field (change it to "['¿¡!?,:;]")
    - Under the "Strip by character property" settings, uncheck the "Punctuation, Dash Characters" box (see https://en.wikipedia.org/wiki/Unicode_character_property for info here).
    - If you're also worried about the underscores "_", uncheck the "Punctuation, connector" box as well.
    This essentially keeps the "." and "-" intact (and "_" if you want).

    Then as described in the Tokenizer configuration:
    - Remove "._-" from the Ignore characters field (leave it blank)
    - Add "._-" to the Whitespace characters field

Production build 0.71.5 2024