Silence the warnings: An overlong word (more than 50 characters) ...

Comment over 2 years ago →
dabley
I got this problem after setting up Search API (v8.x-1.28) on my Drupal 9.5.3 site. I'm using the pre-processors: Ignore Case, Transliteration, HTML Filter, Tokenizer, Stemmer. I'm not using the Ignore Characters processor, but I see that the Tokenizer itself has an "Ignore Characters" parameter. This is set by default to ignore the three characters "._-". I changed this, setting it to empty.

This has pretty much solved the problem for me. I now get a much smaller number of "overlong word" errors, and these are almost entirely from strings of characters which are not words, so dont need to be indexed. So, the solution seems simple.

I am wondering why nobody mentioned the "Ignore characters" parameter in the Tokenizer before? Is this a feature that has been added recently?

It's a shame that the documentation on the processors → doesn't go into detail about how the Tokenizer works. It would be helpful if people could better understand how best to configure it. And what's the rationale behind ignoring the characters "._-"? Are there any downsides to not ignoring these characters?
Comment over 2 years ago →
🇦🇹Austria drunken monkey Vienna, Austria
It's a shame that the documentation on the processors → doesn't go into detail about how the Tokenizer works. It would be helpful if people could better understand how best to configure it. And what's the rationale behind ignoring the characters "._-"? Are there any downsides to not ignoring these characters?

Good point, thanks for the suggestion. I edited the page to provide a little bit more detail about how the Tokenizer works – hope that helps.

Regarding your specific question: I think this mostly has to do with the fact that the Database backend currently doesn’t support phrase queries, so if you split a word into two tokens, the fact that they appeared right next to each other is effectively lost. Thus, the only way to make sure that a search for “rag-tag” doesn’t match a text that just contains “rag” and “tag” somewhere is to index it as a single token – either by ignoring the dash, or by indexing it. And since dashes are optional a lot of the time in English (e.g., in “re-introduce”/“reintroduce”), ignoring them seemed like the better option. Similar for dots and underscores. (Note that dots used as normal punctuation, at the end of a sentence, can safely be ignored as they are followed by a space anyways.)
I hope this helps you understand the reasoning behind that default. In any case, if you see better results when clearing that option, then by all means do so.

The option was added in September 2020 (see #3253986: Tokenizer doesn't allow no ignore characters → ), so indeed it didn’t exist when this issue was initially posted.
Comment over 2 years ago →
dabley
@drunken monkey: thanks, that makes things much clearer. So, I see there are 3 basic options when deciding what to do with characters like "-", "." and "_".

1. Ignore them - in which case we are liable to get "overlong word" errors if the text contains strings like "this-is-a-long-file_name.which.is.over-the.50-character-word-length_limit". Shorter strings like "smith-and-jones" will be indexed ok, but as "smithandjones", and so a search for "smith" wont match. A word like "re-introduce" will get indexed as "reintroduce", and will match a search for "re-introduce" or "reintroduce", but not a search for "introduce"

2. Dont ignore them, and treat them as whitespace. In this case, "smith-and-jones" is indexed as 3 separate words, and so a search for "smith" will match. But "re-introduce" will also be treated as separate words, and so a search for "reintroduce" will not match. (A search for "re-introduce" will match, as will a search for "introduce".)

3. Dont ignore them, and dont treat them as whitespace. In this case, "smith-and-jones" is indexed as-is, so, again, a search for "smith" will not match. A search for "reintroduce" wont match "re-introduce".

I see that the list of default whitespace characters is very long, and does include "-._", so by removing these characters from the Tokenizer's "Ignore Characters" option, they will instead be treated as whitespace, as in (2).
Comment over 2 years ago →
🇬🇧United Kingdom sittard
We are using the pre-processors in the following order: Ignore Case, Transliteration, HTML Filter, Tokenizer, Ignore Characters, Stemmer and Type Specific Boosting.

We are seeing the following error: An overlong word (more than 50 characters) was encountered while indexing: tsukaketonaruyounacijidenazhuangkuangkafurokuramunishengrirumareteirukotokabiyaotesu.

On investigation we notice this error happening because of the “Transliteration” processor which converts chinese/japanese characters into romanized words causing the long words.

When disabling the transliteration most of the errors disappear apart from a few words that are converted from links. However disabling the transliterator might cause search issues for other languages with special characters (such as polish or spanish) as searching those words without special characters will not work. For example a search string that contains letter “o” will not match special characters in strings such as “ò”.

Please can you advise
Comment about 2 years ago →
🇦🇹Austria drunken monkey Vienna, Austria
Try putting the “Transliteration” processor after “Tokenizer”, and maybe try whether switching the “Simple CJK handling” setting of the “Tokenizer” processor on or off helps in your use case.

For transliteration, we simply use the transliteration service provided by Drupal core. I know very little about Asian scripts so cannot really offer any more advice or provide suggestions for solving this within our module. It seems like, ideally, transliteration would insert spaces between separate words, but I don’t know if that’s really possible (that is, if it is simple enough to spot which characters form words).
Comment over 1 year ago →
🇦🇹Austria maxilein
Thank you all.
I spent days on that problem... and could not figure out where the problems came from...

#23 💬 Silence the warnings: An overlong word (more than 50 characters) ... Needs work made the options very clear.

Turns out we have lots of links in the body/rendered item which are by default settings converted to overlong words:

https://www...../resources/blog/create-custom-content-type-programmatica...
becomes
createcustomcontenttypeprogrammaticallyusingconfigurationapidrupal8

So the solution is very simple.

I chose

2. Don't ignore them, and treat them as whitespace. In this case, "smith-and-jones" is indexed as 3 separate words, and so a search for "smith" will match. But "re-introduce" will also be treated as separate words, and so a search for "reintroduce" will not match. (A search for "re-introduce" will match, as will a search for "introduce".)

see screenshot for entire setting (which is basically the default after installing search api)
Comment over 1 year ago →
🇱🇻Latvia mr.valters Georgia
Comment over 1 year ago →
🇦🇹Austria maxilein
please, what is that patch for?
Comment over 1 year ago →
🇱🇻Latvia mr.valters Georgia
This paths is to remove overlong (longer than 50 characters). 99% cases these are links (not visible in page), so visitors will never see the results.
Comment over 1 year ago →
🇺🇸United States Taiger Bend, Oregon
The code based fixed did not work for my situation. It was the configuration change mentioned in #26 that worked for me.
Comment 10 months ago →
🇺🇸United States joegl
Thanks #23 and #26 for the informative comments. I was able to implement the solution in #26 (second option outlined in #23) while leaving the Ignore Characters processer enabled for other characters outside of the troublemakers: "._-".

In the Ignore Characters configuration:
- Remove the "." from the Strip by Regular Expression field (change it to "['¿¡!?,:;]")
- Under the "Strip by character property" settings, uncheck the "Punctuation, Dash Characters" box (see https://en.wikipedia.org/wiki/Unicode_character_property for info here).
- If you're also worried about the underscores "_", uncheck the "Punctuation, connector" box as well.
This essentially keeps the "." and "-" intact (and "_" if you want).

Then as described in the Tokenizer configuration:
- Remove "._-" from the Ignore characters field (leave it blank)
- Add "._-" to the Whitespace characters field
Comment about 15 hours ago →
🇩🇰Denmark ressa Copenhagen
I tried moving and adding different characters in different fields ("/", ".", etc.) but still got the long word warning. Eventually I reset the processors, but happened to have also changed the order of processors into the below, which seems to have made the long word warnings go away. Search (even inside URLs) and Autocomplete seem to work as expected ("Preprocess query" has the same order):

Preprocess index

HTML filter

Tokenizer

Ignore case

Transliteration

Ignore characters

Silence the warnings: An overlong word (more than 50 characters) ...

Comments & Activities