- Issue created by @achap
- Status changed to Postponed: needs info
almost 2 years ago 1:07am 14 March 2023 - π¦πΊAustralia kim.pepper πββοΈπ¦πΊSydney, Australia
Seems like a reasonable change. Are you able to submit a PR?
- Assigned to achap
- π¦πΊAustralia kim.pepper πββοΈπ¦πΊSydney, Australia
Can you take a look at β¨ Add a search_as_you_type data type Fixed to see if that is a better fit for your case?
- achap π¦πΊ
Thanks for putting that together. From what I'm seeing it actually has the same issue as the original edge n-gram implementation, i.e. it's highlighting the entire word rather than the n-grams themselves. Not sure why that is based on the docs https://opensearch.org/docs/latest/search-plugins/searching-data/highlight/
- @achap opened merge request.
- Status changed to Needs review
almost 2 years ago 4:44am 23 March 2023 - achap π¦πΊ
Switching from filter to tokenizer is working for me with Edge N-gram filters. I guess the two plugins can co-exist?
- Status changed to Postponed: needs info
almost 2 years ago 10:12pm 23 March 2023 - π¦πΊAustralia kim.pepper πββοΈπ¦πΊSydney, Australia
Yeah they can both exist.
I wonder if you can get the same results with
search_as_you_type
by just playing with the highlighter options? https://www.elastic.co/guide/en/elasticsearch/reference/current/highligh... - achap π¦πΊ
I have previously played around with those settings on the edge n-gram field before using a custom tokenizer and it didn't appear to do anything but I haven't had a chance to try it out yet for search_as_you_type. I imagine it's caused by the same issue, i.e. that search_as_you_type is probably using the standard tokenizer which splits tokens up on word boundaries rather than individual characters.
This SO question appears to solve it in the same way for the search_as_you_type implementation (implementing an edge n-gram tokenizer) https://stackoverflow.com/questions/59677406/how-do-i-get-elasticsearch-to-highlight-a-partial-word-from-a-search-as-you-type
From the https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html#analysis-tokenizers it says that a tokenizer is among other things responsible for:
- Order or position of each term (used for phrase and word proximity queries)
- Start and end character offsets of the original word which the term represents (used for highlighting search snippets).
If I analyze a title field that is using the custom edge ngram tokenizer I get the following token information for the sentence "This is a title":
{ "tokens" : [ { "token" : "t", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "th", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "thi", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 2 }, { "token" : "this", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 3 }, { "token" : "i", "start_offset" : 5, "end_offset" : 6, "type" : "word", "position" : 4 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "word", "position" : 5 }, { "token" : "a", "start_offset" : 8, "end_offset" : 9, "type" : "word", "position" : 6 }, { "token" : "t", "start_offset" : 10, "end_offset" : 11, "type" : "word", "position" : 7 }, { "token" : "ti", "start_offset" : 10, "end_offset" : 12, "type" : "word", "position" : 8 }, { "token" : "tit", "start_offset" : 10, "end_offset" : 13, "type" : "word", "position" : 9 }, { "token" : "titl", "start_offset" : 10, "end_offset" : 14, "type" : "word", "position" : 10 }, { "token" : "title", "start_offset" : 10, "end_offset" : 15, "type" : "word", "position" : 11 } ] }
If I analyze a search_as_you_type field I get the following information:
{ "tokens" : [ { "token" : "this", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "a", "start_offset" : 8, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "title", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 3 } ] }
So if the offset information is used for highlighting that explains why only the edge_ngram_tokenizer is working as expected.
- π¦πΊAustralia kim.pepper πββοΈπ¦πΊSydney, Australia
OK. Makes sense. Now we just need to decide whether highlighting whole words or tokens should be the default.
- achap π¦πΊ
Sorry for not replying, got a bit side tracked :D I've been using this patch in production without issues for a while now. In terms of which one should be default I guess something to consider is index size and performance. Don't have any hard data to back this up but I guess tokenizing every character is a lot more expensive than every word. So maybe because of that and also preserving backwards compatibility it makes sense to keep filter as the default and add the tokenizer as a new plugin?
- π¦πΊAustralia kim.pepper πββοΈπ¦πΊSydney, Australia
I'm inclined to push people towards the search_as_you_type approach rather than getting into specific tokenizers and analyzers etc. If people want to build their own custom solutions they can do that.
- achap π¦πΊ
No worries I will move this patch into our own codebase :)
- Status changed to Closed: won't fix
over 1 year ago 1:00am 6 July 2023 - π¦πΊAustralia kim.pepper πββοΈπ¦πΊSydney, Australia
OK cool. I'll close this for now then.