Undefined language is used always for most of the Fulltext typed fields in Solr which can reduce the accuracy of search results because of the accents

Created on 18 July 2024, 9 months ago
Updated 21 July 2024, 9 months ago

Setup

  • Solr version: 8.11.3
  • Drupal Core version: 10.3.1
  • Search API version: 8.x-1.35
  • Search API Solr version: 4.3.4
  • Configured Solr Connector: Standard

Issue

I try to describe our developer experience when building a search page for multiple languages:

  • install a site with 2 languages, for example: English, Hungarian
  • make a node type translatable
  • add an article node with title in English: "Hungary field"
  • add a translation for this node in Hungarian: "Hungária mező"
  • install Search API and search API Solr
  • create a new index with Nodes, and add the node Title field as "Fulltext (ngram)"
  • run the indexing
  • build a view where there's a fulltext search, so we can test.
  • open this view in Hungarian,
  • notice that for "hungária", "hungaria", "hun" and "mező" there are results displayed
  • try to search for "mezo" now
  • there are no results for some reason
  • change in the index the type of Title to regular "Fulltext"
  • run the indexing
  • notice that now "hungária", "hungaria", "mező" and even "mezo" gives results also. For regular "Fulltext" there are no results for "hun" since that works only for full word searches.

I looked through what could happen and I found out that when I use the "Fulltext" type, then in Solr it will be "tm_X3b_hu_title" with type of "text_hu" and that's defined in search_api_solr/config/optional/search_api_solr.solr_field_type.text_hu_7_0_0.yml

  analyzers:
    -
      type: index
      charFilters:
        -
          class: solr.MappingCharFilterFactory
          mapping: accents_hu.txt

That means for "Fulltext", the accents are removed, so when someone searches for "mezo", then "mező" will be a valid result.

But when I change the type to "Fulltext (ngram)", then it becomes "tcngramm_X3b_hu_title" with type of "text_ngram". And according to search_api_solr/config/install/search_api_solr.solr_field_type.text_ngram_und_7_0_0.yml

  analyzers:
    -
      type: index
      charFilters:
        -
          class: solr.MappingCharFilterFactory
          mapping: accents_und.txt

That means for "Fulltext (ngram)", only the accents defined in accents_und.txt are removed, so when someone searches for "mezo", then "mező" will not be a valid result, since "ő -> o" is missing from accents_und.txt. And the previous "hungaria" search worked because "á -> a" is in accents_und.txt right now.

I saw that most languages define a language specific versions for these fields mostly:
- text_LANGCODE
- text_unstemmed_LANGCODE
- text_phonetic_LANGCODE
- collated_LANGCODE

That means if someone wants to use "Fulltext (ngram)", "Fulltext (ngramstring)", "Fulltext (edge)", "Fulltext (edgestring)" etc. they will experience the same that because of a missing language specific type, the accents or stopwords could cause this behavior in search for language native speakers.

The question is do we have to fill the missing types for each languages manually? Or should we add more accents to this accents_und.txt?

Feature request
Status

Active

Version

4.0

Component

Code

Created by

🇸🇰Slovakia kaszarobert

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Production build 0.71.5 2024