All fields being indexed as arrays

Created on 12 April 2022, about 2 years ago
Updated 18 June 2024, 10 days ago

It seems that this module is making all fields become indexed as arrays no matter the type specified in the mappings.

So a string, will be a text array, an integer an integer array and so on.

Not sure if this was inherited from ES connector, but it creates a weird situation as the values conveyed in the mappings don't correspond to reality.

Here is an example:

{
        "_source" : {
      
          "industry_sector_name" : [
            "Education & Training"
          ],
          "nid" : [
            45574
          ],
          "opportunity_type_name" : [
            "Virtual Experience"
          ],
          "overview" : [
            """Lorem ipsum""
          ],
          "parent_employer_advertiser_name" : [
            "Example"
          ],
          "study_field_name" : [
            "Engineering & Mathematics"
          ],
          "title" : [
            "Example virtual experience (REM)"
          ],
          "search_api_id" : [
            "entity:node/45574:en"
          ],
          "search_api_datasource" : [
            "entity:node"
          ],
          "search_api_language" : [
            "en"
          ]
        }
      }

Not really sure if there is a good reason for this, but it feels strange, especially coming from Solr Search API where those mappigs are respected.

Thoughts?

πŸ› Bug report
Status

Needs work

Version

3.0

Component

Code

Created by

πŸ‡¦πŸ‡ΊAustralia kyuubi

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡¦πŸ‡ΊAustralia kim.pepper πŸ„β€β™‚οΈπŸ‡¦πŸ‡ΊSydney, Australia

    My understanding is all fields are multi-value fields, even if there is just one value. There is no array data type.

  • Status changed to Postponed: needs info over 1 year ago
  • πŸ‡¦πŸ‡ΊAustralia kim.pepper πŸ„β€β™‚οΈπŸ‡¦πŸ‡ΊSydney, Australia

    Postponing unless someone has more input on my previous comment.

  • Status changed to Active over 1 year ago
  • πŸ‡¦πŸ‡ΊAustralia tallytarik

    I've just run into this trying to use the Neural Search plugin with a custom knn_vector field and ingest pipeline. That field type only supports a single value, and throws an error when it's passed an array. This is what happens at the moment because the input text field (title in my case) is indexed as an array:

    [error] failed to parse field [title_embedding] of type [knn_vector] in document with id 'entity:node/12345:en'. Preview of field's value: '{knn=[...]}'. Current token (START_OBJECT) not numeric, can not use numeric value accessors

    I've hacked together a change to IndexParamBuilder::buildFieldValues() to return the value as a string instead of an array, and can confirm it now works. Something like the patch πŸ’¬ Source Fields in Elasticsearch Index are arrays RTBC in the linked issue might be the way to go - check the field cardinality for each field, and if it's 1, process and return the first (and only) value rather than as an array. I'm pretty new to OpenSearch so not 100% across if there could be other impacts of that change, though.

  • achap πŸ‡¦πŸ‡Ί

    Just wanted to say I had the exact same issue as tallytarik. For the most part everything being an array did not affect anything apart from when implementing a knn_vector field. It throws the same error for a multi value field. I used the IndexParamsEvent to alter the field to be single value and it works.

  • πŸ‡¦πŸ‡ΊAustralia kim.pepper πŸ„β€β™‚οΈπŸ‡¦πŸ‡ΊSydney, Australia

    I'm happy to consider this in the next major release branch. I think it would be a BC break and not sure if there would be an upgrade path needed.

  • Status changed to Needs work 10 days ago
  • πŸ‡¦πŸ‡ΊAustralia kim.pepper πŸ„β€β™‚οΈπŸ‡¦πŸ‡ΊSydney, Australia

    Spent some time on looking at the patch and how this could be implemented here. The code that checks for whether a field is a list or not is quite complex and seems to indicate there is a lack of trust in the TypeData definition isList() method.

    If we were to proceed, I would expect we would need pretty decent Kernel test coverage to ensure indexing and querying work as expected.

Production build 0.69.0 2024