The whole index gets cleared when any change in the search index configuration is imported

Created on 17 March 2024, 8 months ago
Updated 26 August 2024, 3 months ago

Problem/Motivation

Currently, any change to the search index configuration clears the whole index (similar issue as search_api_opensearch 🐛 The whole index gets cleared/deleted when any change in the search index configuration is imported/synced Needs work )

But, according to the ElasticSearch 8.12 Guide, in the REST APIs section, under Index API -> Update mapping API, you can do the following without changing the index...

  1. Add new fields to an index
  2. Add new properties to an existing field
  3. Add multi-fields (i.e.: index the same field in different ways)
  4. Change mapping parameters for an existing field
  5. Rename a field by creating an alias field (but note that Search API doesn't track renames — they appear as one field being deleted and a new field being created — so we don't have to handle this case)

... you only need to rebuild the index when...

  1. Changing the mapping (i.e.: Search API Data Type) of an existing field
  2. (the docs don't describe how to delete/remove a field, but checking with @mparker17's search team lead, that also requires us to rebuild the index)

(Note that clearing the index when index config changes slightly is also the behavior on the elasticsearch_connector-8.x-7.x branch).

Steps to reproduce

  1. Create an index
  2. Index content
  3. Export the search index, make a minor change like changing a label etc
  4. Import the search index and all your indexed content will be cleared

Proposed resolution

In #4, and #5, we decided to tackle this problem in multiple phases, with this ticket being Phase 1, and Support Aliases API and zero downtime mapping updates Active being Phase 2.

In this ticket, we're going to change the behavior so that we only clear the search index when:

  1. Fields are deleted from the index
  2. Field mappings (i.e.: Search API Data Type) have changed

We also need to write tests to ensure that changes made through the UI and by importing configuration work as-expected.

Remaining tasks

  1. Review and feedback
  2. RTBC and feedback
  3. Commit

User interface changes

None.

API changes

None.

Data model changes

None.

📌 Task
Status

Needs review

Version

8.0

Component

Code

Created by

🇫🇮Finland sokru

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @sokru
  • 🇫🇮Finland sokru

    Lowering the priority since the feature is same as in 8.x-7.x.

  • 🇨🇦Canada mparker17 UTC-4

    According to the ElasticSearch 8.12 Guide, in the REST APIs section, under Index API -> Update mapping API, it sounds like:

    1. You can add new fields to an index anytime
    2. You can add new properties to an existing field anytime
    3. You can add multi-fields (i.e.: index the same field in different ways) anytime
    4. You can change mapping parameters for an existing field anytime
    5. You have to reindex to change the mapping of an existing field
    6. To rename a field, the suggested way to do so is to create an alias field, which you can do anytime
    7. (the docs don't describe how to delete/remove a field, but checking with our client's search team lead, that also requires reindexing)

    To quote from that page in the section about chang[ing] the mapping of an existing field, "If you need to change the mapping of a field in other indices, create a new index with the correct mapping and reindex your data into that index." (similar to what's described for OpenSearch in #3285438-15: The whole index gets cleared/deleted when any change in the search index configuration is imported/synced ).

    So it looks like we're in a similar situation to the Search API OpenSearch maintainers.

  • 🇨🇦Canada mparker17 UTC-4

    Thinking about how to implement Phase 2 (i.e.: rebuild the index)...

    In #3285438-9: The whole index gets cleared/deleted when any change in the search index configuration is imported/synced , @longwave suggests using a blue/green deployment method. That is to say, for each Search API Index defined in Drupal's configuration (e.g.: machine name foo), we would need to work with (at least) 2 ElasticSearch Indexes (e.g.: foo_blue and foo_green). We'd initially pick one of them to be "active" (e.g.: foo_green), create it, and work with it normally. Later, if there was a configuration change that required us to re-index, then we would...

    1. create the other index (e.g.: foo_blue) with the changed configuration
    2. reindex the old index (foo_green) to the new one (foo_blue)
    3. set the new index (foo_blue) as the "active" index
    4. delete the old index (foo_green)

    To signify which is the "active" index, we should create at least 1 Index Alias, that points to the currently-"active" index.

    Aside: ElasticSearch allows you to create readable Index Aliases that point to 1..* Indexes; but only allows you to create writeable Index Aliases that point to 1 Index. The simplest approach for elasticsearch_connector might be to create 1 Index Alias (e.g.: foo, named after the Search API Index) that points to the active Index; but if that doesn't work, we might have to create 2 (foo_read, and foo_write).

    While I think a blue/green method is a reasonable approach, I can see it being a source of confusion: admins/DevOps might (reasonably) ask questions like...

    1. "When I created 1 Search API Index, I expected 1 ElasticSearch Index, so why did I get 2-3 things: 1-2 ElasticSearch Index Aliases, and an ElasticSearch Index; and why don?"
    2. "Why the sudden spike in CPU / disk usage when I change a text field to an integer in the index settings?"
    3. "Why were there 2 indexes for a short time?" (especially if the reindex operation resulted in a spike in the disk/memory/CPU monitoring logs),
    4. "Why did the old index that I was referring to by name get deleted?"

    ... so if we take this approach, then we need to be pretty clear about what to expect, and how to interact with it (e.g.: if you need to read/write to ElasticSeach directly, read/write to the Index Alias, not the Index directly) both in the Search API Index UI, in the module's README, and/or in other documentation.

    I'd be interested in hearing feedback from other maintainers about whether they think this is the right approach, and whether we should create 1 or 2 aliases.

  • 🇫🇮Finland sokru

    I agree with Phase 2 approach, only small detail is to cover the need to also set the index as read-only during the step two "reindex the old index (foo_green) to the new one (foo_blue)".

    I think we should use this issue to cover only phase 1 and use Support Aliases API and zero downtime mapping updates Active for phase 2.

  • 🇨🇦Canada mparker17 UTC-4

    Sounds great! Thank you, @sokru!

    I have started writing tests, but they're still in the early stages of implementation. I will push them to this issue soon, then start on implementing phase 1.

  • Open on Drupal.org →
    Core: 10.2.1 + Environment: PHP 8.1 & MySQL 5.7
    last update 8 months ago
    Waiting for branch to pass
  • Open on Drupal.org →
    Core: 10.2.1 + Environment: PHP 8.1 & MySQL 5.7
    last update 8 months ago
    Waiting for branch to pass
  • Status changed to Needs work 8 months ago
  • 🇨🇦Canada mparker17 UTC-4

    Still working on this; but I've made some good progress.

  • Status changed to Needs review 3 months ago
  • 🇨🇦Canada mparker17 UTC-4

    When I copy the changes to .gitlab-ci.yml from Support for Search API Spellcheck Active — i.e.: by disabling _PHPUNIT_CONCURRENT and increasing the memory resources for Elasticsearch — the tests pass.

    Reviewing our goals from the issue summary and comments, I think this is ready for review now.

    I will update the issue summary to better reflect the new scope of this ticket in relation to Support Aliases API and zero downtime mapping updates Active

  • 🇨🇦Canada mparker17 UTC-4

    Updated issue summary

Production build 0.71.5 2024