Support for Configuring Synonyms

Created on 24 May 2024, about 1 month ago
Updated 9 June 2024, 23 days ago

Problem/Motivation

We have a number for synonyms we need to create for our search. (i.e. ldpe for low density polyethylene and visa versa.) So far we have not been able to make things work as we would expect or found any specific documentation. At this point we are considering this to be a support request, however we expect it might convert to a documentation task. We would be glad to contribute back with our findings on this as we discover our solution.

It's also possible it might have something to do with this issue https://www.drupal.org/project/search_api_pantheon/issues/3424724#commen... πŸ› Search API Schema reverts back to 4.2 after deploys or randomly Active

Our configuration includes use of Solr Schema on Pantheon (all environments) is 4.2.10 with search_api_pantheon 8.1.8, search_api 8.x-1.34, and search_api_solr 4.3.3

Our configuration includes the following.

  • Our new content is being indexed on creation.
  • We have edited the English Text Field listed on SolrFieldType Configuration (is this the right place?) to include the following lines:
    ldpe, "low density polyethylene"
    "low density polyethylene", ldpe
    lowdensitypolyethylene, ldpe
    ldpe, lowdensitypolyethylene
     
  • Our changes are saved in the Pantheon synonyms_en.txt and included with the config.zip
  • Search does not find any terms using "lpde" and the field analysis only displays "ldpe" in the query analysis, even though the other values exist
  • Steps to reproduce

    Proposed resolution

    Remaining tasks

    User interface changes

    API changes

    Data model changes

    πŸ’¬ Support request
    Status

    Active

    Version

    8.1

    Component

    Documentation

    Created by

    πŸ‡ΊπŸ‡ΈUnited States bsnodgrass

    Live updates comments and jobs are added and updated live.
    Sign in to follow issues

    Comments & Activities

    • Issue created by @bsnodgrass
    • πŸ‡ΊπŸ‡ΈUnited States bsnodgrass
    • πŸ‡ΊπŸ‡ΈUnited States bsnodgrass
    • πŸ‡ΊπŸ‡ΈUnited States bsnodgrass
    • πŸ‡ΊπŸ‡ΈUnited States dorficus

      I've been digging into this and I've found some very interesting things:

      I am using Lando for local dev with the Pantheon recipe and I've tried a couple of things.

      1. Creating a new Solr server with custom config automatically added
      2. "Posting" custom config to default Pantheon server

      Here are some strange findings:

      • When developing locally, using a custom server and index, I'm able to get synonyms to work.
      • When developing locally, using the Pantheon server and index, synonyms work with custom config.
      • When testing on Pantheon with default config, synonyms do not work, nor should they.
      • When testing on Pantheon with custom "posted" config, synonyms do not work, but they shouldn't.

      Here's where it gets strange:

      When testing on local, both custom and Pantheon servers, the testing at

      admin/config/search/search-api/server/pantheon_solr8/solr-admin/field-analysis
      

      revealed the following:

      However, testing the same way on Pantheon with a Pantheon server revealed this:

      The most interesting part is that the following happens, which I believe is related:

      I checked all of my schema files and the tokenizer should definitely be the StandardTokenizer

      <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100" storeOffsetsWithPositions="true">
        <analyzer type="index">
          <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"/>
          <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
          <filter class="solr.LengthFilterFactory" min="2" max="100"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.SnowballPorterFilterFactory" protected="protwords_en.txt" language="English"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
        <analyzer type="query">
          <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.SynonymGraphFilterFactory" ignoreCase="true" synonyms="synonyms_en.txt" expand="true"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"/>
          <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="0" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
          <filter class="solr.LengthFilterFactory" min="2" max="100"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.SnowballPorterFilterFactory" protected="protwords_en.txt" language="English"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
      </fieldType>
      

      The part of note there is:


      solr.StandardTokenizerFactory"/>

      What this is telling me, in addition to the issues mentioned on https://www.drupal.org/project/search_api_pantheon/issues/3424724#commen... πŸ› Search API Schema reverts back to 4.2 after deploys or randomly Active is that there is another config file hiding somewhere that we do not have access to edit that is overriding some of the customizations. This is also evidenced by seeing Schema 4.3.3 on local and 4.2.0 on Pantheon.

      Without knowing if this is indeed the case, it's difficult to determine what the next steps to correct this are.

    • πŸ‡΅πŸ‡­Philippines danreb

      @dorficus You are correct. The default Solr schema in Pantheon platform was set to 4.2.10, currently reposting of Solr Schema on Pantheon was broken (Sticky Solr Schema bug)

      If you want your custom config to take affect, what you need to do right now is to open a ticket and let the CSE or the platform engineers reposted the config for you in the affected environments.

    • πŸ‡ΊπŸ‡ΈUnited States bsnodgrass

      @danreb I've created a support ticket assistance with making this happen or instructions as to how we can post the config.zip ourselves?

      Initially we would like to post the schema changes on transmfg.build multidev to confirm our issue is fixed.

      Following we will be making a number of changes on transmfg.build and have them applied to all our environments.

    • πŸ‡ΊπŸ‡ΈUnited States dorficus

      @bsnodgrass and @danreb I have verified that after the ticket with Pantheon, we are now using the correct schema in the multidev. After verifying this, I also verified that the correct tokenizer and synonym filters were working correctly.

      It still seems that for the "core" config which defines the schema, there will need to be Pantheon intervention on all environments to get it up to date, however we are able to post our own config using drush sapps, assuming that our custom config is available to the Drupal site.

      I have included the config in a folder in the docroot of the project, so that command to post config ends up being drush sapps pantheon_solr8 /code/solr/custom_config/.

      1. Key steps in this process are:
      2. Pushing the config to the platform via git
      3. Posting the config using the above command
      4. Reloading the Solr server core: admin/config/search/search-api/server/pantheon_solr8/solr-admin/reload-core
      5. Reindexing the content after the core has reloaded

      Once that is done, I was able to verify that the files were correct in the admin/config/search/search-api/server/pantheon_solr8/files
      I was also able to test the queries vs. index values using the Field Analysis tool: admin/config/search/search-api/server/pantheon_solr8/solr-admin/field-analysis

      Of note, synonyms with whitespaces do not work. Underscores did not seem to correct this either. In our use case of "ldpe" being returned for "low density polyethylene", Solr read the latter as three separate tokens, non-combined. Within our synonyms_en.txt file I had set the terms to be interchangeable, ldpe, low density polyethylene, however the query "low density polyethylene" against a field value of "ldpe" did not return results.

      To fix this, I escaped the whitespaces with low\ density\ polyethylene, redid the above steps to update the config, and the results appeared as expected.

      Thank you for your help on this, @danreb.

    Production build 0.69.0 2024