Support multisite index in Search API DB ?

Created on 14 November 2024, about 1 month ago

Problem/Motivation

Howdy!

We'd like to index nodes from multiple sites into the same database (using search_api_db):

We managed to make multiple sites talk to the same database & tables by updating the table key_value where column collection is search_api_db.indexes (there seems to be no way to configure this in the admin UI).

However we are facing a limitation: the primary key of the Search API DB tables (item_id) does not take into account the site name. It is only made of the entity type, entity ID and language (e.g. entity:node/3:en). As a result if we index a node which ID is 123 from site1, and then index another node which ID is also 123 from site2, then the latter will remove the former from the index.

Possible solution

At first glance it looks like we may be able to tweak the value of the primary key (item_id) by extending the datasource ContentEntity (method getItemId()).

Would this be the recommended way to implement this functionality?

Does Search API DB provides a native support for indexing multiple sites into the same database? This seems to be supported by Search API Solr ( documentation ).

Thanks!

💬 Support request
Status

Active

Version

1.30

Component

Database backend

Created by

🇨🇦Canada fengtan Montreal, Canada

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @fengtan
  • 🇨🇦Canada fengtan Montreal, Canada

    Since this is supported by search_api_solr , I had a look at how it is implemented there.

    search_api_solr's unique key is id (source) ; this is different from search_api_db which primary key is item_id:

    • id is a concatenation of the site hash, index ID and node ID ; because it depends on the site hash it supports indexing multiple sites into the same Solr index (source). The site hash is computed from the site's base URL (source).
    • item_id which does not take into account the site the content comes from (sample value of an item_id: entity:node/3:en).

    I am tempted to come up with a similar mechanism for search_api_db and try to make a patch. That being said, because the table names are all stored in the {key_value} table (collection search_api_db.indexes), this means this functionality would assume all sites share the same key_value record. The site administrator may have to make sure this record is identical on all sites.

  • 🇨🇦Canada fengtan Montreal, Canada

    Here is a patch that adds a new column site_hash to all tables, and includes that column into primary keys. This is similar to what Search API Solr does.

    Documentation on how to make multiple sites index into the same tables using this functionality is included in README.md in the patch. This requires to synchronize the value of search_api_db.indexes (table key_value) across all sites manually, which is a bit awkward but seems to work. A possible improvement would be to add an option to automatically set this up via the admin UI.

    The patch does not include a mechanism to search across multiple sites (we do not need this ourselves) but that is an improvement that could also be implemented.

    I will port this patch to 8.x-1.x-dev too.

  • 🇨🇦Canada fengtan Montreal, Canada

    ... and here is a re-roll for 8.x-1.x-dev.

    I am going to move this to "Needs review" just to know whether this breaks automated tests and get some feedback on whether there is a desire to support this feature. I am guessing the manual steps to synchronize search_api_db.indexes on multisite sites should be automated, but before we implement that I'd like to know if I am going into the right direction.

    Thanks!

  • Pipeline finished with Failed
    30 days ago
    Total: 330s
    #347328
  • Pipeline finished with Failed
    25 days ago
    Total: 331s
    #352450
  • Pipeline finished with Failed
    24 days ago
    Total: 342s
    #353553
  • Pipeline finished with Canceled
    24 days ago
    Total: 128s
    #353635
  • Pipeline finished with Success
    24 days ago
    Total: 458s
    #353639
  • 🇨🇦Canada fengtan Montreal, Canada

    Automated tests were failing in !192.
    I have fixed them by removing all the search-specific conditions for now. The patch now just takes care of the indexing part.
    Updated the patch file accordingly.

  • 🇦🇹Austria drunken monkey Vienna, Austria

    Thanks for opening this issue and already providing a working implementation.

    I think the main question here is how many people would actually use this functionality. If there is enough interest, I think the other problems (see below) wouldn’t be that hard to resolve, but I agree that there’s no need to work on them yet if we don’t know if this will actually get merged. It does add some complexity, and would still require some work to get right, so I wouldn’t want to merge this just for a few users.

    For reference, some remaining tasks I’d see for this issue:

    • Either automating the key-value change or making it unnecessary.
    • Correctly handling searches on a multi-site index one way or the other: either only displaying results from the current site or being able to correctly handle results from other sites. (We might just want to make this an option.)
    • Maybe instead of adding a site_hash column just changing item_id to include the site hash, as done in the Solr module.
    • Adding test coverage (also for the update hook).
    • I also added two detailed comments to the MR.
Production build 0.71.5 2024