Add support for periodically indexing arbitrary pages (views, contact forms, …)

Created on 16 October 2023, about 1 year ago
Updated 20 November 2023, about 1 year ago

Problem/Motivation

There have been multiple questions about indexing non-entities (views, etc) — see this issue's "Related issues" field in the sidebar.

It would be nice for search_api to have built-in support for that.

Proposed resolution

Add a datasource that takes a list of paths on the current site, and periodically fetches them (as the anonymous user) and indexes them.

I've attached a WIP patch.

Remaining tasks

  • Is this general idea suitable for inclusion in search_api?
  • — solved; I needed to subclass ComplexDataDefinitionBase and reimplement its getPropertyDefinitions method
  • — solved; in search_api_cron I added $index->getTrackerInstance()->trackAllItemsUpdated('rendered_page')
  • When the site-builder updates the list of page paths, search_api should call the datasource's getItemIds method to get an updated list of items that need to be indexed. It doesn't automatically do this, and I haven't yet figured out how to convince it to do so. In the submitConfigurationForm method, I tried adding $this->index->rebuildTracker();, but that prevents the new form values from being saved.
  • Add tests.
Feature request
Status

Needs work

Version

1.0

Component

Plugins

Created by

🇺🇸United States smokris Athens, Ohio, USA

Live updates comments and jobs are added and updated live.
  • Needs tests

    The change is currently missing an automated test that fails when run with the original code, and succeeds when the bug has been fixed.

Sign in to follow issues

Comments & Activities

  • Issue created by @smokris
  • 🇺🇸United States smokris Athens, Ohio, USA

    Early WIP patch attached.

  • Status changed to Needs review about 1 year ago
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update about 1 year ago
    540 pass, 2 fail
  • 🇺🇸United States smokris Athens, Ohio, USA

    Updated patch attached.

  • 🇺🇸United States smokris Athens, Ohio, USA

    (Add another remaining task.)

  • Status changed to Needs work about 1 year ago
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Great feature request, thanks!
    There was already an issue for this (indexing arbitrary pages) a long time ago, but I cannot find it myself anymore. I think the main problem we were thinking about back then was security – it’s pretty much impossible to index reliable access information for arbitrary pages, so access checking doesn’t really work. (Or could at least only work as postprocessing for search results.)
    However, if the admin enters the paths to index themselves, and we include a big warning to only include publicly available pages there (or, more accurately, pages that can be accessed by everyone that will be able to access the search results pages), I guess that should be fine.
    Would still be interesting to see what we discussed there, but I guess that can’t be helped if I can’t find the issue anymore.

    What I was wondering about is whether sending an HTTP request is really the best way to obtain the page contents? Seems that, at the very least, this would depend on the site’s theme. Most should, at this point, be “nice” and put all main content into <main>, but I cannot believe this is universal. We might at least need to make that XPath query configurable.
    Executing the request internally would have seemed a more natural choice to me, but I guess we can see in the Rendered Item processor what a myriad of edge cases you run into there, when trying to render something in an unexpected context, so maybe the HTTP request really is the way to go.

    Another sticky point here is that, in its current form, this would also need to send those HTTP requests every time some of the pages are displayed as search results, which is of course unacceptable. (You could get around this using Solr, or some other backend that returns the indexed fields, but if we want to add this to the module the behavior must also be acceptable when using the database backend.) So, we might need to cache the indexed values (probably in a new cache bin), and clear that cache every time the pages are reindexed. (That way, we could also avoid reindexing if the values haven’t changed – e.g., with some hash of the contents.) Or, we could take the title value from the menu item and only use the HTTP request for viewItem(), so users would need to use the “Rendered item” processor if they want the HTML contents.

    Anyways, yes, I do think this is, in principle, fit to be included in the Search API. Please just tell me when it’s ready to review. I might also post about it somewhere to attract other testers, to make sure this works well for as many sites as possible.

    When the site-builder updates the list of page paths, search_api should call the datasource's getItemIds method to get an updated list of items that need to be indexed. It doesn't automatically do this, and I haven't yet figured out how to convince it to do so. In the submitConfigurationForm method, I tried adding $this-&gt;index-&gt;rebuildTracker();, but that prevents the new form values from being saved.

    Maybe compute the difference yourself in the form submit method and then manually call $index->trackItemsInserted()/$index->trackItemsDeleted() as appropriate?

    In any case, thanks again for working on this!

  • 🇬🇧United Kingdom aesuk

    Would this eventually cover the indexing of content in a header or footer of a view.?

    Global Text Area is a field where most describe/summarise the contents of the view.

    And so there are many use cases where that needs to be indexed. I see people searching for those keywords in our analytics. But for various reasons that content is not on the nodes themselves - only the header of the view.

    e.g if I searched loosely for running shoes. Sure I would want to see all the shoes in the search, but I would actually want the main category page to show up first place on that search.

  • 🇦🇹Austria drunken monkey Vienna, Austria

    @aesuk: If the view is included in the <main> section of the page, then yes, that would be included. (As would the view’s contents.)

  • 🇬🇧United Kingdom aesuk

    ok thanks... we can make sure our templates have that in place

Production build 0.71.5 2024