indexing external entities - tracking locks up

Created on 14 September 2023, 10 months ago
Updated 3 May 2024, about 2 months ago

Problem/Motivation

using search_api_solr for a solr index or a search_api with a database index, always unable to complete tracking.
Not all items have been tracked for this index. This means the displayed index status is incomplete and not all items will currently be indexed.


Mitigation/Solution outlined here:


#3387321-7: indexing external entities - tracking locks up →
What would likely solve this is to have a working page and item limit per page but with a json array configuration with mentioned patches for multilingual and the versions I used, I was so far unable to get a page approach to work for this solution , and it was not mentioned in the patch authors documentation.
Maybe this situation has changed with the latest release?

Steps to reproduce

My original plan was to use external_entities exactly as described by @sylus in comment #4 , however despite having solr indexed content and following steps described here → no matter what I do I have so far been unable see results generated using the solr indexed values only. I haven't given up yet but it's been a grind for 3 days learning this ecosystem of modules and getting some results but hitting a wall.

video tutorial, I pretty much did this using views to provide the json data on the remote and I added his patches.
https://www.youtube.com/watch?v=tHGc6AdLzs4

corresponding PDF https://www.drupaleurope.org/sites/default/files/slides/2018-09/How%20to...

I originally tried tagged releases then switched to these, similar issues.

releases used, all updated:

    "drupal/search_api": "dev-1.x",
    "drupal/search_api_autocomplete": "dev-1.x#925a16b1",
    "drupal/search_api_page": "^1.0",
    "drupal/search_api_solr": "dev-4.x",
    "drupal/external_entities": "^2.0@alpha",

With
D9.5.x
PHP 8.1
solr 8.11

Patches used:

    "drupal/external_entities": {
      "3376591 - Field mappings form section doesn't show a form element": "https://git.drupalcode.org/project/external_entities/-/merge_requests/25.patch",
      "3376604 - missing @return in docs": "https://www.drupal.org/files/issues/2023-08-10/3376604-9.patch",
      "2998391 - signel item from remote drupal views rest is wrapped in an array by default, add option to handle this": "https://www.drupal.org/files/issues/2019-07-03/2988391-3-views_rest_support.patch"
    },

The issue that I am now focusing on is why the "Tracking" procedure doesn't complete. It still allows indexing despite the tracking not completed (there's an error, I'll debug that again after changing approach a few times). with that said, my views with indexed data that aren't producing results.

I wonder if there's a recent regression in external_entities or search_api / search_api_solr that is causing problems retrieving content from solr.

I've indexed, have indexed content, have re-indexed many times, reviewed the configuration and re-consulted documentation.

Three days so far on this.

I'm noticing this plaguing issue:
Not all items have been tracked for this index. This means the displayed index status is incomplete and not all items will currently be indexed.

Proposed resolution

TBD

Remaining tasks

TBD

User interface changes

TBD

API changes

TBD

Data model changes

TBD

📌 Task
Status

Needs review

Version

2.0

Component

Documentation

Created by

🇨🇦Canada joseph.olstad

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @joseph.olstad
  • 🇨🇦Canada joseph.olstad

    There may be something else at play here, I may have to spin up a vanilla Drupal to test, the content is being indexed and I have confirmed it is present. With that said, I'm not sure what the tracking function does or why it's not completing in either solr index or a db index. search_api has the tracking function, but if it's external entities it's not completing the tracking and consumes enormous cpu resources.

  • 🇨🇦Canada joseph.olstad

    attempting to create a solr document index

    following this doc
    https://www.drupal.org/docs/8/modules/search-api-solr/search-api-solr-ho... →

    I'll review steps

    The website encountered an unexpected error. Please try again later.
    TypeError: call_user_func(): Argument #1 ($callback) must be a valid callback, no array or string given in call_user_func() (line 275 of core/lib/Drupal/Core/Render/Element/MachineName.php).
    
    call_user_func() (Line: 275)
    Drupal\Core\Render\Element\MachineName::validateMachineName()
    call_user_func_array() (Line: 282)
    Drupal\Core\Form\FormValidator->doValidateForm() (Line: 238)
    Drupal\Core\Form\FormValidator->doValidateForm() (Line: 238)
    Drupal\Core\Form\FormValidator->doValidateForm() (Line: 238)
    Drupal\Core\Form\FormValidator->doValidateForm() (Line: 118)
    Drupal\Core\Form\FormValidator->validateForm() (Line: 591)
    Drupal\Core\Form\FormBuilder->processForm() (Line: 323)
    Drupal\Core\Form\FormBuilder->buildForm() (Line: 73)
    Drupal\Core\Controller\FormController->getContentResult() (Line: 39)
    Drupal\layout_builder\Controller\LayoutBuilderHtmlEntityFormController->getContentResult()
    call_user_func_array() (Line: 123)
    Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->Drupal\Core\EventSubscriber\{closure}() (Line: 580)
    Drupal\Core\Render\Renderer->executeInRenderContext() (Line: 124)
    Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->wrapControllerExecutionInRenderContext() (Line: 97)
    Drupal\Core\EventSubscriber\EarlyRenderingControllerWrapperSubscriber->Drupal\Core\EventSubscriber\{closure}() (Line: 169)
    Symfony\Component\HttpKernel\HttpKernel->handleRaw() (Line: 81)
    Symfony\Component\HttpKernel\HttpKernel->handle() (Line: 58)
    Drupal\Core\StackMiddleware\Session->handle() (Line: 48)
    Drupal\Core\StackMiddleware\KernelPreHandle->handle() (Line: 106)
    Drupal\page_cache\StackMiddleware\PageCache->pass() (Line: 85)
    Drupal\page_cache\StackMiddleware\PageCache->handle() (Line: 48)
    Drupal\Core\StackMiddleware\ReverseProxyMiddleware->handle() (Line: 51)
    Drupal\Core\StackMiddleware\NegotiationMiddleware->handle() (Line: 23)
    Stack\StackedHttpKernel->handle() (Line: 718)
    Drupal\Core\DrupalKernel->handle() (Line: 19)
    
  • 🇨🇦Canada joseph.olstad

    Still evaluating, however this might be actually fixed by patch 11 here:
    #2998391-11: Add option to handle json wrapped in array - Repatch for version 8.x-2.0-alpha2 →

  • 🇨🇦Canada joseph.olstad

    Ok, I've figured out a way how to mitigate this. You have to go into your search api server settings and crank up all of the timeout settings

    Increase these default values:
    Query timeout:
    from 5 to 30

    Index timeout:
    from 5 to 30

    Optimize timeout
    from 10 to 60

    Finalize timeout
    from 30 to 120

    Commit within
    from 1000 to 30000

    After I did this the tracking in search_api was able to handle whatever external_entities was throwing at it for building the tracking approximately 300 external entities.

  • Status changed to Needs review 5 months ago
  • 🇳🇱Netherlands pefferen

    Paging is indeed very important to take into account. as mentioned in #2. The fact that increasing the timeout, gives me an indication that this could be connected, as more than the requested amount of records are fetched from the API.
    The other day I had this issue when I wanted to retrack the items in Search API. When I configured the correct pagination properties the issue was solved.

  • 🇨🇦Canada joseph.olstad

    For sure, paging should help, I wasn't sure how to set that up properly, is there some documentation to this effect?
    How does this affect the mapping?

    I configured my json endpoint similarly to this:
    https://youtu.be/tHGc6AdLzs4

    and used an array patch for "map fields wrapped in object curley braces"

  • 🇨🇦Canada joseph.olstad

    Ok reviewed a bit more, the rebuild tracking is timing out.

    I set up a pager , the page variable is:
    page
    the offset variable is:
    offset
    the number of items per page is:
    items_per_page

    I set the view in drupal to 50 items per page
    exposed the option to change it

    enabled full pager

    I'll review this again soon

    ✨ Add option to handle json wrapped in array Active

    external data source is Drupal 8
    consumer is Drupal 10 external_entities module.

  • 🇨🇦Canada joseph.olstad

    For some reason the page configuration didn't work for me using the mentioned approaches with the json array support patch

  • 🇮🇳India manikandank03 Tamil Nadu

    I am also facing the exact issue on the while using this Beta version in Drupal 10.2.5, the search index rebuild not working anymore.

  • 🇨🇦Canada joseph.olstad

    increasing the timeouts helped but isn't the solution. Configuring the paging to work would be a first step but I'm using an array approach as mentioned:
    ✨ Add option to handle json wrapped in array Active

  • 🇳🇱Netherlands pefferen

    @josepholstad did you try debugging the storage plugin to see what it is trying to fetch. It can be useful to test the queries send to the server for example in Postman or the browser to see what parameters you need to configure. Hope this helps

  • 🇨🇦Canada joseph.olstad

    it was doing 260 entities in one shot. no paging, I was unable to get the paging configuration to work with the json array approach provided by:
    ✨ Add option to handle json wrapped in array Active

    I was hoping to eventually find some more documentation on setting up an external drupal view to an external_entities client with a pager using alternative approaches.

    The language patch is also required, otherwise advanced text processing is disabled by solr. Solr does not handle language undefined in a very elegant way. We have more than one language but this also affects english.
    ✨ Make external entities language aware RTBC

  • 🇨🇦Canada joseph.olstad

    if the batch was 20 external entities at one shot, it wouldn't time out. My endpoint has 260 entities.

    it is working in a production environment, but the tracking times out and busts. My production environment has limited memory, throttled cpu and resources.

    the endpoint has varnish, so the 260 records in json are actually served up from a cache in varnish, with that said, it's a heavy load without a page size working.

  • 🇨🇦Canada joseph.olstad

    before I start debugging the storage plugin, it'd be nice to figure out why the paging setup wasn't working with the array approach.

Production build 0.69.0 2024