pulled dependencies don't participate in the batch, risk of PHP execution timeouts

Created on 1 August 2025, 28 days ago

Problem/Motivation

When the ImportService imports a set of entities or a channel, it creates a batch with a single operation which has the URL for the JSONAPI data on the server -- either the whole channel, or with the list of UUIDs as a filter.

The batch API then runs this, and the operation callback, ImportBatchHelper::importUrlBatch() fetches the JSONAPI URL in the first run, and stores the whole of the data in the batch sandbox.

This data is then sliced through in ImportBatchHelper::importUrlBatch(), according the batch size parameter.

The problem is that if a particular entity also imports other entities -- from references, embedded entities, etc -- and those also import, this is all done in one iteration of the batch. This can cause PHP execution timeouts, because it can quite easily take over 10 minutes.

Steps to reproduce

Proposed resolution

We need to find a way to have nested entities also participate in the batch.

I wonder whether with Drupal now using a queue for batches, we can manipulate that queue directly to add items.

However, the nested entities need to be imported before the parent, so that IDs can be translated correctly. I actually wrote something that does this way back in the Drupal 6 days: https://git.drupalcode.org/project/transport/-/blob/6.x-1.x/transport.co... -- it has a concept of storing an entity with dependencies in a paused state, and returning to it once the nested entities have been imported.

Given we have the RuntimeImportContext we're passing around, we could put things in that.

Remaining tasks

User interface changes

API changes

Data model changes

🐛 Bug report
Status

Active

Version

4.0

Component

Entity Share Client

Created by

🇬🇧United Kingdom joachim

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @joachim
  • 🇬🇧United Kingdom joachim

    I think I can see a way to make this work, with some provisos...

    Suppose when an entity is being run through the import processor pipeline by ImportService, some of the processors (such as links, reference fields, embedded links) find it has links/nested entities. Let's call the main entity A1, and currently the list of entities to process is this, where the * indicates the current item being processed:

    > *A1, X, Y, Z

    The ImportService collects the list of dependencies. Let's call them B1, B2, B3. It then manipulates the batch queue, so it looks like this:

    > *A1, B1, B2, B3, A1, X, Y, Z

    It then stops processing A1, allowing the batch/queue system to advance to B1.

    If B1 has links/nested entities C1, C2, it does the same, so the queue is:

    > *B1, C1, C2, B1, B2, B3, A1, X, Y, Z

    The idea is that we stop processing the parent entity, import the dependencies first, then re-encounter the parent entity in the queue.

    We obviously still need to respect $entitiesMarkedForImport, which prevents infinite loops of link following.

    The provisos we need for this all to work are:

    1. We need to be able to add items to the front of the batch queue. I don't know how yet. We can override the queue class in the batch definition, though we may not need to, as the two queue classes used by BatchAPI say they use FIFO ordering in the docs -- and that's what we want. The other problem is how to access the queue, but since we can set the queue name in the batch definition, we can maybe just reach into it with the queue API.

    2. We need to split up our batch operations to be atomic. Currently, both whole-channel and selected UUID pulls pass a URL to the batch operation callback, and that fetches the whole JSONAPI data and then slices through it in repeated calls to the same operation callback:

        if (empty($context['sandbox'])) {
          $response = $import_service->jsonApiRequest('GET', $url);
          $json = Json::decode((string) $response->getBody());
          $entity_list_data = EntityShareUtility::prepareData($json['data']);
          $context['sandbox']['entity_list_data'] = $entity_list_data;
    
          $context['sandbox']['progress'] = 0;
          $context['sandbox']['max'] = count($entity_list_data);
          $context['sandbox']['batch_size'] = \Drupal::getContainer()->getParameter('entity_share_client.batch_size');
        }
        if (!isset($context['results']['imported_entity_ids'])) {
          $context['results']['imported_entity_ids'] = [];
        }
    
        $sub_data = array_slice($context['sandbox']['entity_list_data'], $context['sandbox']['progress'], $context['sandbox']['batch_size']);
    

    This won't allow what we want to do here, as we potentially want to insert items between items here. We need to do only one entity per operation, and I think for clarity and simplicity we need to remove the entity list data stuff, and give each batch operation its own JSONAPI URL.

    3. Tests! PullKernelTestBase currently has very naive handling of running the batch. This will need to be improved.

  • 🇬🇧United Kingdom joachim

    Hacking extra items into the batch is possible, like this:

    // Inside a batch operation callback:
        $batch = &batch_get();
    
        // Get the name for the batch queue.
        // @see _batch_populate_queue()
        // We expect there is only one set.
        $queue_name = 'drupal_batch:' . $batch['id'] . ':' . 0;
        $queueFactory = \Drupal::service('queue');
    
        /** @var \Drupal\Core\Queue\QueueInterface $queue */
        $queue = $queueFactory->get($queue_name);
        $queue->createItem(
          [static::class . '::operation', [99]],
        );
        // Hack the count and the total for our batch set, so that _batch_process()
        // processes the additional queue item.
        $batch['sets'][0]['total']++;
        $batch['sets'][0]['count']++;
    
  • 🇬🇧United Kingdom joachim

    Or alternatively....

    We still split the initial batch into one operation per requested entity.

    But instead of making the batch operation parameter a URL or some entity data, we introduce an entity pull request value object.

    This gets passed through the processor plugins, and if there are references, those get added to it. This creates a stack of requests in the EPR.

    The import service can then mark the EPR as being incomplete.

    The batch operation callback can then report to BatchAPI that it is incomplete, and so it keeps getting called until the EPR's stack is all done.

Production build 0.71.5 2024