pulled dependencies don't participate in the batch, risk of PHP execution timeouts

Issue created by @joachim
Comment 3 months ago →
🇬🇧United Kingdom joachim
I think I can see a way to make this work, with some provisos...

Suppose when an entity is being run through the import processor pipeline by ImportService, some of the processors (such as links, reference fields, embedded links) find it has links/nested entities. Let's call the main entity A1, and currently the list of entities to process is this, where the * indicates the current item being processed:

> *A1, X, Y, Z

The ImportService collects the list of dependencies. Let's call them B1, B2, B3. It then manipulates the batch queue, so it looks like this:

> *A1, B1, B2, B3, A1, X, Y, Z

It then stops processing A1, allowing the batch/queue system to advance to B1.

If B1 has links/nested entities C1, C2, it does the same, so the queue is:

> *B1, C1, C2, B1, B2, B3, A1, X, Y, Z

The idea is that we stop processing the parent entity, import the dependencies first, then re-encounter the parent entity in the queue.

We obviously still need to respect $entitiesMarkedForImport, which prevents infinite loops of link following.

The provisos we need for this all to work are:

1. We need to be able to add items to the front of the batch queue. I don't know how yet. We can override the queue class in the batch definition, though we may not need to, as the two queue classes used by BatchAPI say they use FIFO ordering in the docs -- and that's what we want. The other problem is how to access the queue, but since we can set the queue name in the batch definition, we can maybe just reach into it with the queue API.

2. We need to split up our batch operations to be atomic. Currently, both whole-channel and selected UUID pulls pass a URL to the batch operation callback, and that fetches the whole JSONAPI data and then slices through it in repeated calls to the same operation callback:

if (empty($context['sandbox'])) { $response = $import_service->jsonApiRequest('GET', $url); $json = Json::decode((string) $response->getBody()); $entity_list_data = EntityShareUtility::prepareData($json['data']); $context['sandbox']['entity_list_data'] = $entity_list_data; $context['sandbox']['progress'] = 0; $context['sandbox']['max'] = count($entity_list_data); $context['sandbox']['batch_size'] = \Drupal::getContainer()->getParameter('entity_share_client.batch_size'); } if (!isset($context['results']['imported_entity_ids'])) { $context['results']['imported_entity_ids'] = []; } $sub_data = array_slice($context['sandbox']['entity_list_data'], $context['sandbox']['progress'], $context['sandbox']['batch_size']);
This won't allow what we want to do here, as we potentially want to insert items between items here. We need to do only one entity per operation, and I think for clarity and simplicity we need to remove the entity list data stuff, and give each batch operation its own JSONAPI URL.

3. Tests! PullKernelTestBase currently has very naive handling of running the batch. This will need to be improved.

🇬🇧United Kingdom joachim

Hacking extra items into the batch is possible, like this:

// Inside a batch operation callback:
    $batch = &batch_get();

    // Get the name for the batch queue.
    // @see _batch_populate_queue()
    // We expect there is only one set.
    $queue_name = 'drupal_batch:' . $batch['id'] . ':' . 0;
    $queueFactory = \Drupal::service('queue');

    /** @var \Drupal\Core\Queue\QueueInterface $queue */
    $queue = $queueFactory->get($queue_name);
    $queue->createItem(
      [static::class . '::operation', [99]],
    );
    // Hack the count and the total for our batch set, so that _batch_process()
    // processes the additional queue item.
    $batch['sets'][0]['total']++;
    $batch['sets'][0]['count']++;

Comment 3 months ago →
🇬🇧United Kingdom joachim
Or alternatively....

We still split the initial batch into one operation per requested entity.

But instead of making the batch operation parameter a URL or some entity data, we introduce an entity pull request value object.

This gets passed through the processor plugins, and if there are references, those get added to it. This creates a stack of requests in the EPR.

The import service can then mark the EPR as being incomplete.

The batch operation callback can then report to BatchAPI that it is incomplete, and so it keeps getting called until the EPR's stack is all done.
Comment about 2 months ago →
🇮🇱Israel amitaibu Israel
> we can maybe just reach into it with the queue API.

I think that's the way to go. The advantage is shifting the concept from "I process it right now", to:

1. Gather the dependencies by order, and queue them by order.
2. Now they can be processed one by one, however long it takes
3. If something is changed in the entities while they are still processed, we can re-queue, and re-process

This means that debugging and logging are becoming slightly easier, as you'd be dealing with one entity at a time, rather than a big tree of entities.

I would also recommend relying on https://www.drupal.org/project/advancedqueue → . Having completed queue items stay forever/ long period is great for auditing. Also, the queue items are entities and thus fieldable. Allowing adding metadata on top of them.
Comment about 2 months ago →
🇬🇧United Kingdom joachim
Thanks for your comment!

At first I didn't follow what you meant at all, as I was thinking, surely the gathering of dependencies and the fetching of data are completely inseparable, since you have to make the JSONAPI request to know what the dependencies are!

But then I had a lightbulb moment, and yes, I think doing it the way you suggest is going to be better than the route I was going down (which involved an ImportRequest object which would persist in the batch operation, and which would handle the recursion, and so could keep telling the current batch operation that it wasn't done yet).

What we need to do then is this: we have 2 or 3 lists of operations to work through:

1. optional, only if importing a channel rather than a list of UUIDs: populate a queue of 1 item per entity to import. So the batch operation works as it currently does: advances the pager and reports it's incomplete until the pager has reached the end. But instead of fetching entities, it merely puts each single entity into a list.

2. work the list of entities to import. For each entity, analyse dependencies. Any dependencies get added to the END of the current queue. Again, this is a single batch operation, which reports that it's incomplete while queue items remain.

3. work the list of entities to import, this time doing the actual processing and saving of data. BUT it has to work the list BACKWARDS -- doing the dependencies first.

This is where I'm not sure whether queues are the right thing here. Because we work the same list in both steps 2 and 3 -- almost?!? I'm wondering whether we'd be better off using a custom table, which we don't destroy when we take items from it, so that we can work the same items in both steps 2 and 3.

One consideration is dependencies we see several times. We don't need step 2 to process them multiple times, BUT if a 2nd dependent sees them, we need them to go after that. E.g. we have A, A1, A2, B and B reports A1 as a dependency also. Then we need the order to be: A, A2, B, A1 because step 3 needs A1 to be imported before both A *and* B are.

But if we naively do that, then we're potentially moving A1 to be BEFORE its own dependencies! Support had: A, A1, A1A, A2, B. If we move A1 to the end, then it's getting imported before A1A which is no good. Urgh, dependencies!

I wonder if the naive approach is just to REPROCESS A1's dependencies because it's now at the end of the queue, so we'd end up with: A, A2, B, A1, A1A. But that's going to cause repeat processing of A1's data!
Comment about 2 months ago →
🇬🇧United Kingdom joachim
Hmm yes, queue is maybe wrong, because we keep relying on order. What we actually have is a hierarchy of data dependency -- albeit one with multiple parents, and **potentially circularity too**.

I wonder if with a custom data table instead, we could store the hierarchy data in a field, and then rely on that for our ordering.
Comment about 2 months ago →
🇮🇱Israel amitaibu Israel
That's an interesting challenge!

Short recap

We should move away from one big batch and break the work into small tasks.

Tasks can fail. We need safe retries for the same action.

Possible flow

Assuming we start with one “parent” entity:

Put entity ID 1 into a queue named items_to_fetch_info.

Get entity ID 1 from queu and call JSON:API to fetch the entity and all related references. At this stage we only care about entity types and IDs. No processing yet.

Build the tree of related entities and flatten it into a list ordered bottom-up. Grandchildren first, then children, then the parent.

Queue that ordered list into a process_items queue.

Store a parent reference on each queue item, so items know which item they depend on. This can live in the payload or in a dedicated field if you extend the queue item entity.

Process one entity at a time. If it fails, let the queue retry later with backoff.

Once the parent entity is processed, the whole unit is done.

Advanced Queue advantage

We can query queue items with EntityQuery and apply your own logic. Example: do not fetch a child for processing until its parent is completed.
Comment about 2 months ago →
🇬🇧United Kingdom joachim
I was halfway through writing this when you posted your comment, so I'm going to post it anyway, then consider your points above/

I'm definitely leaning against a queue, because I think we need to store hierarchy data.

I've had a brief look at various ways to store hierarchy data in SQL. Unlike most of the examples I see, we have a need for multiple parentage, since one entity can have multiple dependencies, and one dependency can have multiple parents (consider two nodes which both use the same paragraph library item, or two nodes which both have the same other taxonomy term). We can also have circular references (two nodes which link to each other in the body text).

We don't really have performance requirements, as we're not aiming for speed. We would like it to be fairly simple to change the graph, as while we gather dependencies we'll be constantly changing the tree structure.

A closure table seems like the best thing for this (see https://www.slideshare.net/slideshow/models-for-hierarchical-data/4179181), with a column in the closure table for the distance.

Even without any circularity though, ordering is going to be a complex thing because of the the multiple hierarchy:

e.g. A requests A1, A2. A1 requests A1A, A1B. A1A requests C. A2 also requests C. How do we know when to do C, A1A, A2?

And potential circularity throws this all up in the air anyway.

I'm thinking we need to do a *decent* attempt at ordering the tree -- if A is our initial entity, we order its dependencies by their distance from A, in the hope that will be good enough for getting dependencies in first.

But each time we save an entity, we check the tree for its OWN dependents, and check whether they are all saved already or not. If not, then we save this entity anyway, skipping the unsatisfied dependencies, and mark our metadata about it being 'draft'.

We then have a 4th phase, after ALL entities have at least been saved in draft, and therefore have a local ID, and then we re-save all the draft entities (without causing a 2nd revision ideally) so that their reference field values are finalised.

So the 4 phases are:

1. Use the pull request parameters to populate a table of metadata about the entities to pull, 1 row per entity. This takes care of expanding either an entire channel, or a list of UUIDs. This phase can store JSONAPI data in the metadata if it has it available, so it doesn't need to be fetched from the remote again.

2. work through this table, fetching the data if necessary. Add an item to the table for each dependency, and work those too until all dependencies are done. Store dependency hierarchy.

3. Using dependency hierarchy for a 'best guess' at order, save the data. If an entity doesn't have all its dependencies ready, mark it our metadata to say it's in draft.

4. Work through all the drafts and re-save them to get all the local IDs in place.

Each phase will need a batch API operation which will re-run until it deems itself to be complete.
Comment about 2 months ago →
🇮🇱Israel amitaibu Israel
> I was halfway through writing this when you posted your comment, so I'm going to post it anyway, then consider your points above/

haha, no worries :)

> This phase can store JSONAPI data in the metadata if it has it available, so it doesn't need to be fetched from the remote again.

👍

pulled dependencies don't participate in the batch, risk of PHP execution timeouts

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments & Activities

Short recap

Possible flow

Advanced Queue advantage