Add an entity iterator to load entities in chunks

Merge request !439Issue #2577417: Add an entity iterator to load entities in chunks → (Open) created by dawehner
Comment about 2 years ago →
🇫🇮Finland lauriii Finland
This should address the test failure.
Comment about 2 years ago →
🇨🇳China jungle Chongqing, China
+++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php @@ -0,0 +1,56 @@ + * @param \Drupal\Core\Cache\MemoryCache\MemoryCacheInterface $memoryCache + * The memory cache. ... + public function __construct(protected EntityStorageInterface $entityStorage, protected MemoryCacheInterface $memoryCache, array $ids, protected int $chunkSize = 50) {
Cant call array_values on any iterable.
Dont think iterables are necessarily countable..?

1. Comments on the MR were resolved, array $ids replaced iterables $ids, +1 to this change.

2. Read most of the comments here. The cache was discussed a lot. As a dev to use the iterator, I really do not want to care about cache, as long as it works.

I would use it like the following:

<?PHP
// For example, the entity type is node.
$iterator = new ChunkedIterator('node', $ids, $chunk_size)
foreach(iterator as $loaded_entities) {
// do something with $load_entities.
}
?>

Instead of passing in $entity_storage, giving $entity_type is simpler. we can default $memoryCache to \Drupal::service('entity.memory_cache')

Suggested change:

public function __construct(protected string $entity_type, array $ids, protected int $chunkSize = 50, protected MemoryCacheInterface $memoryCache = NULL, protected EntityStorageInterface $entityStorage = NULL ) { if($memoryCache === NULL) { $this->memoryCache = \Drupal::service('entity.memory_cache'); } if($entityStorage === NULL) { $this->entityStorage = \Drupal::entityTypeManager()->getStorage($entity_type); } }
3. MW for Needs CR at least.
Status changed to Needs work about 2 years ago5:42pm 19 February 2023
Comment about 2 years ago →
🇨🇳China jungle Chongqing, China
Comment almost 2 years ago →
🇩🇪Germany donquixote
Good idea overall in this issue!

Btw, sometimes we want to not just save memory, but also split the operation into multiple requests.
That is, we want to pause somewhere and resume later.
I think this mostly applies for content entities, probably never for config entities.

An obvious way to do this would be paging. But this would break if entities are added or removed between requests.

Instead, for entities with auto-increment ids, a solution can be to load them in reverse order of id, and remember the last processed id (or next id to be processed) between requests. This guarantees that all entities will be processed that existed when the operation was started.

To do this, the generator method could accept an additional parameter to indicate a starting point (min/inf or max/sup id).
Or it could be passed through an additional method (setter or wither).
Or there could be a factory for the entity iterator.
Or it could be something we add in the entity query classes.

---

Currently the proposed class can not easily be injected as a dependency or registered as a service, because the ids iterator has to be provided at construction time.
Passing the ids iterator directly to the generator method would work, but then the class is no longer an IteratorAggregate, so no longer works in foreach() directly.
Another option is to start with an empty list of ids, and use a wither method (= immutable setter that returns a modified clone) to set the ids. So the initial object would be technically complete, but useless, without the ids being added.

Also, for a generator, we don't _need_ to implement `\IteratorAggregate`, we can also have a generator method with a more custom signature. Of course then you can't just foreach($object as ..) it, you have to do foreach ($object->generatorMethod()).
Comment over 1 year ago →
🇫🇷France andypost
what's left here except CR?
Comment over 1 year ago →
🇬🇧United Kingdom joachim
I still think this needs to be tied closed to EntityQuery - arrays of IDs aren't scalable either.

Also, instantiating the class needs to be made easier -- having to pass in a cache object isn't good DX.
Comment over 1 year ago →
duadua
It seems like @donquixote and @joachim are bringing out some valid points:

- Naming
- Developer experience
- Genericness

Generally I agree that the experience is not great. It feels like having some form of factory for that seems a good idea.
A few places for that are:

- Entity storage class (adding even more to those classes seems hard)
- Entity repository (feels like a nice place) (createEntityIterable(iterable $ids): EntityIterable)
- Tie it to an entity query, like a method ->executeAndLoadIterable()

My general gut feeling is that entity repository is a good place. It balances flexibility with discoverability . We could then put in something on the entity query as a follow up on top of that.
Comment over 1 year ago →
🇩🇪Germany donquixote
Generally I agree that the experience is not great. It feels like having some form of factory for that seems a good idea.

We can use static methods to avoid polluting other interfaces.
We just pass in the required services or objects to the static methods.
I will try to come up with something.
Comment over 1 year ago →
🇩🇪Germany donquixote
Correcting myself:

Currently the proposed class can not easily be injected as a dependency or registered as a service, because the ids iterator has to be provided at construction time.

Actually this is not a case if the `iterable $ids` is an array or IteratorAggregate instead of an Iterator.
Maybe we should be more strict about the parameter.
We also then need something that can give us this list of ids, unless we want to use an array of ids always. This could be done in a follow-up.
Comment over 1 year ago →
duadua
So you are saying for now it would be better to:

- open up a follow up to accept an iterable
- accept only an array right now?

I’m excited to see what you come up with on the factory/initialisation side of things.
Comment over 1 year ago →
🇩🇪Germany donquixote
I wonder: Can we restrict this functionality to content entities with integer ids?
Usually for config entities it is fine to load them all at once.

This would mean we can provide more restrictive @param and @return docs.
E.g. @param list<int> $ids and @return \Iterator<int, ContentEntityInterface>
Comment over 1 year ago →
🇨🇭Switzerland berdir Switzerland
Content entities are not limited to int IDs, you can have strings as well there.
Comment over 1 year ago →
🇩🇪Germany donquixote
Content entities are not limited to int IDs, you can have strings as well there.

Alright, good to know!
Then we can as well support all entities including config, we would not gain anything from restricting it to just content entities.

------------------

To another point (@joachim in #106, #108).

Yes, my point is that when you load an array of LOTS of entity IDs, you can exhaust memory too. IIRC it was about 1-2 million.

I talked with @dawehner whether we only need to limit the entity objects loaded at once, or also the entity ids.

With millions of ids, the memory usage will still be survivable, though not negligible.
The ids are keyed by revision id, so this is not a list but a numeric associative array.
With memory_get_usage() compared before/after, I find that an array filled with $array[$i * 2] = $i for $i from 0 to 999.999 fills up 33.554.496 bytes in one system, and 41.943.120 bytes in another.
Besides the memory impact there is also a time cost, because Statement::fetchAllKeyed() uses a foreach() to fill the values.
In addition, we have to assume there is memory usage in the database engine. I don't know if the entire result set will be in sql memory at once, though.

I think we want to support the following cases:

A long-running operation that really wants to process all entities in one go. It wants to avoid having all entity objects in memory, A long list of ids is usually ok, unless we accumulate copies of this list in logs or other places if things go wrong repeatedly.
It could still be good to load the ids in chunks of 1.000 or 10.000 just to be respectful to other operations that need memory.

A batch operation where in each step we process a limited number of entities. This could either be a fixed number, or the batch step could be timeboxed.
I think most batch operations I have seen pre-compute the list of ids before starting the batch. Usually this means they will be all in memory at one point.
Personally I think it is more elegant and scales better to use a query that can be resumed at a given offset, instead of relying on pre-computed ids. Here a chunking of ids can be useful, if the calling code does not already specify a range limit.

An operation that only needs the first n entities, and does not care about the rest.
Normally this will already put a range limit on the entity query.
But what if the end is determined dynamically? E.g. list products until the accumulated price reaches a given limit. Here chunking of ids can be useful, especially if we run this more than once in a request, for whichever reason. But it seems quite rare.

So, it seems that, in most cases, it is ok or survivable to load all ids at once.
But the chunking or incremental loading of ids is probably easy enough to do it anyway, when there is not already a range limit. There is a benefit at least in some cases.

To do this properly, we should provide incremental loading of ids directly in the entity query.
We cannot add this to the existing query interface, but we can add an extended interface for this purpose.
Comment over 1 year ago →
🇬🇧United Kingdom joachim
> So, it seems that, in most cases, it is ok or survivable to load all ids at once.

That's not been my experience working on migrations -- see 🐛 Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale Postponed .

$array = range(1, 4_000_000);
consumes 70MB. An array up to 6M crashes my (admittedly small) 128MB memory limit.

> To do this properly, we should provide incremental loading of ids directly in the entity query.
> We cannot add this to the existing query interface, but we can add an extended interface for this purpose.

What's the reason we can't add this to the query interface?
Comment over 1 year ago →
🇩🇪Germany donquixote
consumes 70MB. An array up to 6M crashes my (admittedly small) 128MB memory limit.

That's small indeed. And 6M is a lot. But both still in a realistic range.
So this shows we really need to load the ids in increments or in chunks.

What's the reason we can't add this to the query interface?

If we add a method to an interface, this breaks existing implementations in 3rd party code.
The BC-friendly way is to create a new interface extending the existing one, and make it optional to implement.
Comment over 1 year ago →
🇫🇷France andypost
It could use SplQueue to get more optimized results

PS: also there https://pecl.php.net/package/ds notes https://medium.com/@rtheunissen/efficient-data-structures-for-php-7-9dda...
Comment over 1 year ago →
🇩🇪Germany donquixote
Here is a POC/preview branch with:
- Proposed changes to the current branch
- A new commit to add iterator functionality to the entity query, to iterate over ids.

https://git.drupalcode.org/issue/drupal-2577417/-/compare/11.x...2577417...

We can have a look at this, but I think we should split out the entity query ids iterator to a separate issue.
Comment over 1 year ago →
🇩🇪Germany donquixote
A new commit to add iterator functionality to the entity query, to iterate over ids.

The test I added here is a big copy paste, where ->execute() is replaced with ->getIterator() + iterator_to_array().
We should do something more sophisticated to avoid the repetition. We should cover the iterator and array functionality with the same test code. For now I just wanted to prove that it works.

The preview PR turns the sql entity query into an IteratorAggregate.
IteratorAggregate is subtly implies a stateless object, where iterating over the values repeatedly produces the same result.
At first I was a bit skeptical if this is the case, but it seems that it is.
Comment over 1 year ago →
🇬🇧United Kingdom catch
If we add a method to an interface, this breaks existing implementations in 3rd party code.

We have a base class in this case, so it's allowable in a minor release with a change record.

abstract class QueryBase implements QueryInterface
Comment 9 months ago →
🇮🇳India bhanu951
Adding changes in MR as patch before rebasing to 11.x
Comment 9 months ago →
🇳🇿New Zealand quietone
Changed branch of the MR to 11.x as requested in #contribute by Bhanu951
Pipeline finished with Failed
9 months ago
#253443
Pipeline finished with Failed
9 months ago
Total: 134s
#253449
Pipeline finished with Failed
9 months ago
Total: 259s
#253456
Pipeline finished with Failed
9 months ago
#253457
Pipeline finished with Failed
9 months ago
Total: 135s
#253461
Pipeline finished with Failed
9 months ago
Total: 494s
#253464
Pipeline finished with Failed
9 months ago
Total: 626s
#253472
Pipeline finished with Failed
9 months ago
Total: 202s
#253490
Pipeline finished with Failed
9 months ago
Total: 508s
#253498
Comment 9 months ago →
🇭🇺Hungary mxr576 Hungary
mxr576 → changed the visibility of the branch 2577417-entity-id-iterator-poc to hidden.
Pipeline finished with Failed
6 months ago
Total: 698s
#356723
Pipeline finished with Failed
6 months ago
Total: 637s
#356731
Pipeline finished with Failed
6 months ago
Total: 633s
#356760
Pipeline finished with Failed
6 months ago
Total: 2153s
#356792
Pipeline finished with Failed
6 months ago
Total: 726s
#356822
Pipeline finished with Failed
5 months ago
Total: 486s
#362425
Pipeline finished with Failed
5 months ago
Total: 452s
#362435
Pipeline finished with Canceled
5 months ago
Total: 459s
#362436
Pipeline finished with Success
5 months ago
Total: 464s
#362439
Pipeline finished with Success
4 months ago
Total: 885s
#393644
Pipeline finished with Failed
4 months ago
Total: 552s
#393923
Pipeline finished with Success
3 months ago
Total: 298s
#414345
Pipeline finished with Failed
2 months ago
Total: 1435s
#447353
Pipeline finished with Failed
2 months ago
Total: 1582s
#447414

Add an entity iterator to load entities in chunks

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Merge Requests

!439Add an entity iterator to load entities in chunks
Open

Comments & Activities

Add an entity iterator to load entities in chunks

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Merge Requests

!439Add an entity iterator to load entities in chunksOpen

Comments & Activities

!439Add an entity iterator to load entities in chunks
Open