Add an entity iterator to load entities in chunks

Created on 30 September 2015, over 8 years ago
Updated 24 October 2023, 7 months ago

Problem/Motivation

If you need to load a lot of entities and iterate over them (pretty common), your memory usage could easily go through the roof.

For example, consider updating all nodes for some reason is a post_update hook. You want to load them all and save them, but loading all 1_000_000 nodes will over time lead to a out of memory issue on the machine. In order to solve that one common strategy would be, to load them in batches and in between clear the static memory cache.

This could also affect core directly, as we may need to update content (or config) entities in the upgrade path, but have no way of knowing how many items a site will have. This could cause problems and easily balloon over the PHP requirements we set to run Drupal.

Proposed resolution

Introduce an EntityIterator class that uses a generator to break the IDs to load into chunks. Then load just a set amount each time, clearing the static cache before each load so it only contains the items currently being iterated over.

We can get clever and use a generator. Note: We clear the memory cache using $memory_cache->deleteAll() so all referenced entities are removed from the cache as well.

Remaining tasks

User interface changes

N/A - Developer only

API changes

Addition of new Iterator. Example usage:

 $iterator = new ChunkedIterator($entity_storage, \Drupal::service('entity.memory_cache'), $all_ids);
 foreach ($iterator as $entity) {
   // Process the entity
 }

Data model changes

N/A

šŸ“Œ Task
Status

Needs work

Version

11.0 šŸ”„

Component
EntityĀ  ā†’

Last updated 1 day ago

Created by

šŸ‡¬šŸ‡§United Kingdom damiankloip

Live updates comments and jobs are added and updated live.
  • Performance

    It affects performance. It is often combined with the Needs profiling tag.

  • Needs change record

    A change record needs to be drafted before an issue is committed. Note: Change records used to be called change notifications.

Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • šŸ‡«šŸ‡®Finland lauriii Finland

    This should address the test failure.

  • šŸ‡ØšŸ‡³China jungle Chongqing, China
    +++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
    @@ -0,0 +1,56 @@
    +   * @param \Drupal\Core\Cache\MemoryCache\MemoryCacheInterface $memoryCache
    +   *   The memory cache.
    ...
    +  public function __construct(protected EntityStorageInterface $entityStorage, protected MemoryCacheInterface $memoryCache, array $ids, protected int $chunkSize = 50) {
    
    Cant call array_values on any iterable.
    Dont think iterables are necessarily countable..?

    1. Comments on the MR were resolved, array $ids replaced iterables $ids, +1 to this change.

    2. Read most of the comments here. The cache was discussed a lot. As a dev to use the iterator, I really do not want to care about cache, as long as it works.

    I would use it like the following:

    <?PHP
    // For example, the entity type is node.
    $iterator = new ChunkedIterator('node', $ids, $chunk_size)
    foreach(iterator as $loaded_entities) {
    // do something with $load_entities.
    }
    ?>

    Instead of passing in $entity_storage, giving $entity_type is simpler. we can default $memoryCache to \Drupal::service('entity.memory_cache')

    Suggested change:

    public function __construct(protected string $entity_type, array $ids, protected int $chunkSize = 50, protected MemoryCacheInterface $memoryCache = NULL, protected EntityStorageInterface $entityStorage = NULL ) {
      if($memoryCache === NULL) {
         $this->memoryCache = \Drupal::service('entity.memory_cache');
      }
      if($entityStorage === NULL) {
         $this->entityStorage = \Drupal::entityTypeManager()->getStorage($entity_type);
      }
    }
    
    

    3. MW for Needs CR at least.

  • Status changed to Needs work over 1 year ago
  • šŸ‡ØšŸ‡³China jungle Chongqing, China
  • šŸ‡©šŸ‡ŖGermany donquixote

    Good idea overall in this issue!

    Btw, sometimes we want to not just save memory, but also split the operation into multiple requests.
    That is, we want to pause somewhere and resume later.
    I think this mostly applies for content entities, probably never for config entities.

    An obvious way to do this would be paging. But this would break if entities are added or removed between requests.

    Instead, for entities with auto-increment ids, a solution can be to load them in reverse order of id, and remember the last processed id (or next id to be processed) between requests. This guarantees that all entities will be processed that existed when the operation was started.

    To do this, the generator method could accept an additional parameter to indicate a starting point (min/inf or max/sup id).
    Or it could be passed through an additional method (setter or wither).
    Or there could be a factory for the entity iterator.
    Or it could be something we add in the entity query classes.

    ---

    Currently the proposed class can not easily be injected as a dependency or registered as a service, because the ids iterator has to be provided at construction time.
    Passing the ids iterator directly to the generator method would work, but then the class is no longer an IteratorAggregate, so no longer works in foreach() directly.
    Another option is to start with an empty list of ids, and use a wither method (= immutable setter that returns a modified clone) to set the ids. So the initial object would be technically complete, but useless, without the ids being added.

    Also, for a generator, we don't _need_ to implement `\IteratorAggregate`, we can also have a generator method with a more custom signature. Of course then you can't just foreach($object as ..) it, you have to do foreach ($object->generatorMethod()).

  • šŸ‡«šŸ‡·France andypost

    what's left here except CR?

  • šŸ‡¬šŸ‡§United Kingdom joachim

    I still think this needs to be tied closed to EntityQuery - arrays of IDs aren't scalable either.

    Also, instantiating the class needs to be made easier -- having to pass in a cache object isn't good DX.

  • It seems like @donquixote and @joachim are bringing out some valid points:

    - Naming
    - Developer experience
    - Genericness

    Generally I agree that the experience is not great. It feels like having some form of factory for that seems a good idea.
    A few places for that are:

    - Entity storage class (adding even more to those classes seems hard)
    - Entity repository (feels like a nice place) (createEntityIterable(iterable $ids): EntityIterable)
    - Tie it to an entity query, like a method ->executeAndLoadIterable()

    My general gut feeling is that entity repository is a good place. It balances flexibility with discoverability . We could then put in something on the entity query as a follow up on top of that.

  • šŸ‡©šŸ‡ŖGermany donquixote

    Generally I agree that the experience is not great. It feels like having some form of factory for that seems a good idea.

    We can use static methods to avoid polluting other interfaces.
    We just pass in the required services or objects to the static methods.
    I will try to come up with something.

  • šŸ‡©šŸ‡ŖGermany donquixote

    Correcting myself:

    Currently the proposed class can not easily be injected as a dependency or registered as a service, because the ids iterator has to be provided at construction time.

    Actually this is not a case if the `iterable $ids` is an array or IteratorAggregate instead of an Iterator.
    Maybe we should be more strict about the parameter.
    We also then need something that can give us this list of ids, unless we want to use an array of ids always. This could be done in a follow-up.

  • So you are saying for now it would be better to:

    - open up a follow up to accept an iterable
    - accept only an array right now?

    Iā€™m excited to see what you come up with on the factory/initialisation side of things.

  • šŸ‡©šŸ‡ŖGermany donquixote

    I wonder: Can we restrict this functionality to content entities with integer ids?
    Usually for config entities it is fine to load them all at once.

    This would mean we can provide more restrictive @param and @return docs.
    E.g. @param list<int> $ids and @return \Iterator<int, ContentEntityInterface>

  • šŸ‡ØšŸ‡­Switzerland Berdir Switzerland

    Content entities are not limited to int IDs, you can have strings as well there.

  • šŸ‡©šŸ‡ŖGermany donquixote

    Content entities are not limited to int IDs, you can have strings as well there.

    Alright, good to know!
    Then we can as well support all entities including config, we would not gain anything from restricting it to just content entities.

    ------------------

    To another point (@joachim in #106, #108).

    Yes, my point is that when you load an array of LOTS of entity IDs, you can exhaust memory too. IIRC it was about 1-2 million.

    I talked with @dawehner whether we only need to limit the entity objects loaded at once, or also the entity ids.

    With millions of ids, the memory usage will still be survivable, though not negligible.
    The ids are keyed by revision id, so this is not a list but a numeric associative array.
    With memory_get_usage() compared before/after, I find that an array filled with $array[$i * 2] = $i for $i from 0 to 999.999 fills up 33.554.496 bytes in one system, and 41.943.120 bytes in another.
    Besides the memory impact there is also a time cost, because Statement::fetchAllKeyed() uses a foreach() to fill the values.
    In addition, we have to assume there is memory usage in the database engine. I don't know if the entire result set will be in sql memory at once, though.

    I think we want to support the following cases:

    • A long-running operation that really wants to process all entities in one go. It wants to avoid having all entity objects in memory, A long list of ids is usually ok, unless we accumulate copies of this list in logs or other places if things go wrong repeatedly.
      It could still be good to load the ids in chunks of 1.000 or 10.000 just to be respectful to other operations that need memory.
    • A batch operation where in each step we process a limited number of entities. This could either be a fixed number, or the batch step could be timeboxed.
      I think most batch operations I have seen pre-compute the list of ids before starting the batch. Usually this means they will be all in memory at one point.
      Personally I think it is more elegant and scales better to use a query that can be resumed at a given offset, instead of relying on pre-computed ids. Here a chunking of ids can be useful, if the calling code does not already specify a range limit.
    • An operation that only needs the first n entities, and does not care about the rest.
      Normally this will already put a range limit on the entity query.
      But what if the end is determined dynamically? E.g. list products until the accumulated price reaches a given limit. Here chunking of ids can be useful, especially if we run this more than once in a request, for whichever reason. But it seems quite rare.

    So, it seems that, in most cases, it is ok or survivable to load all ids at once.
    But the chunking or incremental loading of ids is probably easy enough to do it anyway, when there is not already a range limit. There is a benefit at least in some cases.

    To do this properly, we should provide incremental loading of ids directly in the entity query.
    We cannot add this to the existing query interface, but we can add an extended interface for this purpose.

  • šŸ‡¬šŸ‡§United Kingdom joachim

    > So, it seems that, in most cases, it is ok or survivable to load all ids at once.

    That's not been my experience working on migrations -- see šŸ› Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale Postponed .

    $array = range(1, 4_000_000);
    

    consumes 70MB. An array up to 6M crashes my (admittedly small) 128MB memory limit.

    > To do this properly, we should provide incremental loading of ids directly in the entity query.
    > We cannot add this to the existing query interface, but we can add an extended interface for this purpose.

    What's the reason we can't add this to the query interface?

  • šŸ‡©šŸ‡ŖGermany donquixote

    consumes 70MB. An array up to 6M crashes my (admittedly small) 128MB memory limit.

    That's small indeed. And 6M is a lot. But both still in a realistic range.
    So this shows we really need to load the ids in increments or in chunks.

    What's the reason we can't add this to the query interface?

    If we add a method to an interface, this breaks existing implementations in 3rd party code.
    The BC-friendly way is to create a new interface extending the existing one, and make it optional to implement.

  • šŸ‡«šŸ‡·France andypost

    It could use SplQueue to get more optimized results

    PS: also there https://pecl.php.net/package/ds notes https://medium.com/@rtheunissen/efficient-data-structures-for-php-7-9dda...

  • šŸ‡©šŸ‡ŖGermany donquixote

    Here is a POC/preview branch with:
    - Proposed changes to the current branch
    - A new commit to add iterator functionality to the entity query, to iterate over ids.

    https://git.drupalcode.org/issue/drupal-2577417/-/compare/11.x...2577417...

    We can have a look at this, but I think we should split out the entity query ids iterator to a separate issue.

  • šŸ‡©šŸ‡ŖGermany donquixote

    A new commit to add iterator functionality to the entity query, to iterate over ids.

    The test I added here is a big copy paste, where ->execute() is replaced with ->getIterator() + iterator_to_array().
    We should do something more sophisticated to avoid the repetition. We should cover the iterator and array functionality with the same test code. For now I just wanted to prove that it works.

    The preview PR turns the sql entity query into an IteratorAggregate.
    IteratorAggregate is subtly implies a stateless object, where iterating over the values repeatedly produces the same result.
    At first I was a bit skeptical if this is the case, but it seems that it is.

  • šŸ‡¬šŸ‡§United Kingdom catch

    If we add a method to an interface, this breaks existing implementations in 3rd party code.

    We have a base class in this case, so it's allowable in a minor release with a change record.

    abstract class QueryBase implements QueryInterface 
    
Production build 0.69.0 2024