Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale

Created on 11 July 2020, over 4 years ago
Updated 14 April 2023, over 1 year ago

This is useful for a large amount of datasource records using Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity

Currently the more datasource records using this plugin; the more memory consumption happens during migrate-import due to eager-loading.

How to reproduce? Have a entity-bundle with lots of fields and that entity-bundle has lots of records (eg.nodes).

$printMem = function () {
  $usage = memory_get_usage();
  $peak = memory_get_peak_usage();
  var_dump([
    'memory_get_usage()' => $usage,
    'simple usage' => number_format(round($usage / 1024 / 1024)) . ' MB',
    'memory_get_peak_usage()' => $peak,
    'simple peak' => number_format(round($peak / 1024 / 1024)) . ' MB',
  ]);
};

$array = [];
for ($i = 0; $i < 1000000; $i++) { $array[] = $i; }
$printMem();
unset($array);
$printMem();
// 50 MB usage, 82 MB peak usage.
// legend: 82 MB is app bootstrap on a drush cli. eg. use (drush php) or (drupal shell) to execute series of php codes.

$node = node_load(8878);
$printMem();
// 52 MB usage, 82 MB peak usage.

use Drupal\migrate\MigrateExecutable;
use Drupal\migrate\MigrateMessage;
$migrationMachineName = 'XXX';
$migration = \Drupal::service('plugin.manager.migration')->createInstance($migrationMachineName);
$migration->getIdMap()->prepareUpdate();
$executable = new MigrateExecutable($migration, new MigrateMessage());
$executable->import();
$printMem();
// 439 MB usage, 439 MB peak usage.
🐛 Bug report
Status

Postponed

Version

10.1

Component
Migration 

Last updated 2 days ago

Created by

🇺🇸United States chriscalip Chicago, IL

Live updates comments and jobs are added and updated live.
  • Needs tests

    The change is currently missing an automated test that fails when run with the original code, and succeeds when the bug has been fixed.

Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • The Needs Review Queue Bot tested this issue. It either no longer applies to Drupal core, or fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

    Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.

    Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

  • 🇮🇳India pooja saraah Chennai

    Fixed failed commands on #21
    Attached patch against Drupal 10.1.x

  • 🇺🇸United States smustgrave

    Seems the retest aborted also. I don't think there is a build problem but could be wrong.

    Am moving to NW for the tests.

  • 🇨🇭Switzerland berdir Switzerland

    The while (TRUE) here seems wrong, you don't need a while with yield, otherwise it will run forever, that's why all the CI runs time out.

    Also, we can add a loadMultiple() because loading 50 nodes at once will result in ~50x fewer queries.

  • 🇬🇧United Kingdom joachim
    +++ b/core/modules/migrate_drupal/src/Plugin/migrate/source/ContentEntity.php
    @@ -165,31 +177,59 @@ public function __toString() {
    +    while (TRUE) {
    +      $query = $this->query();
    +      if (isset($this->batchSize) && $this->batchSize > 0) {
    +        // Run the query in batches, to prevent large source sizes exhausting
    +        $query->range($current_batch * $this->batchSize, $this->batchSize);
    +      }
    +      $ids = $query->execute();
    +
    +      // End the loop when we run out of source entities.
    +      if (empty($ids)) {
    +        break;
    +      }
    +
    

    We need to end the loop when $ids is empty, but we can only determine what $ids is once we've run a query.

    It's not something we can put inside the condition for the while().

    Hence why I went for the while (TRUE) { break; } construction.

    What do you suggest instead?

    BTW this issue started with a MR, please could we stick to that and not post patches?

  • 🇨🇭Switzerland berdir Switzerland

    You are right. This would be much nicer as a

    do { ... } while (empty($ids)<code> though.
    
    That said, after reading it again, I realized the problem. it only works if you do have a batch size, because without a batch size, it will not add a range and then you will run the same query over and over again and $ids will never be empty. You could do a count($ids) < $batch_size || !$batch_size or so.
    
    The thing with batch size is that it's quite tricky. Rerunning the query can give you different results, it is for example fundamentally incompatible with the map table join that you're also working on. The map table join will no longer return the results that already match and then the batch would also skip non-processed results, causing you to skip $batch_size * $n on every query execution. Additionally, the query currently has no sort, so some databases will not guarantee a stable sort order, again risking to only process a random subset, so you'd need to add a sort. That's why \Drupal\migrate\Plugin\migrate\source\SqlBase::initializeIterator() is so complicated, it's handling all those different features and can only use some combinations of them.
    
    So you have to pick between using the map join to speed up reruns (and then you can do multiple batches of 100k/1M or so entities) and a batch size anyway.
    
    IMHO, using loadMultiple() and the memory reset will have a much bigger impact on memory usage, although a large enough data set will eventually make you run into a hard limit as well, but most people will hit the limit much earlier due to loaded entities. Based on a blackfire run against a data set of 420k source entities, Drupal\Core\Entity\Query\Sql\Query::result() uses about 250MB for that. I did some tests with range(0,1000000) which uses only 33MB, but the query gives us strings which clearly use up much more space.
    
    What do you think about focusing on that first, possibly in a new issue. I'm confident that we can get that into core much more easily as there shouldn't be any negative side effects/possible bugs unlike batch. Maybe we need a test with more entities than the chunk size, but that's about it. We could maybe use <code>Settings::get('entity_update_batch_size', 50)

    , or a different setting name, so we don't have to create 50 entities in a test.

    For those tests, I compared using array_chunk(), array_splice() and array_slice(). array_chunk() more than doubles memory usage as it creates many smaller arrays, array_splice() is extremely slow for this use case (get the first N entries from the array) probably because it reorders the whole array on every call, array_slice() with an incrementing position looks like a clean winner:

      $ids = range(1, 1000003);
    
      $i = 0;
      while  ($ids_chunk = array_slice($ids, $i, 50)) {
        $i += 50;
      }
    
  • 🇨🇭Switzerland mathilde_dumond

    I created a related issue and uploaded a patch that runs the query in one time, but loads the entities by batches, and clears the memory regularly. It may not be enough for all situations, but already helps with memory and performance 📌 Use loadMultiple() and reset memory cache in ContentEntity source Needs work

  • 🇬🇧United Kingdom joachim

    It's been a while since I've worked on all this, and my current migration project is a low-budget affair so not much more than rerolling patches. So I don't remember all the details.

    > IMHO, using loadMultiple() and the memory reset will have a much bigger impact on memory usage, although a large enough data set will eventually make you run into a hard limit as well, but most people will hit the limit much earlier due to loaded entities

    I hit the memory limit from just getting an array of entity IDs from the query on a migration that was 6M source entities. That sounds like a lot, but it was just a customer portal website. It's not an outlandish use case.

    > Rerunning the query can give you different results, it is for example fundamentally incompatible with the map table join that you're also working on. The map table join will no longer return the results that already match and then the batch would also skip non-processed results, causing you to skip $batch_size * $n on every query execution.

    Took me a while to get my head round that, but yes, I see what you mean.

    What's strange is that I definitely used batch_size AND map join (with the patch from 🐛 ContentEntity migration source doesn't consider the migration map Needs work ) and I don't remember having problems with it. Maybe there was some query caching going on in mySQL? Or maybe my map join patch wasn't working properly?

    At any rate, that's something that could be figured out - don't advance the LIMIT part of the query if the source uses a map join.

    The map join is important too, as without it, resuming a migration after a crash takes a horrendous amount of time as it checks everything it's already migrated.

    > Additionally, the query currently has no sort, so some databases will not guarantee a stable sort order, again risking to only process a random subset, so you'd need to add a sort.

    Migrate SQL source query isn't ordered Active :)

    > What do you think about focusing on that first, possibly in a new issue. I'm confident that we can get that into core much more easily as there shouldn't be any negative side effects/possible bugs unlike batch.

    You mean doing a separate issue which just handles clearing the entity storage static cache, and postponing this issue to be worked on after the new one? Yes, makes sense to split this up into simpler issues. Looks like @mathilde_dumond has already created it -- thanks! :)

  • Status changed to Postponed over 1 year ago
  • 🇬🇧United Kingdom joachim

    In which, this is postponed on 📌 Use loadMultiple() and reset memory cache in ContentEntity source Needs work .

Production build 0.71.5 2024