Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale

Comment over 2 years ago →
needs-review-queue-bot
The Needs Review Queue Bot → tested this issue. It either no longer applies to Drupal core, or fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.

Consult the Drupal Contributor Guide → to find step-by-step guides for working with issues.
Comment over 2 years ago →
🇮🇳India pooja saraah Chennai
Fixed failed commands on #21
Attached patch against Drupal 10.1.x
Comment over 2 years ago →
🇺🇸United States smustgrave
Seems the retest aborted also. I don't think there is a build problem but could be wrong.

Am moving to NW for the tests.
Comment over 2 years ago →
🇨🇭Switzerland berdir Switzerland
The while (TRUE) here seems wrong, you don't need a while with yield, otherwise it will run forever, that's why all the CI runs time out.

Also, we can add a loadMultiple() because loading 50 nodes at once will result in ~50x fewer queries.

🇬🇧United Kingdom joachim

+++ b/core/modules/migrate_drupal/src/Plugin/migrate/source/ContentEntity.php
@@ -165,31 +177,59 @@ public function __toString() {
+    while (TRUE) {
+      $query = $this->query();
+      if (isset($this->batchSize) && $this->batchSize > 0) {
+        // Run the query in batches, to prevent large source sizes exhausting
+        $query->range($current_batch * $this->batchSize, $this->batchSize);
+      }
+      $ids = $query->execute();
+
+      // End the loop when we run out of source entities.
+      if (empty($ids)) {
+        break;
+      }
+

We need to end the loop when $ids is empty, but we can only determine what $ids is once we've run a query.

It's not something we can put inside the condition for the while().

Hence why I went for the while (TRUE) { break; } construction.

What do you suggest instead?

BTW this issue started with a MR, please could we stick to that and not post patches?

Comment over 2 years ago →

🇨🇭Switzerland berdir Switzerland

You are right. This would be much nicer as a

do { ... } while (empty($ids)<code> though.

That said, after reading it again, I realized the problem. it only works if you do have a batch size, because without a batch size, it will not add a range and then you will run the same query over and over again and $ids will never be empty. You could do a count($ids) < $batch_size || !$batch_size or so.

The thing with batch size is that it's quite tricky. Rerunning the query can give you different results, it is for example fundamentally incompatible with the map table join that you're also working on. The map table join will no longer return the results that already match and then the batch would also skip non-processed results, causing you to skip $batch_size * $n on every query execution. Additionally, the query currently has no sort, so some databases will not guarantee a stable sort order, again risking to only process a random subset, so you'd need to add a sort. That's why \Drupal\migrate\Plugin\migrate\source\SqlBase::initializeIterator() is so complicated, it's handling all those different features and can only use some combinations of them.

So you have to pick between using the map join to speed up reruns (and then you can do multiple batches of 100k/1M or so entities) and a batch size anyway.

IMHO, using loadMultiple() and the memory reset will have a much bigger impact on memory usage, although a large enough data set will eventually make you run into a hard limit as well, but most people will hit the limit much earlier due to loaded entities. Based on a blackfire run against a data set of 420k source entities, Drupal\Core\Entity\Query\Sql\Query::result() uses about 250MB for that. I did some tests with range(0,1000000) which uses only 33MB, but the query gives us strings which clearly use up much more space.

What do you think about focusing on that first, possibly in a new issue. I'm confident that we can get that into core much more easily as there shouldn't be any negative side effects/possible bugs unlike batch. Maybe we need a test with more entities than the chunk size, but that's about it. We could maybe use <code>Settings::get('entity_update_batch_size', 50)

, or a different setting name, so we don't have to create 50 entities in a test.

For those tests, I compared using array_chunk(), array_splice() and array_slice(). array_chunk() more than doubles memory usage as it creates many smaller arrays, array_splice() is extremely slow for this use case (get the first N entries from the array) probably because it reorders the whole array on every call, array_slice() with an incrementing position looks like a clean winner:

  $ids = range(1, 1000003);

  $i = 0;
  while  ($ids_chunk = array_slice($ids, $i, 50)) {
    $i += 50;
  }

Comment over 2 years ago →
🇨🇭Switzerland mathilde_dumond
I created a related issue and uploaded a patch that runs the query in one time, but loads the entities by batches, and clears the memory regularly. It may not be enough for all situations, but already helps with memory and performance 📌 Use loadMultiple() and reset memory cache in ContentEntity source Needs work
Comment over 2 years ago →
🇬🇧United Kingdom joachim
It's been a while since I've worked on all this, and my current migration project is a low-budget affair so not much more than rerolling patches. So I don't remember all the details.

> IMHO, using loadMultiple() and the memory reset will have a much bigger impact on memory usage, although a large enough data set will eventually make you run into a hard limit as well, but most people will hit the limit much earlier due to loaded entities

I hit the memory limit from just getting an array of entity IDs from the query on a migration that was 6M source entities. That sounds like a lot, but it was just a customer portal website. It's not an outlandish use case.

> Rerunning the query can give you different results, it is for example fundamentally incompatible with the map table join that you're also working on. The map table join will no longer return the results that already match and then the batch would also skip non-processed results, causing you to skip $batch_size * $n on every query execution.

Took me a while to get my head round that, but yes, I see what you mean.

What's strange is that I definitely used batch_size AND map join (with the patch from 🐛 ContentEntity migration source doesn't consider the migration map Needs work ) and I don't remember having problems with it. Maybe there was some query caching going on in mySQL? Or maybe my map join patch wasn't working properly?

At any rate, that's something that could be figured out - don't advance the LIMIT part of the query if the source uses a map join.

The map join is important too, as without it, resuming a migration after a crash takes a horrendous amount of time as it checks everything it's already migrated.

> Additionally, the query currently has no sort, so some databases will not guarantee a stable sort order, again risking to only process a random subset, so you'd need to add a sort.

✨ Migrate SQL source query isn't ordered Active :)

> What do you think about focusing on that first, possibly in a new issue. I'm confident that we can get that into core much more easily as there shouldn't be any negative side effects/possible bugs unlike batch.

You mean doing a separate issue which just handles clearing the entity storage static cache, and postponing this issue to be worked on after the new one? Yes, makes sense to split this up into simpler issues. Looks like @mathilde_dumond has already created it -- thanks! :)
Status changed to Postponed over 2 years ago11:45am 14 April 2023
Comment over 2 years ago →
🇬🇧United Kingdom joachim
In which, this is postponed on 📌 Use loadMultiple() and reset memory cache in ContentEntity source Needs work .
Comment 5 months ago →
🇺🇸United States benjifisher Boston area
In 📌 Move content_entity source plugin to migrate module Active , we moved the content_entity source plugin to the migrate module. I am adding that as a related issue.

Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale

Comments & Activities