drush migratation with --sync has suboptimal performance (v6)

Created on 30 July 2023, over 1 year ago

Problem/Motivation

When using the --sync option during migrations, they get exponentially slower with the number of items to process.

One of my migrations has 35K items and the first phase during MigrateEvents::PRE_IMPORT takes a whole 20 minutes before even starting the migration itself (which takes even longer).

After profiling and investigating, I think the way migrate_tools keeps track of all source rows in migrate_tools_migrate_prepare_row is suboptimal and could be greatly improved.

Currently, it relies on States to make a list of IDs grow. Which implies loading some serialized data from the database, adding something to it, and re-serializing it to the database. For each row. It means that for each row, the amount of data to serialize and unserialize gets bigger and bigger. Here is where the main culprit is:

function migrate_tools_migrate_prepare_row(Row $row, MigrateSourceInterface $source, MigrationInterface $migration) {
  if (!empty($migration->syncSource)) {
    // Keep track of all source rows here [...]
    $source_id_values = \Drupal::state()->get('migrate_tools_sync', []);
    $source_id_values[] = $row->getSourceIdValues();
    \Drupal::state()->set('migrate_tools_sync', $source_id_values);
  }
}

After tweaking how this tracking works, I got it down from 20 minutes to less than 45 seconds in time.

Also as a bonus, I'm pretty sure the current system using states doesn't allow to run safely several migrations in parallel, so I've added the migration_id to the dedicated table so each migration has its own separated set of tracked IDs.

Steps to reproduce

Have a big migration to run (with tens of thousands of entries) and run it with the --sync option. Illustration:

drush migrate:import YOUR_MIGRATION --sync

Proposed resolution

Rework the way the module keeps track of rows' source IDs in a dedicated database table, so adding new entries doesn't get slower and slower as the migration progresses.

User interface changes

None.

API changes

None.

Data model changes

Use a dedicated database table for source IDs tracking.

-- --

I'm planning on creating a MR shortly. I just need to figure out how to use the new MR system.

๐Ÿ› Bug report
Status

Active

Version

6.0

Component

Code

Created by

๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @pacproduct
  • Pipeline finished with Failed
    over 1 year ago
    #15057
  • Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    Not currently mergeable.
  • @pacproduct opened merge request.
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    22 pass, 2 fail
  • @pacproduct opened merge request.
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    22 pass, 2 fail
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    I'm not sure what makes tests fail. Seems like the new database table is not found? When I install the module locally, it does get created correctly though.

    What am I missing? Any help for fixing the CI would be appreciated :)

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    36 pass
  • Status changed to Needs review over 1 year ago
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    Yay! Tests pass :)
    Switching status to "needs review".

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    Just added a commit.

    I noticed that the migrate_tools_migrate_prepare_row hook is running during the PRE_IMPORT phase, and also during the IMPORT phase. That was duplicating all SyncID entries although we need them during the PRE_IMPORT phase. So I added a little something so IDs get recorded during the PRE_IMPORT phase only. Then, they stay available in database during the actual migration phase just in case another module would need them, and they get cleaned during the POST_IMPORT phase.

    Also added some missing comments.

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    36 pass
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & sqlite-3.27
    last update over 1 year ago
    36 pass
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    36 pass
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    36 pass
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance simon georges Rouen, France
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance denix

    dear all,
    we are following closely this patch because when working with large datasets the operation is incredibly expensive in terms of performances. We are running tests on our side and we can only appreciate better performances and no drawbacks for the time been.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States Jorge Navarro

    I tested it in a large migration and indeed, migration time become a fraction, looking forward to see it merged!

  • Status changed to RTBC over 1 year ago
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance simon georges Rouen, France

    Considering several people have reported it working without drawbacks, let's try to move it forward ;-)

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance MaxPah

    Just for information, with this patch (who is working fine) there is a log on reports/status page :

    READ-COMMITTED
    For this to work correctly, all tables must have a primary key. The following table(s) do not have a primary key: migrate_tools_sync_source_ids. See the setting MySQL transaction isolation level page for more information.

    I think it's just this line missing in hook_schema declaration:

    'primary key' => ['migration_id'],

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    @MaxPah I cannot reproduce your issue. Also, migration_id cannot be used as primary key as this column doesn't contain unique entries.

    We could add an arbitrary serial 'id' as unique identifier and as primary key. Although it won't be of any use, I guess it's good practice anyway.

    I'll rework that shortly.

  • Pipeline finished with Failed
    over 1 year ago
    #25417
  • Pipeline finished with Failed
    over 1 year ago
    #25418
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    36 pass
  • Status changed to Needs review over 1 year ago
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance simon georges Rouen, France
  • Pipeline finished with Success
    about 1 year ago
    Total: 230s
    #73645
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    Because branch `6.0.x` recently changed, I had to rebase my MR and resolve conflicts. However, now the MR itself doesn't apply anymore to version `6.0.2` of this module.

    So here is an attached patch specfically for version `6.0.2`, so people can keep applying it easily if needed.

  • ๐Ÿ‡จ๐Ÿ‡ฆCanada emixaam

    Successfully tested Merge request 44 with migrate_tools 6.0.4 and migrate_plus 6.0.2, on Drupal 10.2.3. On a local environment, I tested removing 10000 items from the source of a custom migration and executing ยดdrush mim --syncยด. The items were correctly removed, and the time to complete the sync import was much faster than without the patch.

    Was there more thoughts given into the discussion #mr44-note197623 about using MigrateTools as a service vs moving to a dedicated service? I think it would make sense but I'm new on the issue.

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    @emixaam I'm not sure what the best approach would be.

    If the issue is that the service name is weird, I guess we could:
    - Rename the class and its corresponding service. That could impact some code elsewhere in the project, I did not check.
    - As the services currently declared by this module don't always match the name of their corresponding classes, another approach would be to rename the service only but not the class itself (e.g. "migrate_tools.helper").

    If the issue is that the source code is not well-enough organized, that would require a bit more rework but maybe we could:
    -Move all these functions into a separate service, but I'm not sure how to call it: these are utility helper functions for tracking source IDs during the first phase of a migration, insuring the module knows which entries to delete. But there's already a service with a name suggesting it handles that sync task (among other things): "migrate_tools.migration_sync" (Drupal\migrate_tools\EventSubscriber\MigrationImportSync).
    - Therefore, another consideration would be to completely remove class Drupal\migrate_tools\MigrateTools (which initially contained a small static function buildIdList) and put everything into the existing service "migrate_tools.migration_sync"

    I feel like the decision is up to the maintainer on what the best approach for their module is at this point, but what do you guys think?

  • ๐Ÿ‡ง๐Ÿ‡ชBelgium BramDriesen Belgium ๐Ÿ‡ง๐Ÿ‡ช

    Uploading a static patch from the MR since the patch of #18 No longer applies to the latest release (6.0.4)

  • State service uses a cache collector for performance since Drupal 10.3 https://www.drupal.org/node/3177901 โ†’

    Though on existing sites this feature needs to be enabled explicitly through the settings flag $settings['state_cache'] = TRUE;, this is the default behavior for new sites (in default.settings.php), and in Drupal 11+ the setting is removed and state cache is permanently turned on.

    There are 2 queries presented in the change record to check current state usage and then decide to opt-in/out state cache depending on results.

    Sites with more than 100 entries and/or a total value size/length of 100,000 should review their specific entries...

    Size/length can easily exceed 100,000 on thousands migration items. Setting the issue priority to Major...

  • ๐Ÿ‡ซ๐Ÿ‡ฎFinland vermario

    Agree with the previous poster, this will become a problem in Drupal11 for sure. Sounds like it would be great to push this change forward.

  • Pipeline finished with Failed
    4 months ago
    Total: 231s
    #292124
  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom scott_euser

    Thanks very much for this! We have tested this out quite thoroughly as we were similarly hitting an issue syncing roughly 40,000 entities from a CRM twice per hour and the sync was hanging. We have been running this for several days without issue as we figured its fairly safe since we could always remove the patch and re-sync the content the ~20 minute way.

    We also did notice the issue of multiple migrate --sync's also having this problem:

    Also as a bonus, I'm pretty sure the current system using states doesn't allow to run safely several migrations in parallel, so I've added the migration_id to the dedicated table so each migration has its own separated set of tracked IDs.

    Which is sorted by this MR.

    The only thing I can suggest further is a subsequent follow-up where we could add a cron job just to check if there are any closed migrate IDs in the new table that still have data in the table itself. Essentially the code without this patch was causing our database to pile up each time in the key-value whenever it timed out, with orphan key-value data since the clean-up phase was never reached. While this is far more performant, its I suppose still possible some failure leaves orphans in the table if the clean-up step in this MR is never reached (ie, ::cleanSyncData()). But definitely a follow-up thing IMO.

  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom scott_euser
  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom scott_euser

    Cleaned up patches so its clear what we tested (ie, the visible MR)

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States hestenet Portland, OR ๐Ÿ‡บ๐Ÿ‡ธ

    This would be helpful for Drupal.org's site migrations and the time period that we have Modern Drupal and D7 running in parallel.

  • ๐Ÿ‡ช๐Ÿ‡ธSpain fjgarlin

    For context, our user migration is +2 million records, so holding that number of IDs in memory is not an option.

  • First commit to issue fork.
  • Pipeline finished with Skipped
    3 months ago
    #330343
  • heddn Nicaragua

    Thanks everyone for your contributions.

  • Automatically closed - issue fixed for 2 weeks with no activity.

  • Status changed to Fixed 27 days ago
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States loopy1492

    I just pulled our database from prod which is running migrate_tools version 6.0.4, then ran updates locally from version 6.0.4 to version 6.0.5 (according to composer.lock). `drush updb` is throwing:

    > [notice] Update started: migrate_tools_update_10000
    > [error] Table 'migrate_tools_sync_source_ids' already exists.
    > [error] Update failed: migrate_tools_update_10000

    The only thing I could think of is that it was set to -dev recently, but that is not the case.

  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom scott_euser

    You could have had the patch installed ages ago too, it existed over a year ago I believe.

    Anyways probably you could open a follow-up issue to add a check if table not exists into the update hook.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States loopy1492

    I was able to sync the database from one environment to the other upstream and get a successful updb. I'm not entirely sure what the deal is with syncing down to local, but I'm guessing it's something with my provider or my computer. Or possibly disparate MySQL versions, though that's a pie in the sky guess. Thanks for your response.

  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom mjpa

    I came across this issue after upgrading to 10.4 and looking into the state cache flag.

    Our state values are all relatively small, except migrate_tools_sync that's pretty large.

    We've updated the module and ran updates etc which has created the new table - all fine there.

    However, should the old state entry still exist? Should it not be removed as part of the update hook?

    I can't find any code leftover that would remove it so it looks like it'll persist forever unless we delete it ourselves.

  • heddn Nicaragua

    Can you open a follow-up to remove the old state entry? We should get rid of it.

  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom mjpa

    Yep, no problem. I've created ๐Ÿ“Œ Remove the old migrate_tools_sync state entry Active

Production build 0.71.5 2024