drush migration with '--sync' option has suboptimal performance

Created on 30 July 2023, 11 months ago
Updated 19 June 2024, 6 days ago

Problem/Motivation

When using the --sync option during migrations, they get exponentially slower with the number of items to process.

One of my migrations has 35K items and the first phase during MigrateEvents::PRE_IMPORT takes a whole 20 minutes before even starting the migration itself (which takes even longer).

After profiling and investigating, I think the way migrate_tools keeps track of all source rows in migrate_tools_migrate_prepare_row is suboptimal and could be greatly improved.

Currently, it relies on States to make a list of IDs grow. Which implies loading some serialized data from the database, adding something to it, and re-serializing it to the database. For each row. It means that for each row, the amount of data to serialize and unserialize gets bigger and bigger. Here is where the main culprit is:

function migrate_tools_migrate_prepare_row(Row $row, MigrateSourceInterface $source, MigrationInterface $migration) {
  if (!empty($migration->syncSource)) {
    // Keep track of all source rows here [...]
    $source_id_values = \Drupal::state()->get('migrate_tools_sync', []);
    $source_id_values[] = $row->getSourceIdValues();
    \Drupal::state()->set('migrate_tools_sync', $source_id_values);
  }
}

After tweaking how this tracking works, I got it down from 20 minutes to less than 45 seconds in time.

Also as a bonus, I'm pretty sure the current system using states doesn't allow to run safely several migrations in parallel, so I've added the migration_id to the dedicated table so each migration has its own separated set of tracked IDs.

Steps to reproduce

Have a big migration to run (with tens of thousands of entries) and run it with the --sync option. Illustration:

drush migrate:import YOUR_MIGRATION --sync

Proposed resolution

Rework the way the module keeps track of rows' source IDs in a dedicated database table, so adding new entries doesn't get slower and slower as the migration progresses.

User interface changes

None.

API changes

None.

Data model changes

Use a dedicated database table for source IDs tracking.

-- --

I'm planning on creating a MR shortly. I just need to figure out how to use the new MR system.

πŸ› Bug report
Status

Needs review

Version

6.0

Component

Code

Created by

πŸ‡«πŸ‡·France pacproduct

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @pacproduct
  • Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 11 months ago
    Not currently mergeable.
  • @pacproduct opened merge request.
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 11 months ago
    22 pass, 2 fail
  • @pacproduct opened merge request.
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 11 months ago
    22 pass, 2 fail
  • πŸ‡«πŸ‡·France pacproduct

    I'm not sure what makes tests fail. Seems like the new database table is not found? When I install the module locally, it does get created correctly though.

    What am I missing? Any help for fixing the CI would be appreciated :)

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 11 months ago
    36 pass
  • Status changed to Needs review 11 months ago
  • πŸ‡«πŸ‡·France pacproduct

    Yay! Tests pass :)
    Switching status to "needs review".

  • πŸ‡«πŸ‡·France pacproduct

    Just added a commit.

    I noticed that the migrate_tools_migrate_prepare_row hook is running during the PRE_IMPORT phase, and also during the IMPORT phase. That was duplicating all SyncID entries although we need them during the PRE_IMPORT phase. So I added a little something so IDs get recorded during the PRE_IMPORT phase only. Then, they stay available in database during the actual migration phase just in case another module would need them, and they get cleaned during the POST_IMPORT phase.

    Also added some missing comments.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 11 months ago
    36 pass
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & sqlite-3.27
    last update 11 months ago
    36 pass
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 11 months ago
    36 pass
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 11 months ago
    36 pass
  • πŸ‡«πŸ‡·France Simon Georges Rouen, France
  • πŸ‡«πŸ‡·France denix

    dear all,
    we are following closely this patch because when working with large datasets the operation is incredibly expensive in terms of performances. We are running tests on our side and we can only appreciate better performances and no drawbacks for the time been.

  • πŸ‡ΊπŸ‡ΈUnited States Jorge Navarro

    I tested it in a large migration and indeed, migration time become a fraction, looking forward to see it merged!

  • Status changed to RTBC 10 months ago
  • πŸ‡«πŸ‡·France Simon Georges Rouen, France

    Considering several people have reported it working without drawbacks, let's try to move it forward ;-)

  • πŸ‡«πŸ‡·France MaxPah

    Just for information, with this patch (who is working fine) there is a log on reports/status page :

    READ-COMMITTED
    For this to work correctly, all tables must have a primary key. The following table(s) do not have a primary key: migrate_tools_sync_source_ids. See the setting MySQL transaction isolation level page for more information.

    I think it's just this line missing in hook_schema declaration:

    'primary key' => ['migration_id'],

  • πŸ‡«πŸ‡·France pacproduct

    @MaxPah I cannot reproduce your issue. Also, migration_id cannot be used as primary key as this column doesn't contain unique entries.

    We could add an arbitrary serial 'id' as unique identifier and as primary key. Although it won't be of any use, I guess it's good practice anyway.

    I'll rework that shortly.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 9 months ago
    36 pass
  • Status changed to Needs review 9 months ago
  • πŸ‡«πŸ‡·France Simon Georges Rouen, France
  • πŸ‡«πŸ‡·France pacproduct

    Because branch `6.0.x` recently changed, I had to rebase my MR and resolve conflicts. However, now the MR itself doesn't apply anymore to version `6.0.2` of this module.

    So here is an attached patch specfically for version `6.0.2`, so people can keep applying it easily if needed.

  • πŸ‡¨πŸ‡¦Canada emixaam

    Successfully tested Merge request 44 with migrate_tools 6.0.4 and migrate_plus 6.0.2, on Drupal 10.2.3. On a local environment, I tested removing 10000 items from the source of a custom migration and executing Β΄drush mim --syncΒ΄. The items were correctly removed, and the time to complete the sync import was much faster than without the patch.

    Was there more thoughts given into the discussion #mr44-note197623 about using MigrateTools as a service vs moving to a dedicated service? I think it would make sense but I'm new on the issue.

  • πŸ‡«πŸ‡·France pacproduct

    @emixaam I'm not sure what the best approach would be.

    If the issue is that the service name is weird, I guess we could:
    - Rename the class and its corresponding service. That could impact some code elsewhere in the project, I did not check.
    - As the services currently declared by this module don't always match the name of their corresponding classes, another approach would be to rename the service only but not the class itself (e.g. "migrate_tools.helper").

    If the issue is that the source code is not well-enough organized, that would require a bit more rework but maybe we could:
    -Move all these functions into a separate service, but I'm not sure how to call it: these are utility helper functions for tracking source IDs during the first phase of a migration, insuring the module knows which entries to delete. But there's already a service with a name suggesting it handles that sync task (among other things): "migrate_tools.migration_sync" (Drupal\migrate_tools\EventSubscriber\MigrationImportSync).
    - Therefore, another consideration would be to completely remove class Drupal\migrate_tools\MigrateTools (which initially contained a small static function buildIdList) and put everything into the existing service "migrate_tools.migration_sync"

    I feel like the decision is up to the maintainer on what the best approach for their module is at this point, but what do you guys think?

  • πŸ‡§πŸ‡ͺBelgium BramDriesen Belgium πŸ‡§πŸ‡ͺ

    Uploading a static patch from the MR since the patch of #18 No longer applies to the latest release (6.0.4)

  • State service uses a cache collector for performance since Drupal 10.3 https://www.drupal.org/node/3177901 β†’

    Though on existing sites this feature needs to be enabled explicitly through the settings flag $settings['state_cache'] = TRUE;, this is the default behavior for new sites (in default.settings.php), and in Drupal 11+ the setting is removed and state cache is permanently turned on.

    There are 2 queries presented in the change record to check current state usage and then decide to opt-in/out state cache depending on results.

    Sites with more than 100 entries and/or a total value size/length of 100,000 should review their specific entries...

    Size/length can easily exceed 100,000 on thousands migration items. Setting the issue priority to Major...

Production build 0.69.0 2024