drush migratation with --sync has suboptimal performance (v6)

Issue created by @pacproduct
Pipeline finished with Failed
about 2 years ago
#15057
Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
last update about 2 years ago
Not currently mergeable.
@pacproduct opened merge request.
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
last update about 2 years ago
22 pass, 2 fail
@pacproduct opened merge request.
Comment about 2 years ago →
System Message
pacproduct → closed merge request !42
Comment about 2 years ago →
System Message
pacproduct → closed merge request !43
Merge request !44Improve performances when using the --sync option. → (Merged) created by pacproduct
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
last update about 2 years ago
22 pass, 2 fail
Comment about 2 years ago →
🇫🇷France pacproduct
I'm not sure what makes tests fail. Seems like the new database table is not found? When I install the module locally, it does get created correctly though.

What am I missing? Any help for fixing the CI would be appreciated :)
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
last update about 2 years ago
36 pass
Status changed to Needs review about 2 years ago2:38pm 30 July 2023
Comment about 2 years ago →
🇫🇷France pacproduct
Yay! Tests pass :)
Switching status to "needs review".
Comment about 2 years ago →
🇫🇷France pacproduct
Comment about 2 years ago →
🇫🇷France pacproduct
Just added a commit.

I noticed that the migrate_tools_migrate_prepare_row hook is running during the PRE_IMPORT phase, and also during the IMPORT phase. That was duplicating all SyncID entries although we need them during the PRE_IMPORT phase. So I added a little something so IDs get recorded during the PRE_IMPORT phase only. Then, they stay available in database during the actual migration phase just in case another module would need them, and they get cleaned during the POST_IMPORT phase.

Also added some missing comments.
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
last update about 2 years ago
36 pass
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.1 & sqlite-3.27
last update about 2 years ago
36 pass
Open in Jenkins → Open on Drupal.org →
Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
last update about 2 years ago
36 pass
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
last update about 2 years ago
36 pass
Comment about 2 years ago →
🇫🇷France simon georges Rouen, France
Comment almost 2 years ago →
🇫🇷France denix
dear all,
we are following closely this patch because when working with large datasets the operation is incredibly expensive in terms of performances. We are running tests on our side and we can only appreciate better performances and no drawbacks for the time been.
Comment almost 2 years ago →
🇺🇸United States Jorge Navarro
I tested it in a large migration and indeed, migration time become a fraction, looking forward to see it merged!
Status changed to RTBC almost 2 years ago8:22am 6 September 2023
Comment almost 2 years ago →
🇫🇷France simon georges Rouen, France
Considering several people have reported it working without drawbacks, let's try to move it forward ;-)
Comment almost 2 years ago →
🇫🇷France MaxPah
Just for information, with this patch (who is working fine) there is a log on reports/status page :

READ-COMMITTED
For this to work correctly, all tables must have a primary key. The following table(s) do not have a primary key: migrate_tools_sync_source_ids. See the setting MySQL transaction isolation level page for more information.

I think it's just this line missing in hook_schema declaration:

'primary key' => ['migration_id'],
Comment almost 2 years ago →
🇫🇷France pacproduct
@MaxPah I cannot reproduce your issue. Also, migration_id cannot be used as primary key as this column doesn't contain unique entries.

We could add an arbitrary serial 'id' as unique identifier and as primary key. Although it won't be of any use, I guess it's good practice anyway.

I'll rework that shortly.
Pipeline finished with Failed
almost 2 years ago
#25417
Pipeline finished with Failed
almost 2 years ago
#25418
Open in Jenkins → Open on Drupal.org →
Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
last update almost 2 years ago
36 pass
Status changed to Needs review almost 2 years ago2:09pm 2 October 2023
Comment almost 2 years ago →
🇫🇷France simon georges Rouen, France
Pipeline finished with Success
over 1 year ago
Total: 230s
#73645
Comment over 1 year ago →
🇫🇷France pacproduct
Because branch `6.0.x` recently changed, I had to rebase my MR and resolve conflicts. However, now the MR itself doesn't apply anymore to version `6.0.2` of this module.

So here is an attached patch specfically for version `6.0.2`, so people can keep applying it easily if needed.
Comment over 1 year ago →
🇨🇦Canada emixaam
Successfully tested Merge request 44 with migrate_tools 6.0.4 and migrate_plus 6.0.2, on Drupal 10.2.3. On a local environment, I tested removing 10000 items from the source of a custom migration and executing ´drush mim --sync´. The items were correctly removed, and the time to complete the sync import was much faster than without the patch.

Was there more thoughts given into the discussion #mr44-note197623 about using MigrateTools as a service vs moving to a dedicated service? I think it would make sense but I'm new on the issue.
Comment over 1 year ago →
🇫🇷France pacproduct
@emixaam I'm not sure what the best approach would be.

If the issue is that the service name is weird, I guess we could:
- Rename the class and its corresponding service. That could impact some code elsewhere in the project, I did not check.
- As the services currently declared by this module don't always match the name of their corresponding classes, another approach would be to rename the service only but not the class itself (e.g. "migrate_tools.helper").

If the issue is that the source code is not well-enough organized, that would require a bit more rework but maybe we could:
-Move all these functions into a separate service, but I'm not sure how to call it: these are utility helper functions for tracking source IDs during the first phase of a migration, insuring the module knows which entries to delete. But there's already a service with a name suggesting it handles that sync task (among other things): "migrate_tools.migration_sync" (Drupal\migrate_tools\EventSubscriber\MigrationImportSync).
- Therefore, another consideration would be to completely remove class Drupal\migrate_tools\MigrateTools (which initially contained a small static function buildIdList) and put everything into the existing service "migrate_tools.migration_sync"

I feel like the decision is up to the maintainer on what the best approach for their module is at this point, but what do you guys think?
Comment over 1 year ago →
🇧🇪Belgium BramDriesen Belgium 🇧🇪
Uploading a static patch from the MR since the patch of #18 No longer applies to the latest release (6.0.4)
Comment over 1 year ago →
🇺🇸United States sassafrass
Using: drupal/migrate_tools": "6.0.x-dev@dev"
Cannot apply patch https://www.drupal.org/files/issues/2024-03-11/3378047-21.patch →
Cannot apply patch https://www.drupal.org/files/issues/2024-03-11/3378047-21.patch →
Comment about 1 year ago →
antongp
State service uses a cache collector for performance since Drupal 10.3 https://www.drupal.org/node/3177901 →

Though on existing sites this feature needs to be enabled explicitly through the settings flag $settings['state_cache'] = TRUE;, this is the default behavior for new sites (in default.settings.php), and in Drupal 11+ the setting is removed and state cache is permanently turned on.

There are 2 queries presented in the change record to check current state usage and then decide to opt-in/out state cache depending on results.

Sites with more than 100 entries and/or a total value size/length of 100,000 should review their specific entries...

Size/length can easily exceed 100,000 on thousands migration items. Setting the issue priority to Major...
Comment 11 months ago →
🇫🇮Finland vermario
Agree with the previous poster, this will become a problem in Drupal11 for sure. Sounds like it would be great to push this change forward.
Pipeline finished with Failed
11 months ago
Total: 231s
#292124
Comment 11 months ago →
🇬🇧United Kingdom scott_euser
Thanks very much for this! We have tested this out quite thoroughly as we were similarly hitting an issue syncing roughly 40,000 entities from a CRM twice per hour and the sync was hanging. We have been running this for several days without issue as we figured its fairly safe since we could always remove the patch and re-sync the content the ~20 minute way.

We also did notice the issue of multiple migrate --sync's also having this problem:

Also as a bonus, I'm pretty sure the current system using states doesn't allow to run safely several migrations in parallel, so I've added the migration_id to the dedicated table so each migration has its own separated set of tracked IDs.

Which is sorted by this MR.

The only thing I can suggest further is a subsequent follow-up where we could add a cron job just to check if there are any closed migrate IDs in the new table that still have data in the table itself. Essentially the code without this patch was causing our database to pile up each time in the key-value whenever it timed out, with orphan key-value data since the clean-up phase was never reached. While this is far more performant, its I suppose still possible some failure leaves orphans in the table if the clean-up step in this MR is never reached (ie, ::cleanSyncData()). But definitely a follow-up thing IMO.
Comment 11 months ago →
🇬🇧United Kingdom scott_euser
Comment 11 months ago →
🇬🇧United Kingdom scott_euser
Cleaned up patches so its clear what we tested (ie, the visible MR)
Comment 10 months ago →
🇺🇸United States hestenet Portland, OR 🇺🇸
This would be helpful for Drupal.org's site migrations and the time period that we have Modern Drupal and D7 running in parallel.
Comment 10 months ago →
🇪🇸Spain fjgarlin
For context, our user migration is +2 million records, so holding that number of IDs in memory is not an option.
First commit to issue fork.
Pipeline finished with Skipped
10 months ago
#330343
Comment 10 months ago →
System Message

heddn → committed 5cc29289 on 6.0.x authored by pacproduct →
Issue #3378047 by pacproduct, scott_euser, maxpah, heddn: drush...
Comment 10 months ago →
heddn Nicaragua
Thanks everyone for your contributions.
Comment 9 months ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.
Status changed to Fixed 8 months ago4:24pm 24 December 2024
Comment 8 months ago →
🇺🇸United States loopy1492
I just pulled our database from prod which is running migrate_tools version 6.0.4, then ran updates locally from version 6.0.4 to version 6.0.5 (according to composer.lock). `drush updb` is throwing:

> [notice] Update started: migrate_tools_update_10000
> [error] Table 'migrate_tools_sync_source_ids' already exists.
> [error] Update failed: migrate_tools_update_10000

The only thing I could think of is that it was set to -dev recently, but that is not the case.
Comment 8 months ago →
🇬🇧United Kingdom scott_euser
You could have had the patch installed ages ago too, it existed over a year ago I believe.

Anyways probably you could open a follow-up issue to add a check if table not exists into the update hook.
Comment 8 months ago →
🇺🇸United States loopy1492
I was able to sync the database from one environment to the other upstream and get a successful updb. I'm not entirely sure what the deal is with syncing down to local, but I'm guessing it's something with my provider or my computer. Or possibly disparate MySQL versions, though that's a pie in the sky guess. Thanks for your response.
Comment 8 months ago →
🇬🇧United Kingdom mjpa
I came across this issue after upgrading to 10.4 and looking into the state cache flag.

Our state values are all relatively small, except migrate_tools_sync that's pretty large.

We've updated the module and ran updates etc which has created the new table - all fine there.

However, should the old state entry still exist? Should it not be removed as part of the update hook?

I can't find any code leftover that would remove it so it looks like it'll persist forever unless we delete it ourselves.
Comment 8 months ago →
heddn Nicaragua
Can you open a follow-up to remove the old state entry? We should get rid of it.
Comment 8 months ago →
🇬🇧United Kingdom mjpa
Yep, no problem. I've created 📌 Remove the old migrate_tools_sync state entry Active

drush migratation with --sync has suboptimal performance (v6)

Problem/Motivation

Steps to reproduce

Proposed resolution

User interface changes

API changes

Data model changes

Merge Requests

!44drush migratation with --sync has suboptimal performance (v6)
Merged

Comments & Activities

drush migratation with --sync has suboptimal performance (v6)

Problem/Motivation

Steps to reproduce

Proposed resolution

User interface changes

API changes

Data model changes

Merge Requests

!44drush migratation with --sync has suboptimal performance (v6)Merged

Comments & Activities

!44drush migratation with --sync has suboptimal performance (v6)
Merged