sync might be too strict during id comparison; rolls back everything

Created on 3 January 2020, over 4 years ago
Updated 3 December 2023, 7 months ago

I was noticing with a migrate_plus Url migration that the sync option was rolling back everything each import when the source had not changed. It's because the in_array id check (https://git.drupalcode.org/project/migrate_tools/blob/8.x-4.x/src/EventS...) is pretty fragile. In my case, $row->getSourceIdValues() was returning:

[ 'item_id' => INT ]

while $id_map->currentSource() was returning

[ 'item_id' => STRING ]

which causes a false negative for the strict in_array check.

I don't think this is the fault of migrate_tools, but I think it can be prevented. Incoming patch hashes IDs to JSON before comparison, which fixes my use case but might be inefficient.

๐Ÿ› Bug report
Status

Needs review

Version

6.0

Component

Code

Created by

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • ๐Ÿ‡ฉ๐Ÿ‡ชGermany berliner

    As mentioned by @byrond, this seems to have been fixed in 6.0.x by this commit in ๐Ÿ› sync option doesnt work with track_changes Fixed .

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    Version 6.0.1 does seem to fix the issue as long as the ids' "type" property is set correctly.

    In my case migrate_tools was rolling back all my items because the source ID was an integer, and I had defined:

      ids:
        source_id:
          type: string
          max_length: 100
    

    Fixing my configuration to type "integer" fixed everything:

      ids:
        source_id:
          type: integer
    

    I overlooked it in the first place, so I'm wondering if this subtlety is documented somewhere?

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.0 & MySQL 5.7
    last update about 1 year ago
    Patch Failed to Apply
  • ๐Ÿ‡ฎ๐Ÿ‡นItaly robertom

    Hi, sorry for my bad english.

    This issue is still valid.

    I have many d7 to d9 migrations suffering from this because, for example, d7_taxonomy_term_entity_translation states that entity_id will be integer, but the data coming from the db returns it as a string.

    The approach used in patch #4/#17 might be incorrect because, if the id is an md5, the conversion might assume it's a numeric value when it doesn't contain letters and the strict in_array check will fail.

    Attached a "workaround" that enforce the type of $source_id_values as stated by $source->getIds()

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.0 & MySQL 5.7
    last update about 1 year ago
    Patch Failed to Apply
  • ๐Ÿ‡ฎ๐Ÿ‡นItaly robertom

    sorry, there is no point in checking repeatedly...

    new patch attached

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    36 pass
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    29 pass, 3 fail
  • ๐Ÿ‡ฎ๐Ÿ‡นItaly robertom

    Even with patch #21 I get some unwanted rollbacks when i run a migration group with the --group=... option.

    The problem is due to the way the data is stored in the state variable. We need to add the migration id as the key of the array.

    Attached a proposed patch and interdiff with #21

  • The last submitted patch, 22: sync-id-hash-3104268-22.patch, failed testing. View results โ†’
    - codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    36 pass
  • First commit to issue fork.
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    29 pass, 3 fail
  • @jamsilver opened merge request.
  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom jamsilver

    I had a situation with a unit tests that failed because it executed the same migration twice in the same page request. It wasn't enough for migrate_tools_sync state to be reset in the constructor.

    I've created a Merge Request that adds the following changes to the patch in 22 (interdiff attached):

    1. Reset migrate_tools_sync at the point of scanning through source rows to collect the IDs. This fixes my unit test
    2. Factor that bit of code to a new method for readability.

    I've also raised this as a Merge Request, rather than as a patch, since that is the workflow in place on this ticket originally. I raised a new one because things seem to have moved along quite a lot since the original was raised.

    Tests fail, but the first error I'm seeing looks unrelated to my change. Is there a problem with the tests at the moment?

  • ๐Ÿ‡ท๐Ÿ‡ดRomania bbu23

    Unfortunately the MR from #26 cannot be applied to the latest version of the module.

  • ๐Ÿ‡จ๐Ÿ‡ฆCanada jigarius Montrรฉal

    While testing this behavior, I realized that it makes many, many write queries on the key_value table. This in turn results in huge MySQL bin logs which makes the server run out of memory at some point. Is anyone else facing this issue? Is it safe to say that there is an efficiency/memory optimization issue with the --sync option?

  • Status changed to Needs work 11 months ago
  • heddn Nicaragua

    I'm afraid that the change in #26 to move the set into the point of scanning might have lead to #28. Can we revert that change please?

  • First commit to issue fork.
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 8 months ago
    36 pass
  • Status changed to Needs review 8 months ago
  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom scott_euser

    I updated the MR with a merge from 6.0.x and created a separate issue https://www.drupal.org/project/migrate_tools/issues/3396130 ๐Ÿ› Switch GitLab CI configuration to the template developed by the DA Needs review for the pipeline failure

  • ๐Ÿ‡ท๐Ÿ‡ดRomania bbu23

    Thank you, tested the patch from MR !40. It seems to have fixed the rolling back everything situation for a migration that uses track_changes. Though the migration is much slower now, quite a difference. Track changes works pretty fast when not used in combination with sync. I'm not sure if this can be improved or not.

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    @bbu23 Could you be facing the issue referenced in https://www.drupal.org/project/migrate_tools/issues/3378047 ๐Ÿ› drush migration with '--sync' option has suboptimal performance Needs review ?

    More generally, I have the feeling that this thread should take into consideration the other thread I posted above, because the way migrate_tools_migrate_prepare_row currently works gets exponentially slower with time and should be fixed first, in my opinion.

  • ๐Ÿ‡ท๐Ÿ‡ดRomania bbu23

    @pacproduct Yes, thank you, that could be it! One of my migrations took hours with sync option on.

    So what is your suggestion? Should I try the patch from the referenced issue that has a dedicated new table to sync or do I need to combine both patches?
    Also, is your comment #19 from current issue still valid? I feel that it's a bit strange to have to manually specify a type for each migration. Most of the migrations have the ID coming in as string, that means modifications everywhere. Thanks

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    @bbu23 I believe my comment in #19 is still valid, but the best is to try it to be sure.

    I feel like the global tendency in programming is to favor strict typing more and more as it tends to reduce issues and bugs.
    Thus personally, I'm kind of okay with the idea of configuring what is the type of data you're expecting from the source.
    Although maybe `migrate_tools` could warn you when source IDs type do not match the configured type so you know something is dubious, instead of converting it silently which causes it to rollback everything later...

    Everyone's situation isn't the same, and there might be cases where having a strict source ID typing wouldn't fit one's need. From the top of my mind, the only case that could be problematic that I can think of is if the source system were to return a mix of different types as the primary ID (a mix of strings and ints, for instance). In that particular case, having a "loose type" for source IDs could be useful.
    Hard for me to say which approach is better...

    -- --

    With regards to your case: If setting properly the source ID types on your migration configurations solves your rollback problem without patching `migrate_tools`, it feels like a quick and healthy fix in my opinion. In which case, you may apply the patch from the referenced issue if it fixes your performance problem.
    Otherwise, you may have to merge the 2 patches together. Not sure how tricky this is going to be. YMMV.

Production build 0.69.0 2024