ContentEntity migration source doesn't consider the migration map

Created on 17 December 2020, almost 4 years ago
Updated 8 June 2023, over 1 year ago

Problem/Motivation

The migration system keeps track of what has been migrated already from a source by writing a record for each migrated source row to a map table.

The SqlBase source plugin base class makes use of the map table by doing an SQL JOIN to the map table, so that the source query is filtered to only those source records that haven't already been imported.

This means that if you do an incremental migration, the migration process doesn't have to go through lots of source records that have already been imported, because they are simply eliminated from the query result.

However, the ContentEntity migration source, which provides entities from the current Drupal site as the source rows, doesn't consider the map.

This means that if you do an incremental migration, or do your migration in batches, either for performance or during development, ALL the entities that have already been migrated are iterated over, loaded, and checked against the map.

This makes incremental migrations very slow, as they have to go over all the already migrated entities before they get to entities that need to be migrated.

Steps to reproduce

Run an incremental migration with lots of source records (10k or so).

Proposed resolution

Move the addMapJoin() method to a new MapJoinTrait.

Use the new trait in both SqlBase and ContentEntity.

In ContentEntity, get the SQL query from the source entity query, and JOIN to the map table.

Remaining tasks

User interface changes

None.

API changes

None.

Data model changes

None.

Release notes snippet

TBD

πŸ› Bug report
Status

Needs work

Version

11.0 πŸ”₯

Component
MigrationΒ  β†’

Last updated 3 days ago

Created by

πŸ‡¬πŸ‡§United Kingdom joachim

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • @joachim opened merge request.
  • πŸ‡¬πŸ‡§United Kingdom joachim

    Rebased a new branch on 10.1.

  • Status changed to Needs work over 1 year ago
  • The Needs Review Queue Bot β†’ tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

    Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.

    Consult the Drupal Contributor Guide β†’ to find step-by-step guides for working with issues.

  • First commit to issue fork.
  • πŸ‡¬πŸ‡§United Kingdom joachim

    It looks like ✨ Allow EntityQuery to be converted to the underlying SQL query Needs work is not going to get in.

    We can maybe do this by decorating the entity query factory instead, the way workspaces module does it?

  • πŸ‡¬πŸ‡§United Kingdom joachim

    Unfortunately, doing it the way Workspaces module does it isn't going to be possible, because of the way Workspaces module does it: πŸ› Workspace QueryFactory alters queries in a way that's not compatible with any other module doing the same Active

  • πŸ‡¬πŸ‡§United Kingdom joachim

    New plan is to use a query tag and hook_query_alter().

  • Status changed to Needs review over 1 year ago
  • last update over 1 year ago
    Custom Commands Failed
  • πŸ‡¨πŸ‡­Switzerland berdir Switzerland

    Restarting this with an isolated patch that uses an alter hook which calls back to the source plugin. I've also not kept the trait, it's just two instances and both the return argument and the optional argument are only needed for SqlBase, making this more complicated and refactoring SqlBase at the same time seems to cause some test fails too. And as shown further down, entity queries need other things to consider.

    Test fails show that this has never been used/tested on an entity with translations, as it caused the query to explode due to the missing table alias. That seemed easy enough to fix at first and I was able to get ContentEntitySource green without major issues.

    But then I tried to write a test that actually verifies that the join works, and the combination of how source plugins work, translations, entity queries caused my brain to melt.

    This turned out to be _hard_. Entity queries essentially ignore that the fact that such a thing as translations exist, match on any data table row they can and will just aggregate multiple matches back to the ids in the result array. First I thought that I can't get it work with included translations, but it turned out that was actually the easier version, the tricky part was when they were not included, then I needed to ensure to only query the default translation or the query still found translations not matching the map table. The test now verifies with 40 (!) permutations of included translation, revision (which doesn't really do much yet, but it will be a fun one to support once revisions as a source actually work, the current state makes no sense to me at all), map join settings and every possible migrate map combination with two languages.

    The test verifies both the query result (which essentially returns only entity ids that need to have at least one translation still migrated) and the returned row count (which is then filtered down on the specific translations and always considers the map status).

  • last update over 1 year ago
    Custom Commands Failed
  • πŸ‡¨πŸ‡­Switzerland berdir Switzerland

    I like you too phpcs.

  • last update over 1 year ago
    29,441 pass
  • πŸ‡¨πŸ‡­Switzerland berdir Switzerland

    Hm, missed cspell, kinda annoying, using the same comment as in SqlBase for that.

  • πŸ‡¬πŸ‡§United Kingdom joachim

    > Entity queries essentially ignore that the fact that such a thing as translations exist, match on any data table row they can and will just aggregate multiple matches back to the ids in the result array

    Is that related to my comment here: #2942948: Add allTranslations() and allTranslations()->count to entity query β†’ ?

  • last update over 1 year ago
    Custom Commands Failed
  • πŸ‡¨πŸ‡­Switzerland berdir Switzerland

    It's related to that, yes. But my change here seems to work fine according to the tests.

    We have been running our migration for a few days now (yeah, it's slow) and it's been working well from what see so far. We have noticed one bug though, and that is that the join is also applied to the count query, so that the total is essentially just the remaining articles. I think that doesn't really affect the actual migration, but migrate status is confused then.

    The fix seems easy enough, just need to skip the method when there are no fields.

  • Status changed to Needs work over 1 year ago
  • The Needs Review Queue Bot β†’ tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

    This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

    Consult the Drupal Contributor Guide β†’ to find step-by-step guides for working with issues.

Production build 0.71.5 2024