Batch import performance and scalability

Created on 5 September 2024, 9 months ago
Updated 9 September 2024, 9 months ago

Problem/Motivation

After successfully exporting all 16k nodes (with references) from one instance of my site, I am now testing the import process. I am testing the import of 16,262 nodes, for a total of 73,757 referenced entities. When I trigger an import through the UI, it times out after 300 seconds.

From my test in the parent issue, comment #35 Use Batch API for export/import Needs review , I was able to import 118k entities and the batch import process triggered after ~10 seconds.

After reviewing my patch #41 Use Batch API for export/import Needs review in the parent ticket against what is now in the 2.1.x branch, I notice a few issue:

  • In my patch #41 I moved all processing to within the batch operations for scalability. The 2.1.x branch calls the prepareForImport() method which decodes all of the files, and this happens before the batch operation is initialized, leading to timeouts.
📌 Task
Status

Needs review

Version

2.1

Component

Code

Created by

🇨🇦Canada smulvih2 Canada 🍁

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @smulvih2
  • 🇩🇪Germany mkalkbrenner 🇩🇪

    In my patch #41 I moved all processing to within the batch operations for scalability. The 2.1.x branch calls the prepareForImport() method which decodes all of the files, and this happens before the batch operation is initialized, leading to timeouts.

    I started with your implementation. But the import of path aliases and entity references based on entity IDs (not uuids) was broken.
    I didn't see any other solution than re-introducing that initial scan. (I reduced its load but we can't skip it.)

    I also re-introduced the "second run" for entities to get that fixed.

    The issue happened when you export ID 17 and ID 23, and einity ID 23 is referencing ID 17 without having the UUIDs in the export.
    Then you import it into a different site where the max ID is 100.

    Both entities to be imported will get new IDs. And the order of the import is not guaranteed.
    prepareForImport() build a mapping based on UUIDs that the old ID 23 references the old ID 17.

    Now in the first batch runs, ID 27 will be imported and become ID 112. At this stage it references an existing ID 17 of the target system which is a completely different entity.
    Later in the batch the old ID 17 gets imported and becomes ID 144.

    Now in a second run, the new entity ID 112 (23 in the old system) will be updated and the reference will get changed from 17 to 144.
    ID 144 itself has no second run as it was detected to have no references.

    If you have an idea how to solve that without initially scanning all files, I would be happy to hear your ideas.

    Meanwhile I implemented normalizers to work around some issues. So one possible solution might be to add the missing UUIDs to the export files to avoid the algorithim we have for some years now.

    2.1.x stores all files to import in a variable within the class scope. Since the class is then added to the database as part of the queue item, this leads to a large database size increase. My patch #41 stored this data outside of the class using keyValueStorage. More information about this issue in comment #33 of parent ticket.

    I know. And 2.1.x is still beta. As you might have noticed, I ran into an issue with the batch queue garbage collection. So I had to add our own queue. So that is the right place now to manage the key value store and to include it in the garbage collection.

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    On more important note!

    I added a feature to preserve the entity IDs. For sure it could only be used if you can ensure that there are no collisions with the target system. For example if you prepare content on system A and deploy it to system B and C where nobody creates such content.

    In this case, the initial decoding of the files gets skipped. And the alogrithm described above gets skipped as well.

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    take a look at updateInternalLinks(). That functionality was broken with your approach.

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    Not all export and import options are available via drush and UI yet. That needs to be done.

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    This is an example of an exported field of type link which target node IDs needs to be corrected:

     "field_target_link": [
            {
                "uri": "internal:\/node\/2633",
                "title": "",
                "options": [],
                "lang": "en"
            },
            {
                "uri": "internal:\/node\/2633",
                "title": "",
                "options": [],
                "lang": "de"
            },
            {
                "uri": "internal:\/node\/2633",
                "title": "",
                "options": [],
                "lang": "fr"
            }
        ],
    

    And this is another example using the entity syntax:

        "field_link": [
            {
                "uri": "entity:node\/2612",
                "title": "Back to the demo devices overview",
                "options": {
                    "attributes": {
                        "class": [
                            "btn",
                            "btn-secondary"
                        ]
                    }
                },
                "lang": "en"
            },
            {
                "uri": "entity:node\/2612",
                "title": "Zur\u00fcck zur Demoger\u00e4te\u00fcbersicht",
                "options": {
                    "attributes": {
                        "class": [
                            "btn",
                            "btn-secondary"
                        ]
                    }
                },
                "lang": "de"
            },
            {
                "uri": "entity:node\/2612",
                "title": "Retour \u00e0 l'aper\u00e7u des appareils de d\u00e9monstration",
                "options": {
                    "attributes": {
                        "class": [
                            "btn",
                            "btn-secondary"
                        ]
                    }
                },
                "lang": "fr"
            }
    
  • Assigned to mkalkbrenner
  • 🇩🇪Germany mkalkbrenner 🇩🇪

    I'm working on it.

  • Merge request !16Import performance → (Merged) created by mkalkbrenner
  • Status changed to Needs review 9 months ago
  • 🇩🇪Germany mkalkbrenner 🇩🇪

    I created first draft.
    The ID correction now happens on decoding the file, based on such metadata added to the json export:

       "_dcd_metadata": {
            "uuids": {
                "node": {
                    "2657": "5aa5e62b-9f95-419c-8119-38c7b8e08a41"
                }
            },
            "export_timestamp": 1725620599
        }
     

    The mapping from the old Entity ID to the UUID is now included in the exported file.

    Nevertheless, the second correction run is required, because the targeted entity might not exists in the first when it gets imported later in that run.

    I think that this implementation will give us a boost, but it means a BC break. Old exports can't be imported correctly.

  • 🇨🇦Canada smulvih2 Canada 🍁

    @mkalkbrenner yes I noticed the garbage collection you added, this is great! It's important to keep the queues running efficiently to prevent it from growing out of control. And thanks for providing the example field_target_link output, I see what you mean now. Although my site is complex, it uses mostly entity reference fields and references/links within the body field (processed text), so it can rely on UUIDs. The Link field requires entity_ids to get working after import.

    I think to make this scalable we will need to run the decodeFile()/addToImport() as their own batch operations that run before the actual importFile() operations. If we can make it so only the scan() method runs before the operations start then this will scale much better. The scan() method alone can take around 10 seconds for ~100k entities. Also if we can remove variables from the class storing large amounts of data, like $files, then this will minimize the amount of data written to the database. I calculated >2TB of data based on ~100k entities. Writing this data to the database takes longer, also contributing to timeouts as this happens before the batch operations start. The impact grows at an accelerating rate, as each file you add to the import creates an additional batch operation (database entry) and slightly increases the size of all batch operations.

  • 🇨🇦Canada smulvih2 Canada 🍁

    Ahh actually I see you are using static::class now instead of $this for batch operations. So this means the batch operations will be called without an instance of the class, so maybe this removes the $files from the queue item in the database, will verify this.

  • 🇨🇦Canada smulvih2 Canada 🍁

    Ok so using static::class does not include the class variables, so the new default_content_deploy_queue queue items are nice and small, so my first point has actually been addressed.

  • 🇨🇦Canada smulvih2 Canada 🍁

    I am now able to get the import batches started through the UI! I see it writes twice as many queue items now to account for additional processing of entity_ids. The issue I am seeing now is that it doesn't move past the first import item, it just hangs there and never progresses. Not seeing any errors in the logs.

  • 🇩🇪Germany mkalkbrenner 🇩🇪

    The issue I am seeing now is that it doesn't move past the first import item, it just hangs there and never progresses. Not seeing any errors in the logs.

    I faced that, too. That commit fixed it:
    https://git.drupalcode.org/project/default_content_deploy/-/merge_reques...

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024