Use Batch API for export/import

Comment about 2 years ago →
🇨🇦Canada smulvih2 Canada 🍁
I need batch for export/import as well. I have a site with 2k nodes, and I am not able to export through the UI or using Drush, I either get timeouts or memory issue. I still have 2 large migrations for this site and will have >10k nodes by the time it's ready for hand-off.

I wrote a simple drush command that batches export, given a content_type and bundle. I am able to use this to export all 2k nodes - https://gist.github.com/smulvih2/c3a406ecca47bf344fd2b36804c7d927

Batching import is not as straightforward. My workaround for now is to use drush dcdi and add the following to settings.php:

if (PHP_SAPI === 'cli') { ini_set('memory_limit', '-1'); }
With this I am able to import all 2k nodes. It would be nice to have progress updates during the import process so you can gauge time to completion.
Comment almost 2 years ago →
🇨🇦Canada smulvih2 Canada 🍁
I had some time on Friday to look into this, was able to get import working with batch through drush and UI. Still needs some more testing and to get export working with batch as well. Posting my patch here to capture my progress, will continue working on this when I have spare time.

🇨🇦Canada smulvih2 Canada 🍁

Just tested #6 on a migration I am running to see how the new batch process would handle a real-life scenario. It seems to work as expected.

Here is the contents of my export:

file: 1
media: 1
node: 90
taxonomy_term: 88
Total JSON files: 180

I am calling the new importBatch() method programmatically on a custom form submit handler, like this:

$import_dir = \Drupal::service('extension.list.module')->getPath('cra_default_content') . '/content';
$importer = \Drupal::service('default_content_deploy.importer');

$importer->setForceOverride(TRUE);
$importer->setFolder($import_dir);
$importer->prepareForImport();
$importer->importBatch();

This gives me the batch progress bar and correctly shows 180 items being processed. After the import, I get all 90 nodes with translations. The term reference fields all work as expected. Even the links to other imported nodes within the body field work as expected.

Here are the patches I have in my project:

"drupal/default_content_deploy": {
    "3302464 - remove entities not supporting UUIDs": "https://www.drupal.org/files/issues/2022-08-08/dcd-remove-entities-without-uuid-support-3302464-2.patch",
    "3349952 - Export processed text references": "https://www.drupal.org/files/issues/2023-03-23/dcd-export-processed-text-references-3349952-2.patch",
    "3357503 - Allow users to configure which referenced entities to export": "https://www.drupal.org/files/issues/2023-05-01/default_content_deploy-referenced_entities_config-3357503-2.patch",
    "3102222 - batch import": "https://www.drupal.org/files/issues/2023-09-18/dcd-import-batch-api-3102222-6.patch"
},

Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
After testing the batch import for DCD a bit more, I think we need to add a batch process for the prepareForImport() method as well. With a lot of JSON files in the export directory, decoding all of these files can cause timeouts.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
With patch #6 I was getting warnings in the dblog. New patch fixes this and now batch import works without any dblog errors/warnings.
Status changed to Needs review over 1 year ago8:30am 22 November 2023
Comment over 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
Using batches would be good improvement. But we need to review the patch carefully.
The current process runs three times across the content to deal with things like path aliases. So the context that needs to be shared between the batches is important.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Agreed, although so far it seems to be working well for things like entity references, links to other nodes, etc. Will make sure to test this with path aliases like you suggest. Also need to implement batch for export since currently I am using a custom drush command to get past the export limitation.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Got batch export working for all three modes (default, reference, all). Adding patch here to capture changes, but will push changes to a PR to make it easier to review.

Remaining tasks:

Performance - exporting with references can load the same entity multiple times. Each batch item exports a single entity with any of its' referenced entities. If the same term is referenced on multiple nodes, the JSON file for the term is updated multiple times. We can add an array to store processed entities to avoid duplication.

There is some duplication of code between exportBatchDefault and exportBatchWithReferences, could extract these into new method(s).

Add progress indicator to drush command, so batch output shows in CLI

Test export/import with complex data
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
As I suspected in #8, the prepareForImport() method will cause timeouts on large data sets due to the amount of processing that occurs per JSON file. The new patch moves prepareForImport() into it's own batch process, that then passes the data to the existing batch process that imports the entities. Tested this on a large data set and works well.
Comment over 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
Thanks for the patch, I'll review it ...
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Removed a line used for testing.
Merge request !7Move export/import into batch processes → (Open) created by smulvih2
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Created a PR to make it easier to review the changes, inline with #15.
Status changed to Needs work over 1 year ago12:32am 16 January 2024
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
One issue with my last patch in passing data from first batch process prepareForImport() to importBatch(). Will need to figure out a different solution for this before this is ready for review.
Status changed to Needs review over 1 year ago6:51pm 19 January 2024
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Ok, got the import() process working with batch, it's solid now. Updated patch attached, will update PR shortly after and add comments.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Status changed to Needs work over 1 year ago1:52pm 22 January 2024
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
This patch needs a re-roll after recent changes pushed to 2.0.x-dev branch. Will work on this over the next few days.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Re-rolled patch to apply against latest 2.0.x-dev branch.
Status changed to Needs review over 1 year ago11:09am 29 January 2024
Comment over 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
Status changed to Needs work over 1 year ago10:38am 30 January 2024
Comment over 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
The patch doesn't apply anymore
Status changed to Needs review over 1 year ago8:34pm 12 February 2024
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Re-rolled patch against latest 2.0.x-dev branch. Tested all 3 export modes through the UI as well as Drush. Also tested import through both UI and Drush. Seems to work well. Made a slight adjustment to the --text_dependencies Drush flag so it takes the UI config value if not specified in the Drush command.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Updated MR to align with changes in latest patch (#25) - https://git.drupalcode.org/project/default_content_deploy/-/merge_requests/7/diffs
Status changed to Needs work over 1 year ago2:26pm 13 February 2024
Comment over 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
After updating my containers to PHP8.3 (from PHP8.1) I am getting this error when importing content using this patch:

Deprecated function: Creation of dynamic property Drupal\default_content_deploy\Importer::$context is deprecated in Drupal\default_content_deploy\Importer->import() (line 255 of modules/contrib/default_content_deploy/src/Importer.php).

Adding this to the top of the Importer class fixes the deprecation error:

/** * The batch context. * * @var array */ protected $context;
New patch tested on PHP8.3 and fixes the issue.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
I'm running a large export of > 16k nodes, with references, for a total of > 118k entities. Reviewing patch #28 I found a redundant call to load the entity for a second time in the exportBatchDefault() method. New patch attached removes this second call and hopefully will speed things up a bit.
Comment over 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
+++ b/src/Importer.php @@ -291,93 +369,98 @@ class Importer { - // All entities with entity references will be imported two times to ensure - // that all entity references are present and valid. Path aliases will be - // imported last to have a chance to rewrite them to the new ids of newly - // created entities. - for ($i = 0; $i <= 2; $i++) {
I just can repeat what I commented on the MR.
The current patch removed an essential feature:
All entities with entity references will be imported two times to ensure that all entity references are present and valid. Path aliases will be imported last to have a chance to rewrite them to the new ids of newly created entities.

This strategy is essential to correct references via IDs (not everything works with UUIDs yet).
Especially path aliases are special and break using the proposed patch.
The patch only works for the content in total.
But exporting from A and importing into B breaks as he ID collisions in references aren't corrected anymore.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
@mkalkbrenner, yes I can reproduce the issue with path_alias entities. I was not running into this before as path_aliases are not exported with reference since they have a relationship from

path_alias -> node

instead of node -> path_alias. Now that I specifically export path_alias entities as well, I can reproduce. I will make sure to account for this in the batch import in subsequent patches, but for now I am looking at how to make the export/import process scalable.

I was able to successfully export all 118,000 entities with the batch export, but I am having issues with the import at this scale. The issue is how I pass the $file to each batch operation, which is the contents of the JSON file in question. The issue with passing the file to the batch operation is that it then writes the file contents to the queue table in the database. Of the 118k entities, 50k are serialized files like images, pdf files, etc... Writing the actual file contents to the database exploded the database size and would timeout before even starting the batch operation, or run out of disk space.

I will need to rewrite the importer class for this to work, and instead of passing the file contents to the batch operations, I will just pass a pointer to the file in the filesystem. Then each batch operation can load the file from the pointer. The I will combine the decodeFile() and importEntity() batch operations into one method, reducing the amount of batches by 50% (2 per JSON file). If I ensure path_alias entities are imported last, then I can probably get them working by just swapping the entity_ids.

See my comments from slack below for records (2024-04-16):

Need to make an update to my DCD patch, instead of passing the file contents to the batch operation, and subsequently storing the JSON files in the database, I am going to pass a pointer to the file in the filesystem. Then load the file directly in the batch operation. This is going to be needed in a production scenario to prevent the database from swelling when imports are ran.

And look at combining the decodeFile() and importEntity() batch operations into one operation that does both, which will reduce the number of batches by 50%

This was the full rewrite of the importer class I was hoping to avoid, but now it looks like it's needed
Comment over 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
A quick explanation about the old algorithm to fix references via ID:

First run:
Import new and updated entities, but skip path aliases. Store old ID and new ID of newly created entities in a mapping array. Store a list of newly created entities that reference other entities via IDs in another array (NEW).

Second run:
Run updates on all entities stored in array (NEW) and correct the reference IDs according to the mapping array. Still skip the path aliases.

Third run:
Import path aliases. In case of new aliases, adjust the referenced entity IDs according to the mapping array.

I think that this algorithm could be kept. The first batch has to create a second batch of newly created entities that reference other entities via IDs and skip path aliases. For path aliases it has to create the third batch.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
@mkalkbrenner thanks for the info! Will refer to this when fixing for path_aliases.

So I found the major source of my database exploding on import. DCD does a filesystem scan and stores a pointer to each file in the Importer object. This object is then added to each batch item in the database. This means each batch item in the database would have all 118k pointers. I was able to use KeyValueStorage to store this data outside of the object, and now the queue entries look reasonable for each batch item:

a:2:{i:0;a:2:{i:0;O:38:"Drupal\\default_content_deploy\\Importer":10:{s:46:"\000Drupal\\default_content_deploy\\Importer\000folder";s:14:"../content/dcd";s:52:"\000Drupal\\default_content_deploy\\Importer\000dataToImport";a:0:{}s:16:"\000*\000forceOverride";b:0;s:23:"\000*\000discoveredReferences";a:0:{}s:20:"\000*\000oldEntityIdLookup";a:0:{}s:17:"\000*\000entityIdLookup";a:0:{}s:11:"\000*\000newUuids";a:0:{}s:10:"\000*\000context";a:1:{s:7:"sandbox";a:2:{s:8:"progress";i:0;s:5:"total";i:22;}}s:14:"\000*\000_serviceIds";a:11:{s:13:"deployManager";s:30:"default_content_deploy.manager";s:16:"tempStoreFactory";s:17:"tempstore.private";s:16:"entityRepository";s:17:"entity.repository";s:5:"cache";s:13:"cache.default";s:10:"serializer";s:10:"serializer";s:17:"entityTypeManager";s:19:"entity_type.manager";s:11:"linkManager";s:16:"hal.link_manager";s:15:"accountSwitcher";s:16:"account_switcher";s:8:"exporter";s:31:"default_content_deploy.exporter";s:8:"database";s:8:"database";s:15:"eventDispatcher";s:16:"event_dispatcher";}s:18:"\000*\000_entityStorages";a:0:{}}i:1;s:11:"processFile";}i:1;a:4:{i:0;s:41:"cc1b90d7-7d81-47ca-b0b6-fd5c068a55e4.json";i:1;s:61:"../content/dcd/node/cc1b90d7-7d81-47ca-b0b6-fd5c068a55e4.json";i:2;i:22;i:3;i:22;}}

With this new path, I combined both batch operation callbacks into one callback, so we have 50% less batch operations with this method.

Rough numbers, if I have 118,000 entities to import, then all 118,000 entities would be referenced in each of the 118,000 batch operations. Since we had 2 operations per batch that would be x2. So 118k x 118k x2 = 27.87 billion entries. This is one such entry:

0 => [ 'name' => 'cc1b90d7-7d81-47ca-b0b6-fd5c068a55e4.json', 'uri' => '../content/dcd/node/cc1b90d7-7d81-47ca-b0b6-fd5c068a55e4.json' ],
If we say this string is 92 bytes, then this would be approx. 2.3 TB of storage required for this.

Now we only have 118,000 batch operations (not times two), and the total data that is stored per operation in the DB is about 1234 bytes. So 1234 bytes times 118,000 is approx. 145MB.

So the import of 118k entities would increase the DB size by about 145MB instead of 2+ TB. This should also significantly reduce the time it takes to write the batch operations to the database when batch_set() is called (before the progress bar is shown).

Uploading patch here and will test the import against the 118k entities and report back.
Comment over 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
If we (optionally) use igbinary to serialize this array, I expect that we'll save 80% of this memory.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Update: With the latest patch #33 I was able to trigger an import through the UI of 118k entities and within 10 seconds it started to process the entities where before it would take too long and timeout. I am now importing the entities using nohup and drush. Definitely need to think of a few ways to optimize the export/import process, will take a look at igbinary for this!
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Coming back to the export, I originally exported 118k entities using the "All content" option which worked perfectly. Also tested "Content of the entity type" option and it worked well. i am now testing "Content with references" and running into an problem. When exporting with references, the queue database table would get populated as expected, then during the first batch operation it would spin until timed out. The issue ends up being getEntityReferencesRecursive(). It looks at references in the body field, and spiders out to include hundreds of nodes.

This is where the patch in #3357503 ✨ Allow users to configure which referenced entities to export Needs review comes in handy. I can exclude nodes from reference calculations. For example, I want to export all nodes of type page, with references. If I include nodes as part of the reference calculation, the first batch operation could be massive and timeout. With nodes excluded, I would get all pages still, but they would be spread out across all batch operations and not all included in one.

I have updated the patch to include #3357503, and also apply this filter directly to getEntityProcessedTextDependencies() to avoid loading entities that are later filtered out anyways.

Also need #3435979 🐛 Export misses translated reference fields Needs review included to make sure media items on French translations are included in the export.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Ok another major improvement to the "export with references". I ran into a node that has 208 entity references (lots of media and files). This was causing memory issues, it was hitting 800MB and then would die. This is because of the current approach to exporting with references.

The current approach is to get all referenced entities recursively into an array, then loop over that array and serialize into another array, then loop over the second array and write the JSON files. With entities that have large amounts of references this starts to kill the memory.

My new approach is to get all referenced entities recursively into an array, then loop over that array to serialize and write to the file system at the same time. This means there is no second large array storing all entities and their serialized content. I tested this against the node with 804 references and it works great, memory usage doesn't go over 50MB. This node even has a PDF file of 28MB being serialized with better_normalizers enabled.

The patch attached provides this update to the exporter class. It removes a method, and simplifies the code, which is always nice :)
Status changed to Needs review over 1 year ago2:12pm 6 May 2024
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
I've made some major improvements on the importer class. Using a dataset of 2k nodes (12k+ entities) it was initially taking me 80min to import the full 12K entities and 60min to import them when all entities have already been imported (skipped). These numbers were even larger once I hooked up the methods for internal link replacement updateInternalLinks() and updateTargetRevisionId().

I did a full review of the importer class to fix all performance related issue. Please find below a list of improvements:

Minimize the data being stored in the queue table for each batch operation. I originally passed data like processed UUIDs and entity_id => UUID mappings through the class itself, which was causingdelay in starting the batch process as well as made the database grow significantly. Then this information was being passed using KeyValueStorage on each batch operation. The final solution now uses $context to pass this data to each subsequent batch operation, significantly speeding up the import process.

Added a setting called "Support old content", which enables the updateInternalLinks() method. This option loops over each JSON file to look for uri field names and does a str_starts_with to see if internal: or entity: exist in the JSON file. This is not needed with newer sites since links now use UUIDs and entities can be embedded with <drupal-entity> elements. Disabling this option sped up the import process.

There was duplication of $this->serializer->decode($this->exporter->getSerializedContent($entity), 'hal_json');. It was first being called in importEntity() to determine of there is a diff, then if there was a diff the preAddToImport() method was called and performed the comparison again. Now with this new patch this comparison is only done once, which significantly improved performance.

I also made changes to support path_alias entities. To accommodation this without needing to loop over the entities multiple times, I ensure path_alias entities are processed last so that the corresponding entities being referenced in the path field already exist and their entity_ids are available in $context.

I did a complete review of the importer class and fixed things like doc comments, inline comments, and general coding practices.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
I added output for the exporter class, similar to the importer class. It output how many entities of each type were exported. When I did this, I found a performance issue with export with references. I had 229 taxonomy_terms in total, but the count at the end of the export process was showing thousands. To fix this, I added the already processed entities to the batch $context, so I could check if it has already been exported and skip. This prevented thousands of writes to already written files. Also added batch output for export so when running in Drush you get an idea of what is happening, same as the importer. Updated patch attached.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
I created the related issue for file_entity → . Batch export would fail if a file entity is referencing an image/file that was removed from the file system. Was seeing this on a full site export, with old test data. Suggest updating file_entity to latest 2.0-rc6 to avoid this issue.
Comment over 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
Added error handling when export with references is used and the selected options produce no results.
Assigned to mkalkbrenner
Status changed to Needs work about 1 year ago3:22pm 3 July 2024
Comment about 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
I spent a lot of time on an in depth review. Some features like "export changes since" are broken. I started to rewrite the patch to fix these issues.
Comment about 1 year ago →
🇳🇴Norway eiriksm Norway
With the amount of code this touches, I feel it would first make sense to have in place the most basic test coverage of these forms, which can then be extended when reviewing and implementing the refactoring here.

This is why I opened 📌 Create tests for import and export forms Needs review which is now NR and has these very basic tests at least. A place to start?
Comment about 1 year ago →
🇨🇦Canada smulvih2 Canada 🍁
@kalkbrenner that is one feature I didn't test (export changes since), so glad you are doing an in-depth review of this! I do have this patch on my project and it is working great for moving 10k+ entities from environment to environment, so for the main use-case of just exporting/importing it is working great!

@eiriksm thanks for adding these tests!!!
Comment about 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
Unfortunately I found different issues and can't agree that it is working great. At least not with the content I could use for testing. I had to restructure the code and after two days of work, the export seems to work now with the different modes using drush.
But it is still much slower than without batches.
But don't get me wrong, the existing patch is a great starting point.
I will now take care of the import next week. And maybe we could talk about performance later as soon as the content itself is correct.
Status changed to Needs review about 1 year ago2:43pm 10 July 2024
Comment about 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
Here's a first draft of my rewrite.
Comment about 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
OK, a small mistake in the import. Here's a fix.
Comment about 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
oh, I uploaded the previous patch again, now ...
Comment about 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
mkalkbrenner → changed the visibility of the branch 3102222-integrate-with-batch to hidden.
Merge request !14Issue #3102222 by smulvih2, mkalkbrenner: Use Batch API for export/import → (Merged) created by mkalkbrenner
Comment about 1 year ago →
System Message

mkalkbrenner → committed 4ee6cc34 on 2.1.x
Issue #3102222 by smulvih2, mkalkbrenner: Use Batch API for export/...
Status changed to Fixed about 1 year ago11:57am 12 July 2024
Comment about 1 year ago →
🇩🇪Germany mkalkbrenner 🇩🇪
Comment about 1 year ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

Use Batch API for export/import

Merge Requests

!14Use Batch API for export/import
Merged

!7Use Batch API for export/import
Open

Comments & Activities

Use Batch API for export/import

Merge Requests

!14Use Batch API for export/importMerged

!7Use Batch API for export/importOpen

Comments & Activities

!14Use Batch API for export/import
Merged

!7Use Batch API for export/import
Open