Use Batch API for export/import

Created on 19 December 2019, over 4 years ago
Updated 7 May 2024, about 2 months ago

We are using this in Demo Framework with the Scenario module, and its great! However, we often have to install quite a few modules and large amounts of content that risk timing out. It would be great if we could "throttle" this using Batch API.

Is it possible to integrate Batch API to optionally allow us to import content?

@rlnorthcutt
________________________________________________________________________________________________________________________________

We have a requirement to provide content deployments across environments. We looked into other options, like content_deploy, workspaces (core), but we determined default_content_deploy was the best option to use with minimal changes needed to support our needs.

There are a few issues with the current DCD implementation with this in mind:

  • It doesn't use batch API, so it limits the amount of data that can be exported/imported
  • It works well for entity reference fields, but not for exporting entity references in text fields (linkit, entity_embed, media, etc...)
  • It doesn't give fine-grained control over which type of entities to export as dependencies.
  • Error with exporting entities that do not support UUIDs

With these issues solved, DCD would be perfectly suited to support content deployment setups.

@smulvih2

✨ Feature request
Status

Needs review

Version

2.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States rlnorthcutt Austin, TX

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    I need batch for export/import as well. I have a site with 2k nodes, and I am not able to export through the UI or using Drush, I either get timeouts or memory issue. I still have 2 large migrations for this site and will have >10k nodes by the time it's ready for hand-off.

    I wrote a simple drush command that batches export, given a content_type and bundle. I am able to use this to export all 2k nodes - https://gist.github.com/smulvih2/c3a406ecca47bf344fd2b36804c7d927

    Batching import is not as straightforward. My workaround for now is to use drush dcdi and add the following to settings.php:

    if (PHP_SAPI === 'cli') {
      ini_set('memory_limit', '-1');
    }

    With this I am able to import all 2k nodes. It would be nice to have progress updates during the import process so you can gauge time to completion.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    I had some time on Friday to look into this, was able to get import working with batch through drush and UI. Still needs some more testing and to get export working with batch as well. Posting my patch here to capture my progress, will continue working on this when I have spare time.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Just tested #6 on a migration I am running to see how the new batch process would handle a real-life scenario. It seems to work as expected.

    Here is the contents of my export:

    • file: 1
    • media: 1
    • node: 90
    • taxonomy_term: 88
    • Total JSON files: 180

    I am calling the new importBatch() method programmatically on a custom form submit handler, like this:

    $import_dir = \Drupal::service('extension.list.module')->getPath('cra_default_content') . '/content';
    $importer = \Drupal::service('default_content_deploy.importer');
    
    $importer->setForceOverride(TRUE);
    $importer->setFolder($import_dir);
    $importer->prepareForImport();
    $importer->importBatch();
    

    This gives me the batch progress bar and correctly shows 180 items being processed. After the import, I get all 90 nodes with translations. The term reference fields all work as expected. Even the links to other imported nodes within the body field work as expected.

    Here are the patches I have in my project:

    "drupal/default_content_deploy": {
        "3302464 - remove entities not supporting UUIDs": "https://www.drupal.org/files/issues/2022-08-08/dcd-remove-entities-without-uuid-support-3302464-2.patch",
        "3349952 - Export processed text references": "https://www.drupal.org/files/issues/2023-03-23/dcd-export-processed-text-references-3349952-2.patch",
        "3357503 - Allow users to configure which referenced entities to export": "https://www.drupal.org/files/issues/2023-05-01/default_content_deploy-referenced_entities_config-3357503-2.patch",
        "3102222 - batch import": "https://www.drupal.org/files/issues/2023-09-18/dcd-import-batch-api-3102222-6.patch"
    },
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    After testing the batch import for DCD a bit more, I think we need to add a batch process for the prepareForImport() method as well. With a lot of JSON files in the export directory, decoding all of these files can cause timeouts.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    With patch #6 I was getting warnings in the dblog. New patch fixes this and now batch import works without any dblog errors/warnings.

  • Status changed to Needs review 7 months ago
  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ

    Using batches would be good improvement. But we need to review the patch carefully.
    The current process runs three times across the content to deal with things like path aliases. So the context that needs to be shared between the batches is important.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Agreed, although so far it seems to be working well for things like entity references, links to other nodes, etc. Will make sure to test this with path aliases like you suggest. Also need to implement batch for export since currently I am using a custom drush command to get past the export limitation.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Got batch export working for all three modes (default, reference, all). Adding patch here to capture changes, but will push changes to a PR to make it easier to review.

    Remaining tasks:

    • Performance - exporting with references can load the same entity multiple times. Each batch item exports a single entity with any of its' referenced entities. If the same term is referenced on multiple nodes, the JSON file for the term is updated multiple times. We can add an array to store processed entities to avoid duplication.
    • There is some duplication of code between exportBatchDefault and exportBatchWithReferences, could extract these into new method(s).
    • Add progress indicator to drush command, so batch output shows in CLI
    • Test export/import with complex data
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    As I suspected in #8, the prepareForImport() method will cause timeouts on large data sets due to the amount of processing that occurs per JSON file. The new patch moves prepareForImport() into it's own batch process, that then passes the data to the existing batch process that imports the entities. Tested this on a large data set and works well.

  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ

    Thanks for the patch, I'll review it ...

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Removed a line used for testing.

  • Merge request !7Move export/import into batch processes β†’ (Open) created by smulvih2
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Created a PR to make it easier to review the changes, inline with #15.

  • Status changed to Needs work 5 months ago
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    One issue with my last patch in passing data from first batch process prepareForImport() to importBatch(). Will need to figure out a different solution for this before this is ready for review.

  • Status changed to Needs review 5 months ago
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Ok, got the import() process working with batch, it's solid now. Updated patch attached, will update PR shortly after and add comments.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁
  • Status changed to Needs work 5 months ago
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    This patch needs a re-roll after recent changes pushed to 2.0.x-dev branch. Will work on this over the next few days.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Re-rolled patch to apply against latest 2.0.x-dev branch.

  • Status changed to Needs review 5 months ago
  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ
  • Status changed to Needs work 5 months ago
  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ

    The patch doesn't apply anymore

  • Status changed to Needs review 5 months ago
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Re-rolled patch against latest 2.0.x-dev branch. Tested all 3 export modes through the UI as well as Drush. Also tested import through both UI and Drush. Seems to work well. Made a slight adjustment to the --text_dependencies Drush flag so it takes the UI config value if not specified in the Drush command.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Updated MR to align with changes in latest patch (#25) - https://git.drupalcode.org/project/default_content_deploy/-/merge_requests/7/diffs

  • Status changed to Needs work 4 months ago
  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    After updating my containers to PHP8.3 (from PHP8.1) I am getting this error when importing content using this patch:

    Deprecated function: Creation of dynamic property Drupal\default_content_deploy\Importer::$context is deprecated in Drupal\default_content_deploy\Importer->import() (line 255 of modules/contrib/default_content_deploy/src/Importer.php).

    Adding this to the top of the Importer class fixes the deprecation error:

      /**
       * The batch context.
       *
       * @var array
       */
      protected $context;

    New patch tested on PHP8.3 and fixes the issue.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    I'm running a large export of > 16k nodes, with references, for a total of > 118k entities. Reviewing patch #28 I found a redundant call to load the entity for a second time in the exportBatchDefault() method. New patch attached removes this second call and hopefully will speed things up a bit.

  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ
    +++ b/src/Importer.php
    @@ -291,93 +369,98 @@ class Importer {
    -    // All entities with entity references will be imported two times to ensure
    -    // that all entity references are present and valid. Path aliases will be
    -    // imported last to have a chance to rewrite them to the new ids of newly
    -    // created entities.
    -    for ($i = 0; $i <= 2; $i++) {
    

    I just can repeat what I commented on the MR.
    The current patch removed an essential feature:
    All entities with entity references will be imported two times to ensure that all entity references are present and valid. Path aliases will be imported last to have a chance to rewrite them to the new ids of newly created entities.

    This strategy is essential to correct references via IDs (not everything works with UUIDs yet).
    Especially path aliases are special and break using the proposed patch.
    The patch only works for the content in total.
    But exporting from A and importing into B breaks as he ID collisions in references aren't corrected anymore.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    @mkalkbrenner, yes I can reproduce the issue with path_alias entities. I was not running into this before as path_aliases are not exported with reference since they have a relationship from

    path_alias -> node

    instead of node -> path_alias. Now that I specifically export path_alias entities as well, I can reproduce. I will make sure to account for this in the batch import in subsequent patches, but for now I am looking at how to make the export/import process scalable.

    I was able to successfully export all 118,000 entities with the batch export, but I am having issues with the import at this scale. The issue is how I pass the $file to each batch operation, which is the contents of the JSON file in question. The issue with passing the file to the batch operation is that it then writes the file contents to the queue table in the database. Of the 118k entities, 50k are serialized files like images, pdf files, etc... Writing the actual file contents to the database exploded the database size and would timeout before even starting the batch operation, or run out of disk space.

    I will need to rewrite the importer class for this to work, and instead of passing the file contents to the batch operations, I will just pass a pointer to the file in the filesystem. Then each batch operation can load the file from the pointer. The I will combine the decodeFile() and importEntity() batch operations into one method, reducing the amount of batches by 50% (2 per JSON file). If I ensure path_alias entities are imported last, then I can probably get them working by just swapping the entity_ids.

    See my comments from slack below for records (2024-04-16):

    Need to make an update to my DCD patch, instead of passing the file contents to the batch operation, and subsequently storing the JSON files in the database, I am going to pass a pointer to the file in the filesystem. Then load the file directly in the batch operation. This is going to be needed in a production scenario to prevent the database from swelling when imports are ran.

    And look at combining the decodeFile() and importEntity() batch operations into one operation that does both, which will reduce the number of batches by 50%

    This was the full rewrite of the importer class I was hoping to avoid, but now it looks like it's needed

  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ

    A quick explanation about the old algorithm to fix references via ID:

    First run:
    Import new and updated entities, but skip path aliases. Store old ID and new ID of newly created entities in a mapping array. Store a list of newly created entities that reference other entities via IDs in another array (NEW).

    Second run:
    Run updates on all entities stored in array (NEW) and correct the reference IDs according to the mapping array. Still skip the path aliases.

    Third run:
    Import path aliases. In case of new aliases, adjust the referenced entity IDs according to the mapping array.

    I think that this algorithm could be kept. The first batch has to create a second batch of newly created entities that reference other entities via IDs and skip path aliases. For path aliases it has to create the third batch.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    @mkalkbrenner thanks for the info! Will refer to this when fixing for path_aliases.

    So I found the major source of my database exploding on import. DCD does a filesystem scan and stores a pointer to each file in the Importer object. This object is then added to each batch item in the database. This means each batch item in the database would have all 118k pointers. I was able to use KeyValueStorage to store this data outside of the object, and now the queue entries look reasonable for each batch item:

    a:2:{i:0;a:2:{i:0;O:38:"Drupal\\default_content_deploy\\Importer":10:{s:46:"\000Drupal\\default_content_deploy\\Importer\000folder";s:14:"../content/dcd";s:52:"\000Drupal\\default_content_deploy\\Importer\000dataToImport";a:0:{}s:16:"\000*\000forceOverride";b:0;s:23:"\000*\000discoveredReferences";a:0:{}s:20:"\000*\000oldEntityIdLookup";a:0:{}s:17:"\000*\000entityIdLookup";a:0:{}s:11:"\000*\000newUuids";a:0:{}s:10:"\000*\000context";a:1:{s:7:"sandbox";a:2:{s:8:"progress";i:0;s:5:"total";i:22;}}s:14:"\000*\000_serviceIds";a:11:{s:13:"deployManager";s:30:"default_content_deploy.manager";s:16:"tempStoreFactory";s:17:"tempstore.private";s:16:"entityRepository";s:17:"entity.repository";s:5:"cache";s:13:"cache.default";s:10:"serializer";s:10:"serializer";s:17:"entityTypeManager";s:19:"entity_type.manager";s:11:"linkManager";s:16:"hal.link_manager";s:15:"accountSwitcher";s:16:"account_switcher";s:8:"exporter";s:31:"default_content_deploy.exporter";s:8:"database";s:8:"database";s:15:"eventDispatcher";s:16:"event_dispatcher";}s:18:"\000*\000_entityStorages";a:0:{}}i:1;s:11:"processFile";}i:1;a:4:{i:0;s:41:"cc1b90d7-7d81-47ca-b0b6-fd5c068a55e4.json";i:1;s:61:"../content/dcd/node/cc1b90d7-7d81-47ca-b0b6-fd5c068a55e4.json";i:2;i:22;i:3;i:22;}}

    With this new path, I combined both batch operation callbacks into one callback, so we have 50% less batch operations with this method.

    Rough numbers, if I have 118,000 entities to import, then all 118,000 entities would be referenced in each of the 118,000 batch operations. Since we had 2 operations per batch that would be x2. So 118k x 118k x2 = 27.87 billion entries. This is one such entry:

    0 => [
      'name' => 'cc1b90d7-7d81-47ca-b0b6-fd5c068a55e4.json',
      'uri' => '../content/dcd/node/cc1b90d7-7d81-47ca-b0b6-fd5c068a55e4.json'
    ],
    

    If we say this string is 92 bytes, then this would be approx. 2.3 TB of storage required for this.

    Now we only have 118,000 batch operations (not times two), and the total data that is stored per operation in the DB is about 1234 bytes. So 1234 bytes times 118,000 is approx. 145MB.

    So the import of 118k entities would increase the DB size by about 145MB instead of 2+ TB. This should also significantly reduce the time it takes to write the batch operations to the database when batch_set() is called (before the progress bar is shown).

    Uploading patch here and will test the import against the 118k entities and report back.

  • πŸ‡©πŸ‡ͺGermany mkalkbrenner πŸ‡©πŸ‡ͺ

    If we (optionally) use igbinary to serialize this array, I expect that we'll save 80% of this memory.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Update: With the latest patch #33 I was able to trigger an import through the UI of 118k entities and within 10 seconds it started to process the entities where before it would take too long and timeout. I am now importing the entities using nohup and drush. Definitely need to think of a few ways to optimize the export/import process, will take a look at igbinary for this!

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Coming back to the export, I originally exported 118k entities using the "All content" option which worked perfectly. Also tested "Content of the entity type" option and it worked well. i am now testing "Content with references" and running into an problem. When exporting with references, the queue database table would get populated as expected, then during the first batch operation it would spin until timed out. The issue ends up being getEntityReferencesRecursive(). It looks at references in the body field, and spiders out to include hundreds of nodes.

    This is where the patch in #3357503 ✨ Allow users to configure which referenced entities to export Needs review comes in handy. I can exclude nodes from reference calculations. For example, I want to export all nodes of type page, with references. If I include nodes as part of the reference calculation, the first batch operation could be massive and timeout. With nodes excluded, I would get all pages still, but they would be spread out across all batch operations and not all included in one.

    I have updated the patch to include #3357503, and also apply this filter directly to getEntityProcessedTextDependencies() to avoid loading entities that are later filtered out anyways.

    Also need #3435979 πŸ› Export misses translated reference fields Needs review included to make sure media items on French translations are included in the export.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Ok another major improvement to the "export with references". I ran into a node that has 208 entity references (lots of media and files). This was causing memory issues, it was hitting 800MB and then would die. This is because of the current approach to exporting with references.

    The current approach is to get all referenced entities recursively into an array, then loop over that array and serialize into another array, then loop over the second array and write the JSON files. With entities that have large amounts of references this starts to kill the memory.

    My new approach is to get all referenced entities recursively into an array, then loop over that array to serialize and write to the file system at the same time. This means there is no second large array storing all entities and their serialized content. I tested this against the node with 804 references and it works great, memory usage doesn't go over 50MB. This node even has a PDF file of 28MB being serialized with better_normalizers enabled.

    The patch attached provides this update to the exporter class. It removes a method, and simplifies the code, which is always nice :)

  • Status changed to Needs review about 2 months ago
  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    I've made some major improvements on the importer class. Using a dataset of 2k nodes (12k+ entities) it was initially taking me 80min to import the full 12K entities and 60min to import them when all entities have already been imported (skipped). These numbers were even larger once I hooked up the methods for internal link replacement updateInternalLinks() and updateTargetRevisionId().

    I did a full review of the importer class to fix all performance related issue. Please find below a list of improvements:

    • Minimize the data being stored in the queue table for each batch operation. I originally passed data like processed UUIDs and entity_id => UUID mappings through the class itself, which was causingdelay in starting the batch process as well as made the database grow significantly. Then this information was being passed using KeyValueStorage on each batch operation. The final solution now uses $context to pass this data to each subsequent batch operation, significantly speeding up the import process.
    • Added a setting called "Support old content", which enables the updateInternalLinks() method. This option loops over each JSON file to look for uri field names and does a str_starts_with to see if internal: or entity: exist in the JSON file. This is not needed with newer sites since links now use UUIDs and entities can be embedded with <drupal-entity> elements. Disabling this option sped up the import process.
    • There was duplication of $this->serializer->decode($this->exporter->getSerializedContent($entity), 'hal_json');. It was first being called in importEntity() to determine of there is a diff, then if there was a diff the preAddToImport() method was called and performed the comparison again. Now with this new patch this comparison is only done once, which significantly improved performance.

    I also made changes to support path_alias entities. To accommodation this without needing to loop over the entities multiple times, I ensure path_alias entities are processed last so that the corresponding entities being referenced in the path field already exist and their entity_ids are available in $context.

    I did a complete review of the importer class and fixed things like doc comments, inline comments, and general coding practices.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    I added output for the exporter class, similar to the importer class. It output how many entities of each type were exported. When I did this, I found a performance issue with export with references. I had 229 taxonomy_terms in total, but the count at the end of the export process was showing thousands. To fix this, I added the already processed entities to the batch $context, so I could check if it has already been exported and skip. This prevented thousands of writes to already written files. Also added batch output for export so when running in Drush you get an idea of what is happening, same as the importer. Updated patch attached.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    I created the related issue for file_entity β†’ . Batch export would fail if a file entity is referencing an image/file that was removed from the file system. Was seeing this on a full site export, with old test data. Suggest updating file_entity to latest 2.0-rc6 to avoid this issue.

  • πŸ‡¨πŸ‡¦Canada smulvih2 Canada 🍁

    Added error handling when export with references is used and the selected options produce no results.

Production build 0.69.0 2024