Timeout when processing large files

Created on 20 October 2022, over 1 year ago
Updated 30 May 2023, about 1 year ago

Problem/Motivation

When processing larges files via the admin interface the system is timing out. Data pipelines does use the batch api to process files but when processing the file it is processing the entire file and not breaking it up into chunks and only processing n records at a time and passing the control back to the batch API.

Steps to reproduce

On the data pipeline dataset upload a large file to be processed. We are getting this on a 11Mb CSV file and process via the admin interface.

Proposed resolution

Instead of looping through the entire file you should only process 20-50 records at a time so that batch api can take control and reinitiate the connection to stop the timeout.

Work around

For the time being drush can be used to process these large files by using the `data-pipelines:reindex` command

πŸ› Bug report
Status

Fixed

Version

1.0

Component

Code

Created by

πŸ‡¦πŸ‡ΊAustralia gordon Melbourne

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    Patch Failed to Apply
  • πŸ‡»πŸ‡³Vietnam tannguyenhn

    I created a patch to process amount of dataset per request.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    Patch Failed to Apply
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    Patch Failed to Apply
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    Patch Failed to Apply
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    47 pass, 4 fail
  • πŸ‡»πŸ‡³Vietnam tannguyenhn

    New patch for 1.x-dev.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    47 pass, 4 fail
  • Status changed to Needs work about 1 year ago
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    57 pass
  • Status changed to Needs review about 1 year ago
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    Patch Failed to Apply
  • πŸ‡΅πŸ‡°Pakistan jibran Sydney, Australia

    I have moved the queuing bug to πŸ› Fix the queuing logic for indexing Closed: duplicate and created πŸ“Œ Do not loop over the generator in Dataset::data() instead use the validate date iterator Closed: duplicate for looping over the generator issue.

    For πŸ› Fix the queuing logic for indexing Closed: duplicate , let's move the patch from #23 there.
    For πŸ“Œ Do not loop over the generator in Dataset::data() instead use the validate date iterator Closed: duplicate , let's move the patch from #15 there.

  • πŸ‡΅πŸ‡°Pakistan jibran Sydney, Australia

    We'll comeback to this once the child issues are fixed and see if this is still a bug.

  • πŸ‡΅πŸ‡°Pakistan jibran Sydney, Australia
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    62 pass
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Here's a patch that combines both approaches, retains the queue, and limits the amount of times we seek over the data or call ->extractDataFromSource

    Works locally for large datasets and those with tricky validation

    No interdiff as its a redo from scratch copying ideas from both approaches.

    With this patch we:

    * queue items in chunks
    * process the queue
    * validation happens during processing
    * invalid items are skipped

    Open to making the chunksize configurable per @tannguyenhn's patch if that's desired

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    We could even make the chunk size a field on the dataset, so smaller datasets (less complex rows) could use a large number for speed, and big datasets (lots of data in each row) could use a smaller number for memory overhead

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10
    1. +++ b/src/Entity/DatasetInterface.php
      @@ -16,8 +19,12 @@ interface DatasetInterface extends ContentEntityInterface {
      +  const VALIDATION_QUEUE_PREFIX = 'data_pipelines_validate:';
      

      this is dead code and can be removed

    2. +++ b/src/Plugin/QueueWorker/ProcessValidation.php
      @@ -0,0 +1,75 @@
      +
      +declare(strict_types=1);
      +
      +namespace Drupal\data_pipelines\Plugin\QueueWorker;
      +
      +use Drupal\Core\Entity\EntityTypeManagerInterface;
      +use Drupal\Core\Plugin\ContainerFactoryPluginInterface;
      +use Drupal\Core\Queue\QueueWorkerBase;
      +use Drupal\data_pipelines\DatasetData;
      +use Drupal\data_pipelines\Entity\DatasetInterface;
      +use Symfony\Component\DependencyInjection\ContainerInterface;
      +use Symfony\Component\Validator\ConstraintViolation;
      +
      +/**
      + * Defines a class for processing validation.
      + *
      + * @QueueWorker(
      + *   id = "data_pipelines_validate",
      + *   title = @Translation("Validation"),
      + *   deriver =
      + *   \Drupal\data_pipelines\Plugin\Derivative\DatasetQueueWorkerDeriver::class,
      + * )
      + */
      +final class ProcessValidation extends QueueWorkerBase implements ContainerFactoryPluginInterface {
      

      this is dead code now and can be removed

    3. +++ b/src/TransformValidDataIterator.php
      @@ -0,0 +1,67 @@
      +      $this->position++;
      

      We probably need an implementation of ::rewind here to reset the pointer, I'll add a test and a fix for that

  • πŸ‡΅πŸ‡°Pakistan jibran Sydney, Australia

    RE #29: I feel like we need a default value chunk size on the entity which can be changed by the user editing the entity.

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    We probably need an implementation of ::rewind here to reset the pointer, I'll add a test and a fix for that

    This isn't needed because you can't rewind a generator (the inner iterator) so we're good

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Worth noting that the use of queues here means if for whatever reason the batch process dies part way through indexing you can just use drush queue-run with the queue name to pick up where it left off

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    63 pass
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    I love that the test suite runs in <1 min locally for this project

    Address #31 and #30

  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Screenshot of new field

  • πŸ‡΅πŸ‡°Pakistan jibran Sydney, Australia

    The patch looks really good overall. Just some minor feedback.

    1. +++ b/data_pipelines.install
      @@ -266,3 +266,31 @@ function data_pipelines_update_8013() {
      +      10 => 10,
      +      100 => 100,
      +      1000 => 1000,
      +      10000 => 10000,
      +      100000 => 100000,
      

      Can we add ',' on the values side for sanity? :D

    2. +++ b/src/Entity/Dataset.php
      @@ -223,13 +250,71 @@ class Dataset extends ContentEntityBase implements DatasetInterface {
      +  public function queueProcessing(QueueInterface $queue, int $from = 0, int $limit = 1000): int {
      

      This is too convoluted for my understanding can we please add more docs around the logic?

    3. +++ b/src/Entity/Dataset.php
      @@ -223,13 +250,71 @@ class Dataset extends ContentEntityBase implements DatasetInterface {
      +      foreach ($destinations as $destination_id => $destination) {
      

      Do we have a test to check it works with multiple destinations?

    4. +++ b/src/Entity/Dataset.php
      @@ -248,15 +334,12 @@ class Dataset extends ContentEntityBase implements DatasetInterface {
      +  public function getDataIterator(int $seek_from = 0): TransformValidDataIterator {
      

      How is seeking working? Can we add a comment

    5. +++ b/src/Form/DatasetBatchOperations.php
      @@ -108,50 +86,24 @@ class DatasetBatchOperations implements ContainerInjectionInterface {
      +    $indexed = $dataset->queueProcessing(\Drupal::queue(DatasetInterface::QUEUE_NAME_PREFIX . $dataset->id()), $context['sandbox']['from'], $chunk_size);
      

      Let's use dataset machine name instead of id for queues. Do we need a follow up for that?

  • πŸ‡¦πŸ‡ΊAustralia mortim07

    @larowlan Great work, looks solid! Just a few comments.

    @@ -223,13 +250,71 @@ class Dataset extends ContentEntityBase implements DatasetInterface {
    +    $destinations = [];
    +    $chunk_sizes = [];
    +    $buffers = [];
    +    foreach ($this->get('destinations') as $destination) {
    +      if (!$entity = $destination->entity) {
    +        continue;
    +      }
    

    You can just use the method `getDestinations()`.

    @@ -105,35 +103,23 @@ final class DestinationWorker extends QueueWorkerBase implements ContainerFactor
    -      $this->logger->error("Unknown destination request of type %type", ['%type' => get_class($data)]);
    +    $result = $data->getOperation()->process($destinationPlugin, $dataset, $data->getChunk());
    

    Just pass the data and let the operation determine if it needs the chunk.

    @@ -0,0 +1,67 @@
    +    $errors = $this->pipeline->validate($this->getInnerIterator()->current());
    +    if ($errors->count() > 0) {
    +      foreach ($errors as $validation_error) {
    +        assert($validation_error instanceof ConstraintViolation);
    +        $this->dataset->addLogMessage(sprintf('Validation error in row %d: %s', $this->position, (string) $validation_error->getMessage()), FALSE);
    +      }
    +      $this->dataset->save();
    +      return FALSE;
    +    }
    

    Is it possible to encapsulate this within the valid method as part of the filter iterator?

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    64 pass
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Addressed #36 and ##3 except for the comment about valid vs accept. I looked into the php docs for iterators, and if we return false from ::valid then the iterator stops (its like EOF), which isn't what we want

    Also fixed a fail on D10

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    64 pass
  • Status changed to RTBC about 1 year ago
  • πŸ‡΅πŸ‡°Pakistan jibran Sydney, Australia

    The docs update looks good. Happy to mark it RTBC.

  • Status changed to Fixed about 1 year ago
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Committed to 1.x - thanks!

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.69.0 2024