Any plan to integrate Box/spout ?

Created on 20 April 2022, almost 3 years ago
Updated 29 May 2023, over 1 year ago

Problem/Motivation

Exporting large files on a project, I run into memory problem even by using batch operations

Steps to reproduce

Export in XSLX a file of 6K+ lines / 50+ columns

Proposed resolution

Reading on stackoverflow about this issue, I found information about Box/spout which seems a good fit for such a task

Any plan to integrate it as a possible fallback to PHPSpreadsheet ?

cf.:
https://stackoverflow.com/questions/50731704/building-very-large-spreads...
https://opensource.box.com/spout/

Feature request
Status

Active

Version

1.0

Component

Code

Created by

🇨🇦Canada kiwad

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇳🇱Netherlands theemstra

    There are some limitations to box/sprout that would remove some functionality from the module that some people might be using, including "no automatic column widths, no column number/string formatting)" as per the first link you provided.
    We could provide the option to choose one of the engines, with the tradeoffs of each library.

  • 🇨🇦Canada kiwad

    Not sure box/spout is still maintained, project is archived on github

    Maybe fastexcelwriter ?
    https://packagist.org/packages/avadim/fast-excel-writer

  • 🇮🇹Italy thebetti

    there is any update?

  • heddn Nicaragua

    phpoffice/phpspreadsheet now has a 2.x release. Is it any faster?

  • Status changed to Closed: outdated 5 months ago
  • 🇳🇿New Zealand ericgsmith

    Re opening for clarification - is there a reason why this is closed as outdated vs closed as won't fix?

    I've done a quick test locally and it looks like v2 phpspreadsheet is still considerably slower than the fast-excel-writer mentioned in #5

    I did a quick test with about 6000 rows of data and phpspreadsheet is taking 17 seconds to write the file, where as fast-excel-writer takes less than a second. In my test, the file produced from fast-excel-writer is double the size of the one produced by phpspreadsheet - so I assume there is some additional optimizing by phpspreadsheet that contributes to the additional time. Still - I think that result warrants further clarification.

    My quick test results for anybody interested - I used our existing csv export which is around 6000 lines.

    test-speed.php:

    <?php
    
    include 'vendor/autoload.php';
    
    $csvFile = fopen('examplefile.csv', 'r');
    
    $spreadsheet = new PhpOffice\PhpSpreadsheet\Spreadsheet();
    $spreadsheet->setActiveSheetIndex(0);
    $sheet = $spreadsheet->getActiveSheet();
    
    $rowIndex = 1;
    while (($row = fgetcsv($csvFile)) !== FALSE) {
      $columnIndex = 1;
      foreach ($row as $value) {
        $sheet->setCellValue([$columnIndex, $rowIndex], $value);
        $columnIndex++;
      }
      $rowIndex++;
    }
    
    fclose($csvFile);
    
    $writer = PhpOffice\PhpSpreadsheet\IOFactory::createWriter($spreadsheet, 'Xlsx');
    $writer->save('test-speed.xlsx');
    

    Run twice:

    time php scripts/test-speed.php 
    
    real	0m17.630s
    user	0m17.584s
    sys	0m0.044s
    
    time php scripts/test-speed.php 
    
    real	0m17.392s
    user	0m17.332s
    sys	0m0.056s
    
    

    Same data with the alternative library:

    <?php
    
    include 'vendor/autoload.php';
    
    $csvFile = fopen('examplefile.csv', 'r');
    
    $spreadsheet = avadim\FastExcelWriter\Excel::create(['Sheet1'], ['auto_convert_number' => true]);
    $sheet = $spreadsheet->sheet();
    
    while (($row = fgetcsv($csvFile)) !== FALSE) {
      $sheet->writeRow($row);
    }
    
    fclose($csvFile);
    
    $spreadsheet->save('test-speed2.xlsx');
    

    Huge improvement:

    time php scripts/test-speed2.php 
    
    real	0m0.893s
    user	0m0.863s
    sys	0m0.028s
    
    time php scripts/test-speed2.php 
    
    real	0m0.920s
    user	0m0.891s
    sys	0m0.028s
    
    
  • 🇳🇿New Zealand ericgsmith

    One thing I missed is there is a popular and maintained fork of the originally mentioned Box/spout - https://github.com/openspout/openspout

    While both are maintained openspout is more widely used.

    I've tested this as well and its also looking good:

    <?php
    
    include 'vendor/autoload.php';
    
    $csvFile = fopen('examplefile.csv', 'r');
    
    $writer = new \OpenSpout\Writer\XLSX\Writer();
    $writer->openToFile('test-speed3.xlsx');
    
    while (($row = fgetcsv($csvFile)) !== FALSE) {
      $cells = \OpenSpout\Common\Entity\Row::fromValues($row);
      $writer->addRow($cells);
    }
    
    fclose($csvFile);
    
    $writer->close();
    
    print memory_get_peak_usage();
    

    And the output:

    time php scripts/test-speed3.php 
    1269864
    real	0m1.283s
    user	0m1.233s
    sys	0m0.036s
    

    What I have not done is a feature comparison of some of the views style plugin options.

    Both openspout and fast-excel-writer treat reading and writing as separate things - where as with php spreadsheet you can open and write to the same spreadsheet. So if things like batch processing are concerned either option would need to do something similar to what views_data_export is doing here - https://git.drupalcode.org/project/views_data_export/-/blob/8.x-1.x/src/...

    Open spout has docs for this - https://github.com/openspout/openspout/blob/4.x/docs/guides/1-add-data-e... where as for the fast read / write libraries it wasn't so clear from a first look how to ensure all the right info gets copied over.

    I'm not sure how relevant that is - for us we are using this module to provide the serializer and views_data_export for the views plugin as batch processing was needed - I haven't looked into how / if batches are handed with this modules views plugins.

  • 🇳🇿New Zealand ericgsmith

    Ok, final bit of research.

    I updated my test scripts to load the existing file and then loop the CSV again to add the rows again.

    PHP spreadsheet:

    time php scripts/test-speed.php 
    147031000
    real	1m5.979s
    user	1m5.735s
    sys	0m0.188s
    

    Openspout

    time php scripts/test-speed3.php 
    1271096
    real	0m7.810s
    user	0m7.716s
    sys	0m0.081s
    

    This definitely enough for me to take this back to the client to try get more hours to work on this. My current thinking would be as a sub module that can override the xls_serialization.encoder.xlsx service - with the caveat that the full functionality may not be available with the open spout library. My use case does not currently require these features as we use the serializer with views data export.

  • Pipeline finished with Success
    4 months ago
    Total: 269s
    #296485
  • Pipeline finished with Success
    4 months ago
    Total: 228s
    #296542
  • 🇳🇿New Zealand ericgsmith

    I have added a draft MR with a proof of concept for this libary.

    If using with views_data_export I have added a related issue and MR to 📌 [PP1] Considering switching to another library to batch XLSX files because of performance on large files Postponed

    The approaches are a bit rough - the main aim here is for me to verify that this library is going to be performant and applicable for use on an existing project where phpspreadsheet is problematic.

    Early feedback always welcome for any interested parties - not setting for review yet as I need to do some more testing, and there are todo's related to the investigating functionality gaps between the 2 libraries - its possible some of these options are supported I just haven't investigated yet.

  • 🇨🇦Canada kiwad

    Early feedback :

    Tested an XLSX export that wasn't working in phpspreadsheet due to lack of memory

    1. Applied MR https://git.drupalcode.org/project/xls_serialization/-/merge_requests/20...
    2. composer require openspout/openspout
    3. Switched config

    The export now works

  • 🇳🇿New Zealand ericgsmith

    Thanks for taking the time to test @kiwad - I appreciate it.

    I've also been doing some additional testing and am keen to tidy up this MR into something that could make it into this module.

    I'm not so keen on the current implementation of extending the encoder and replacing it with the factory - while that worked there only some commonality between the 2 and I feel it doesn't make it easy to track feature parity. Long term maintenance wide it also means the extending class is fragile and subject to breaking if the xls class changes / introduces new features.

    One thought I had was refactoring the implementation to use an adapter pattern - e.g if we had:

    interface SpreadsheetAdapterInterface {
    
      public function addRow(array $row);
      public function setColumnsAutoSize();
      public function setRowsAutoHeight();
      public function setMetaData(array $metadata);
    }
    

    We could do something like:

        try {
          $adapter = \Drupal::service('xls_serialization.adapter_factory')->get($this->xlsFormat);
          // Set headers.
          $this->setHeaders($adapter, $data, $context);
          // Set the data.
          $this->setData($adapter, $data);
          // Set the width of every column with data in it to AutoSize.
          if (empty($config->get('xls_serialization_autosize'))) {
            $adapter->setColumnsAutoSize();
          }
    
          // Set rows to auto-height.
          $adapter->setRowsAutoHeight();
    

    etc in the Xls class and split out the phpspreadsheet logic and openspreadsheet logic to their own files with using a shared adapter interface.
    One of the challenging this for this view config options for applying styling, while possible to add styles in openspout it needs to be done at the time the row is added to the spreadsheet - not after as is currently done for the phpspreadsheet implementation.

    This would leave things like formatValue, validateWorksheetTitle, extractHeaders in the XLS class and library specific calls in the adapters.

    This would be a bigger refactoring of the module - so keen to get feedback on this.

  • 🇳🇿New Zealand ericgsmith

    I have pushed a work in progress on the idea I had in #15 to try split the library specific code out - its a bit rough and not tested yet, end of day here so will pick back up next week. Assigning to me since I did not finish.

  • Pipeline finished with Failed
    4 months ago
    Total: 223s
    #300421
  • Pipeline finished with Failed
    4 months ago
    Total: 253s
    #304683
  • Pipeline finished with Failed
    4 months ago
    Total: 212s
    #304703
  • Pipeline finished with Success
    4 months ago
    Total: 163s
    #304714
  • 🇳🇿New Zealand ericgsmith

    The encode unit test is passing with open spout - but I'm having more doubts around how best to structure this - I'm reversing from my previous thought that the adapter pattern is cleanest, as there is so little commonality in terms of supported features and the way the libraries implement those features (e.g we could support styling in openspout but we would have to set the styles at the time we write the row rather than at the end).

    I also don't know if its worth trying to chase all the features - if people need the full features they probably doing fine with the old library, maybe reverting back to a completely separate service would be easier since the openspout code will be super minimal.

  • I think that makes sense. Then the old service and library could be deprecated, right?

  • Pipeline finished with Success
    3 months ago
    Total: 263s
    #306698
  • Pipeline finished with Success
    3 months ago
    #306701
  • 🇳🇿New Zealand ericgsmith

    Closed initial MR attempt - opened a new MR with a much more simpler approach to just override the service in a sub module.

    I think doing less initially is better to get this committed - this will give users a clear option to go for performance instead of additional features.

    Some features the phpspreadsheet option provides are technically possible with openspout, but I think it would make sense to do these as follow ups if people need them. In my case, we don't use any of them so have no issue going for the simpler library. Not all of them are supported and so I don't think chasing feature parity with the slower library is worth the effort.

    I also added a hook requirements check - I think its fair that this module only recommends openspout and does not add it as a full dependency.

    Ready for a proper review now.

  • Pipeline finished with Canceled
    3 months ago
    #313096
  • 🇳🇿New Zealand ericgsmith

    MR is now testing on Drupal 11 by default - which this module does not have a release for - so ran manual pipeline against Drupal 10 - https://git.drupalcode.org/issue/xls_serialization-3276294/-/pipelines/3...

  • Pipeline finished with Success
    3 months ago
    Total: 134s
    #313099
  • 🇳🇿New Zealand RoSk0 Wellington

    Looks really good! Thanks Eric!

    There are couple code style issues that could be fixed to make it perfect.

  • Status changed to RTBC 3 months ago
  • The patch file for v1.5.0

  • Pipeline finished with Success
    2 months ago
    Total: 199s
    #336876
  • 🇳🇿New Zealand ericgsmith

    Doesn't rebase cleanly against latest changes in 8.x-1.x branch - will take a look soon

  • 🇳🇿New Zealand ericgsmith

    Rebased against 2.0.x - minor changes made to constructor to get the commit to apply cleanly.

    Hiding old MR and branch against 8.x-1.x

    I note there is not 2.x or 2.1.x branch - as this is a new feature it would need to go into a 2.1.x release at this point no branch existis for this.

  • Pipeline finished with Success
    about 2 months ago
    Total: 161s
    #348781
  • Pipeline finished with Success
    about 2 months ago
    Total: 142s
    #348782
  • 🇫🇷France mably

    Hi @ericgsmith, I just created the 2.1.x branch, you can update your MR.

  • Pipeline finished with Success
    10 days ago
    Total: 151s
    #393834
  • Status changed to Needs review 10 days ago
  • 🇳🇿New Zealand ericgsmith

    Thanks @mably - I've rebased the MR and updated target to 2.1.x

  • Pipeline finished with Success
    10 days ago
    Total: 146s
    #393835
  • 🇫🇷France mably

    Let's merge this so people can start playing with it.

Production build 0.71.5 2024