Allow to use the spreadsheet iterator as a migrate_plus data parser

Created on 20 July 2020, over 4 years ago
Updated 20 August 2024, 4 months ago

Problem/Motivation

For some use cases it could be useful to be able to migrate data directly from a URL (XLS file stored on an external server for example).
The migrate_plus module easily allows to fetch data from URLs instead of local files and provide a data_parser plugin mecanism to read this data.

Proposed resolution

Create a migrate_plus data_parser plugin to be able to read files from a URL instead of a file.

✨ Feature request
Status

Needs review

Version

2.1

Component

Code

Created by

πŸ‡«πŸ‡·France duaelfr Montpellier, France

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡§πŸ‡ͺBelgium matthijs

    Added missing return types for Migrate Plus 6.x compatibility

  • πŸ‡ΊπŸ‡¦Ukraine vlad.dancer Kyiv

    Thank you @duaelfr and @Matthijs for this plugin. We've used for a couple of years. It worked great.
    But recently migrated to Drupal 10 and also changed some data import logic.
    We needed "id-less" data import. Well, actually migrate can't do that. But we decided to use
    "row_index_column" field, provided by this module, and xls file_name as ids.

    Here is our migration config:

    source:
      plugin: custom_source_extends_url_source
      data_fetcher_plugin: file
      data_parser_plugin: custom_data_parser_extends_migrate_spreadsheet
      track_changes: true
      worksheet: 'Sheet1'
      origin: A2
      header_row: 1
      item_selector: null
      // Filled up on runtime
      urls: []
      columns: []
      fields: []
    
      # Actual ids for migrate source plugin will be next:
      # - file_name
      # - spreadsheet_row_index
      ids:
        spreadsheet_row_index:
          type: string
      row_index_column: 'spreadsheet_row_index'
    

    And then we faced a number of issues with data parser plugin.
    I'm not sure the roots of them, maybe updated migrate/migrate_plus/drupal apis.

    But here are how we ended up using data parser plugin and fixed issues.

    @claudiu.cristea was right about that file in openSourceUrl() can come not only from remote, but from local fs.

    Here how delt with it:

    protected function openSourceUrl($url): bool {
        $fetcher = $this->getDataFetcherPlugin();
    
        if ($fetcher->getPluginId() === 'file') {
          $this->configuration['file'] = $url;
        }
        else {
          // Why we want to read file two times?
          /** @var \GuzzleHttp\Psr7\Response $response */
          $response = $this->getDataFetcherPlugin()->getResponse($url);
    
          preg_match('/filename="(.*?)"/', $response->getHeader('Content-Disposition')[0] ?? '', $matches);
          $this->configuration['file'] = $this->fileSystem->saveData($response->getBody(), 'temporary://' . $matches[1], FileSystemInterface::EXISTS_RENAME);
        }
    
        $this->initializeIterator();
        return TRUE;
      }
    

    Next thing. In our case checking for an empty current item in valid() was not enought. We got infinite loop.
    So ended up with checking row_index_column and rows count:

    public function valid(): bool {
        $currentItem = $this->currentItem;
        if (!$currentItem) {
          return FALSE;
        }
        if (!parent::valid()) {
          return FALSE;
        }
    
        // Check iterator for current/max index.
        // Fix issue with lost last record.
        $rowIndexCol = $this->configuration['row_index_column'] ?? NULL;
        if ($rowIndexCol) {
          $index = $currentItem[$rowIndexCol];
          $max = $this->spreadsheetIterator->getRowsCount();
    
          if ($index <= $max) {
            return TRUE;
          }
        }
        // Fallback to default check, which doesn't work for us,
        // but should work for others if they use "row_index_column" feature.
        else {
          return $this->spreadsheetIterator->valid();
        }
    
        return FALSE;
      }
    

    BTW: there is also php notice for undefined row_index_column array key in SpreadsheetIterator class but I'll report this issue separately and link it here.

  • Pipeline finished with Success
    4 months ago
    Total: 138s
    #259036
Production build 0.71.5 2024