Support paging through multiple requests

Created on 24 December 2015, over 8 years ago
Updated 18 April 2024, 2 months ago

Follow-up to #2623012: Implement interfaces and base classes for URL-based sources โ†’

The stand-alone JSON plugin had some support for following links from one request to the next where a feed is paginated (remnants still exist in comments). Get this working in general in the fetcher plugins.

Remaining tasks

๐Ÿ“Œ Task
Status

RTBC

Version

6.0

Component

Plugins

Created by

๐Ÿ‡บ๐Ÿ‡ธUnited States mikeryan Murphysboro, IL, USA

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance slayne40

    For pager type urls, verify if the selector return an url.

  • ๐Ÿ‡ซ๐Ÿ‡ฎFinland kekkis Pirkkala

    I believe the patches starting from #112 contain an error when using the 'paginator' type and when the $selector_data is scalar. The comparison operator between $num_items and $selector_data has shifted from '==' to '!=' for an unknown reason. How this manifests is that only the first page ever gets processed.

    In this new patch I try to explain the thinking in the comment a bit better, plus of course fix the operator.

    Leaving as Needs work since there still are no tests.

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.5 + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    Patch Failed to Apply
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    185 pass, 2 fail
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States sassafrass

    I am testing using the latest patch. In my yaml, I am using the paginator type because the urls provided in the JSON endpoints are not valid for my use case.

      urls:
        - http://services.baltimorecountymd.gov/api/hub/pets/pets?status=Adoptable
        - http://services.baltimorecountymd.gov/api/hub/pets/pets?status=Lost
      pager:
        type: paginator
        selector: /metaData/
        paginator_type: page_number
        default_num_items: 10
        page_key: page
        size_key: recordsPerPage
    

    The paginated urls are generated as expected. However, it never stops generating urls. Urls being generated are valid but have no records. For example see: https://services.baltimorecountymd.gov/api/hub/pets/pets?status=Adoptabl....

    My particular use case has non-scalar $selector_data but the JSON is not in format expected by this segment of code, which always evaluates to true.

         else {
                // If we have an array of rows
                if (count($selector_data) > 0) {
                  $next_urls[] = Url::fromUri($path['path'], [
                    'query' => $path['query'],
                    'fragment' => $path['fragment'],
                  ])->toString();
                }
              }
            }
    
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States jcandan

    Re-roll #119 for latest 6.0.x-dev.

  • Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 9 months ago
    Not currently mergeable.
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 9 months ago
    189 pass, 2 fail
  • ๐Ÿ‡ป๐Ÿ‡ณVietnam chuyenlv

    Re-roll #121 for 6.0.x, Drupal 10 and PHP 8.1

  • ๐Ÿ‡ซ๐Ÿ‡ฎFinland kekkis Pirkkala

    The return type of \Drupal\migrate_plus\Plugin\migrate_plus\data_parser\Json::getSourceData cannot be set to array reliably since it might return NULL (per its own code) or any scalar (from json_decode). This is my suggestion how to fix it, even though using mixed might do as well.

  • Status changed to Needs review 8 months ago
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 8 months ago
    189 pass, 1 fail
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States jacobbell84

    Comment 100 introduced a logic change to the getSourceData function in the Json data parser plugin that I believe wasn't intended.

         // Otherwise, we're using xpath-like selectors.
         $selectors = explode('/', trim($item_selector, '/'));
         $return = $this->sourceData;
         foreach ($selectors as $selector) {
    +      if (!isset($return[$selector])) {
    +        return NULL;
    +      }
           if (!empty($selector) || $selector === '0') {
             $return = $return[$selector];
           }
    

    With that change, it's no longer possible to use the Json data parser without setting an item selector. This is because the 'explode' method will return an array with one empty element if an empty item selector is passed to it, which causes the new check to fail 100% of the time and returns null. I've changed that logic to what's below, which I believe solves that problem and keeps the original intended functionality.

        // Otherwise, we're using xpath-like selectors.
        $return = $this->sourceData;
        if (!empty($item_selector)) {
          $selectors = explode('/', trim($item_selector, '/'));
          foreach ($selectors as $selector) {
            if (!isset($return[$selector])) {
              return NULL;
            }
            $return = $return[$selector];
          }
        } elseif (sizeOf($return) === 0) {
          return NULL;
        }
    
  • The last submitted patch, 125: migrate_plus-support-paging-2640516-125.patch, failed testing. View results โ†’
    - codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

  • Status changed to Needs work 8 months ago
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States nadavoid

    Is there a way to use this patch for the paging scenario I have? The things in particular that I'm not sure how to solve:

    1. The token for the next page is unpredictable; it needs to be read from the current page
    2. The identifier for the next page is only a token, not a full URL

    My JSON source has this structure:

    {
      "response": {
        "docs": [
          {"one", "two", "three"}
        ]
      },
      "count": "329",
      "nextPageToken": "qwk9"
    }
    

    The URL of the next page is the same as the current page, with the change of a URL param: &pageToken=qwk9

    I'm thinking that I'll need to update the patch or extend it using custom code, but wanted to check here first. If updating the patch is there an existing pager type that would be best to model this on? These are the pager types that I see recognized in the patch so far:

    • urls
    • cursor
    • page
    • paginator
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States nadavoid

    Updating issue summary.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States nadavoid

    Commenting with what I understand to be the purpose of some of the pager options for the JSON plugin.

    selector: The path to the JSON item identifying the next page.

    type

    • urls: selector points to one or more URLs.
    • cursor: selector is just a portion of the next page URL. Set `key` to the URL parameter key that should be passed.
    • page: selector is a current page number, and expected to increment: 1, 2, 3, etc.
    • paginator: not sure

    key: Parameter key in the URL, will be given a value that identifies the page to request. If key is set to "page", then the resulting generated URL will contain "?page=[value-from-current-json]"

    More parameters to document:

    size_key
    page_key
    selector_max
    default_num_items
    paginator_type

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States jacobbell84

    Fixing some incorrect variable names (A couple spots were still using $this->itemSelector instead of $item_selector) and fixing another edge case where the item_selector is blank.

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update 7 months ago
    187 pass, 3 fail
  • ๐Ÿ‡ช๐Ÿ‡ธSpain guiu.rocafort.ferrer Barcelona

    guiu.rocafort.ferrer โ†’ made their first commit to this issueโ€™s fork.

  • Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    Not currently mergeable.
  • Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    Not currently mergeable.
  • Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    Not currently mergeable.
  • Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    Not currently mergeable.
  • Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    Not currently mergeable.
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    195 pass
  • ๐Ÿ‡ช๐Ÿ‡ธSpain guiu.rocafort.ferrer Barcelona

    I updated the issue fork with the latest 6.0.x branch, so it is mergeable now. I also made a few changes so the Gitlab CI is available and runs de tests, and also fixed some of the tests.

    Some of the changes i made, which might be arguable, are:

    • When the item_selector does not exist, the getSourceData returns now an empty array instead of null. This way, the plugin acts as there is no rows to be imported and does nothing.
    • When the item_selector is NULL or it is not defined, it is set to '' by default. In that case i believe the getSourceData method should return the whole sourceData contents.
    • One of the tests failed because the $item_selector parameter for getSourceData was set to type string, but it was passing an integer value. Inside the function, it checked if the parameter is_numeric, to apply the backwards compatibility for depth selection. I changed the method definition to accept also integer values.

    My plan is to next write some tests for the pager functionality, and go from there to fix potential issues and make sure it works ok, before merging into the main branch.

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    Unable to generate test groups
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    Unable to generate test groups
  • ๐Ÿ‡ช๐Ÿ‡ธSpain guiu.rocafort.ferrer Barcelona

    The last commit adds test cases for the pager types urls, cursor, and page. The tests for paginator are still missing thought.

    I also expanded a little bit the documentation on the pager types here: https://www.drupal.org/docs/8/api/migrate-api/migrate-source-plugins/mig... โ†’

  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    CI aborted
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    207 pass
  • Status changed to Needs review 6 months ago
  • Open in Jenkins โ†’ Open on Drupal.org โ†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 6 months ago
    Patch Failed to Apply
  • ๐Ÿ‡ช๐Ÿ‡ธSpain guiu.rocafort.ferrer Barcelona

    I am changing the status to "Needs review", since i wrote tests for all the pager types, and also updated the documentation to reflect the new functionality.

    Documentation: https://www.drupal.org/docs/8/api/migrate-api/migrate-source-plugins/mig... โ†’

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States nadavoid

    The issue fork has merge conflicts against the main branch, with the recent update of migrate_plus to 6.0.2. The latest patches also no longer apply.

    @guiu.rocafort.ferrer Could you merge the latest into your issue fork and resolve conflicts? I'll be happy to test after that, since I'm actively using this, and will continue to for quite a while.

  • ๐Ÿ‡ช๐Ÿ‡ธSpain guiu.rocafort.ferrer Barcelona

    Hi @nadavoid, the merge request should now be good to go.

  • Status changed to RTBC 5 months ago
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States nadavoid

    Thanks for the quick update @guiu.rocafort.ferrer! I really appreciate it. I've successfully applied the patch from the MR on 6.0.2 and confirmed that the cursor pagination still works in my installation.

    The code looks good good to me, and the tests seem comprehensive. So I'm marking this RTBC. Others who have been deeper in this issue are of course also welcome to review. I'm strongly in favor of merging soon, and if adjustments are needed or bugs are found, they can be handled in smaller followup issues.

    I'm also uploading the current version of the patch so that people have something stable to use in their builds today.

  • ๐Ÿ‡ช๐Ÿ‡ธSpain guiu.rocafort.ferrer Barcelona
  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom longwave UK

    I'm using this with the paginator pager type and while it works well, it is repeating HTTP requests - and my source is relatively slow to respond, so this is slowing down my migrations.

    If I add logging to Http::getResponse() I see that requests are made up to three times for the same URL:

    get https://[redacted]?size=500... done
    get https://[redacted]?size=500&offset=500... done
    get https://[redacted]?size=500... done
    get https://[redacted]?size=500&offset=500... done
    get https://[redacted]?size=500&offset=1000... done
    get https://[redacted]?size=500&offset=500... done
    get https://[redacted]?size=500&offset=1000... done
    get https://[redacted]?size=500&offset=1500... done
    get https://[redacted]?size=500&offset=1000... done
    

    I hate to say this at this stage but I wonder if the different pagers need to be their own type of plugin? That would mean they could be more easily extended for different cases.

  • ๐Ÿ‡ช๐Ÿ‡ธSpain guiu.rocafort.ferrer Barcelona

    Hi @longwave, I understand your concerns about the performance issues, and the redundant http requests, i do believe too that this is not optimal, and there is room for improve.

    This issue have been opened since 2015, and it has been difficult to push it forward and make it to merge with the development branch, so i am worried that addressing those issues would delay even more. I personally have a few sites that make use of the patch for a while now, and i believe some other people might be in the same situation.

    So i believe it would be better to merge the issue into development, and then create a follow-up issue to improve the performance situation.
    What do the other think about this ?

  • ๐Ÿ‡ฌ๐Ÿ‡งUnited Kingdom longwave UK

    That decision is up to the Migrate Plus maintainers, I think it would be fine to defer to followups given this fixes some immediate issues and both removing the additional requests and refactoring to use plugins could be done separately.

  • ๐Ÿ‡ฉ๐Ÿ‡ฐDenmark googletorp

    I've used this in the past and compared to the alternative (which is to fetch everything in one go) this is a major improvement. This is also an opt-in feature, so while making a lot of requests can cause problems, it's something the developers using this can determine if it's a problem or not. From what I understand if you're not using the feature we won't have any performance change.

    Maybe it's better to get it in, solve a lot of people's problems and get real life feedback on which performance issues would make sense to address.

    I think this is RTBC.

Production build 0.69.0 2024