Investigate Revising the Update Functionality

Created on 23 September 2020, over 4 years ago
Updated 8 January 2024, about 1 year ago

Problem/Motivation

Mark Triggs recently posted on Slack:

yep, all_ids and modified_since is the same thing the archivesspace indexer uses to get records out as quickly as possible. Fetches the set of modified IDs using that endpoint, then breaks them into chunks of 25 and doe a GET on /repositories/n/archival_objects?id_set=1,2,3,4,... to get back sets of records in batches of 25 (and actually spreads the batches across multiple threads to improve throughput)

the trick with the API is that you can get 25 records back in about the same time as you can get one, so using the id_set parameter on your GETs is the trick to good performance there

The availability of the modified_since parameter on the API is not included in the documentation so I didn't know it existed although I believed it should.

In any case, we currently use the update endpoint. We should investigate whether using this previously unknown parameter allows us to simplify our code. We may be able to ditch nearly all of the update code and instead rely on the supplying the modified_since parameter to the source iterator. We could still use batching, but focus on using the modified_since + limit parameters.

Anyway, something to look into.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

πŸ“Œ Task
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States seth.e.shaw

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡ΊπŸ‡ΈUnited States cao89

    I've started looking at this. Mainly because the update command (using the search endpoint) isn't working for me, but modified_since seems to work fine for me.

    One thing I'm getting hung up on is how would you build the batches? The search endpoint returns one set of results for all types. If you switch to using modified_since wouldn't you have to make a request for each type? At that point I'm not sure how you incorporate 'max-pages'. There's no endpoint that uses modified_since but returns results across all types, correct? If I'm understanding correctly batches are being processed and tracked by 'last_user_mtime'. So if I run a batch on the first 10 pages, those pages are sorted by user_mtime, and when the batched update finishes up to max_pages, 'archivesspace.latest_user_mtime' is updated so you don't reprocess the ones you've already processed. But that wouldn't work across multiple types using the iterator, right (because user_mtime wouldn't be sequential across all types anymore)?

Production build 0.71.5 2024