Error responses are stored when using the Download migration process

Created on 20 February 2020, almost 5 years ago
Updated 8 October 2024, 3 months ago

Following on from the work done in the parent issue addressing error handling.

The Download process doesn't currently handle any clean up in terms of deleting the destination file when the error occurs.

For example, downloading a http://example.com/example.jpg file which 404s will throw an exception, but will keep the destination example.jpg file on disk and have it contain the raw HTML for the 404 response, which might not be expected behaviour.

This is especially hard to work with when using the file_exists: 'use existing' option.

After discussion with Framwork managers, we should delete the file on a 400 response, and there is no need to preserve current behavior, as saving a 404 response to a jpg file is a bug.

Additional notes. There's currently some differing opinions on how to proceed here. I currently see three options.

  1. Do nothing, potentially add a log message or warning if we detect a non-200 response while downloading a file.
    • Benefits: it's easy.
    • Downside: you may have files in your filesystem named my_file.jpg that contain the body of a 404 or 403 response.
  2. Delete the file on non-200 responses.
    • Benefits: No error messages stored in files with an image/document extension after the migration completes.
    • Downside: downloading and then deleting a file from the filesystem. The file will have to exist in the public file system, for at least a short while. Could we accidentally delete a file we don't want to? could we potentially overwrite a good file with a bad one and then delete it? What are the edge cases to handle?
  3. Download files to the temporary file system, verify integrity, then move to the final destination.
    • Benefits: safest option, could eliminate files with bad responses from ever being in the filesystem, could also do integrity checks against the file to prevent malicious scripts or XSS attacks from being migrated into the filesystem. Currently migrated files are only as safe as your file source.
    • Downsides: potential slowdown of file migrations (which already can take a long time for large filesystems), particularly if the temp directory and the public directory are on different filesystems, which is often likely to be the case.
πŸ› Bug report
Status

Needs work

Version

11.0 πŸ”₯

Component

migration system

Created by

Live updates comments and jobs are added and updated live.
  • Needs framework manager review

    It is used to alert the framework manager core committer(s) that an issue significantly impacts (or has the potential to impact) multiple subsystems or represents a significant change or addition in architecture or public APIs, and their signoff is needed (see the governance policy draft for more information). If an issue significantly impacts only one subsystem, use Needs subsystem maintainer review instead, and make sure the issue component is set to the correct subsystem.

Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • First commit to issue fork.
  • @benjifisher opened merge request.
  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    benjifisher β†’ changed the visibility of the branch 3114887-cleanup-failed-downloads--rebase to hidden.

  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    benjifisher β†’ changed the visibility of the branch 9.4.x to hidden.

  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    I rebased @slucero's branch on the current 11.x and made a new MR.

  • πŸ‡³πŸ‡ΏNew Zealand xurizaemon Ōtepoti, Aotearoa 🏝

    I support the "delete on error" approach in MR 9785.

    Discussed above in 19 is the idea of using a HEAD request to check the file's existence. (I realise I'm responding to a Slack thread from two years ago here, but wanted to put this on record in case the solution turns back from the current MR approach.)

    Working recently with Migrate in Islandora and large file ingests (>1GB assets) recently, there's a case where an initial HEAD will not prevent this error: if the file URL is correct and the HEAD is a 200, then retrieval fails for eg timeout or resource limits (server returns a 200, but the content length is incorrect). My recollection is that Guzzle does correctly throw an exception and terminate the stream in this case.

    The current behaviour is that the terminated stream will be written as a partial download when the request exits. This MR looks like it would correctly ensure the failed download is removed.

Production build 0.71.5 2024