Missing temporary files in load balanced environments

Created on 27 September 2017, about 7 years ago
Updated 29 August 2023, over 1 year ago

Problem/Motivation

When an import needs multiple cron runs to complete and when using the HTTP Fetcher, Feeds uses the temporary directory to temporary store the file that is being imported.

On some servers, the default configured temporary directory is not static. It can change per request. This can for example be the case on Acquia and Pantheon. This is problematic because if the temporary directory is changed between two import cycles, Feeds looses track of the file being imported and therefore cannot finish the import.

Proposed resolution

Do not use the temporary directory to temporary store a file to import, but use private://feeds/in_progress instead. The D7 version of Feeds already does this.

The first step however is adding test coverage for importing data that is using multiple calls/requests. Right now, there is only test coverage for importing data in one go. There are already tests written for this case for the D7 version of Feeds. It probably is a good idea to port these over. These tests can be found in the class 'FeedsFileHTTPTestCase'. The following methods from that class would be the most interesting to port over:

  1. ::setUpMultipleCronRuns()
  2. ::getInProgressFile()
  3. ::testImportSourceWithMultipleCronRuns() - ported in #3044983: Add test coverage for importing using multiple cron runs β†’
  4. ::testAbortImportWhenTemporaryFileIsDeleted()

Additionally, there seems to be no warnings logged when a file that is in progress being imported is suddenly missing.

Workaround

A workaround is to set a different path for the temporary directory, one that is static. In settings.php:
$config['system.file']['path']['temporary'] = '/your/temporary/directory';
Be sure to change '/your/temporary/directory' to the directory of your choice.

Remaining tasks

  1. Write test coverage for aborting an import because of a missing temporary file.
  2. Log a warning when the file that is progress of being imported no longer exists.
  3. Move the temporary file to private://feeds/in_progress.
  4. Write test coverage for removing the temporary file after import (this task could also be handled in πŸ“Œ Auto-purge old temporary feed downloads Fixed ).
  5. Add a setting for the in progress directory called "in_progress_dir". Make it part of a config object called "feeds.settings".

Original report by bburg

I'm running several sites using the Pantheon platform on an elite plan. These sites use feeds to import content from a central site. They also run in a load-balanced set up, so multiple instances of Drupal running on separate servers with a shared database.

The problem is that on cron, the Feeds module will download the feed source (an XML document from the central site), and store it in the temporary directory. It appears that the parser isn't triggered immediately, but instead added to the queue. The queue entry references the temporary file in the full path. On pantheon, the temporary directory may have an inconsistent path. e.g. /srv/bindings/580b80cb8cb344dfa6f2c1059d896835/tmp. The hashed number I believe is something called the "Pantheon Binding." Which I assume is a reference to the current container, but I'm not certain. So, on subsequent cron runs, the queue processor may end up trying to access a the temporary file that no longer exists as the Pantheon Binding has changed since the queue job was saved to the queue table. This results in an accumulation of queue items, that point to non-existent temporary files, that will never be removed, since the queue job doesn't remove them if the temporary file is missing. This leads to a lot of errors like this one, which also seem to cause performance issues during cron:

RuntimeException: File /srv/bindings/c1e9936fd5b347d7ae98b1226d7724f0/tmp/feeds_http_fetcherELdIb4 does not exist. in Drupal\feeds\Result\FetcherResult->checkFile() (line 53 of /srv/bindings/580b80cb8cb344dfa6f2c1059d896835/code/web/modules/contrib/feeds/src/Result/FetcherResult.php).

I'm using a snapshot of the dev branch from August 2016. So if this problem has already been addressed in a previous issue, my apologies. Upgrading to a newer snapshot seems to lead to several errors...

Pantheon support has suggested that we change the temporary files directory to point to a path on the shared, networked file system, under sites/default/files/tmp, which they have documented how to do under Temporary File Management with Multiple Application Containers: Using Valhalla, Pantheon's Networked Filesystem. I'll link their support team to this ticket as well so they can make any comments and clarify the issue. Our next scheduled deployment window isn't until early October, So I won't be able to test that suggestion in production until then.

Also, to address the stuck queue items, I think the only option is to delete all queue items whose name matches "feeds_feed_parse%". So adding an update hook with the following:

  db_delete('queue')
    ->condition('name', 'feeds_feed_parse%', 'like')
    ->execute();

(db_delete() is deprecated, but I don't know the new method yet). This might be something to address in this module. Also, for the issue itself. Might it be better to store the file reference in the queue item as a filestream wrapper instead? e.g. temporary://feeds_http_fetcherELdIb4.

I'm not sure if supporting such environments is really in the scope of this project (why I've filed this as a support request). I'm really filing this issue here in order to improve the search-ability of the problem in general in case other people are running into it.

πŸ“Œ Task
Status

Fixed

Version

3.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States bburg Washington D.C.

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • Status changed to Needs work almost 2 years ago
  • Status changed to Needs review almost 2 years ago
  • πŸ‡³πŸ‡±Netherlands megachriz

    Setting this back to "Needs review" because I think that the issue reported here needs to be fixed in an other issue, possibly πŸ› FetcherResult::checkFile() throws \RuntimeException, which is never caught Needs review .

  • πŸ‡ΊπŸ‡ΈUnited States smustgrave

    So we are using redis for our queue so maybe the bad values are in there. Clearing the queue table doesn't do anything for us.

  • πŸ‡³πŸ‡±Netherlands megachriz

    @smustgrave
    Hm, that's unfortunate. Since 8.x-3.0-beta3, Feeds automatically cleans up tasks for a specific feed from the queue table whenever that feed finishes an import or when the feed gets unlocked. But if that queue is external, Feeds cannot clean up tasks from that queue. The queue has no API method to clear a specific task from it. The queue API only provides a method to clear a whole queue. The Feeds queues are currently defined per feed type. Perhaps each feed should gets its own queue? That could potentially lead to a large amount of queues, not sure if that's a problem.

  • πŸ‡ΊπŸ‡ΈUnited States smustgrave

    So have to dig more but if the feeds module is querying the queue table directly that seems wrong to me. If it's using the queue service then when a queue is cleared it will clear from what other source it is using.

    Experiencing many times drupal default queue will fail and cause performance issues.

  • πŸ‡³πŸ‡±Netherlands megachriz

    @smustgrave
    We should probably discuss the queue thing further in an other issue than this, but see the methods hasQueueTasks() and clearQueueTasks() in FeedImportHandler. hasQueueTasks() is used by FeedsLockBackend to determine if an import still appears to be active. clearQueueTasks() is called when finishing an import or when unlocking a feed. When unlocking a feed, it is important that any active imports for that feed get killed and that is done now by removing specific tasks from the queue table.

    The code:

      /**
       * Checks if there are still tasks on the feeds queue.
       *
       * @param \Drupal\feeds\FeedInterface $feed
       *   The feed to look for in the queue.
       *
       * @return bool
       *   True if there are still tasks on the queue. False otherwise.
       */
      public function hasQueueTasks(FeedInterface $feed): bool {
        // First check if the queue table exists.
        if (!$this->database->schema()->tableExists('queue')) {
          // No queue table exists, so no tasks can exist on it.
          return FALSE;
        }
    
        $result = $this->database->select('queue')
          ->fields('queue', [])
          ->condition('data', 'a:3:{i:0;i:' . $feed->id() . ';%', 'LIKE')
          ->countQuery()
          ->execute()
          ->fetchField();
    
        return $result > 0;
      }
    
      /**
       * Removes all queue tasks for the given feed.
       *
       * @param \Drupal\feeds\FeedInterface $feed
       *   The feed for which to remove queue tasks.
       */
      public function clearQueueTasks(FeedInterface $feed): void {
        if (!$this->database->schema()->tableExists('queue')) {
          return;
        }
        $this->database->delete('queue')
          ->condition('data', 'a:3:{i:0;i:' . $feed->id() . ';%', 'LIKE')
          ->execute();
      }
    

    I didn't see an other way to cleanup stale queue tasks. The only other option is to remove the whole queue, but that would kill imports of other feeds of the same feed type as well. Queues are currently defined per feed type, not per feed.

  • Tested the MR and also experienced the RuntimeException error with /tmp/feeds_http_fetcher not existing in our load-balanced environment.

    Still testing but I wonder if it has something to do with our site also using a s3fs for a remote file system, which is quite useful for load-balanced environments. I noticed the tempnam method in the FeedsFileSystemBase class was using fileSystem->realpath to get the path. The Drupal API docs say regarding realpath:

    "Resolves the absolute filepath of a local URI or filepath. The use of this method is discouraged, because it does not work for remote URIs. Except in rare cases, URIs should not be manually resolved. Only use this function if you know that the stream wrapper in the URI uses the local file system, and you need to pass an absolute path to a function that is incompatible with stream URIs.

    https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21File%21Fi...

  • πŸ‡³πŸ‡±Netherlands megachriz

    @rdivince
    Thanks helping with testing! I've no experience with s3fs. Can you check what happens with that filesystem if you try to write a file the way FeedsFileSystemBase::tempnam() does that? So test that separately, outside of the Feeds context?

    also experienced the RuntimeException error with /tmp/feeds_http_fetcher not existing in our load-balanced environment

    That's probably because there are still queue items left on the queue. So in order to check if the solution here works for future imports, try to empty these Feeds queues first. One way of doing that is by installing the Queue UI β†’ module. It can also be done with the drush command drush queue:delete, but then you need to know the exact queue name, which you can see with drush queue:list.

  • πŸ‡³πŸ‡±Netherlands megachriz

    @rdivince
    I tried to find a way how I could test the patch with a S3 filesystem. It looks like I would need an account for Amazon’s Simple Storage Service (S3) for that. I tried to create an account there but I got stuck at the step asking for Credit Card details. In the Netherlands it is not self-evident to have that. So I’m tempted to commit this without the S3 check.

  • πŸ‡ΊπŸ‡ΈUnited States smustgrave

    FYI after I cleared my redis queues and applied this patch across all our environments we are no longer seeing the problem.

    We are using Azure with a share file mount. Our prod and stage are scale up to multiple instances as well.

  • @MegaChriz thanks for the info! I was able to clear the queue with the drush commands provided. But it did not fix the import errors. I tested file writing with tempnam outside of the feeds context but did not have any luck. But I did get the MR to work as followed:

    In FeedsFileSystemBase::tempnam(), instead of writing a temporary file with tempnam, I wrote a file w/ \Drupal::service('file.repository')->writeData(). This method should support local and remote stream wrappers. Then I got additional error information that the getAsync call in HttpFetcher::get() appears to open a file with "w+" permissions by default which s3 (and perhaps other remote stream wrappers) do not seem to allow: https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/s3-stream-wra...

    You can upload larger files by streaming data using fopen() and a β€œw”, β€œx”, or β€œa” stream access mode. The Amazon S3 stream wrapper does not support simultaneous read and write streams (e.g. β€œr+”, β€œw+”, etc). This is because the HTTP protocol doesn’t allow simultaneous reading and writing.

    First opening the stream through GuzzleHttp\Psr7\Utils::tryFopen() with "w" mode and passing that filestream as the sink to getAsync worked perfectly. Not sure if other use cases require "w+"

    While these tweaks to the MR successfully resolved my errors with s3fs, I know mine is just one of many use cases and you'll know those better than me. So I've attached my tweaks as a patch and would like to hear your thoughts.

  • Status changed to RTBC over 1 year ago
  • πŸ‡ΊπŸ‡¦Ukraine pingwin4eg Zaporizhia πŸ‡ΊπŸ‡¦

    The fix from the merge request works - the default private directory (private://feeds/in_progress/#) is used instead of the temporary one.

    For those who still see the error with the tmp file, check also other fetcher classes. For example, in our client's project there are also the feeds_http_auth_fetcher module with it's own fetcher, and the custom one, and existing feed entities use all 3. So I had to change them too similar to the http fetcher.

    • MegaChriz β†’ committed 4062e01b on 8.x-3.x
      Issue #2912130 by Fant0m, MegaChriz, artematem, n4r3n, merauluka, bburg...
  • Status changed to Fixed over 1 year ago
  • πŸ‡³πŸ‡±Netherlands megachriz

    @rdivince
    I had been giving Feeds a low priority for the past month, so I guess that for your patch I'll need to dive deeper to find out how remote file systems work. So because of that I'm merging the changes of this issue now and move your patch to a follow-up issue. Perhaps we can find a way to add test coverage for remote stream wrappers too?

  • Automatically closed - issue fixed for 2 weeks with no activity.

  • Status changed to Fixed over 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States bkosborne New Jersey, USA

    Glad to see this landed! thanks everyone who contributed.

    @MegaChriz any idea when you might cut a new beta that includes this?

  • πŸ‡³πŸ‡±Netherlands megachriz

    @bkosborne
    I hope within a month from now. There is a follow-up issue that I would like to include in the next release as well: πŸ› FetcherResult::checkFile() throws \RuntimeException, which is never caught Needs review . While the fix for this issue only aims to prevent imports ending abruptly for new imports, that other issue should help cleaning up existing imports that are (still) stuck in a loop as a result of the issue that was reported here.

  • πŸ‡©πŸ‡ͺGermany c-logemann Frankfurt/M, Germany

    I understand this new feature for fixing loadbalancing issues. But this new tmp files inside the private folder can also break some backup strategies. So where is the "configurable" part?

    Put http downloaded files in a configurable directory and added service to interact with this directory - to fix issue with missing temporary files in load balanced environments.

  • πŸ‡ͺπŸ‡¨Ecuador jwilson3

    Use the in_progress_dir config element in the Feed module's feeds.settings.yml config file.

    The default location is now private://feeds/in_progress assuming you've configured a private files directory, otherwise, it is public://feeds/in_progress.

    Review the commit from above for additional context:

    https://git.drupalcode.org/project/feeds/-/commit/4062e01bb8471254d9daa5...

  • πŸ‡©πŸ‡ͺGermany c-logemann Frankfurt/M, Germany

    @jwilson3 Thanks for the description. So it's a good candidate to set in settings.php

    In many situations I think configuration via settings.php is enough especially for very special functionality. And this feature is such a special thing to solve load balancing problems but can rise problems for non advanced admins. So maybe it would be fine to make make this feature and a setting for it more visible. But it's just an idea.

    For our customers we will just exclude this folder via backup scripts. This we already do with other temporary folders like image styles rendered css etc.

Production build 0.71.5 2024