in_progress filesystem grows indefinitely

Created on 5 July 2023, over 1 year ago
Updated 3 June 2024, 6 months ago

Problem/Motivation

There is a site with Feeds 3.0 beta4, and several feeds that we import regularly.
Recently daily backups started to fail, and that was the root cause:

-files/private/feeds/in_progress/138 (83,832 files/4.9GB size)
-files/private/feeds/in_progress/158 (259,113 files/15GB size)
-files/private/feeds/in_progress/144 (260,301 files/15GB size)
-files/private/feeds/in_progress/137 (263,489 files/15GB size)

Steps to reproduce

Configure feeds with frequent refresh rate and in_progress filesystem should grow by each run, I assume.

Proposed resolution

Define some kind of retention for in_progress filesystem and clean up data afterwards.

Remaining tasks

Implement the change.

User interface changes

Admin config settings for this behavior.

API changes

None.

Data model changes

None

Test scenario

  • Create a feed type using the fetcher “Download” (http fetcher).
  • Create a feed and provide a url that does not exist.
  • Run the import in background or configure periodic import.

Watch what happens on the file system each time that you run cron.

🐛 Bug report
Status

Fixed

Component

Code

Created by

🇭🇺Hungary aron novak Hungary, Budapest

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @aron novak
  • 🇭🇺Hungary aron novak Hungary, Budapest

    We need to note that not all feeds were affected, 4 of ~20 only. However those generated ~50 GB disk usage.

  • Status changed to Postponed: needs info over 1 year ago
  • 🇳🇱Netherlands megachriz

    Hm, after an import, the file from which content was imported, should get cleaned up. See the following:

    FeedsExecutable::finish():

    $fetcher_result->cleanUp();
    $feed->finishImport();
    

    HttpFetcherResult::cleanUp:

    /**
     * {@inheritdoc}
     */
    public function cleanUp() {
      if ($this->filePath) {
        $this->fileSystem->unlink($this->filePath);
      }
    }
    

    Feed::finishImport() calls Feed::unlock(): and in that method there is:

    \Drupal::service('feeds.file_system.in_progress')->removeFiles((string) $this->id());
    

    FeedsFileSystemBase::removeFiles():

    /**
     * {@inheritdoc}
     */
    public function removeFiles(string $subdirectory): bool {
      $directory = $this->getFeedsDirectory() . '/' . $subdirectory;
      if (!is_dir($directory)) {
        // Directory does not exist. Abort.
        return FALSE;
      }
      $result = $this->fileSystem->deleteRecursive($directory);
      return $result ? TRUE : FALSE;
    }
    

    Perhaps there is a case that I overlooked and then the file cleanup doesn't happen? I need more specific steps to reproduce. If possible, a test that demonstrates the issue would be nice.

  • 🇭🇺Hungary aron novak Hungary, Budapest

    I will try to come up with a test, what I concluded that the problematic Feeds have URLs that returns 404 error, not HTTP 200.

  • 🇳🇱Netherlands megachriz

    Feed urls that return a 404 is useful information! With that I'm able to reproduce the issue. I see that the issue exists only on cron runs, not when importing through the UI.
    I see also that every time I run cron, one downloaded file gets added. This makes me think that this issue is closely related to 🐛 FetcherResult::checkFile() throws \RuntimeException, which is never caught Needs review . Also because it is about a RuntimeException. That issue is currently kinda stuck on the question: when to abort an import and when to retry at a later time?

  • Status changed to Active over 1 year ago
  • 🇺🇸United States bkosborne New Jersey, USA
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 7.4 & MySQL 5.7
    last update over 1 year ago
    695 pass, 1 fail
  • 🇳🇱Netherlands megachriz

    A possible fix for this issue is added to 🐛 FetcherResult::checkFile() throws \RuntimeException, which is never caught Needs review . The test from here is moved over to there as well.

    @Aron Novak
    @bkosborne
    Do you want to review the changes here? Perhaps you also have an opinion on the risk of imports getting aborted on temporary failures.
    Personally, I think that a never ending import with files getting piled up is more serious than an import getting aborted when there's a timeout or a similar issue like that.

  • Pipeline finished with Skipped
    about 1 year ago
    #36214
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 7.4 & MySQL 5.7
    last update 11 months ago
    715 pass
  • Status changed to Needs review 11 months ago
  • 🇳🇱Netherlands megachriz

    Because I think that 🐛 FetcherResult::checkFile() throws \RuntimeException, which is never caught Needs review requires some more thought, I've come up with a simpler fix for this issue. And that is: when a RequestException occurs in HttpFetcher::get(), delete the temporary created file immediately. This does not fix Feeds trying to fetch over and over again, but does fix the file system from growing indefinitely.

    That Feeds tries to fetch over and over again should be addressed in 🐛 FetcherResult::checkFile() throws \RuntimeException, which is never caught Needs review . The thing there that requires some more thought is in what situations Feeds should retry and in what situations it should abort the import.

    Leaving the proposed fix here for review for a few weeks. I plan to commit it sometime in January, should there be no feedback.

  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 7.4 & MySQL 5.7
    last update 11 months ago
    715 pass
  • Pipeline finished with Canceled
    11 months ago
    Total: 728s
    #64678
  • Pipeline finished with Failed
    11 months ago
    Total: 1069s
    #64690
  • 🇺🇸United States irinaz
  • Pipeline finished with Failed
    9 months ago
    Total: 693s
    #93136
  • Pipeline finished with Failed
    9 months ago
    Total: 190s
    #93150
  • Pipeline finished with Success
    9 months ago
    Total: 192s
    #93155
  • Pipeline finished with Success
    9 months ago
    Total: 195s
    #93159
  • Pipeline finished with Skipped
    9 months ago
    #107631
  • Pipeline finished with Success
    7 months ago
    Total: 465s
    #140201
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 7.4 & MySQL 5.7
    last update 7 months ago
    718 pass
  • Pipeline finished with Skipped
    6 months ago
    #176329
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.5 + Environment: PHP 7.4 & MySQL 5.7
    last update 6 months ago
    718 pass
  • Status changed to Fixed 6 months ago
  • 🇳🇱Netherlands megachriz

    Okay, I merged this. And retested the scenario. When a resource that Feeds tries to fetch returns a 404, Feeds will keep trying on every cron run to fetch it over and over again. But with the changes from this issue it no longer stores a file on the file system.

    It has been awhile for me, but I think that the other issue (Feeds endlessly tries to fetch a resource) should be handled in 🐛 FetcherResult::checkFile() throws \RuntimeException, which is never caught Needs review .

    And with this change in, we can plan a new Feeds release: the last Feeds release with Drupal 9 support.

  • 🇳🇱Netherlands megachriz

    I see that the reported issue can technically still happen in case of a 304, but it seems only to happen in case fetching happens 'stand-alone', separate from the import process. Fixing that will be done in 🐛 Delete Empty Files Created If Empty/Error Exception is Returned Needs review .

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024