Improve performances when scanning a directory's files' metadata

Created on 26 April 2024, 2 months ago
Updated 3 May 2024, about 2 months ago

Problem/Motivation

I'm currently working on a Drupal 10 site which combines this module (s3fs) and imce.

All media are in a single directory for legacy reasons, which contains ~20K images.

Because of how it is implemented in imce, this causes very slow scanning of directories, here is what imce does (in \Drupal\imce\Imce::scanDir):

   if (!$opendir = opendir($diruri)) {
      return $content + ['error' => TRUE];
    }
    [...]
    while (($filename = readdir($opendir)) !== FALSE) {
      [...]
      $fileuri = $uriprefix . $filename;
      $is_dir = is_dir($fileuri);
      if ($is_dir ? !$browse_subfolders : !$browse_files) {
        continue;
      }
      [...]
    }
    closedir($opendir);

My understanding is that \Drupal\s3fs\StreamWrapper\S3fsStream::dir_opendir is pretty smart as it preloads all entries' names in one go by loading all sub-entries from the database.

Which makes the while loop above pretty efficient as \Drupal\s3fs\StreamWrapper\S3fsStream::dir_readdir only uses data already loaded in RAM.

However, each loop calls is_dir, and that ends up by a call to \Drupal\s3fs\StreamWrapper\S3fsStream::stat.

That is what causes one query per file when scanning a directory, in our particular case.

Steps to reproduce

Scan a directory containing thousands of files, and test whether each entry is a directory or not, as illustrated by the code above.

Proposed resolution

I was wondering if it would be a good idea to offer a new configuration for s3fs that preloads all files' metadata when making a call to \Drupal\s3fs\StreamWrapper\S3fsStream::dir_opendir.

The idea would be that S3fsStream::dir_opendir would not only load in RAM file names like it does today, but also all files' metadata, so subsequent calls to S3fsStream::stat does not need to query the database (within the same web request).

What are your thoughts?

User interface changes

Add a checkbox on the settings screen so this can be activated or not, because I guess it could cause RAM issues for some specific setups, so maybe it should stay off by default. I'm unsure.

API changes

None.

Data model changes

None I think.

โœจ Feature request
Status

Active

Version

4.0

Component

Code

Created by

๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @pacproduct
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States cmlara

    Fundamentally our metadata caching is currently designed to solve this by relying on the Drupal Cache system.

    Are you currently using a memory cache service (memcached or redis)? The default in Drupal is SQL.

    There are some workflows a memory based cache service might not help (directory only scanned once a month), however if you are iterating over the folder frequently (and your cache bin is large enough) a significant amount of this data could stay in-memory over multiple requests.

    Might need to patch s3fs.module to change the cache bin (it was likely not the greatest choice to use โ€œdefaultโ€, IIRC we have already fixed that in 4.x).

    Without looking too deeply, I could see a few concerns that may pose an issue and need to be investigated. While Drupal Core re-uses a StreamWrapper instance for the entire requests PHP constructs a new instance for almost every separate file operation that isnโ€™t a part of the same File Descriptor ( If you did not use f_open() or equivalent and pass around an FD). I suspect this means stat() in this case is being called under multiple instances which would require a globally static cache that, for operational reason, canโ€™t be cleared (one could also do multiple dir_opendir() at different locations complicating the issue) or could in some workflows quickly negate any benefits.

    Moving to 4.x as a feature request. Not closing as it is an interesting thought on how to possibly make the database access more efficient, however without further research Iโ€™m not sure how feasible this is and would initially be looking for feedback in how your cache service is setup as that may be the better solution to this request.

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    Hi @cmlara and thx for your quick feedback :)

    We're using the default cache backend at the moment, which is the database indeed.

    When profiling what the imce module does, I end up with the following map (sorry it's in french - Columns read: Incl., Self, Distance, Call(s), Function called).

    We can clearly see that the database gets hit 19K+ times (the number of files), and I believe you're right: that is via Drupal\Core\Cache\DatabaseBackend, so using another cache system is an interesting lead, I'll investigate. Thanks.

  • Status changed to Postponed: needs info about 2 months ago
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States cmlara

    Setting Postponed-NMI awaiting feedback on cache service change impact.

  • ๐Ÿ‡ซ๐Ÿ‡ทFrance pacproduct

    Hi again! :)

    I configured memcache on our project as the cache backend for all bins and here is what I can observe, with "hot" caches in both cases:

    • - Time to scan my directory without memcache: 8 seconds
    • - Time to scan my directory with memcache: 3 seconds

    So, using a proper cache backend does improve things a lot. In my case though, these are the times to load a single picture on the site, so it is still too slow but I think the remaining optimizations need to be done on IMCE's end (see related issue: www.drupal.org/project/imce/issues/3443768 ๐Ÿ› Poor performance when downloading files from a non-standard filesystem Active )

    For information, I did the test again while profiling with XDebug. Response times are a lot worse while profiling as expected, but the performance difference with/without memcache can be seen:

    Without memcache (total 28 seconds):

    With memcache (total 19 seconds):

    ----

    I guess s3fs could still preload all metadata when opening a directory as a "nice to have", but considering the results above and your explanations, I think the main culprit on out project isn't s3fs but the way it gets used by IMCE + the fact that the website we're migrating has 20K files in a single directory which doesn't sound like a good practice to begin with.

    Thanks a lot for your help @cmlara :)
    I'll let you decide whether this issue should stay open for a future improvement of s3fs or not.
    Cheers.

  • Status changed to Active about 2 months ago
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States cmlara

    High level reading linked issue:
    That does indeed sound like an issue in IMCE if they are processing unrelated content for a single file during hook_file_download(). That is the sort of performance fault that may not show up in local storage yet will show up in network storage.

    The 62% decrease in time is impressive to show what a good cache can deliver.

    I don't personally see 20k files as a problem on its own. A 3-8 second time would normally not be as significant if this was a backend operation. I wonder how often stat() all files in a directory is done for fronted facing operations.

    On top of that if a program did need all of this data for frontend usage they should be prepared to start paging the data.

    There will always be some performance limitations we cant overcome. That said I don't want to say this is impossible, however I don't see it being a high priority either at the moment as this may be fixable by more efficient filesystem usage.

    Setting to minor, if anyone can come up with a reliable method for this it would be worth considering however at the moment I'm not going to spend a significant amount of effort looking into this until less

    Side note: we will drop that additional 14% for preventCrossSchemeAccess() in 4.x as we will not have that as a security concern.

    This does make me wonder if we should (since we have GitLabCi now) run regular profiling runs as part of the build process.

Production build 0.69.0 2024