- Issue created by @pacproduct
- ๐บ๐ธUnited States cmlara
Fundamentally our metadata caching is currently designed to solve this by relying on the Drupal Cache system.
Are you currently using a memory cache service (memcached or redis)? The default in Drupal is SQL.
There are some workflows a memory based cache service might not help (directory only scanned once a month), however if you are iterating over the folder frequently (and your cache bin is large enough) a significant amount of this data could stay in-memory over multiple requests.
Might need to patch s3fs.module to change the cache bin (it was likely not the greatest choice to use โdefaultโ, IIRC we have already fixed that in 4.x).
Without looking too deeply, I could see a few concerns that may pose an issue and need to be investigated. While Drupal Core re-uses a StreamWrapper instance for the entire requests PHP constructs a new instance for almost every separate file operation that isnโt a part of the same File Descriptor ( If you did not use f_open() or equivalent and pass around an FD). I suspect this means stat() in this case is being called under multiple instances which would require a globally static cache that, for operational reason, canโt be cleared (one could also do multiple dir_opendir() at different locations complicating the issue) or could in some workflows quickly negate any benefits.
Moving to 4.x as a feature request. Not closing as it is an interesting thought on how to possibly make the database access more efficient, however without further research Iโm not sure how feasible this is and would initially be looking for feedback in how your cache service is setup as that may be the better solution to this request.
- ๐ซ๐ทFrance pacproduct
Hi @cmlara and thx for your quick feedback :)
We're using the default cache backend at the moment, which is the database indeed.
When profiling what the
imce
module does, I end up with the following map (sorry it's in french - Columns read:Incl., Self, Distance, Call(s), Function called
).We can clearly see that the database gets hit 19K+ times (the number of files), and I believe you're right: that is via
Drupal\Core\Cache\DatabaseBackend
, so using another cache system is an interesting lead, I'll investigate. Thanks. - Status changed to Postponed: needs info
7 months ago 7:09am 2 May 2024 - ๐บ๐ธUnited States cmlara
Setting Postponed-NMI awaiting feedback on cache service change impact.
- ๐ซ๐ทFrance pacproduct
Hi again! :)
I configured memcache on our project as the cache backend for all bins and here is what I can observe, with "hot" caches in both cases:
- - Time to scan my directory without memcache: 8 seconds
- - Time to scan my directory with memcache: 3 seconds
So, using a proper cache backend does improve things a lot. In my case though, these are the times to load a single picture on the site, so it is still too slow but I think the remaining optimizations need to be done on IMCE's end (see related issue: www.drupal.org/project/imce/issues/3443768 ๐ Poor performance when downloading files from a non-standard filesystem Active )
For information, I did the test again while profiling with XDebug. Response times are a lot worse while profiling as expected, but the performance difference with/without memcache can be seen:
Without memcache (total 28 seconds):
With memcache (total 19 seconds):
----
I guess s3fs could still preload all metadata when opening a directory as a "nice to have", but considering the results above and your explanations, I think the main culprit on out project isn't s3fs but the way it gets used by IMCE + the fact that the website we're migrating has 20K files in a single directory which doesn't sound like a good practice to begin with.
Thanks a lot for your help @cmlara :)
I'll let you decide whether this issue should stay open for a future improvement of s3fs or not.
Cheers. - Status changed to Active
7 months ago 9:56pm 3 May 2024 - ๐บ๐ธUnited States cmlara
High level reading linked issue:
That does indeed sound like an issue in IMCE if they are processing unrelated content for a single file during hook_file_download(). That is the sort of performance fault that may not show up in local storage yet will show up in network storage.The 62% decrease in time is impressive to show what a good cache can deliver.
I don't personally see 20k files as a problem on its own. A 3-8 second time would normally not be as significant if this was a backend operation. I wonder how often stat() all files in a directory is done for fronted facing operations.
On top of that if a program did need all of this data for frontend usage they should be prepared to start paging the data.
There will always be some performance limitations we cant overcome. That said I don't want to say this is impossible, however I don't see it being a high priority either at the moment as this may be fixable by more efficient filesystem usage.
Setting to minor, if anyone can come up with a reliable method for this it would be worth considering however at the moment I'm not going to spend a significant amount of effort looking into this until less
Side note: we will drop that additional 14% for preventCrossSchemeAccess() in 4.x as we will not have that as a security concern.
This does make me wonder if we should (since we have GitLabCi now) run regular profiling runs as part of the build process.