search api attachments does not appear to support s3fs

Created on 18 January 2023, about 2 years ago

Problem/Motivation

Drupal 9.5.0. Search API 8.x-1.28. S3FS 8.x-3.1.

Testing to see if Search API Attachments will index files on site using AWS S3 file system. So far, the answer appears to be no.

Steps to reproduce

Install Search api and Search api attachments (configured using solr extractor).

Index site with several nodes which contain file attachments. Both node and pdf attachments text are indexed.

Install S3 File system module ( https://www.drupal.org/project/s3fs β†’ ) to Drupal system and configure to use AWS S3 as file system.

Add new nodes with pdf file attachments.

Re-index site. Text from new nodes is indexed, but text from newly added pdfs is not.

The files that are indexed still use the local filesystem and not the configured S3 filesystem:

Note that the file datasource links in the search results DO point to the S3 filesystem, but the content of files that are NOT on the local filesystem do not appear to be indexed.

Proposed resolution

If search api attachments does not already support indexing s3fs file system attachments, modify it so that it does.

Remaining tasks

It could possibly be that the limitation is is the extractor. If the solr extractor used can only read from the local hard drive, then that would make it nearly impossible for search api attachments to accommodate this change.

User interface changes

I don't think any user interface changes would be required.

API changes

I've looked at the code and it appears the file that would need to be modified is FilesExtractor.php. So far, I've not been able to figure out where in the code that change should happen.

Data model changes

The only data model change I can think of is that the extractor reads files and extracts from S3 file system instead of local file system if S3 is enabled.

πŸ’¬ Support request
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States somebodysysop

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡ΊπŸ‡ΈUnited States Chris Burge

    @SomebodySysop are you seeing anything in the logs?

    Just looking through the code, TextExtractorPluginBase::getRealpath() is of interest to me:

      /**
       * Helper method to get the real path from an uri.
       *
       * @param string $uri
       *   The URI of the file, e.g. public://directory/file.jpg.
       *
       * @return mixed
       *   The real path to the file if it is a local file. An URL otherwise.
       */
      public function getRealpath($uri) {
        $wrapper = $this->streamWrapperManager->getViaUri($uri);
        if($wrapper != FALSE){
          $scheme = $this->streamWrapperManager->getScheme($uri);
          $local_wrappers = $this->streamWrapperManager->getWrappers(StreamWrapperInterface::LOCAL);
          if (in_array($scheme, array_keys($local_wrappers))) {
            return $wrapper->realpath();
          }
          else {
            return $wrapper->getExternalUrl();
          }
        }
      }
    

    I'm wondering if the method is failing to return a usable value here.

  • πŸ‡ΊπŸ‡ΈUnited States somebodysysop

    Thanks for the response. I have not looked at it in over a year since I don't know enough about the module to develop a patch. It was my hope that someone with much more knowledge would stumble upon this and figure out a solution.

  • This patch will resolve the issue, but please ensure that your site's domain has access to the S3 bucket.

  • πŸ‡³πŸ‡ΏNew Zealand ericgsmith

    We've been using this module with s3fs for a long time with no additional patches or code changes needed but from memory solr needs to be configured to allow remote streaming

  • πŸ‡³πŸ‡ΏNew Zealand ericgsmith

    Went back to have a look at the project we were using for this.

    Originally when using Solr 8.x we had enableRemoteStreaming set to true through some custom request dispatcher config - something like:

    search_api_solr.solr_request_dispatcher.request_dispatcher_remote_streaming.yml:

    uuid: ....
    langcode: en
    status: true
    id: request_dispatcher_remote_streaming
    label: 'Remote Steaming'
    minimum_solr_version: 7.0.0
    environments: {  }
    recommended: true
    request_dispatcher:
      name: requestParsers
      enableRemoteStreaming: true
      multipartUploadLimitInKB: -1
      formdataUploadLimitInKB: -1
      addHttpRequestToContext: true
    

    In later solr version this changed to being enabled by an environment var - so now we just have an environment variable:

    SOLR_OPTS: "-Dsolr.enableRemoteStreaming=true"
    

    But before fill you with false hope - we were using S3FS module but with the public file takeover, meaning the bucket is publicly accessible and the external URL is used. I haven't tested with the non public wrapper.

Production build 0.71.5 2024