No results from file_extractor extracted data in Drupal 10.1

Created on 5 September 2023, about 1 year ago
Updated 6 September 2023, about 1 year ago

Problem/Motivation

This is a pretty strange issue.
(Only!) when using Drupal 10.1 and search_api_meilisearch (version does not really matter, it seems, but I use 1.0.x-dev) and indexing and trying to retrieve full text searchable file extracts, I do not get anything back into my search view.

Switching to the search_api DB backend works.
Using Meilisearch's dev dashboard also correctly retrieves data, so the indexing also works.
If I run Meilisearch in debug mode, I even see the request coming in correctly and results also getting sent back. They just don't appear in my search view.

On Drupal 9.5, it works as expected.

I am unsure where to start to debug this or whether this a problem in this module at all (but I think so, as the DB backend works).

Steps to reproduce

configpath=set/this/to/siteconfig/path
apt install poppler-utils #(or however your distibution provides this)
composer require drupal/file_extractor:4.1.1
drush pm:enable file_extractor
cat << EOF > $configpath/file_extractor.settings.yml
extraction_method: pdftotext_extractor
extraction_method_settings:
  pdftotext_path: pdftotext
extraction_settings:
  extractable:
    excluded_extensions: 'aif art avi bmp gif ico mov oga ogv png psd ra ram rgb flv'
    max_filesize: '0'
    exclude_private: true
  extraction_result:
    number_first_bytes: '1 MB'
EOF
drush cim
  1. Create content with some PDF attachments.
  2. Create index for content title and file_extractor extracted text for the attachment fields. Only uploading files as "media" (not part of a node) and indexing media only may also work.
  3. Index data.
  4. Check Meilisearch to see extracted data has correctly been created/indexed.
  5. Create a new search view building upon the newly created index.
  6. Add content title and extracted text as fields.
  7. Search for something - only get content title but no text extract.

Can anyone reproduce this?
I tried to reduce this to a handful of config files for easy drush cim'ing, but I got lost in dependencies.
I hope the steps above are clear enough, otherwise please come back to me and I see whether I can improve them!
Thanks a lot in advance!

🐛 Bug report
Status

Closed: works as designed

Version

1.0

Component

Code

Created by

🇦🇹Austria tgoeg

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @tgoeg
  • 🇸🇮Slovenia DeaOm

    I followed the steps mentioned, but could not reproduce the issue. Tested with D10.1.1, search api meilisearch dev version (git checkout not composer install), file extractor 4.1.1 and poppler-utils for the pdf to text conversion. I get the title and the extracted file field displayed in Drupal. I also checked the meilisearh and it's working there properly also. I'm not to familiar with the file extractor to know what the issue could be as you said it works with the DB server.
    Do you see any errors, maybe in console or in the logs?
    Will leave this open, so anybody else can also try and reproduce the issue.

  • Status changed to Closed: works as designed about 1 year ago
  • 🇦🇹Austria tgoeg

    I drilled down deeper on this.

    I have a working setup now as well. If I had followed the instructions above myself, it would have worked out as well :-)

    The problem might stem from the ID mappings (now made obsolete by 📌 Remove entity id mapping config Fixed ) and the fact that I share indexes between multiple indexes (D10.1 and D9.5) for quicker testing (they are pretty huge and I don't want to wait for indexing during testing). I still don't get why only some fields get displayed and others not, however.

    What seems to have fixed it:

    • Updated to current dev-1.0.x
    • Deleted index in search_api config (re-indexing would not fix it and this is something I will add to another ticket; stems from 🐛 Remove the possibility to add field with machine name id Fixed as it wanted to create the column "id" another time)
    • Re-imported config (to recreate index and deleted views that got dropped together with the index)
    • Re-indexed nodes
    • Profit

    I guess this can be closed.
    And the learning might be that sharing indexes currently might not be a good idea, yet, though it mostly seems to work. Maybe it fully works when all instances use a version fixing 📌 Remove entity id mapping config Fixed .

Production build 0.71.5 2024