Unable to index a a PDF attachement

Created on 19 September 2023, about 1 year ago
Updated 20 September 2023, about 1 year ago

Problem/Motivation

I have a fully-working Search API set up and now I am attempting to add Search API Attachments in order to be able to search the text in PDF documents.

Local set up

Drupal Version 9.5.10
Web server: Apache/2.4.33 (Win64) OpenSSL/1.0.2u mod_fcgid/2.3.9
Database: MySQL 5.7.24
PHP: 8.1.0

Installation

I have installed:

  • Search API Attachments (9.02) using Composer:
    composer require 'drupal/search_api_attachments:^9.0'
  • pdftotext using Composer: composer require ottosmops/pdftotext

I have tested the pdftotext PHP and received confirmation that my choice of that extractor is happy...
"Extracted data: The extraction seems working! Yay! Congratulations! "

My simple scenario

I have an Entity reference field called 'field_download_restricted' which references a Media type 'Document' and that in turn points to a local PDF file. All of that works perfectly fine WRT to the rest of the site functionality.

I have configured my Search API index correctly with the field_download_restricted set to Fulltext type and I initiate a re-index.
All appears to be be well in terms of indexing- i.e. no immediate complaints and the a report message that syas "All itmes were indexed successfully" but my View that does a Full text search on several fields including the field_download_restricted does not find what I am looking for in the PDFs.

I'm pretty sure that I have set everything up correctly Search API-wise.

When I look in the Watchdog I noticed that, during indexing, I see the following TWO errors...

Error 1:Deprecated function

Location	http://bit-by-bit/batch?_format=json&id=2535&op=do
Referrer  http://bit-by-bit/batch?id=2535&op=start
Deprecated function: mb_strcut(): Passing null to parameter #1 ($string) of type string is deprecated in Drupal\search_api_attachments\Plugin\search_api\processor\FilesExtractor->limitBytes() (line 412 of <...>Sites\bit-by-bit.org\public_html\modules\contrib\search_api_attachments\src\Plugin\search_api\processor\FilesExtractor.php)
#0 <...>Sites\bit-by-bit.org\public_html\core\includes\bootstrap.inc(347): _drupal_error_handler_real(8192, 'mb_strcut(): Pa...', '<...>...', 412)
#1 [internal function]: _drupal_error_handler(8192, 'mb_strcut(): Pa...', '<...>...', 412)
#2 <...>Sites\bit-by-bit.org\public_html\modules\contrib\search_api_attachments\src\Plugin\search_api\processor\FilesExtractor.php(412): mb_strcut(NULL, 0, 1048576.0)
#3 <...>Sites\bit-by-bit.org\public_html\modules\contrib\search_api_attachments\src\Plugin\search_api\processor\FilesExtractor.php(296): Drupal\search_api_attachments\Plugin\search_api\processor\FilesExtractor->limitBytes(NULL)
#4 <...>Sites\bit-by-bit.org\public_html\modules\contrib\search_api_attachments\src\Plugin\search_api\processor\FilesExtractor.php(253): Drupal\search_api_attachments\Plugin\search_api\processor\FilesExtractor->extractOrGetFromCache(Object(Drupal\node\Entity\Node), Object(Drupal\file\Entity\File), Object(Drupal\search_api_attachments\Plugin\search_api_attachments\PdftotextExtractor))

Error 2:An overlong word

Location	http://bit-by-bit/batch?_format=json&id=2535&op=do
Referrer  http://bit-by-bit/batch?id=2535&op=start
An overlong word (more than 50 characters) was encountered while indexing: xttirrsgrigadtnaoeetlneoergnmociltraefewlyaoitgntrhupiheadelnohesolotfsnsraawbupdiiinunsn1sdat.
Since database search servers currently cannot index words of more than 50 characters, the word was truncated for indexing. If this should not be a single word, please make sure the "Tokenizer" processor is enabled and configured correctly for index External resources.

I have no idea where that offending overly long word comes from...
xttirrsgrigadtnaoeetlneoergnmociltraefewlyaoitgntrhupiheadelnohesolotfsnsraawbupdiiinunsn1sdat

RE: "...please make sure the "Tokenizer" processor is enabled and configured correctly for index External resources."

I have enabled to Tokenizer plugin Processor and re-indexed but both error messages still persist on a re-index.

If anyone can offer any pointers as to what I might try next, I'd be very grateful.

Thank you

💬 Support request
Status

Active

Version

9.0

Component

Miscellaneous

Created by

🇬🇧United Kingdom SirClickALot Somerset

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @SirClickALot
  • 🇬🇧United Kingdom SirClickALot Somerset
  • 🇬🇧United Kingdom SirClickALot Somerset
  • 🇫🇷France izus

    hi,
    two things to try i think:

    1) try with a simple pdf file (a pdf with few words), just to be sure if it is not a data problem (overlong word...)
    2) make sure the view is configured to search in the media field. especially there is an exampel in the README "SIMPLE USAGE EXAMPLE 2"
    https://git.drupalcode.org/project/search_api_attachments/-/blob/9.0.2/R...

  • 🇬🇧United Kingdom SirClickALot Somerset

    Hi @izus ,

    Thank you for the quick reply.

    I did actually base my experiments on exactly that example.

    Since then I have deleted ALL Content and associated (PDF) files and create a single new one with some very basic simple text in it as you suggest.

    I have re-added the 'saa_' field to my index...

    I have the processor enabled...

    I have used the view to search for some words (one at a time) which I know exist the simplr PDF but absolutely no returned results at all.

    It's a mystery!

    I have made sure again that the 'saa_' field is searched in my Fulltext view (none are selected so ALL are used)...

    I have rebuilt the index again; this time no errors or warnings which is interesting...

Production build 0.71.5 2024