Problem/Motivation
I have a fully-working Search API set up and now I am attempting to add Search API Attachments in order to be able to search the text in PDF documents.
Local set up
Drupal Version 9.5.10
Web server: Apache/2.4.33 (Win64) OpenSSL/1.0.2u mod_fcgid/2.3.9
Database: MySQL 5.7.24
PHP: 8.1.0
Installation
I have installed:
- Search API Attachments (9.02) using Composer:
composer require 'drupal/search_api_attachments:^9.0'
- pdftotext using Composer:
composer require ottosmops/pdftotext
I have tested the pdftotext
PHP and received confirmation that my choice of that extractor is happy...
"Extracted data: The extraction seems working! Yay! Congratulations! "
My simple scenario
I have an Entity reference field called 'field_download_restricted
' which references a Media
type 'Document
' and that in turn points to a local PDF file. All of that works perfectly fine WRT to the rest of the site functionality.
I have configured my Search API index correctly with the field_download_restricted
set to Fulltext
type and I initiate a re-index.
All appears to be be well in terms of indexing- i.e. no immediate complaints and the a report message that syas "All itmes were indexed successfully" but my View that does a Full text search on several fields including the field_download_restricted
does not find what I am looking for in the PDFs.
I'm pretty sure that I have set everything up correctly Search API-wise.
When I look in the Watchdog I noticed that, during indexing, I see the following TWO errors...
Error 1:Deprecated function
Location http://bit-by-bit/batch?_format=json&id=2535&op=do
Referrer http://bit-by-bit/batch?id=2535&op=start
Deprecated function: mb_strcut(): Passing null to parameter #1 ($string) of type string is deprecated in Drupal\search_api_attachments\Plugin\search_api\processor\FilesExtractor->limitBytes() (line 412 of <...>Sites\bit-by-bit.org\public_html\modules\contrib\search_api_attachments\src\Plugin\search_api\processor\FilesExtractor.php)
#0 <...>Sites\bit-by-bit.org\public_html\core\includes\bootstrap.inc(347): _drupal_error_handler_real(8192, 'mb_strcut(): Pa...', '<...>...', 412)
#1 [internal function]: _drupal_error_handler(8192, 'mb_strcut(): Pa...', '<...>...', 412)
#2 <...>Sites\bit-by-bit.org\public_html\modules\contrib\search_api_attachments\src\Plugin\search_api\processor\FilesExtractor.php(412): mb_strcut(NULL, 0, 1048576.0)
#3 <...>Sites\bit-by-bit.org\public_html\modules\contrib\search_api_attachments\src\Plugin\search_api\processor\FilesExtractor.php(296): Drupal\search_api_attachments\Plugin\search_api\processor\FilesExtractor->limitBytes(NULL)
#4 <...>Sites\bit-by-bit.org\public_html\modules\contrib\search_api_attachments\src\Plugin\search_api\processor\FilesExtractor.php(253): Drupal\search_api_attachments\Plugin\search_api\processor\FilesExtractor->extractOrGetFromCache(Object(Drupal\node\Entity\Node), Object(Drupal\file\Entity\File), Object(Drupal\search_api_attachments\Plugin\search_api_attachments\PdftotextExtractor))
Error 2:An overlong word
Location http://bit-by-bit/batch?_format=json&id=2535&op=do
Referrer http://bit-by-bit/batch?id=2535&op=start
An overlong word (more than 50 characters) was encountered while indexing: xttirrsgrigadtnaoeetlneoergnmociltraefewlyaoitgntrhupiheadelnohesolotfsnsraawbupdiiinunsn1sdat.
Since database search servers currently cannot index words of more than 50 characters, the word was truncated for indexing. If this should not be a single word, please make sure the "Tokenizer" processor is enabled and configured correctly for index External resources.
I have no idea where that offending overly long word comes from...
xttirrsgrigadtnaoeetlneoergnmociltraefewlyaoitgntrhupiheadelnohesolotfsnsraawbupdiiinunsn1sdat
RE: "...please make sure the "Tokenizer" processor is enabled and configured correctly for index External resources."
I have enabled to Tokenizer plugin Processor and re-indexed but both error messages still persist on a re-index.
If anyone can offer any pointers as to what I might try next, I'd be very grateful.
Thank you