Add a plugin for Unstructured.io

Issue created by @robertoperuzzo
Comment 4 months ago →
🇺🇸United States adanielyan
Thank you for this! I hope this will get merged to the dev branch soon.
Merge request !42Issue #3519494 by robertoperuzzo: Add the plugin for Unstructured.io → (Open) created by robertoperuzzo
Pipeline finished with Failed
3 months ago
Total: 266s
#531657
Comment 3 months ago →
🇮🇹Italy robertoperuzzo 🇮🇹 Tezze sul Brenta, VI
Hi @adanielyan and @izus, I was wondering if we could use the Batch API to test the extractors on settings submission, in order to avoid a timeout. During my tests using the Unstructured.io API locally, I encountered a timeout error a couple of times.

What do you think? Will it be compatible with the other kind of extractors?
Pipeline finished with Success
3 months ago
Total: 187s
#536527
Pipeline finished with Success
3 months ago
Total: 139s
#536545
Comment 3 months ago →
🇮🇹Italy robertoperuzzo 🇮🇹 Tezze sul Brenta, VI
Comment 3 months ago →
🇺🇸United States j-barnes
@robertoperuzzo – thanks for the work on this!
Our team was looking for a way to leverage search_api_attachments with Unstructured to clean up the text for our RAG search, and this is the perfect solution. We have tons of legacy content that hasn’t been OCR’d, so this works well for that.

I did run into a few issues and have a couple of wish-list items:

Chunking elements – the form values don’t persist after save (adding those fields to submitConfigurationForm() fixed it for us).

Chunking settings not applied – the options aren’t being included in the payload request, so they don’t appear to take effect yet.

Large files time-out – anything over ~1 MB (we have one that’s 1.1 MB) hits a DelayedRequeueException loop. Increasing the Guzzle time-outs solved it locally; exposing these as configurable options would be great.

php $options = [ RequestOptions::HEADERS => [ 'Accept' => 'application/json', 'unstructured-api-key' => $api_key?->getKeyValue() ?? '', ], RequestOptions::MULTIPART => [ [ 'name' => 'files', 'contents' => $file_resource, 'filename' => $file->getFilename(), 'headers' => [ 'Content-Type' => $file_mime_type, ], ], [ 'name' => 'strategy', 'contents' => 'ocr_only', ], ], RequestOptions::TIMEOUT => 300, RequestOptions::CONNECT_TIMEOUT => 30, RequestOptions::READ_TIMEOUT => 300, ];

Expose strategy choices – it would be awesome to have a dropdown for extraction strategy (e.g., ocr_only, high_res, etc.) so we can pick the best option for non-OCR’d PDFs.

Anyways, great work on this -- hope we can get this merged in soon. Attaching PDF that we have been having issues with if you need it for testing.
Comment 3 months ago →
🇺🇸United States j-barnes
Issue was unassigned.
Status changed to Needs work 8 days ago7:47am 25 September 2025

Add a plugin for Unstructured.io

Problem/Motivation

Proposed resolution

Remaining tasks

Merge Requests

!42Add a plugin for Unstructured.io
Open

Comments & Activities

Add a plugin for Unstructured.io

Problem/Motivation

Proposed resolution

Remaining tasks

Merge Requests

!42Add a plugin for Unstructured.ioOpen

Comments & Activities

!42Add a plugin for Unstructured.io
Open