Indexing PDF files with Vragen.ai

Open on Drupal.org →

Created on 28 April 2025, 3 months ago

Problem/Motivation

Right now it is not possible to index PDF files in the form of media items. I believe this would be very valuable for sites that have lots of pdf's uploaded. This way Vragen.ai can have more context of these files.

Steps to reproduce

Enable the media entity for indexing
Select the right bundle to index (in my case 'documents').
Track the items to be indexed
Start the index, media items won't be properly handled as we cannot render the content of a pdf file.

Proposed resolution

My proposed solution is to detect whenever a media item with mime type 'application/pdf' is being indexed. We can then construct a URL to it and pass it to Vragen.ai. Then on the side of Vragen.ai it can crawl the provided URL and render it to text + index it. This way we are not passing the entire PDF in an HTTP request and also do not have to render the PDF to HTML on the Drupal side.

User interface changes

None I can think of.

API changes

When a media item is detected it will send a new document request without content. The API will soon be equipped to handle this.

✨ Feature request

Status

Active

Version

2.0

Component

Code

Created by

Live updates comments and jobs are added and updated live.

Sign in to follow issues

Merge Requests

!2Indexing PDF files with Vragen.ai
Merged
Unnamed author
updated 3 months ago

Comments & Activities

Issue created by @JelleGlebbeek
Comment 3 months ago →
JelleGlebbeek
Merge request !2feat: Add support for indexing pdf's through the media module → (Merged) created by Unnamed author
Comment 3 months ago →
JelleGlebbeek
Pipeline finished with Success
3 months ago
Total: 144s
#483819
Comment 3 months ago →
🇳🇱Netherlands bbrala Netherlands
Thanks!

Left a comment, also phpcs is not passing :)
Pipeline finished with Success
3 months ago
Total: 247s
#484451
Pipeline finished with Success
3 months ago
Total: 139s
#484456
Comment 3 months ago →
JelleGlebbeek
Comment 3 months ago →
🇳🇱Netherlands bbrala Netherlands
Comment 3 months ago →
🇳🇱Netherlands bbrala Netherlands
Ty and merged. Nobnew release though, do younneed one?
Comment 3 months ago →
🇳🇱Netherlands bbrala Netherlands
Comment 3 months ago →
JelleGlebbeek
I think a new release is a good idea, especially as there is also a bug with reading settings on beta3.
I'd also rather use a beta version then the dev branch.

Thanks!
Comment 3 months ago →
JelleGlebbeek

contrib.social Blog FAQ Discussions

Production build 0.71.5 2024