Indexing PDF files with Vragen.ai

Created on 28 April 2025, 10 days ago

Problem/Motivation

Right now it is not possible to index PDF files in the form of media items. I believe this would be very valuable for sites that have lots of pdf's uploaded. This way Vragen.ai can have more context of these files.

Steps to reproduce

  1. Enable the media entity for indexing
  2. Select the right bundle to index (in my case 'documents').
  3. Track the items to be indexed
  4. Start the index, media items won't be properly handled as we cannot render the content of a pdf file.

Proposed resolution

My proposed solution is to detect whenever a media item with mime type 'application/pdf' is being indexed. We can then construct a URL to it and pass it to Vragen.ai. Then on the side of Vragen.ai it can crawl the provided URL and render it to text + index it. This way we are not passing the entire PDF in an HTTP request and also do not have to render the PDF to HTML on the Drupal side.

User interface changes

None I can think of.

API changes

When a media item is detected it will send a new document request without content. The API will soon be equipped to handle this.

Feature request
Status

Active

Version

2.0

Component

Code

Created by

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.71.5 2024