Extracting text from PDF in a media field for a prompt?

Created on 6 June 2024, 8 months ago
Updated 8 June 2024, 8 months ago

Problem/Motivation

My node includes a media field with a document that's often a PDF file. In other situations, I'm able to use a token like this to extract the text of the PDF:

[node:field_file:entity:field_media_document:entity:file_extractor_extracted_file]

But it's not working here.

Is there another token that might be better? Or maybe an entirely different approach?

TIA.

πŸ’¬ Support request
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States bogdog400

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @bogdog400
  • πŸ‡©πŸ‡ͺGermany marcus_johansson

    When I have time to solve this https://www.drupal.org/project/ai_interpolator/issues/3446245 πŸ“Œ Add inputter plugins Active it should be possible to create a token base field inputter, that could do what you want to do. The problem with that is that its highly complex and if you put the wrong token (say a string) it needs error handling for this.

    Another option is that the entity is looking inside child entities for in this case file fields. This would of course come with comical side effects like the user profile image showing up as a choice for image input on most entity types. It also is a problem because of deltas - if you have a node that has paragraphs that has medias that has images, how would you choose which one to pick? Let me think about that or if you/anyone have suggestions on solving it. Maybe AI prompt engineering could solve it, though I guess that's more for text prompts?

  • πŸ‡ΊπŸ‡ΈUnited States bogdog400

    Okay. Well, let me know when you're able to tackle this problem.

    Or do some of the AIs take PDFs directly?

  • πŸ‡©πŸ‡ͺGermany marcus_johansson

    Ah, sorry - that issue might actually be that the file_extractor did not trigger - I will check if I can install and try it out and see if there is a bug in the Token integration.

    For PDF to text in the AI Interpolator there is:
    * https://www.drupal.org/project/ai_interpolator_convertapi β†’ - cheap, does similar quality as Tika.
    * https://www.drupal.org/project/unstructured β†’ - can be self hosted, is better then anything else and can also use XLSX, JPG, PNG, DOCX etc. Really awesome product.

Production build 0.71.5 2024