ImageExtractor support for relative asset paths

Created on 18 February 2025, 7 months ago

Problem/Motivation

The current ImageExtractor only looks up absolute URLs, but the markup can also contain relative asset paths. Suggests we add an extra regex for relative assets.

Steps to reproduce

1. Set up a pseudo field on a node type to populate HTML via either Crawler plugins
2. Set up an image field to extract images
3. Only images that are absolute URLs will be matched

Proposed resolution

To add support for relative assets within the ImageExtractor plugin

Remaining tasks

Create changes, review and test

User interface changes

Additional configuration option in entity field config

API changes

N/A

Data model changes

Feature request
Status

Active

Version

1.0

Component

Code

Created by

🇧🇪Belgium baikho Antwerp, BE

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @baikho
  • 🇧🇪Belgium baikho Antwerp, BE
  • 🇧🇪Belgium baikho Antwerp, BE
  • 🇧🇪Belgium baikho Antwerp, BE
  • 🇧🇪Belgium baikho Antwerp, BE

    I believe this could be considered a bug

  • 🇫🇮Finland merilainen

    I think this is a feature request, because it's possible to use an intermediate field called something like "Source HTML with absolute img urls" where AI can be instructed to change any relative URLs to absolute with the following prompt:

    I will give you a page HTML source as the context and I want you to find all relative img tags, replace them in the source with prefix https://boonstoppelverf.nl/ domain so that the url will become absolute. Return the rest of the source HTML as is. If the context is empty, do not do anything.


    Context: {{ raw_context }}

    It's not ideal and wastes tokens, but it works.

  • 🇫🇮Finland merilainen

    Here is a patch which adds support for relative asset paths for both File and Image extractors.

    This is implemented by providing a "domain" configuration which can be set when relative paths should be processed. Leaving it empty will skip processing of relative paths.

    Other changes:
    - Dynamic extensions support for ImageExctractor (copied from FileExtractor)
    - Use extraFormFields instead of extraAdvancedFormFields in both extractors (makes configurable parts more visible)

  • Merge request !5Support for relative asset paths → (Open) created by merilainen
  • Pipeline finished with Success
    21 days ago
    Total: 128s
    #582914
Production build 0.71.5 2024