Sanitize HTML before sending to AI to lessen token size

Created on 27 August 2025, about 1 month ago

Problem/Motivation

During development it was noticed that the tokens were large being logged which would increase cost potential during development along with potential inaccurate results.

Benefits of sanitizing and decreasing markup sent to AI Prompts

  • Security processing: Strip out any potential XSS or other injection vectors from external sites being migrated.
  • Improves prompt quality and reduces "AI hallucinations": Remove unnecessary or confusing HTML markup so the LLM focuses on meaningful content.
  • Reduce tokens: Cleaning extra markup decreases the number of tokens required, lowering costs and improving efficiency of LLM processing.

Steps to reproduce

Run the current slice1 migration and the complete HTML from the external page is sent to AI.

Proposed resolution

Allow for setting of the configuration from within a html_processor configuration for the AI plugin.

ai:
   html_processor:
   ....

A few example of the configuration options would be added to the ai_migration_example migration yml.

Allow for isolating content by container identifier(s)

The container would be specified as a string or array. This will allow specific regions of the html to be targeted and sent to the AI after processing. For arrays, each container filter would be processed and append results to the content string sent AI.

HTML Sanitizer Component

Utilize the Symfony HTML Sanitizer Component for the heavy lifting of sanitizing the html content to allow full flexibility to the developer to the options given from the component. It is based on the HTML Sanitizer W3C Standard Proposal.

https://symfony.com/packages/HTML%20Sanitizer
https://github.com/symfony/html-sanitizer

Allow for stripping by regex

This will allow the developer further flexibility in targeting specific content to strip that would not be possible with the sanitizer.

✨ Feature request
Status

Active

Version

1.0

Component

Code

Created by

πŸ‡ΊπŸ‡ΈUnited States webbywe

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @webbywe
  • Merge request !15Resolve #3543236 "Sanitize html before" β†’ (Merged) created by webbywe
  • Pipeline finished with Success
    about 1 month ago
    Total: 240s
    #582686
  • Pipeline finished with Success
    about 1 month ago
    Total: 180s
    #582705
  • Pipeline finished with Success
    about 1 month ago
    Total: 164s
    #582717
  • πŸ‡ΊπŸ‡ΈUnited States majorrobot

    Thanks for this feature, @webbywe!

    I reviewed the MR and tested locally. Everything worked locally.

    I left a few comments and questions on the MR -- nothing big. Thank you again!

Production build 0.71.5 2024