HTML-encoded query strings do not get parsed correctly

Created on 30 January 2025, 25 days ago

Problem/Motivation

Currently if one of the links within the page contains multiple query strings (i.e. at least one preceded with an ampersand) then the RelToAbs processor code does not handle the fact that these ampersands are HTML-encoded to &.

What this results in is this hyperlink:
<a href="/path/to/page?hello=world&amp;foo=bar&amp;frodo=baggins">My link</a>
Becoming:
<a href="https://example.com/path/to/page?hello=world&amp;amp%3Bfoo=bar&amp;amp%3Bfrodo=baggins">My link</a>

The reason why this result occurs is because the regular expression in the processor is not expecting the found URLs on the page to contain HTML-encoded characters, so when the result is passed to the Url::fromUserInput() method, it is expecting that the URL provided is similarly not HTML encoded and it parses the query string assuming that the delimiter is just & and not &amp;. The undesired "amp" text and its semicolon then becomes part of the query string key and becomes URL-encoded to make it a URL-safe value.

Steps to reproduce

  1. Enable the RelToAbs processor against your active text format.
  2. Add in a hyperlink into the body with multiple query strings within its URL, i.e.: /path/to/page?hello=world&foo=bar&frodo=baggins
  3. Save the page and view it on the front-end of your site.
  4. Observe that the hyperlink now has extra instances of "amp%3B" in its query strings, i.e.: /path/to/page?hello=world&amp%3Bfoo=bar&amp%3Bfrodo=baggins

Proposed resolution

We should assume that the URLs found to get processed contain HTML-encoded characters given this processor is running over HTML. Therefore for each URL we find we should first HTML-decode it so Drupal's Url::fromUserInput() method can handle it correctly, and then re-HTML-encode the new resulting URL before injecting it back into the source.

$resultText = preg_replace_callback('/(href|background|src)=["\']([\/#][^"\']*)["\']/', function ($matches) {
  $url = Html::decodeEntities(preg_replace('/\/{2,}/', '/', $matches[2]));
  try {
    $url = Url::fromUserInput($url)->setAbsolute()->toString();
  }
  catch (\InvalidArgumentException $e) {
    $this->logger->error($e->getMessage());
  }
  return $matches[1] . '="' . Html::escape($url) . '"';
}, $text);
🐛 Bug report
Status

Active

Version

2.2

Component

Code

Created by

🇬🇧United Kingdom SoulReceiver

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.71.5 2024