- Issue created by @SoulReceiver
- Merge request !4Issue #3503423: Applied HTML decoding and HTML encoding around the URL conversion → (Open) created by Unnamed author
Currently if one of the links within the page contains multiple query strings (i.e. at least one preceded with an ampersand) then the RelToAbs processor code does not handle the fact that these ampersands are HTML-encoded to &
.
What this results in is this hyperlink:
<a href="/path/to/page?hello=world&foo=bar&frodo=baggins">My link</a>
Becoming:
<a href="https://example.com/path/to/page?hello=world&amp%3Bfoo=bar&amp%3Bfrodo=baggins">My link</a>
The reason why this result occurs is because the regular expression in the processor is not expecting the found URLs on the page to contain HTML-encoded characters, so when the result is passed to the Url::fromUserInput()
method, it is expecting that the URL provided is similarly not HTML encoded and it parses the query string assuming that the delimiter is just &
and not &
. The undesired "amp" text and its semicolon then becomes part of the query string key and becomes URL-encoded to make it a URL-safe value.
/path/to/page?hello=world&foo=bar&frodo=baggins
/path/to/page?hello=world&%3Bfoo=bar&%3Bfrodo=baggins
We should assume that the URLs found to get processed contain HTML-encoded characters given this processor is running over HTML. Therefore for each URL we find we should first HTML-decode it so Drupal's Url::fromUserInput()
method can handle it correctly, and then re-HTML-encode the new resulting URL before injecting it back into the source.
$resultText = preg_replace_callback('/(href|background|src)=["\']([\/#][^"\']*)["\']/', function ($matches) {
$url = Html::decodeEntities(preg_replace('/\/{2,}/', '/', $matches[2]));
try {
$url = Url::fromUserInput($url)->setAbsolute()->toString();
}
catch (\InvalidArgumentException $e) {
$this->logger->error($e->getMessage());
}
return $matches[1] . '="' . Html::escape($url) . '"';
}, $text);
Active
2.2
Code