Detect and strip base URL from pasted URLs to increase matching hits and support multilingual

Created on 29 August 2019, about 5 years ago
Updated 13 August 2024, 3 months ago

Problem/Motivation

We have a persistent issue where content editors will link to other pages on their site by copying and pasting entire URLs into the LinkIt dialog. For example, they will copy and paste https://sample.com/some/page and link to that, rather than typing in the title "Some Page" and waiting for the matcher to return a result.

As a result, we get lots of absolute URLs in links throughout the site, which is especially problematic if the editors are using a temporary domain (like https://sample-new.princeton.edu), since all those links will break once the temporary domain goes away and the site they're working on goes live.

Proposed resolution

1. Strip the base URL from pasted URLs into the Linkit autocomplete.
2. Check if the URL is a path alias to some other path, and then use that system path instead.

Then the LinkIt filter would convert the system path back to alias on render.

Remaining tasks

  • Support stripping base urls in frontend (when using the CKEditor 5 link widget or the field widget)
    • Support unmatched urls (No content suggestions found. This URL will be used as is.)
  • Support stripping base urls in backend (when bypassing the frontend widget)
    • Support urls matched as entities (in EntityMatcher)
    • Support external links missing a protocol (in ExternalMatcher)
    • Support urls to the front page (in FrontPageMatcher)
  • Support base urls other than the current url (e.g. in case of headless systems or pre-production environments). Ideally this would be configurable, like the Pathologic module does

Implications

This is pretty complicated to do because we cannot just swap out the system path, we also need to add the matcher attributes.

Feature request
Status

Needs work

Version

7.0

Component

Code

Created by

🇺🇸United States bkosborne New Jersey, USA

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇨🇦Canada joseph.olstad

    patch appears to need a reroll for PHP 8.1 compatibility

  • 🇨🇦Canada joseph.olstad

    reroll for 6.0.x

  • Status changed to Needs review over 1 year ago
  • The last submitted patch, 59: linkit6x-3078075-59.patch, failed testing. View results
    - codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

  • 🇺🇸United States mark_fullmer Tucson

    It appears that the NodeMatcherTest failed in #59 because Kernel tests still need to be declared with 'public' visibility (in contrast to Functional/FunctionalJavascript tests)?

    Here's a re-roll that changes the visibility back from 'protected' to 'public'.

  • 🇨🇦Canada joseph.olstad

    Thanks @mark_fullmer , 👍

  • 🇺🇸United States mark_fullmer Tucson
  • 🇺🇸United States mark_fullmer Tucson

    Re-reading the original stated motivation for this issue -- namely, that pasted absolute URLs are *rendered* and if the base URL changes, those URLs will be wrong -- I want to ask whether that part of the motivation would more universally be solved in the context of text format fields by using the pathologic module, which specifically supports stripping matching base URLs from content entered in text editors.

    That said, there still are two reasons that Linkit should maybe also do this:
    1. The Linkit field widget/formatter would not be able to make use of Pathologic to fix this.
    2. There is utility in having the base URL stripping happen during the autocomplete process: content editors will get confirmation that the URL they are pasting is, in fact the entity they are trying to link to, since it will show in the autocomplete suggestions.

    So: I still think I'm in support of doing this, but others coming to this issue should also consider the pathologic module for solving the similar goal in other content in text format fields.

  • 🇨🇦Canada joseph.olstad

    Hi Mark , in our case the patch you rolled works perfectly and is desired. we have one Drupal build that is used by 100+ installations, each with the same domain, there is a different base path for all 100 of these. With all due respect, pathologic is an inferior solution that would require us to have 100 additional configurations for our build.

  • Assigned to mmarler
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.0.7 + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    53 pass, 4 fail
  • 🇨🇦Canada sylus

    Just a quick re-roll of #65

  • The last submitted patch, 68: linkit6x-3078075-68.patch, failed testing. View results
    - codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

  • Status changed to Needs work about 1 year ago
  • 🇨🇦Canada joseph.olstad

    possible solution to failed test

    Line 41 of
    tests/src/Functional/LinkitBrowserTestBase.php
    change this from:

      protected static $modules = ['linkit', 'linkit_test', 'block'];
    

    To this:

      protected static $modules = ['linkit', 'linkit_test', 'block', 'pathauto'];
    
  • 🇳🇱Netherlands Lendude Amsterdam

    Only recently updated to Linkit 6 so we were still on patch #13, client wasn't happy with the UI changes that got added so made a local patch without them.

    Won't derail the thread here, so not uploading, just something to take into consideration before this would be merged. I'm sure the warning is useful for some users but not for ours, so I think it should be made optional at the very least.

  • 🇳🇱Netherlands idebr

    @joseph.olstad Can you elaborate why this issue needs user interface changes in #16? The patch in #13 converts absolute URLs without bothering the user.

  • 🇨🇦Canada joseph.olstad

    @idebr

    Patch #68 includes enhancements over #13

    1. Include support for the base path
    2. include support for access unpublished token
    3. Notice message

    I created a video demo here:
    download and view the video for a demo.
    #3078075-17: Detect and strip base URL from pasted URLs to increase matching hits and support multilingual

  • Issue was unassigned.
  • 🇺🇸United States mark_fullmer Tucson
  • 🇺🇸United States moshe weitzman Boston, MA

    One thing that this patch doesn't handle (I think) is that editing is sometimes done on a different domain than viewing (for security). For example, the author is editing on edit.mass.gov but pastes in a link to www.mass.gov/node/n. Ideally this code would recognize www.mass.gov as an additional base_url and this recognize the pasted link as internal. So, give sites a way to provide a list of domains, not just the one that the author is using.

  • 🇨🇦Canada smulvih2 Canada 🍁

    Re-roll of #68 after 3212820 📌 Upcast linkit profile in linkit.autocomplete route Fixed was merged into the 6.0.x branch.

  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    Composer require failure
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.0.11 + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    45 pass, 8 fail
  • 🇨🇦Canada smulvih2 Canada 🍁

    Found an issue with my last patch in #76, which seems to be an issue with some of the previous patches as well. This issue was in AutcompleteController.php, where $this->linkitProfile is not defined.

    I tested this new patch against the proposed resolution in the ticket description and can confirm that using an absolute URL gets handled correctly, and the content then uses the system path instead of the alias.

  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update about 1 year ago
    Composer require failure
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.0.11 + Environment: PHP 8.2 & pgsql-13.5
    last update about 1 year ago
    53 pass, 4 fail
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.4 + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    53 pass, 4 fail
  • First commit to issue fork.
  • 🇧🇪Belgium dieterholvoet Brussels

    @joseph.olstad Can you elaborate why this issue needs user interface changes in #16? The patch in #13 converts absolute URLs without bothering the user.

    I agree, the bright red message is unnecessary. I did notice that the message and the disabling of the submit button doesn't even work in CKEditor 5 because CSS selectors changed, so that's another reason to leave it out. I'll start a MR based on the last part and remove that part of the code.

  • 🇧🇪Belgium dieterholvoet Brussels

    One oddity I discovered is that when using the Linkit widget, absolute node URLs are resolved to their node/xxx paths. After saving the form and reloading the page, that internal path is again converted to the path alias including the language prefix. Can't we change it so we resolve absolute urls to the aliased path instead of the internal path? Or is that a separate issue?

  • Merge request !35Resolve #3078075 "Detect and strip" → (Open) created by dieterholvoet
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.4 + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    Composer require failure
  • 🇨🇦Canada joseph.olstad

    @DieterHolvoet

    We do not want to use the alias path in the storage.

    The alias can be changed, the node id does not change.

    Rebuilding cache and loading the rendered page generates the aliases, Please undo your related commits to the MR.

    The end result is that people see the up-to-date alias as the intended functionality is that the alias is generated and cached in the rendered output of the page.

  • 🇧🇪Belgium dieterholvoet Brussels

    No, I do not mean we need to store the alias in storage. I mean that in the Linkit field widget, the alias is usually displayed to the user instead of the internal path. After submitting the form it's converted to the internal path. The inconsistency is that now, when resolving an absolute url, the internal path is displayed to the user instead of the alias.

    The changes in the MR are unrelated to this, they are related to my comment in #79.

  • 🇨🇦Canada joseph.olstad

    @DieterHolvoet,
    Your change also removed access_unpublished support for the hashkey processing and it also dissolves support for base path , so if your site is using a prefix path the absolute url won't be recognized using your merge request.

    With that said, this issue is now several years old, any small improvement would be good, can always re-patch on top of whatever the maintainer decides to do.

    Quite a lot of work went into this, you basically undid three different capabilities, two of which are actually pretty important.

  • 🇨🇦Canada joseph.olstad

    var baseUrl = drupalSettings.base_url;
    This base url is actually pretty useful and important to many implementations of Drupal.

  • 🇧🇪Belgium dieterholvoet Brussels

    Your change also removed access_unpublished support for the hashkey processing

    Any logic added in linkit.autocomplete.js has to be written separately for CKEditor 5, which isn't done yet. So either we also add support for that to the CKEditor 5 Linkit implementation, or we remove it here and deal with it in a follow up. Like you said, this issue has been open for years so it would be nice to be able to move this forward.

    and it also dissolves support for base path , so if your site is using a prefix path the absolute url won't be recognized using your merge request.

    By prefix path, do you mean language prefixes? Because those certainly still work, I tested it with the Linkit field widget and the CKEditor 5 implementation.

  • 🇨🇦Canada joseph.olstad

    @DieterHolvoet

    No, this is not refering to the language prefix although that is also included in the base url so deals with it. Multiple drupal sites can be run under the same domain or drupal sites could be under different domains.

    if we want to support all of Drupal possibilities we should include the base_url as was done previously above before the knockout.

    Example
    https://example.localhost/multisite-a/node/1
    if using language prefix would look like:
    https://example.localhost/multisite-a/en/node/1
    https://example.localhost/multisite-a/fr/node/1

    Multisite-b
    https://example.localhost/multisite-b/node/1
    if using language prefix would look like:
    https://example.localhost/multisite-b/en/node/1
    https://example.localhost/multisite-b/fr/node/1

    In your vhost you'd have an alias entry
    in your base project you'd have a symlink to match the alias entry

    Some do a lot more and make a unique vanity prefix per language, that is covered also by the base_url

    Without the base_url logic in, the absolute url is not recognized and the utility of this solution is diminished.

    For example, one of my clients has over 170 drupal instances using the same domain, each drupal site is on a different server and they're distinguished not by domains, but by the prefix it'self.

    We're writing enterprise solutions and by using base_url it ensures that the solution works for EVERYONE!

    We're not just wanting this to work for your site that's managed by domains only, but every possible way to use drupal.

  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.4 + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    Composer require failure
  • 🇧🇪Belgium dieterholvoet Brussels

    Well, the code you wrote is only being used for CKEditor 4 and it's deprecated , Drupal support to be removed by the end of 2023, so I don't think it makes sense to keep this in the MR. If you want that base_url logic it has to be rewritten in a way that's compatible with CKEditor 5 and the field widget. Maybe the find-replace can happen in a backend controller instead of in the frontend?

    Anyway, if you insist on keeping that logic in the MR, I guess this issue will stay on Needs work until someone has time to re-implement it.

  • 🇧🇪Belgium dieterholvoet Brussels

    Did some more testing and all code in the if ($this->moduleHandler->moduleExists('language')) block isn't necessary either. You're converting an aliased path to an internal path, but that's not necessary: Url::fromUserInput() is able to parse aliased paths just fine. Tested using CKEditor 5 and the field widget.

  • Open in Jenkins → Open on Drupal.org →
    Core: 10.1.4 + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    Composer require failure
  • 🇧🇪Belgium dieterholvoet Brussels

    Just came across a potential issue: the current state of this patch makes it impossible to link to a specific entity translation which isn't the current language. It's a niche feature, but something that shouldn't be made impossible IMO. Maybe it would make more sense that if an editor pastes an URL containing a language prefix, the field will link to the entity translation in that specific language instead of the current language?

  • 🇨🇦Canada joseph.olstad

    Our distribution (wxt) is using patch #77. For my review of MR #35 see comment #87 and #85, same review applies to the head of the MR #35

  • 🇺🇸United States bkosborne New Jersey, USA

    I initially created this issue over 4 years ago. I think the current MR, while an improvement, does not go far enough to address the concerns I had when I created the issue:

    1. It only deals with the EntityMatcher. What if someone pastes in a link to a webform, file, a views page, or a path that's handled by a custom route?
    2. Even if every matcher was accounted for, someone can simply ignore the matcher results. In fact, I think most people that are pasting in an absolute URL will ignore the matcher results. If you have a URL you're pasting in, you already know the page you're linking to. You don't need a matcher to find it for you based on some partial page title or something you typed/pasted in. These people will just proceed without clicking any results from the autocomplete list

    Again, I think the MR here is good and should be kept and merged in as it's a UX improvement, but I think we need a different approach here to really solve the problem. I don't want content editors pasting in absolute links to their own site, full stop. I help maintain a platform that hosts >1,000 sites, and those sites are often given temporary production URLs to work in until the site goes "live" with the real domain. We need a way to prevent those temporary domains from spreading into content.

    With CKEditor 4, I solved the problem using this code:

    
    /**
     * Implements hook_form_FORM_ID_alter().
     *
     * Alter the CKEditor link dialog.
     */
    function MY_MODULE_form_editor_link_dialog_alter(&$form, FormStateInterface $formState) {
      $form['#validate'][] = '_editor_link_dialog_url_validate';
    }
    
    /**
     * Custom validate for linkit dialog.
     *
     * Remove base URL from a link.
     */
    function _editor_link_dialog_url_validate(&$form, FormStateInterface $formState) {
      // Simply strip base URL if found in the link path to force the path to
      // be relative. Ideally we'd do more and check if the link is a path alias
      // and swap it with the system path, but it's complicated.
      // See https://www.drupal.org/project/linkit/issues/3078075.
      $attrs = $formState->getValue('attributes');
      if (isset($attrs['href'])) {
        $convertedLink = _change_url_to_relative($attrs['href']);
        if ($convertedLink !== $attrs['href']) {
          $attrs['href'] = $convertedLink;
          $formState->setValue('attributes', $attrs);
        }
      }
    }
    

    This worked completely outside of the LinkIt context. Unfortunately, this doesn't work with CKEditor 5, which doesn't use Drupal's form API for building the link widget interfaces. This feels like something that should be solved in the CKEditor 5 plugin directly, where it strips out the base URL when the link dialog is closed. For the field widget, we can likewise have a form validation callback that does it there.

    Anyone have thoughts on that?

  • 🇺🇸United States bkosborne New Jersey, USA

    Actually, the more I think of it, the more I realize that my problem should indeed be solved like this:

    1. A custom CKEditor 5 plugin that detects when a link is inserted via the Link dialog (which the LinkIt module simply extends) and strips the base URL from there
    2. A modification to the LinkIt field widget to add the validation function to remove the base URL

    The first point I realized after seeing that the LinkIt CKEditor 5 integration is just extending the main Link plugin from CKEditor.

  • 🇨🇦Canada joseph.olstad

    @bkosborne patch #68 cleans up the FQDN for an absolute if it matches the current host, this gives us matcher results afaik. You should try patch #68. The Merge Request is some rogue changes that were made by someone ignoring what I was talking about.

  • 🇧🇪Belgium dieterholvoet Brussels

    @joseph.olstad if you actually read my comments, you would know that those changes you did and I reverted only worked for CKEditor 4, not 5. So they won't help @bkosborne. I didn't ignore what you talked about, like I said I removed these CKEditor 4 related changes for now, until someone is able to re-implement them in a way that's compatible with both 4 and 5 (or just 5, because 4 is not even supported anymore by core).

  • 🇨🇦Canada joseph.olstad

    ck4 is still being used with D10.2+ in large numbers , in fact I published this module during Drupalcon and it went from 0 installs to now over 7600 installs
    https://www.drupal.org/project/ckeditor4_codemirror

    But ya, ck5 obviously not ready yet.

  • 🇺🇸United States mark_fullmer Tucson

    ck4 is still being used with D10.2+ in large numbers

    Weighing in as the maintainer of the Linkit module, I would prefer to see work for this initially to be scoped to CKEditor 5 coverage, and not dependent on CKEditor 4 coverage. I'm not going to put any roadblocks into continued support for CKEditor 4 from the community, but like to treat that as a "backport" of sorts.

  • 🇺🇸United States mark_fullmer Tucson

    We need a way to prevent those temporary domains from spreading into content. (from comment #92 Detect and strip base URL from pasted URLs to increase matching hits and support multilingual Needs work

    The Pathologic module supports this, albeit at the page rendering stage. I am inferring that the ask here is that the base URL of the existing site be stripped from the data saved to the database, right? If so, I think the big question is *when* this should be triggered. On keyup (i.e., immediately)? On click (i.e., when the user moves to save the link)?

    The attached patch and screencast demonstrate the most aggressive approach -- on keyup (this GIF demonstrates that the keyup will trigger both on manual keyboard entry and copy-paste from clipboard):

    If we were to go this route...
    1. I think it would need to be an opt-in option, since some sites, such as decoupled architecture, might not want this behavior.
    2. Maybe the baseUrl needs to be configurable, rather than automatically detected, along the lines of how Pathologic provides multiple baseUrl matchers?
    3. For the Linkit *suggestion* to work in conjunction with this without requiring a user selecting the matching autocomplete option, Do not require selection of autocomplete if only one match (pressing 'Submit' is sufficient) RTBC would need to land.

  • 🇨🇦Canada joseph.olstad

    @mark_fullmer there's a reason why drupalSettings.base_url is used. A lot of work went into this to ensure that it works with all the ways Drupal can be installed, configured and used, including with vhosts based on a path prefix.

    I've commented many times above explaining the various use cases.

  • 🇺🇸United States mark_fullmer Tucson

    there's a reason why drupalSettings.base_url is used

    Thanks for this! I'm on board with using drupalSettings.base_url, and had seen this discussed in the comments. The patch in #98 ("If we were to go this route...") is only meant to be a proof of concept for the alternate implementation proposed by bkosbone (part 1, #93).

    If we end up going that route, there are a number of things that will need to be modified about #98.

  • 🇺🇸United States mark_fullmer Tucson
  • 🇧🇪Belgium matthiaso Leuven

    This patch (tested with 3078075-77) converts all characters to lowercase, which breaks some links, e.g., Google Drive documents.

  • 🇨🇦Canada joseph.olstad

    @MatthiasO here's a patch for the patch you can try.

    looks like to increase hits by using case insensitive we need to check first if the path is matching the current site.

    This patch disables the insensitive optimisation completely. It should be improved but in a pinch will handle drive.google.com links.

    So, this patch should be applied AFTER patch 77 is applied.

    so apply 77 then this patch.

  • Pipeline finished with Failed
    5 months ago
    #213988
  • Pipeline finished with Failed
    5 months ago
    Total: 253s
    #213989
  • 🇧🇪Belgium dieterholvoet Brussels

    I rebased the MR against the 7.x branch.

    I also looked into the lowercasing of paths. I get the intention, and correct me if I'm wrong, but I don't think it's necessary to do it manually. Linkit's entity matcher uses an entity query to search for entities and both entity queries and database queries are case insensitive by default in Drupal (source). I removed the code related to lowercasing incoming paths from the MR.

  • Pipeline finished with Failed
    5 months ago
    Total: 268s
    #214002
  • Pipeline finished with Canceled
    5 months ago
    Total: 71s
    #214833
  • Pipeline finished with Failed
    5 months ago
    Total: 275s
    #214834
  • Pipeline finished with Failed
    5 months ago
    Total: 286s
    #214844
  • 🇧🇪Belgium dieterholvoet Brussels

    I added a 'Remaining tasks' section to the issue summary in an attempt to create an overview what still needs to be done here. I think the biggest thing would be to create a (configurable) list of base urls and use that both in backend and in frontend to strip entered urls, instead of only using the current base url (global $base_url, which is the same as \Drupal::request()->getSchemeAndHttpHost()).

  • 🇳🇿New Zealand stewest Wellington

    The change in #104 https://www.drupal.org/project/linkit/issues/3078075#comment-15665863 Detect and strip base URL from pasted URLs to increase matching hits and support multilingual Needs work which was undoing the removal of the lowercase function means that urls like youtube playlists when created as an external link (not media) will have their url lowercased which break the external link.

    I'm not sure if this should be another issue created specifically for this, but that code change here, would affect this outcome.

    So instead of
    https://www.youtube.com/playlist?list=PLpeDXSh4nHjThszrEoBfJ8sY2NYif0k6b
    we end up with
    https://www.youtube.com/playlist?list=plpedxsh4nhjthszreobfj8sy2nyif0k6b

  • 🇧🇪Belgium dieterholvoet Brussels

    You're right, I think adding back mb_strtolower was accidental. I'll remove it again.

  • Pipeline finished with Failed
    3 months ago
    Total: 259s
    #252577
Production build 0.71.5 2024