Provide 'Check' button for running link checker, provide optional host and port substitution to check links behind a reverse proxy and/or docker environment

Created on 10 May 2021, over 3 years ago
Updated 20 April 2024, 7 months ago

Problem/Motivation

Want to use linkchecker behind a locked down server environment yet our docker is using a non-standard port internally for http /https , provide a form option for host and port substitution to be used in combination with a hosts file entry.
also, sort of not related, found a 999 error code from linkedin that crashes guzzle. Easy fix is to ignore linkedin.com by adding it to the ignore.

Steps to reproduce

cut off internet access , no internet access or a dns limitation so use the patch β†’ and configure host and port substitution in the linkchecker settings.

Sort of related solution for non-standard status code must be an integer value between 1xx and 5xx.
add linkedin.com to the ignore list , it returns a 999 code which throws an exception.

In ProcessBase.php line 172:
[InvalidArgumentException]
Output is empty.

Other Scenario 999 error:
- Create a "Check links" button in "LinkCheckerAdminSettingForm.php" by implementing the function "submitCheck"
$this->checkerBatch->batch();
- Set the "External only" in "What type of links should be checked" dropdown selection on /admin/config/content/linkchecker
- Click "Save Configuration"
- Click "Clear link data and analyze content for links" button
- Click "Check links" button

The error below is thrown in the web browser after the "Check link" process has run for a while.

An AJAX HTTP error occurred.
HTTP Result Code: 500
Debugging information follows.
Path: /en/batch?id=303&op=do_nojs&op=do
StatusText: 500 Service unavailable (with message)

Proposed resolution

See patch
also, add ignore for linkedin.com as it throws a 999 error causing a guzzle exception
ResponseText: The website encountered an unexpected error. Please try again later.InvalidArgumentException: Status code must be an integer value between 1xx and 5xx. in GuzzleHttp\Psr7\Response->assertStatusCodeRange() (line 152 of vendor/guzzlehttp/psr7/src/Response.php).

https://github.com/guzzle/guzzle/issues/2751
Status code must be an integer value between 1xx and 5xx

https://github.com/guzzle/psr7/pull/334
Allow more status codes

Remaining tasks

review patch β†’ and ignore linkedin.com

User interface changes

adds form settings for port and host replacement

API changes

works around limitations of some server environments using a host replace and port substitution (for docker environments or reverse proxy environments serving on a non-standard http https port).

Data model changes

n/a

πŸ› Bug report
Status

Needs work

Version

2.0

Component

Code

Created by

πŸ‡¨πŸ‡¦Canada jamesyao

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡ΊπŸ‡ΈUnited States mkindred

    I was hoping to find a way to override the host for docker containers, so I tried the patches here. The latest #22 no longer applies cleanly, so I reworked it so that it does.

    But I'm a little concerned about a few things with the patch.

    1. Seems like a patch should solve one problem, but this issue is a combination of several unrelated problems:
      1. environment host override
      2. check button
      3. exclusions (not in the current patch)
      4. and content published status
    2. I thought the logic used for the overridehosturl was unclear: use overridehosturl only if (host is set and host== basepath)... why not just override if overrideurl is set? (I might be missing something here that's not applicable in my environment.)

    But I've always been a bit confused by linkchecker's config: seems like base 'path' includes hostname, and that has always perplexed me. I think the config descriptions could use some updating.

    So #23 works for me locally, but I don't think I'll apply it for production. Perhaps the maintainers can chime in... the hostname override is a very good thing.

  • πŸ‡¨πŸ‡¦Canada jamesyao

    Since the client requested "searching publishing contents only" functionality, the linkchecker_index and linkcker_link tables need to be refreshed when the Cron job is executed.
    I updated the logic to clean all records in the two tables if the searching publishing contents only is requested.

  • πŸ‡©πŸ‡ͺGermany gngn

    I think the handling of 'scanning published contents' in LinkExtractorBatch::processEntities() starting in #17 assumes that the scanned entity is a node - but it could be any entity type (e.g. paragraph).

    I tried to illustrate the problem with "COMMENT":

        $entityTypes = $this->getEntityTypesToProcess();
        [...]
        // COMMENT: naming this $publishedNodesOnly - setting name use "content"
        $publishedNodesOnly = $this->linkcheckerSetting->get('search_published_contents_only');
    
        // COMMENT:looping through different entity types
        foreach ($entityTypes as $entityTypeData) {
          [...]
          // COMMENT: storage for current entity type
          $storage = $this->entityTypeManager->getStorage($entityType->id());
          foreach ($ids as $id) {
            // COMMENT: $id is ID of current entity type (and we're loading it)
            $entity = $storage->load($id);
            if ($entity instanceof FieldableEntityInterface) {
              // COMMENT: $entity is a FieldableEntity
              if ($publishedNodesOnly) {
                // COMMENT: $entity->id() is the same as $id (since we loaded $entity with $id) ...
                $nid = $entity->id();
                // COMMENT: ... but now we call it node-ID - and try to load this node!
                // That is only correct if current entity type really is node - but it could be anything.
                $node = $this->entityTypeManager->getStorage('node')->load($nid);
                [...]
              }
            else {
                // Handle $entity like before this patch
                [..]
              }
            }
            [..]
          }
          [..]
        }
    

    Or am I getting something wrong?

  • Assigned to joseph.olstad
  • πŸ‡¨πŸ‡¦Canada joseph.olstad

    I'm going to be doing a fairly serious cycle for a refresher/review on this in the coming days/weeks.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 9 months ago
    86 pass
  • πŸ‡¨πŸ‡¦Canada joseph.olstad

    Evaluating this patch currently.

  • πŸ‡¨πŸ‡¦Canada joseph.olstad

    interdiff for 27 to 24

  • Status changed to Active 9 months ago
  • πŸ‡¨πŸ‡¦Canada joseph.olstad

    patch 27 didn't work out, working on a new one.

  • πŸ‡¨πŸ‡¦Canada joseph.olstad

    New patch, reduced the scope to docker and reverse proxy support, removed the exclusion of published content as there's another patch for this that is available in another issue:
    ✨ Improve performance of LinkExtractorBatch & allow to skip unpublished content Needs review

  • Status changed to Needs review 9 months ago
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MySQL 8
    last update 9 months ago
    86 pass
  • Issue was unassigned.
  • πŸ‡¨πŸ‡¦Canada joelpittet Vancouver

    Can the scope just be "Provide 'Check links' button for running link checker"?

    You could solve the other part with pathologic β†’ or something like that?

    In the latest patch @joseph.olstad there is a couple lines doing isset() followed by !empty() but !empty() does an implied isset(). Further the variable exists in all cases so you can just do if (!$var)

    So

    // for non-standard environment (docker & ubuntu) on the security issue
        if ((isset($uri['host']) && $uri['host'] == $basepath) && (isset($overridehosturl) && !empty($overridehosturl))) {
          if (isset($overridehostport) && !empty($overridehostport)) {
            $replacement = $overridehosturl . ':' . $overridehostport;
          }
          else {
            $replacement = $overridehosturl;
          }
          $host_path = $uri['scheme'] . '://' . $uri['host'];
          $url = str_replace($host_path, $replacement, $url);
        }

    becomes

        // For non-standard environment (docker & ubuntu) on the security issue.
        if ((isset($uri['host']) && $uri['host'] == $basepath) && $overridehosturl) {
          $replacement = $overridehostport ? $overridehosturl . ':' . $overridehostport : $overridehosturl;
          $host_path = $uri['scheme'] . '://' . $uri['host'];
          $url = str_replace($host_path, $replacement, $url);
        }
    

    Those variable names should be snake case too... $override_host_port

    My preference would be to split the task up, easier for one to get accepted before the other.

  • Status changed to Needs work 7 months ago
  • πŸ‡¨πŸ‡¦Canada joelpittet Vancouver
  • πŸ‡¨πŸ‡¦Canada joseph.olstad

    I would suggest to remove the "provide check button" as it's already provided in another issue/patch.

    This issue should focus on "Optional host and port substitution to check links behind a reverse proxy and/or docker environment"

    I'd prefer to keep this optional functionality inside of linkchecker not requiring some third party complex module like pathologic which I doubt will even work in this case.

Production build 0.71.5 2024