[D7] preg_split in _filter_url breaks for long html tags

Created on 12 November 2020, about 4 years ago
Updated 13 March 2023, over 1 year ago

Problem/Motivation

N.B.: Backport issue from πŸ› preg_split in _filter_url breaks for long html tags Fixed
Having a really long HTML tag (e.g. < img > with src:data/image...) makes preg_split fail and return false. On PHP 7.4 this throws a warning ("Warning: count(): Parameter must be an array or an object that implements Countable in _filter_url() (line 535 of core/modules/filter/filter.module).") and makes the field render empty, on PHP 8.0 this throws a fatal error.

Steps to reproduce

1) Have a text format that has Convert URLs into links enabled
2) Using that text format, add a node with content like this: https://gist.github.com/kporras07/618b3bf4cd77ff57fcd5034262220e99
3) Visit the node
4) You will get the warning and empty node or the fatal error depending on your PHP version

Proposed resolution

Backport from πŸ› preg_split in _filter_url breaks for long html tags Fixed

πŸ› Bug report
Status

Needs work

Version

7.0 ⚰️

Component
FilterΒ  β†’

Last updated 3 days ago

No maintainer
Created by

πŸ‡·πŸ‡ΊRussia DD 85

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡ΈπŸ‡°Slovakia poker10

    D10 issue was committed. Therefore changing the status to NW, so that the D7 patch can be checked if it has all changes from the D10 patch and can be moved to NR. Thanks!

  • Status changed to Needs review over 1 year ago
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Ready for another round of reviews, I backported any relevant changes from the final D9/10 committed codes.

  • Status changed to Needs work over 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States neclimdul Houston, TX

    Again, not D7 specific. You can continue the back port but this needs to be fixed in the latest release.

  • Status changed to Needs review over 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States neclimdul Houston, TX

    ooof sorry I see #21 now sorry. Annoyed at the duplicate issue... sorry for the noise.

  • πŸ‡¨πŸ‡¦Canada joseph.olstad

    Great work @stefanos!

  • Status changed to Needs work over 1 year ago
  • πŸ‡ΈπŸ‡°Slovakia poker10

    @stefanos.petrakis Thanks for working on this! I have reviewed the merge request and added some comments.

  • Status changed to Needs review over 1 year ago
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Thanks @poker10, this is ready for another round.

  • πŸ‡ΈπŸ‡°Slovakia poker10

    Thanks, the changes looks good!

    However I am a bit concerned by the testbot on PHP 8.2 / PostgreSQL 13.5, as this combination is ending with a fatal error, see: https://www.drupal.org/pift-ci-job/2626124 β†’ . Not sure if these are some random errors (I have tried 5 runs, see the history), but merge requests on other D7 issues on this PHP 8.2 / PostgreSQL 13.5 combination are green (see: https://www.drupal.org/pift-ci-job/2626080 β†’ ). And also Drupal 7 core is green either. So probably there is some problem related to this patch. It would probably be a good idea to take a look at it.

    Other than that, it seems like a straight backport of the D10 patch, so once this testbot issue are solved, I think this could be RTBC. Thanks!

  • Status changed to Needs work over 1 year ago
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Let's clear this, I am gonna investigate it right now.

  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Found it, it was the pcre.backtrack_limit messing with preg_replace calls that the pgsql layer was issuing. A little refactoring made lots of sense to keep that effect restricted.

    Waiting for the tests to set this back to NR.

  • Issue was unassigned.
  • Status changed to Needs review over 1 year ago
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland
  • πŸ‡ΈπŸ‡°Slovakia poker10

    Thanks, the PostgreSQL problem seems to be solved. But hmm, actually all tests are green except the PHP 7.2 / MySQL 5.6, see: https://www.drupal.org/pift-ci-job/2626322 β†’

  • Status changed to Needs work over 1 year ago
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Let's have a look at this one too.

  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Unfortunately, this needs to stay at "Needs Work" for the time.

    The test introduced here is based on the idea of breaking the preg_* function using pcre.backtrack_limit; this idea as @poker10 observed does not work for any PHP version smaller than 7.3. From PHP7.3 and higher all preg_* functions return bool(FALSE) on failures.

    This breaks the new tests due to this part:

        // Split at all tags; ensures that no tags or attributes are processed.
        $chunks = is_null($text) ? array('') : preg_split('/(<.+?>)/is', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
    

    PHP7.2 (and 7,1 and 7,0): preg_split splits the string partially and $chunks is therefore not FALSE; later on the _filter_url() returns a result that differs from the input text.

    PHP5.6 (and 5.3 and probably all 5,*): preg_split does not split, but returns the whole string and here $chunks is not FALSE either; later on the _filter_url() returns the whole string as a result by coincidence keeps the tests from breaking.

    Ideas for extending the tests to support 5.x and 7,0,1,2 are welcome, I have to think about this some.

  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    FYI: I tried the different PHP versions here => https://3v4l.org/mhmnn#v5.6.40

    using this snippet:

    <?php
    // example code
    
    ini_set('pcre.backtrack_limit', 1);
    $text = "<p>www.test.com</p>";
    $chunks = is_null($text) ? array('') : preg_split('/(<.+?>)/is', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
    var_dump($chunks);
    
  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    I started to look into this compatibility puzzle. As I was browsing the PHP docs for preg_split() and preg_replace_callback(), I noticed something that looks useful: preg_last_error(). I think it will be more reliable to ask PHP whether there was an error instead of trying to guess based on the result.

    Following the example of Comment #35, I tested the following snippet:

    $text = '<p>www.test.com</p>';
    
    echo 'pcre.backtrack_limit: ' . ini_get('pcre.backtrack_limit') . PHP_EOL;
    $chunks = preg_split('/(&lt;.+?&gt;)/is', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
    echo 'error: ' . preg_last_error() . PHP_EOL;
    is_array($chunks) ? print_r($chunks) : var_dump($chunks);
    
    ini_set('pcre.backtrack_limit', 1);
    echo 'pcre.backtrack_limit: ' . ini_get('pcre.backtrack_limit') . PHP_EOL;
    $chunks = preg_split('/(&lt;.+?&gt;)/is', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
    echo 'error: ' . preg_last_error() . PHP_EOL;
    is_array($chunks) ? print_r($chunks) : var_dump($chunks);
    

    For example, you can see the result using PHP 7.2 at https://3v4l.org/peFdO#v7.2.34. (How do you get results in multiple PHP versions at once?) I tested all the PHP versions tested in #29. (I tested the latest patch version available for each.) As predicted in #34, the last line varied, but I always got 0 and 2 for preg_last_error(): that is, PREG_NO_ERROR and PREG_BACKTRACK_LIMIT_ERROR.

    I think it makes sense to replace a line like

        $chunks = is_null($text) ? array('') : preg_split('/(<.+

    )/is', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
    ?>

    with

    I think it makes sense to replace a line like

        $chunks = !preg_last_error() ? array('') : preg_split('/(<.+

    )/is', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
    ?>

  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    Oops, I pasted my comment into the Issue summary field. Let me try to restore it.

  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    Despite the last few lines of Comment #36, I think it will be clearer to use a few more lines. First, move $saved_text = $text; inside the loop. Then, add something like this after each call to a preg_ function:

        if (preg_last_error()) {
          $text = $saved_text;
          continue;
        }
    

    In particular, use that instead of putting most of the loop inside if ($chunks !== FALSE) { ... }.

    I made a similar suggestion in #3239472-56: preg_split in _filter_url breaks for long html tags β†’ and the following comment, and in πŸ“Œ Reduce complexity in _filter_url() Active . In Comment #2 on that issue, @alexpott recommends completely refactoring the code in D10, so I think we do not have to worry here about keeping the code consitent between D7 and D10.

  • πŸ‡¬πŸ‡§United Kingdom RhiP

    I've just done a reroll. Rushed, but sharing in case it helps others. Drupal 7.98

  • last update over 1 year ago
    Patch Failed to Apply
  • last update over 1 year ago
    Patch Failed to Apply
  • last update over 1 year ago
    Patch Failed to Apply
  • last update over 1 year ago
    Patch Failed to Apply
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Environment: PHP 7.4 & PostgreSQL 9.5
    last update over 1 year ago
    Patch Failed to Apply
  • last update over 1 year ago
    Patch Failed to Apply
  • last update over 1 year ago
    Patch Failed to Apply
  • last update over 1 year ago
    Patch Failed to Apply
  • last update over 1 year ago
    Patch Failed to Apply
  • last update over 1 year ago
    Patch Failed to Apply
  • last update about 1 year ago
    2,161 pass
  • last update about 1 year ago
    2,161 pass
  • Status changed to RTBC 8 months ago
  • πŸ‡¦πŸ‡ΉAustria klausi πŸ‡¦πŸ‡Ή Vienna

    Thanks, pull request looks good to me!

    I don't see anything else missing here, so we can go ahead with this?

  • Status changed to Needs work 8 months ago
  • πŸ‡ΈπŸ‡°Slovakia poker10

    I think the current MR was causing failures on some PHP/SQL combinations, see: https://git.drupalcode.org/project/drupal/-/pipelines/39895

    @benjifisher proposed a solution which could fix this (in #37/#38), but if I see correctly, this was not implemented yet.

  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Picking this up for work following #37 and #38 (in solidarity to folks at DCon :-) )

  • Pipeline finished with Failed
    about 2 months ago
    Total: 204s
    #294373
  • Pipeline finished with Success
    about 2 months ago
    Total: 1578s
    #294430
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Back to NR, tests are green for all targets
    https://git.drupalcode.org/issue/drupal-3182166/-/pipelines/294430

  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    Stefanos:

    Thanks for getting back to this. The current patch is a lot easier to review, since it does not re-indent a large block of code. I think you just made one mistake while making that change, so back to NW.

    Comments #36 and #38 were a while ago, but re-reading them now, I think what I had in mind was

        _filter_url_escape_comments('', TRUE);
        $text = preg_replace_callback('`<!--(.*?)-->`s', '_filter_url_escape_comments', $text);
        if (preg_last_error()) {
          $text = $saved_text;
          continue;
        }
    
        // Split at all tags; ensures that no tags or attributes are processed.
        $chunks = preg_split('/(<.+

    )/is', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
    if (preg_last_error()) {
    $text = $saved_text;
    continue;
    }
    ?>

    I wrote,

    ... add something like this after each call to a preg_ function:

    (emphasis added).

  • Pipeline finished with Canceled
    about 2 months ago
    Total: 385s
    #295341
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Hey Benji

    Regarding:

    ... add something like this after each call to a preg_ function:

    This is now done, via a bit of refactoring, using exception throwing.
    Now, all preg_* functions inside _filter_url() will cause reverting to the saved text and moving on to the next iteration.

    This should be good to move back into NR once the tests are green.
    I am off for the weekend, so, please take charge of the codes if that will help finalizing this issue.

  • Pipeline finished with Success
    about 2 months ago
    Total: 649s
    #295344
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Tests are green, back to NR

  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Back to needs work, added one more change, rerunning the tests

  • Pipeline finished with Success
    about 2 months ago
    #295362
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Ready for review

  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    See the MR comment: there is still one place where the current version checks `is_null($text)` unnecessarily. Back to NW for that.

    I guess there are arguments in favor of using exceptions here:

    1. Exceptions are appropriate for dealing with error conditions, such as when the pcre.backtrace_limit parameter is exceeded.
    2. There is only one place in the code that restores $text = $saved_text.

    On the other hand,

    1. Exceptions are normally used with complex call stacks, not when interrupting a single loop.
    2. It is best to keep try blocks small, and the current version puts about 60 lines of code in that block.

    I will not insist, but I prefer using $text = $saved_text; break; instead of throwing and catching exceptions.

  • Pipeline finished with Success
    about 2 months ago
    Total: 649s
    #297947
  • Pipeline finished with Success
    about 2 months ago
    Total: 419s
    #298011
  • Pipeline finished with Success
    about 2 months ago
    Total: 789s
    #298036
  • πŸ‡¨πŸ‡­Switzerland stefanos.petrakis@gmail.com Biel, Switzerland

    Applied the handy suggestion and then some more.

    Thanks for the breakdown regarding Exceptions. And for keeping my brain rolling. :-)
    I attempted that \Exception-based refactoring (35c3a164a90184ad3ba0e53520aaa8ba428e6901) for two reasons, aiming to "ease" the code's state:

    1. So that there is only one place in the code that restores $text = $saved_text.
    2. So that continue-ing to the next iteration of the root loop from inside deeply-nested loops can be done in a more economic way (than any other I could think of). E.g.
      
        foreach ($tasks as $task => $pattern) {
            ...
            for ($i = 0; $i < count($chunks); $i++) {
                ...
                if ($open_tag == '') {
                  ...
                  // This is where we would want the root iteration to abort and continue to the next one,
                  // if a preg_ failure occurs
                  $chunks[$i] = preg_replace_callback($pattern, $task, $chunks[$i]);
      
      

    But, reading the docs again, I figured this is exactly the case for `continue DEPTH`.
    So I replaced \Exception with continue and removed some more redundant code defending.

    I find the changeset is now as effective and compact as it can be and the new tests capture what the original issue detected as a problem.

    Back to NR (tests are green).

  • Pipeline finished with Success
    about 2 months ago
    Total: 513s
    #298055
  • πŸ‡ΊπŸ‡ΈUnited States benjifisher Boston area

    Code review

    I find the changeset is now as effective and compact as it can be ...

    I like that it is compact! That makes it much easier to review.

    reading the docs again, I figured this is exactly the case for continue DEPTH.

    I think I have used continue 2 once or twice. It does not come up very often. Another option is goto, but the xkcd comic on that page is a very relevant warning.

    I searched the _filter_url() code for "preg_", and I see that you have tested preg_last_error() after most preg_*() functions. The exceptions are preg_match() in the else clause of the inner for loop, applied to $chunks[$i] rather than the full $text. Those are probably safe.

    The one place where you needed continue 2 is in the if clause of that inner loop, also applied to $chunks[$i]. I think that is also safe, but the extra check does not hurt.

    Testing

    PHP 7.4

    I followed the "Steps to reproduce" in the issue summary. With 7.x, I got two messages. (The issue summary mentions only the first one.)

    • Warning: count(): Parameter must be an array or an object that implements Countable in _filter_url() (line 1535 of /var/www/html/modules/filter/filter.module).
    • Notice: Trying to access array offset on value of type bool in _filter_url() (line 1541 of /var/www/html/modules/filter/filter.module).

    As expected, the errors are repeated if I clear caches and reload the page. The error comes from rendering the page, not from saving the node.

    I switched to the feature branch. Clearing caches and reloading the page, there are no messages, but also no content. Then I edited the text format, enabling only the "Convert URLs into links" filter. Clear caches, reload: the image (from the gist in the issue summary) appears with no errors. Switching back to 7.x, I still see the two messages and no content.

    PHP 8.0

    With 7.x, I get a WSOD with the message

    Error
    The website encountered an unexpected error. Please try again later.
    Error message
    TypeError: count(): Argument #1 ($value) must be of type Countable|array, bool given in _filter_url() (line 1535 of /var/www/html/modules/filter/filter.module).

    Switching to the feature branch, the page loads normally.

Production build 0.71.5 2024