Adding nbsp; removing + better prep_replace pattern

Created on 13 April 2021, over 3 years ago
Updated 18 July 2023, over 1 year ago

Hi Dom,
I'm ended up to your module (thanks!), and I modified some code in order to better fit my needs: an   removal + a better preg_replace pattern.


namespace Drupal\filter_empty_tags\Plugin\Filter;

use Drupal\filter\Annotation\Filter;
use Drupal\filter\FilterProcessResult;
use Drupal\filter\Plugin\FilterBase;

/**
 * @Filter(
 *   id = "filter_empty_tags",
 *   title = @Translation("Filter Empty Tags"),
 *   description = @Translation("Recursively remove empty tags."),
 *   type = Drupal\filter\Plugin\FilterInterface::TYPE_HTML_RESTRICTOR,
 * )
 */
class FilterEmptyTags extends FilterBase {

  public function process($text, $langcode) {

    //** Return if string not given or empty.
    if (!is_string ($text) || trim ($text) == '')
      return new FilterProcessResult($text);

    //** Remove  .
    $text = str_replace( ' ', ' ', $text );
    //** Recursive empty HTML tags.
    $text = preg_replace("/<([^<\/>]*)([^<\/>]*)>([\s]*?|(?R))<\/\1>/imsU", '', $text);

    return new FilterProcessResult($text);
  }

}

This remove the &nbsp; and then <p></p> as well as <p class=โ€somethingโ€></p>.

I think it is more complete and assure a better clean.
What about?

Bye,
M.

๐Ÿ“Œ Task
Status

Fixed

Version

1.0

Component

Code

Created by

๐Ÿ‡ฎ๐Ÿ‡นItaly bazzmann

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States danflanagan8 St. Louis, US

    Hi! Thanks for this module and the proposed change in this issue. I started by writing a bunch of test cases and found that this updated approach got a bit closer to what I was looking for, but it didn't get all the way there for me.

    I'm posting an updated approach that handles  ,
    (etc), and more complicated recursive emptiness. I don't think the test coverage is configured to be run here, but locally I can assure you that everything passes.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States danflanagan8 St. Louis, US

    FWIW, here's are the failures with the current filter in the module:

    Testing Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest
    .......FF..FFFFFF......                                           23 / 23 (100%)
    
    Time: 1.39 seconds, Memory: 4.00 MB
    
    There were 8 failures:
    
    1) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #7 ('<p class="something"></p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p class="something"></p>'
    
    
    2) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #8 ('<p id="whatever" class="somet..."></p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p id="whatever" class="something"></p>'
    
    
    3) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #11 ('<p><br/></p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p><br/></p>'
    
    
    4) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #12 ('<p>&nbsp;</p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p>&nbsp;</p>'
    
    
    5) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #13 ('<p>&nbsp;&nbsp;</p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p>&nbsp;&nbsp;</p>'
    
    
    6) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #14 ('<p> <b> </b> </p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p>  </p>'
    
    
    7) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #15 ('<p>&nbsp;<b>&nbsp;</b>&nbsp;</p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p>&nbsp;<b>&nbsp;</b>&nbsp;</p>'
    
    
    8) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #16 ('<drupal-media data-caption="b...media>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<drupal-media data-caption="baz" data-entity-type="media" data-entity-uuid="abc123"></drupal-media>'
    

    And below are the failures with the proposal in the IS.

    Testing Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest
    ...........F..FF.F..F..                                           23 / 23 (100%)
    
    Time: 1.39 seconds, Memory: 4.00 MB
    
    There were 5 failures:
    
    1) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #11 ('<p><br/></p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p><br/></p>'
    
    
    2) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #14 ('<p> <b> </b> </p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p>  </p>'
    
    
    3) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterEmptyTags with data set #15 ('<p>&nbsp;<b>&nbsp;</b>&nbsp;</p>', '')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -''
    +'<p>  </p>'
    
    
    4) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterControlStrings with data set #0 ('<h2>This is a control string<...&nbsp;')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -'<h2>This is a control string</h2><div>This string should <b><em>not</em></b> be altered in ANY&nbsp;WAY by the filter. </div> <hr><p class="some-class"><><><></p>&nbsp;'
    +'<h2>This is a control string</h2><div>This string should <b><em>not</em></b> be altered in ANY WAY by the filter. </div> <hr><p class="some-class"><><><></p> '
    
    
    5) Drupal\Tests\filter_empty_tags\Unit\FilterEmptyTagsTest::testFilterControlStrings with data set #3 ('<h1> <h2> hi <p><div>yes</div...&nbsp;')
    Failed asserting that two strings are identical.
    --- Expected
    +++ Actual
    @@ @@
    -'<h1> <h2> hi <p><div>yes</div></p><br></h2>&nbsp;'
    +'<h1> <h2> hi <p><div>yes</div></p><br></h2> '
    
    
  • ๐Ÿ‡บ๐Ÿ‡ธUnited States danflanagan8 St. Louis, US

    Hm, I realized that the filter in #4 would always strip an iframe, which is no good! Here's an update that declares a list of tags to never consider empty. This includes iframe as well as drupal-media and drupal-entity, which are nice to protect. This should make the weight of this filter less important.

    I couldn't figure out how to do this with a preg_replace, so I changed course to use a preg_replace_all and then we loop through the matches.

    Test coverage is even better in this patch and everything passes locally.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States danflanagan8 St. Louis, US

    TIL \s does not match the literal (not html entity) version of the non-breaking space. This latest patch addresses that and adds a couple test cases why not.

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States danflanagan8 St. Louis, US

    I have a nutty site with script tags in wysywigs. I need to update the list of never-empty tags to include script tags. I also realized button should be allowed and then alphabetized the list. Ideally this list would be configurable. Maybe a goal for 2.x.

  • Status changed to RTBC over 1 year ago
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance Dom.

    Hi !

    Thanks @danflanagan8 for your patch submission. It includes tests as I can see and this is really quality work.
    I am very sorry for (huge) response time, and I wish you would forgive me.

    As of today, I will change a little something in your patch: I guess this is just an encoding issue, but opening your patch in the browser shows a "ร‚" character extra that kills for french langage on both those lines:

    '/<([^<\/>)]*)\b[^>]*>((\s|ร‚ |&nbsp;|<br>|<br\/>|<br \/>)*?)<\/\1>/imsU',
    

    ['<p>ร‚ </p>', ''],

    Therefore let me just remove that characters (that does not show on the review plugin surprisingly).

    • Dom. โ†’ committed 96abb134 on 1.x
      Issue #3208610 by danflanagan8: Adding nbsp; removing + better...
  • Status changed to Fixed over 1 year ago
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance Dom.

    Will be included in release 2.0.0.
    Also, I am adding up a follow-up for configuring which tags should be configurable as skippable: โœจ Make "doNotConsiderEmpty" tags configurable. Fixed

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States danflanagan8 St. Louis, US

    Hi Dom.,
    That strangely encoded character was supposed to be the unicode non-breaking space. In my manual testing it is important to have that character as part of the regex.

    In my text editor it shows up as [NBSP]. I copy/pasted from some online resource to get the literal unicode character.

    I'm not very knowledgeable about encoding though to be honest. We need to somehow get the regex to recognize a literal unicode nbsp.

    Looks like maybe we could use \xc2\xa0/?

    https://stackoverflow.com/questions/40724543/how-to-replace-decoded-non-...

  • Status changed to Needs work over 1 year ago
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance Dom.

    That shows to me as : |ร‚ | which is the letter A with the ^ followed by a normal space.
    I therefore tried that is a div on front and indeed it is left as is.

    May I suggest to try using |\u00A0| which is the unicode character \u00A0 that I believe to correspond to nbsp ?
    https://www.fileformat.info/info/unicode/char/00a0/index.htm

  • ๐Ÿ‡บ๐Ÿ‡ธUnited States danflanagan8 St. Louis, US

    I'm learning a lot here!

    I tried using \u00A0 and got this error:

    preg_match_all(): Compilation failed: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u at offset 34

    I swtiched to \xc2\xa0 and that appears to have worked.

    I'm working on the 2.x issue now. That's probably where the focus should be.

    • Dom. โ†’ committed 986243bb on 2.x
      Issue #3208610 by danflanagan8, Dom.: Adding nbsp; removing + better...
  • Status changed to Fixed over 1 year ago
  • ๐Ÿ‡ซ๐Ÿ‡ทFrance Dom.

    I replaced the non-breakable character with \xc2\xa0.
    Added a test which mixes nbsp, space, and return char.

    • Dom. โ†’ committed 986243bb on 1.x
      Issue #3208610 by danflanagan8, Dom.: Adding nbsp; removing + better...
  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024