Performance issues after fixing 'PHP 8.1 preg_replace()' issue

Created on 20 September 2023, about 1 year ago
Updated 25 November 2023, 10 months ago

Problem/Motivation

I am using Search API on a site with PHP 8.2. I noticed the log messages described in 🐛 PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor Fixed , so I applied the patch from that issue. After applying the patch, I noticed that re-indexing the site is terribly slow. PHP is using 100% of CPU, and the batch process takes many minutes to process sample data (fewer than 100 nodes).

I get the same results applying the patch from #3347610 to the 8.x-1.29 release or using the 8.x-1.x-dev version.

Steps to reproduce

For local testing, I used DDev and followed the instructions at https://github.com/ddev/ddev-drupal9-solr to set up a Solr container.

  1. Install Drupal 10.1.5 with the Standard profile.
  2. Install Drush and search_api_solr with Composer.
  3. Enable search_api_solr (and its dependencies). Enable the Content datasource and the Body field (type fulltext).
  4. Enable the HTML filter for the Body field, with default options.
  5. For convenience of copy/paste, disable the WYSIWYG editor for the Basic HTML text format.
  6. Copy https://en.wikipedia.org/wiki/Dog (attached, about 676K) into the Body field of a node. Save.
  7. Clear and build the index: drush search-api:clear && time drush search-api:index.

Proposed resolution

Work-around: Disable indexing of title and alt attributes in the HTML processor. See Comments #3, #6.

Remaining tasks

🐛 Bug report
Status

Fixed

Version

1.0

Component

Plugins

Created by

🇺🇸United States benjifisher Boston area

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @benjifisher
  • 🇺🇸United States benjifisher Boston area

    According to XHProf, 99.9% of the CPU time comes from mb_strpos() and mb_substr(). Both are called hundreds of thousands of times from HtmlFilter::handleAttributes(), which is called 24 times. I think this confirms that #3347610 is responsible.

    My site also uses search_api_attachments, and there are several files being indexed. It seems that those are all successfully indexed, so the problem is with some of the regular text fields, but I may be misreading it.

    The text is not very long:

    MariaDB [db]> SELECT LENGTH(field_body_value) AS strlen FROM node__field_body ORDER BY strlen DESC LIMIT 0,5;
    +--------+
    | strlen |
    +--------+
    | 782574 |
    | 726969 |
    | 678612 |
    | 638747 |
    | 598372 |
    +--------+
    5 rows in set (0.001 sec)
    
  • Status changed to Postponed: needs info about 1 year ago
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Oh, that seems bad. Thanks a lot for reporting this!
    782 kB of text does seem like a lot, but such a degradation of indexing performance is of course still not acceptable. Especially just for the sake of fixing such a small problem.
    First, could you please confirm for which the “HTML filter” processor is enabled? Is it really only those that contain HTML?
    Then, when removing all fulltext fields from the index except for “Body”, does the problem still occur? And does it occur with all items (with large enough body fields), or just for specific ones (maybe with faulty or particularly complex HTML)?

    It seems tricky to try and optimize our current processing code, as I don’t see any obvious flaws. The best solution might be to add an option to switch between our previous and our current implementation – hardly ideal, though, I admit. (Who is going to notice that when they see that indexing takes ages?)

    What you can personally do to work around this is disable indexing of the “alt” and “title” attributes. That should circumvent the whole problematic method, I think.

  • Status changed to Active 11 months ago
  • 🇺🇸United States benjifisher Boston area

    @drunken monkey:

    Thanks for the reply. I am sorry I did not reply earlier.

    There is a lot going on (custom code and other) with my current site, and we may have been applying the HTML filter to non-HTML fields. So I tested the problem with a fresh install of Drupal and a single node. I just added steps to reproduce in the issue summary. I hope I did not leave anything out.

    For the record:

    $ ddev drush status
    Drupal version   : 10.1.5                                     
    Site URI         : https://drupal.ddev.site                   
    DB driver        : mysql                                      
    DB hostname      : db                                         
    DB port          : 3306                                       
    DB username      : db                                         
    DB name          : db                                         
    Database         : Connected                                  
    Drupal bootstrap : Successful                                 
    Default theme    : olivero                                    
    Admin theme      : claro                                      
    PHP binary       : /usr/bin/php8.1                            
    PHP config       : /etc/php/8.1/cli/php.ini                   
    PHP OS           : Linux                                      
    PHP version      : 8.1.23                                     
    Drush script     : /var/www/html/vendor/bin/drush             
    Drush version    : 12.4.0.0                                   
    Drush temp       : /tmp                                       
    Drush configs    : /var/www/html/vendor/drush/drush/drush.yml 
    Install profile  : standard                                   
    Drupal root      : /var/www/html                              
    Site path        : sites/default                              
    Files, Public    : sites/default/files                        
    Files, Private   : /var/www/html/sites/default/files/private  
    Files, Temp      : /tmp         
    

    Following the steps I provided, I get

    $ ddev drush search-api:clear && time ddev drush search-api:index
     [success] test was successfully cleared.
     [success] Found 1 items to index for test. Indexing all items.
     [success] Indexing a maximum number of 1 items (50 items per batch run) for the index 'test'.
    >  [notice] Successfully indexed 1 item on test.
    >  [notice] Message: Successfully indexed 1 item.
    > 
    
    real	1m6.977s
    user	0m0.080s
    sys	0m0.000s
    

    It takes less than a second if I disable both "Index title attribute" and "Index alt attribute" in the settings for the HTML filter. With either one enabled, I get about the same (more than a minute).

    If I open another terminal window and run top in the container, then it shows that PHP is using 100% of a CPU.

    If I reverse the patch from 🐛 PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor Fixed , then again it takes less than a second.

  • 🇺🇸United States benjifisher Boston area

    In Comment #2, I wrote,

    According to XHProf, 99.9% of the CPU time comes from mb_strpos() and mb_substr(). Both are called hundreds of thousands of times from HtmlFilter::handleAttributes(), which is called 24 times. I think this confirms that #3347610 is responsible.

    There are over 4,000 tags in my sample file. If mb_strpos() and mb_substr() are each called at least twice for each tag, and I have a few dozen files, that explains the "hundreds of thousands". I think my content just has too many tags for this code to handle.

    From #3:

    What you can personally do to work around this is disable indexing of the “alt” and “title” attributes. That should circumvent the whole problematic method, I think.

    I may do that for now. I do not think I will miss title attributes, but I would like to be able to index alt text from images.

    I wonder if parsing the HTML would have better performance. You can get a \DOMDocument object with Drupal\Component\Utility\Html::load($text).

  • 🇷🇪Réunion Martygraphie Saint-Denis (Réunion)

    Hi @benjifisher,
    Thank you very much for your analysis, I've encountered the same problem as you.
    Deactivating ALT / TITLE attributes solved the performance problems on my side.
    Marc

  • 🇺🇸United States benjifisher Boston area

    Let' add the work-around to the issue summary.

  • 🇫🇷France Chris64 France

    And what about to come back to the original in 🐛 PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor Fixed and try an other change to solve the null parameter problem? Like,

    -        $text = preg_replace('/<[-a-z_]+[^>]*["\s]alt\s*=\s*("([^"]+)"|\'([^\']+)\')[^>]*>/i', ' <img>$2$3</img> ', $text);
    +        $text = preg_replace('/<[-a-z_]+[^>]*["\s]alt\s*=\s*("([^"]*)"|\'([^\']*)\')[^>]*>/i', ' <img>$2??$3</img> ', $text);
  • Status changed to Needs review 11 months ago
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update 11 months ago
    515 pass, 3 fail
  • Open in Jenkins → Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update 11 months ago
    516 pass, 1 fail
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Thanks for the detailed information, @benjifisher, that was very helpful! So was your suggestion of using \Drupal\Component\Utility\Html::load(), that indeed seems to be the best solution.
    Patch attached, even including a regression test (our first for a performance issue), please test/review!

  • Status changed to Needs work 11 months ago
  • 🇭🇺Hungary balazswmann

    This problem affects me as well. Thank you @benjifisher very much for the investigation and root cause analysis. I run into this in the past could of days after a `8.x-1.27` => `8.x-1.30` update. I also realized that indexing got very very slow. In my case it takes more than 1 minute to index just 50 items but previously (with `8.x-1.27`) I didn't have problems.

    I also use the HTML filter with the default settings and I need to index documents including around a couple of hundreds to several thousands of HTML tags (per document!).

    I applied the patch from #9: let a user specify if her blog entries should be votable ( https://www.drupal.org/project/search_api/issues/3388678#comment-15294371 🐛 Performance issues after fixing 'PHP 8.1 preg_replace()' issue Needs review ) on `8.x-1.30` and it solved the problem for me. Now indexing is fast enough again. Thank you @drunken monkey!

  • 🇺🇸United States benjifisher Boston area

    @drunken monkey:

    I will try applying the patch in my current project, although Comments #8 and #12 are already positive.

    The code looks right to me, In fact, I often say that using DOMDocument is simpler and more reliable than "simpler" methods of processing HTML, and this is a great example of that. This is the first example I have seen where it is also more performant.

    I checked the docs for Html::load() and Html::serialize(), and the underlying DOMDocument methods loadHTML() and saveHTML(). None of them mention any exceptions, so I do not think you need any error handling. In fact, Html::load() explicitly suppresses warnings by calling @$dom->loadHTML(...).

    I am not sure what is going on with the failing test. The foreach loop should be empty, so the code should just convert to DOMDocument and back to string. It is not a no-op: it should "fix" invalid HTML, and a little test shows that it strips the < character, turning a < b into a b (two spaces). But the test results show just a. Maybe different versions of PHP, or the parsing library behind all this, account for the difference: a different version might "fix" the invalid HTML to a <b> and then "fix" that to a.

    I do not see any test coverage for processing both alt and title. I think that $xpath->query() will do the right thing and list <img src="..." alt="some text" title="other text"> only once, but I would like to verify that with a test.

  • 🇺🇸United States benjifisher Boston area

    As I said in the previous comment, Html::load() followed by Html::serialize() is not a no-op. It "normalizes" the HTML. That means the current patch will change what gets sent to be indexed. I do not think I would notice the difference, but in some use cases it might have an effect. I think we should add a change record and mention it in the release notes, so I am adding the tag for a CR.

  • 🇧🇪Belgium herved

    I tested patch #9 on 2 projects locally and I can also see a nice performance gain.
    Note that I don't have the actual files (pdf, docs, etc) locally so the impact on prod must be much more significant.
    Command: drush cr && drush sapi-c && time drush sapi-i

    project 1:
    - without patch: 5m7,417s
    - with patch: 3m31,457s

    project 2:
    - without patch: 5m24,580s
    - with patch: 1m46,645s

    Also, switching to DOM parsing instead of regex makes total sense.
    I'm not too worried about #14 but it's a good point.
    So, +1 thanks!

  • Status changed to Needs review 11 months ago
  • Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update 11 months ago
    Waiting for branch to pass
  • Open on Drupal.org →
    Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27
    last update 11 months ago
    Waiting for branch to pass
  • 🇦🇹Austria drunken monkey Vienna, Austria

    I do not see any test coverage for processing both alt and title. I think that $xpath-&gt;query() will do the right thing and list &lt;img src="..." alt="some text" title="other text"&gt; only once, but I would like to verify that with a test.

    Good point, thanks! Included that in the attached patch.

    I also removed the a < b test, as that is apparently not reliable anymore. (It passed fine for me locally.)

    Regarding the change notice: I really think that this change is much too small to warrant a change record. Such tiny implementation details change with bug fixes all the time.

    Thanks again for your continued work on this!

  • Open in Jenkins → Open on Drupal.org →
    Core: 10.0.11 + Environment: PHP 8.2 & sqlite-3.34
    last update 11 months ago
    508 pass, 3 fail
  • Open in Jenkins → Open on Drupal.org →
    Core: 10.0.11 + Environment: PHP 8.2 & sqlite-3.34
    last update 11 months ago
    545 pass
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Could someone confirm the latest patch worked for them and set this RTBC? Would like to get this committed and create a new release ASAP, so as few people as possible will run into this.

  • 🇳🇱Netherlands RicardoPeters

    @drunken monkey I'm not sure but since 1.30 should cover D9.3 (if I read it correctly) shouldn't the code be compliant with php => 7.4. Not really sure how this should be handled with EOL php versions. We ran into an error applying the patch on a php 7.4 server hence the question. I can supply the patch with the fix, but not sure if that would be right?

  • 🇦🇹Austria drunken monkey Vienna, Austria

    @drunken monkey I'm not sure but since 1.30 should cover D9.3 (if I read it correctly) shouldn't the code be compliant with php => 7.4. Not really sure how this should be handled with EOL php versions. We ran into an error applying the patch on a php 7.4 server hence the question. I can supply the patch with the fix, but not sure if that would be right?

    Hm, yes, that’s food for thought, I guess. Both Drupal 9 and PHP 7 are now EOL, so we definitely don’t need to support them anymore. However, version 1.30 was supposed to be the last one supporting the two, so it would make sense to have a quick “patch release” with fixes for regressions support them as well. Otherwise, people still on Drupal 9 and/or PHP 7 would have no option for resolving those regressions except downgrading this module to 1.29 again.
    Putting this thought into practice, though, would be rather tricky since I already merged 📌 Drop support for Drupal 9 Fixed . Therefore, the current dev version of this module is already incompatible with Drupal 9 and PHP 7. Having release 1.31 still support them would require some Git/release acrobatics that I’m not sure I want to attempt. Especially since it could mean problems for other users who do keep their installations up to date.

    In conclusion, I don’t think I want to attempt to support outdated Drupal and PHP versions with release 1.31, as I’ve already gone ahead and dropped that support. If you cannot update Drupal and/or PHP for some reason (and you certainly should!), please downgrade this module to at least the commit before 🐛 PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor Fixed .

  • Status changed to RTBC 10 months ago
  • 🇳🇱Netherlands RicardoPeters

    Therefore, the current dev version of this module is already incompatible with Drupal 9 and PHP 7. Having release 1.31 still support them would require some Git/release acrobatics that I’m not sure I want to attempt. Especially since it could mean problems for other users who do keep their installations up to date.

    I certainly agree that is definately not worth the effort, the downgrade to 1.29 shouldn't be that hard if you run into this and you don't have the ability to upgrade Drupal/PHP.

    In conclusion, I don’t think I want to attempt to support outdated Drupal and PHP versions with release 1.31, as I’ve already gone ahead and dropped that support. If you cannot update Drupal and/or PHP for some reason (and you certainly should!), please downgrade this module to at least the commit before #3347610: PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor.

    Should we put this somewhere in the changelog or leave it with this issue?

    I ran the patch above on an 8.1 instance with the same issue and found the performance issues had resolved.

    So in this case RTBC, thanks for your time and effort!

  • If the support for D9 is dropped. The module page should also be updated. At the moment it still shows support for ^9.3 || ^10.0.

    When it comes to the issue, testing shows improved performance. RTBC +1.

  • Status changed to Fixed 10 months ago
  • 🇦🇹Austria drunken monkey Vienna, Austria

    Good to hear, thanks for reporting back!
    Merged. Thanks again!

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024