- Issue created by @benjifisher
- 🇺🇸United States benjifisher Boston area
According to XHProf, 99.9% of the CPU time comes from
mb_strpos()
andmb_substr()
. Both are called hundreds of thousands of times fromHtmlFilter::handleAttributes()
, which is called 24 times. I think this confirms that #3347610 is responsible.My site also uses
search_api_attachments
, and there are several files being indexed. It seems that those are all successfully indexed, so the problem is with some of the regular text fields, but I may be misreading it.The text is not very long:
MariaDB [db]> SELECT LENGTH(field_body_value) AS strlen FROM node__field_body ORDER BY strlen DESC LIMIT 0,5; +--------+ | strlen | +--------+ | 782574 | | 726969 | | 678612 | | 638747 | | 598372 | +--------+ 5 rows in set (0.001 sec)
- Status changed to Postponed: needs info
over 1 year ago 10:04am 24 September 2023 - 🇦🇹Austria drunken monkey Vienna, Austria
Oh, that seems bad. Thanks a lot for reporting this!
782 kB of text does seem like a lot, but such a degradation of indexing performance is of course still not acceptable. Especially just for the sake of fixing such a small problem.
First, could you please confirm for which the “HTML filter” processor is enabled? Is it really only those that contain HTML?
Then, when removing all fulltext fields from the index except for “Body”, does the problem still occur? And does it occur with all items (with large enough body fields), or just for specific ones (maybe with faulty or particularly complex HTML)?It seems tricky to try and optimize our current processing code, as I don’t see any obvious flaws. The best solution might be to add an option to switch between our previous and our current implementation – hardly ideal, though, I admit. (Who is going to notice that when they see that indexing takes ages?)
What you can personally do to work around this is disable indexing of the “alt” and “title” attributes. That should circumvent the whole problematic method, I think.
- Status changed to Active
about 1 year ago 7:11pm 24 October 2023 - 🇺🇸United States benjifisher Boston area
@drunken monkey:
Thanks for the reply. I am sorry I did not reply earlier.
There is a lot going on (custom code and other) with my current site, and we may have been applying the HTML filter to non-HTML fields. So I tested the problem with a fresh install of Drupal and a single node. I just added steps to reproduce in the issue summary. I hope I did not leave anything out.
For the record:
$ ddev drush status Drupal version : 10.1.5 Site URI : https://drupal.ddev.site DB driver : mysql DB hostname : db DB port : 3306 DB username : db DB name : db Database : Connected Drupal bootstrap : Successful Default theme : olivero Admin theme : claro PHP binary : /usr/bin/php8.1 PHP config : /etc/php/8.1/cli/php.ini PHP OS : Linux PHP version : 8.1.23 Drush script : /var/www/html/vendor/bin/drush Drush version : 12.4.0.0 Drush temp : /tmp Drush configs : /var/www/html/vendor/drush/drush/drush.yml Install profile : standard Drupal root : /var/www/html Site path : sites/default Files, Public : sites/default/files Files, Private : /var/www/html/sites/default/files/private Files, Temp : /tmp
Following the steps I provided, I get
$ ddev drush search-api:clear && time ddev drush search-api:index [success] test was successfully cleared. [success] Found 1 items to index for test. Indexing all items. [success] Indexing a maximum number of 1 items (50 items per batch run) for the index 'test'. > [notice] Successfully indexed 1 item on test. > [notice] Message: Successfully indexed 1 item. > real 1m6.977s user 0m0.080s sys 0m0.000s
It takes less than a second if I disable both "Index title attribute" and "Index alt attribute" in the settings for the HTML filter. With either one enabled, I get about the same (more than a minute).
If I open another terminal window and run
top
in the container, then it shows that PHP is using 100% of a CPU.If I reverse the patch from 🐛 PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor Fixed , then again it takes less than a second.
- 🇺🇸United States benjifisher Boston area
In Comment #2, I wrote,
According to XHProf, 99.9% of the CPU time comes from mb_strpos() and mb_substr(). Both are called hundreds of thousands of times from HtmlFilter::handleAttributes(), which is called 24 times. I think this confirms that #3347610 is responsible.
There are over 4,000 tags in my sample file. If
mb_strpos()
andmb_substr()
are each called at least twice for each tag, and I have a few dozen files, that explains the "hundreds of thousands". I think my content just has too many tags for this code to handle.From #3:
What you can personally do to work around this is disable indexing of the “alt” and “title” attributes. That should circumvent the whole problematic method, I think.
I may do that for now. I do not think I will miss
title
attributes, but I would like to be able to index alt text from images.I wonder if parsing the HTML would have better performance. You can get a
\DOMDocument
object withDrupal\Component\Utility\Html::load($text)
. - 🇷🇪Réunion Martygraphie Saint-Denis (Réunion)
Hi @benjifisher,
Thank you very much for your analysis, I've encountered the same problem as you.
Deactivating ALT / TITLE attributes solved the performance problems on my side.
Marc - 🇺🇸United States benjifisher Boston area
Let' add the work-around to the issue summary.
- 🇫🇷France Chris64 France
And what about to come back to the original in 🐛 PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor Fixed and try an other change to solve the null parameter problem? Like,
- $text = preg_replace('/<[-a-z_]+[^>]*["\s]alt\s*=\s*("([^"]+)"|\'([^\']+)\')[^>]*>/i', ' <img>$2$3</img> ', $text); + $text = preg_replace('/<[-a-z_]+[^>]*["\s]alt\s*=\s*("([^"]*)"|\'([^\']*)\')[^>]*>/i', ' <img>$2??$3</img> ', $text);
- Status changed to Needs review
about 1 year ago 11:18am 29 October 2023 - last update
about 1 year ago 515 pass, 3 fail - last update
about 1 year ago 516 pass, 1 fail - 🇦🇹Austria drunken monkey Vienna, Austria
Thanks for the detailed information, @benjifisher, that was very helpful! So was your suggestion of using
\Drupal\Component\Utility\Html::load()
, that indeed seems to be the best solution.
Patch attached, even including a regression test (our first for a performance issue), please test/review! The last submitted patch, 9: 3388678-9--fix_html_filter_performance--tests_only.patch, failed testing. View results →
The last submitted patch, 9: 3388678-9--fix_html_filter_performance.patch, failed testing. View results →
- Status changed to Needs work
about 1 year ago 12:08pm 31 October 2023 - 🇭🇺Hungary balazswmann
This problem affects me as well. Thank you @benjifisher very much for the investigation and root cause analysis. I run into this in the past could of days after a `8.x-1.27` => `8.x-1.30` update. I also realized that indexing got very very slow. In my case it takes more than 1 minute to index just 50 items but previously (with `8.x-1.27`) I didn't have problems.
I also use the HTML filter with the default settings and I need to index documents including around a couple of hundreds to several thousands of HTML tags (per document!).
I applied the patch from #9: let a user specify if her blog entries should be votable → ( https://www.drupal.org/project/search_api/issues/3388678#comment-15294371 🐛 Performance issues after fixing 'PHP 8.1 preg_replace()' issue Needs review ) on `8.x-1.30` and it solved the problem for me. Now indexing is fast enough again. Thank you @drunken monkey!
- 🇺🇸United States benjifisher Boston area
@drunken monkey:
I will try applying the patch in my current project, although Comments #8 and #12 are already positive.
The code looks right to me, In fact, I often say that using DOMDocument is simpler and more reliable than "simpler" methods of processing HTML, and this is a great example of that. This is the first example I have seen where it is also more performant.
I checked the docs for
Html::load()
andHtml::serialize()
, and the underlyingDOMDocument
methodsloadHTML()
andsaveHTML()
. None of them mention any exceptions, so I do not think you need any error handling. In fact,Html::load()
explicitly suppresses warnings by calling@$dom->loadHTML(...)
.I am not sure what is going on with the failing test. The
foreach
loop should be empty, so the code should just convert toDOMDocument
and back tostring
. It is not a no-op: it should "fix" invalid HTML, and a little test shows that it strips the<
character, turninga < b
intoa b
(two spaces). But the test results show justa
. Maybe different versions of PHP, or the parsing library behind all this, account for the difference: a different version might "fix" the invalid HTML toa <b>
and then "fix" that toa
.I do not see any test coverage for processing both
alt
andtitle
. I think that$xpath->query()
will do the right thing and list<img src="..." alt="some text" title="other text">
only once, but I would like to verify that with a test. - 🇺🇸United States benjifisher Boston area
As I said in the previous comment,
Html::load()
followed byHtml::serialize()
is not a no-op. It "normalizes" the HTML. That means the current patch will change what gets sent to be indexed. I do not think I would notice the difference, but in some use cases it might have an effect. I think we should add a change record and mention it in the release notes, so I am adding the tag for a CR. - 🇧🇪Belgium herved
I tested patch #9 on 2 projects locally and I can also see a nice performance gain.
Note that I don't have the actual files (pdf, docs, etc) locally so the impact on prod must be much more significant.
Command:drush cr && drush sapi-c && time drush sapi-i
project 1:
- without patch: 5m7,417s
- with patch: 3m31,457sproject 2:
- without patch: 5m24,580s
- with patch: 1m46,645sAlso, switching to DOM parsing instead of regex makes total sense.
I'm not too worried about #14 but it's a good point.
So, +1 thanks! - Status changed to Needs review
about 1 year ago 3:12pm 11 November 2023 - Open on Drupal.org →Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27last update
about 1 year ago Waiting for branch to pass - Open on Drupal.org →Core: 9.5.x + Environment: PHP 8.1 & sqlite-3.27last update
about 1 year ago Waiting for branch to pass - 🇦🇹Austria drunken monkey Vienna, Austria
I do not see any test coverage for processing both
alt
andtitle
. I think that$xpath->query()
will do the right thing and list<img src="..." alt="some text" title="other text">
only once, but I would like to verify that with a test.Good point, thanks! Included that in the attached patch.
I also removed the
a < b
test, as that is apparently not reliable anymore. (It passed fine for me locally.)Regarding the change notice: I really think that this change is much too small to warrant a change record. Such tiny implementation details change with bug fixes all the time.
Thanks again for your continued work on this!
- last update
about 1 year ago 508 pass, 3 fail - last update
about 1 year ago 545 pass The last submitted patch, 16: 3388678-16--fix_html_filter_performance--tests_only.patch, failed testing. View results →
- 🇦🇹Austria drunken monkey Vienna, Austria
Could someone confirm the latest patch worked for them and set this RTBC? Would like to get this committed and create a new release ASAP, so as few people as possible will run into this.
- 🇳🇱Netherlands ricardopeters
@drunken monkey I'm not sure but since 1.30 should cover D9.3 (if I read it correctly) shouldn't the code be compliant with php => 7.4. Not really sure how this should be handled with EOL php versions. We ran into an error applying the patch on a php 7.4 server hence the question. I can supply the patch with the fix, but not sure if that would be right?
- 🇦🇹Austria drunken monkey Vienna, Austria
@drunken monkey I'm not sure but since 1.30 should cover D9.3 (if I read it correctly) shouldn't the code be compliant with php => 7.4. Not really sure how this should be handled with EOL php versions. We ran into an error applying the patch on a php 7.4 server hence the question. I can supply the patch with the fix, but not sure if that would be right?
Hm, yes, that’s food for thought, I guess. Both Drupal 9 and PHP 7 are now EOL, so we definitely don’t need to support them anymore. However, version 1.30 was supposed to be the last one supporting the two, so it would make sense to have a quick “patch release” with fixes for regressions support them as well. Otherwise, people still on Drupal 9 and/or PHP 7 would have no option for resolving those regressions except downgrading this module to 1.29 again.
Putting this thought into practice, though, would be rather tricky since I already merged 📌 Drop support for Drupal 9 Fixed . Therefore, the current dev version of this module is already incompatible with Drupal 9 and PHP 7. Having release 1.31 still support them would require some Git/release acrobatics that I’m not sure I want to attempt. Especially since it could mean problems for other users who do keep their installations up to date.In conclusion, I don’t think I want to attempt to support outdated Drupal and PHP versions with release 1.31, as I’ve already gone ahead and dropped that support. If you cannot update Drupal and/or PHP for some reason (and you certainly should!), please downgrade this module to at least the commit before 🐛 PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor Fixed .
- Status changed to RTBC
about 1 year ago 1:52pm 23 November 2023 - 🇳🇱Netherlands ricardopeters
Therefore, the current dev version of this module is already incompatible with Drupal 9 and PHP 7. Having release 1.31 still support them would require some Git/release acrobatics that I’m not sure I want to attempt. Especially since it could mean problems for other users who do keep their installations up to date.
I certainly agree that is definately not worth the effort, the downgrade to 1.29 shouldn't be that hard if you run into this and you don't have the ability to upgrade Drupal/PHP.
In conclusion, I don’t think I want to attempt to support outdated Drupal and PHP versions with release 1.31, as I’ve already gone ahead and dropped that support. If you cannot update Drupal and/or PHP for some reason (and you certainly should!), please downgrade this module to at least the commit before #3347610: PHP 8.1 preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated HtmlFilter processor.
Should we put this somewhere in the changelog or leave it with this issue?
I ran the patch above on an 8.1 instance with the same issue and found the performance issues had resolved.
So in this case RTBC, thanks for your time and effort!
If the support for D9 is dropped. The module page should also be updated. At the moment it still shows support for ^9.3 || ^10.0.
When it comes to the issue, testing shows improved performance. RTBC +1.
-
drunken monkey →
committed e970f06f on 8.x-1.x
Issue #3388678 by drunken monkey, benjifisher: Fixed performance...
-
drunken monkey →
committed e970f06f on 8.x-1.x
- Status changed to Fixed
about 1 year ago 3:33pm 25 November 2023 - 🇦🇹Austria drunken monkey Vienna, Austria
Good to hear, thanks for reporting back!
Merged. Thanks again! Automatically closed - issue fixed for 2 weeks with no activity.