- Issue created by @luke.leber
- First commit to issue fork.
- First commit to issue fork.
- Status changed to Needs review
3 months ago 5:43am 10 September 2024 - 🇮🇳India chandu7929 Pune
I have verified the attached patch which fixes the issue. Needs review.
"patches": { "drupal/acquia_search": { "Acquia Search can lock threads excessively leading to denial of service on cache clear": "https://www.drupal.org/files/issues/2024-09-10/acquia_serach-3466306.patch" } }
- Status changed to Needs work
3 months ago 9:54pm 10 September 2024 - 🇺🇸United States luke.leber Pennsylvania
I'm not sure that any refactoring effort that still utilizes locks to make multiple request processes wait is a real solution to this problem.
There are two things that can potentially help:
1. Greatly reduce lock time (a naive approach here could simply cause WSOD conditions for website visitors, so this probably isn't a viable solution.
2. Gracefully handle scenarios where there is *a single* process waiting for a response from the Acquia infrastructure and degrade the search experience until that process succeeds.
It's unclear how Acquia manages community contributions to their modules, but these are my recommendations FWIW.
Setting back to NW. If that's not welcomed in this context, let me know and I won't interfere further.
Thanks.
- 🇺🇸United States luke.leber Pennsylvania
See #8 for an example of what's meant by the comments in #7.
- 🇺🇸United States luke.leber Pennsylvania
You're 100% correct in that the critical section needs to be wrapped in a
try
block for safety here, @benjifisher. Thanks for catching that! - 🇺🇸United States luke.leber Pennsylvania
Our TAM suggested that the engineers on this issue might want to connect in order to iron out easier steps to reproduce the issue, so that a potential fix can be confirmed.
In order to test this issue, make sure the following preconditions are met:
- You will need four or five different authenticated users in anonymous sessions (attempting to test this with a single user will be thwarted by PHP session locking!). Incognito tabs can work for this.
- A patch to mock a slow response from Acquia's search service (follows).
diff --git a/src/AcquiaSearchApiClient.php b/src/AcquiaSearchApiClient.php index 4bb268e..bc34604 100644 --- a/src/AcquiaSearchApiClient.php +++ b/src/AcquiaSearchApiClient.php @@ -211,6 +211,19 @@ class AcquiaSearchApiClient { * Response array or FALSE. */ private function searchRequest(string $path, string $query_string) { + + // !!! DO NOT COMMIT THIS !!! + // Simulate a slow response from Acquia's search service. + // + // This is an exaggerated time, but makes it easier to see the underlying + // problem. This simulates a blocking HTTP call that is hanging. + sleep(30); + + // We don't actually want to call the real search service, since the goal + // of what we're doing is to simulate a worst case scenario without any + // real world conditions to contend with. + return FALSE; + $subscription_data = $this->subscription->getSubscription(); // Return no results if there is no subscription data. if (!$subscription_data || !$this->subscription->isActive()) {
Once signed into all of the various user accounts and the simulated enabling criteria patch is in place...
- Flush the drupal cache
- Visit /admin/config/search/search-api/server/acquia_search_server on one account. This will be the process that obtains the lock. It will hang for 30 seconds before releasing its lock and failing gracefully.
- At the same time that the first request is executing, fire off an additional request from each different incognito window to /admin/config/search/search-api/server/acquia_search_server. This simulates a backlog of requests that would ordinarily be generated by users on a production site.
-
Note that *all* of the processes will be blocked until either they either...
- Obtain the lock, spins for an additional 30 seconds, and fails gracefully
- - or - fails to acquire the lock and fails with an exception
In either scenario, the processes will just hang for excessive amounts of time. On a real production site, there are a finite number of php processes available. Eventually resources will be exhausted and end-users will be affected.
If more information, or a live demonstration is desired, please reach out to me and we can set up a call.
Thanks.
- First commit to issue fork.
- Status changed to Needs review
3 months ago 4:11am 18 September 2024 - 🇺🇸United States japerry KVUO
I also saw the logic issue with the FALSE cache set and retrieving the cache. I also noticed when using the snippet above (returning false on the search request) that the 'getAvailableCores' method gets called multiple times, meaning the API query to get search indexes was getting called multiple times. (which is probably where the lock issue is coming from)
I've made a few changes to PR41, but would love a review before it gets committed!
- 🇺🇸United States luke.leber Pennsylvania
Hey, Jakob.
I'll take a look and see if we can apply the patch for tomorrow's deployment and gather some feedback from the ones generally affected by slowdowns on deployments.
I'm very confident that this will make a world of difference for our staff especially.
Thanks!
- Status changed to Needs work
3 months ago 3:01pm 19 September 2024 - 🇺🇸United States luke.leber PennsylvaniaI'm pleased to report having patched !41 into prod and did not receive any locking exceptions on deploy this morning.
Setting back to NW for test fails, but !41 seems to do the trick, functionally.
- Status changed to RTBC
3 months ago 9:41pm 19 September 2024 - 🇺🇸United States luke.leber Pennsylvania
That's a clever use of drupal_static.
Everything looks good to me here. Our team reviewed the functional bits yesterday fairly comprehensively in a code review.
Thanks!
- 🇺🇸United States japerry KVUO
Perfect! yah tests look good (both manual and auto) Committed.
- Status changed to Fixed
3 months ago 11:36pm 19 September 2024 - 🇺🇸United States japerry KVUO
Note, I'll have to make some changes to get a new fix for 3.2... which should be out as a full release soon.
- 🇺🇸United States luke.leber Pennsylvania
Issue credit should be given to Benji Fisher too, as he performed a critical review.
- 🇺🇸United States japerry KVUO
Oh he didn't show up on the issue queue but I see it in the tracker. Done! Also, @luke.leber, if you get a chance, can you test against Acquia Search 3.2.0-rc1? Automated tests are passing, but we don't have great insight into the various configurations customers have. Once thats done, we can make 3.2.0 and not have to maintain two branches.
- 🇺🇸United States luke.leber Pennsylvania
I can ask our search specialist to take a look at the changes and give things a test drive. We have quite a bit of other stuff going on, but there may be down extra hours.
Can we get a "cut a 3.2.0 issue" spun up so he can report the feedback on it?
Thanks
Automatically closed - issue fixed for 2 weeks with no activity.