Acquia Search can lock threads excessively leading to denial of service on cache clear

Created on 6 August 2024, 4 months ago
Updated 20 September 2024, 2 months ago

Problem/Motivation

This is a public follow-up for https://security.drupal.org/node/180370 following the 2 week waiting period for both maintainers and security team members to change the issue status away from "Needs public followup".

Acquia Search has a potential to become a contributor to temporary denial of service conditions on cache clear due to excessive locking, which can lead to exhaustion of PHP processes. On cold caches, multiple blocking HTTP requests are made from the Drupal application to Acquia hosted services, each of which have nondeterministic timing in their responses. Acquia Search in particular has the potential to lock a PHP process for up to 90 seconds.

In \Drupal\acquia_search\AcquiaSearchApiClient::getSearchIndexes we see the following code:

    $lock = 'acquia_search_get_search_indexes';
    if (!$this->lock->acquire($lock)) {
      $this->lock->wait($lock);
      // Throw an exception after X amount of seconds.
      if (!$this->lock->acquire($lock)) {
        throw new \Exception("Couldn't acquire lock for $lock in less than 30 seconds.");
      }
    }

Picking this apart, the first call to $this->lock->acquire($lock)) has the potential to block the process for up to 30 seconds. Next, $this->lock->wait($lock) has the potential to block the process for an additional 30 seconds. Finally, the second call to $this->lock->acquire($lock) can block for another 30 seconds. In total, it seems that this code has the latent potential to halt a PHP process for up to 90 seconds.

We have seen this exception thrown recently on deployments right after drush cr steps.

Steps to reproduce

With a large site under heavy load, flush caches. Sometimes, witness that PHP processes are exhausted and that Couldn't acquire lock for acquia_search_get_search_indexes in less than 30 seconds. exceptions appear in the error logs.

Proposed resolution

Greatly reduce lock timing from the default 30 seconds to something more sane. What exactly is the API performance timing of the Acquia services that the HTTP calls are made to? Responses in 1 second? 3 seconds? Surely 30 second responses aren't to be expected from the Acquia services.

- or -

Only lock on one process and on subsequent requests, return immediately, perhaps making a call to \Drupal::messenger() with a user-friendly error message. I.e. "Acquia Search is not available right now, please try again in a few seconds."

Remaining tasks

Discuss, open MR, commit, profit.

User interface changes

Perhaps a new end-user facing error message.

API changes

TBD

Data model changes

None

🐛 Bug report
Status

Fixed

Version

3.1

Component

Code

Created by

🇺🇸United States luke.leber Pennsylvania

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @luke.leber
  • First commit to issue fork.
  • Pipeline finished with Failed
    3 months ago
    Total: 183s
    #273258
  • Pipeline finished with Failed
    3 months ago
    Total: 182s
    #273311
  • Pipeline finished with Failed
    3 months ago
    Total: 190s
    #273317
  • First commit to issue fork.
  • Pipeline finished with Failed
    2 months ago
    #278845
  • Status changed to Needs review 2 months ago
  • 🇮🇳India chandu7929 Pune

    I have verified the attached patch which fixes the issue. Needs review.

    "patches": {
                "drupal/acquia_search": {
                    "Acquia Search can lock threads excessively leading to denial of service on cache clear": "https://www.drupal.org/files/issues/2024-09-10/acquia_serach-3466306.patch"
                }
            }
    
  • Status changed to Needs work 2 months ago
  • 🇺🇸United States luke.leber Pennsylvania

    I'm not sure that any refactoring effort that still utilizes locks to make multiple request processes wait is a real solution to this problem.

    There are two things that can potentially help:

    1. Greatly reduce lock time (a naive approach here could simply cause WSOD conditions for website visitors, so this probably isn't a viable solution.

    2. Gracefully handle scenarios where there is *a single* process waiting for a response from the Acquia infrastructure and degrade the search experience until that process succeeds.

    It's unclear how Acquia manages community contributions to their modules, but these are my recommendations FWIW.

    Setting back to NW. If that's not welcomed in this context, let me know and I won't interfere further.

    Thanks.

  • Merge request !41PoC example of not locking on >1 process. → (Open) created by luke.leber
  • 🇺🇸United States luke.leber Pennsylvania

    See #8 for an example of what's meant by the comments in #7.

  • Pipeline finished with Failed
    2 months ago
    Total: 313s
    #279646
  • Pipeline finished with Failed
    2 months ago
    Total: 472s
    #279677
  • Pipeline finished with Failed
    2 months ago
    Total: 269s
    #280509
  • 🇺🇸United States luke.leber Pennsylvania

    You're 100% correct in that the critical section needs to be wrapped in a try block for safety here, @benjifisher. Thanks for catching that!

  • Pipeline finished with Failed
    2 months ago
    Total: 482s
    #282725
  • 🇺🇸United States luke.leber Pennsylvania

    Our TAM suggested that the engineers on this issue might want to connect in order to iron out easier steps to reproduce the issue, so that a potential fix can be confirmed.

    In order to test this issue, make sure the following preconditions are met:

    1. You will need four or five different authenticated users in anonymous sessions (attempting to test this with a single user will be thwarted by PHP session locking!). Incognito tabs can work for this.
    2. A patch to mock a slow response from Acquia's search service (follows).
      diff --git a/src/AcquiaSearchApiClient.php b/src/AcquiaSearchApiClient.php
      index 4bb268e..bc34604 100644
      --- a/src/AcquiaSearchApiClient.php
      +++ b/src/AcquiaSearchApiClient.php
      @@ -211,6 +211,19 @@ class AcquiaSearchApiClient {
          *   Response array or FALSE.
          */
         private function searchRequest(string $path, string $query_string) {
      +
      +    // !!! DO NOT COMMIT THIS !!!
      +    // Simulate a slow response from Acquia's search service.
      +    //
      +    // This is an exaggerated time, but makes it easier to see the underlying
      +    // problem.  This simulates a blocking HTTP call that is hanging.
      +    sleep(30);
      +
      +    // We don't actually want to call the real search service, since the goal
      +    // of what we're doing is to simulate a worst case scenario without any
      +    // real world conditions to contend with.
      +    return FALSE;
      +
           $subscription_data = $this->subscription->getSubscription();
           // Return no results if there is no subscription data.
           if (!$subscription_data || !$this->subscription->isActive()) {
          

    Once signed into all of the various user accounts and the simulated enabling criteria patch is in place...

    1. Flush the drupal cache
    2. Visit /admin/config/search/search-api/server/acquia_search_server on one account. This will be the process that obtains the lock. It will hang for 30 seconds before releasing its lock and failing gracefully.
    3. At the same time that the first request is executing, fire off an additional request from each different incognito window to /admin/config/search/search-api/server/acquia_search_server. This simulates a backlog of requests that would ordinarily be generated by users on a production site.
    4. Note that *all* of the processes will be blocked until either they either...

      1. Obtain the lock, spins for an additional 30 seconds, and fails gracefully
      2. - or - fails to acquire the lock and fails with an exception

      In either scenario, the processes will just hang for excessive amounts of time. On a real production site, there are a finite number of php processes available. Eventually resources will be exhausted and end-users will be affected.

    If more information, or a live demonstration is desired, please reach out to me and we can set up a call.

    Thanks.

  • First commit to issue fork.
  • Status changed to Needs review 2 months ago
  • 🇺🇸United States japerry KVUO

    I also saw the logic issue with the FALSE cache set and retrieving the cache. I also noticed when using the snippet above (returning false on the search request) that the 'getAvailableCores' method gets called multiple times, meaning the API query to get search indexes was getting called multiple times. (which is probably where the lock issue is coming from)

    I've made a few changes to PR41, but would love a review before it gets committed!

  • 🇺🇸United States luke.leber Pennsylvania

    Hey, Jakob.

    I'll take a look and see if we can apply the patch for tomorrow's deployment and gather some feedback from the ones generally affected by slowdowns on deployments.

    I'm very confident that this will make a world of difference for our staff especially.

    Thanks!

  • Pipeline finished with Failed
    2 months ago
    Total: 179s
    #286593
  • Status changed to Needs work 2 months ago
  • 🇺🇸United States luke.leber Pennsylvania
    I'm pleased to report having patched !41 into prod and did not receive any locking exceptions on deploy this morning.

    Setting back to NW for test fails, but !41 seems to do the trick, functionally.

  • 🇺🇸United States japerry KVUO

    japerry changed the visibility of the branch 3466306-couldnt-acquire-lock-fix to hidden.

  • Status changed to RTBC 2 months ago
  • 🇺🇸United States luke.leber Pennsylvania

    That's a clever use of drupal_static.

    Everything looks good to me here. Our team reviewed the functional bits yesterday fairly comprehensively in a code review.

    Thanks!

  • Pipeline finished with Skipped
    2 months ago
    #287745
  • Pipeline finished with Skipped
    2 months ago
    #287747
  • Pipeline finished with Skipped
    2 months ago
    #287748
    • japerry committed d0fbe51e on 3.1.x
      Issue #3466306 by luke.leber, japerry: Acquia Search can lock threads...
  • 🇺🇸United States japerry KVUO

    Perfect! yah tests look good (both manual and auto) Committed.

  • Status changed to Fixed 2 months ago
  • 🇺🇸United States japerry KVUO

    Note, I'll have to make some changes to get a new fix for 3.2... which should be out as a full release soon.

  • 🇺🇸United States luke.leber Pennsylvania

    Issue credit should be given to Benji Fisher too, as he performed a critical review.

    • japerry committed a9b4d525 on 3.2.x
      Issue #3466306 by luke.leber, japerry: Acquia Search can lock threads...
  • 🇺🇸United States benjifisher Boston area
  • 🇺🇸United States japerry KVUO

    Oh he didn't show up on the issue queue but I see it in the tracker. Done! Also, @luke.leber, if you get a chance, can you test against Acquia Search 3.2.0-rc1? Automated tests are passing, but we don't have great insight into the various configurations customers have. Once thats done, we can make 3.2.0 and not have to maintain two branches.

  • 🇺🇸United States luke.leber Pennsylvania

    I can ask our search specialist to take a look at the changes and give things a test drive. We have quite a bit of other stuff going on, but there may be down extra hours.

    Can we get a "cut a 3.2.0 issue" spun up so he can report the feedback on it?

    Thanks

  • 🇺🇸United States japerry KVUO

    Done! 🌱 Acquia Search 3.2.0 release Active

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024