Fix selenium performance/stampede issues in gitlab config and BrowserTestBase

Created on 23 July 2024, about 2 months ago
Updated 20 August 2024, 18 days ago

Problem/Motivation

📌 Use selenium/standalone-chrome instead of our chromedriver image Needs work swaps Drupal's tests to use selenium provided images for browser testing. In order to maintain test performance we had to increase the number of parallel runners from 2 to 4.

Notes copied from #3465602-6: Order tests by number of public methods to optimize gitlab job times
There's a race condition/contention issue in HEAD functional js tests with selenium sessions now.

The default SE_SESSION_RETRY_INTERVAL is 5 seconds.

BrowserTestBase installs Drupal before asking for a session.

So for example if there are 16 tests running concurrently, the first test takes 1 minute to install (Umami) per method, and the next 15 tests take 10 seconds to install (testing profile)

Those 15 tests will request a browser session before the first one. Some of our functional js tests finish in about 7-10 seconds, so you can have a lot of tests starting and finishing within the 5s retry.

Even if the first request for a session is correctly handled FIFO by the selenium session queue, each test method closes the session then requests a new one, which means a test with several slow-ish methods can perpetually be put at the back of the session queue. i.e. by the time the first 1 minute test has finished, there could be 10 requests in the session queue, then it will have to wait at least 5 seconds but probably more for a session to be available again.

If each method waits even 5 seconds for a session and there are say 6 methods at 30s each, that's more an extra 30s for a three minute test class.

If it's more like 10 or 15 seconds each time, then it can add several minutes, which is what we're seeing sometimes and I was able to force in a handful of pipeline runs on this issue.

This explains the vastly different timings for individual tests between gitlab jobs when contention issues occur - i.e. a test that normally finishes in a minute can suddenly take six minutes - but it's not that the test as such is taking longer but that it's waiting for a selenium session to be available potentially multiple times for several seconds each time.

The default selenium session request timeout is 300 seconds, so we hardly ever see an actual timeout, but I was able to reproduce one here: https://git.drupalcode.org/project/drupal/-/jobs/2334397

The current high parallel rate works around this - because not that many tests are executed per job, eventually selenium sessions get freed up for any test methods still waiting, but it'll still be happening occasionally with four jobs which can be seen for example here: https://git.drupalcode.org/project/drupal/-/jobs/2333840 - it's easier to reproduce with the other changes in this issue because it pushes multiple-method tests to the beginning of the run together.

I've tried to reduce the timeout to 10 seconds, but I'm not sure that's taking effect on the k8s stack yet - it might need the node to cycle out first to actually reset it? At least I've not been able to trigger a 10 second timeout, only the 300s one.

-    SE_NODE_MAX_SESSIONS: "5"
+    SE_NODE_MAX_SESSIONS: "16"
+    SE_SESSION_RETRY_INTERVAL: "1"
+    SE_SESSION_REQUEST_TIMEOUT: "10"

You would think that setting the maximum sessions to exactly the same as concurrency would fix things, but I still saw the same kinds of side effects with that. I can't find anything conclusive, but my guess is that when you tell selenium to close a session, it returns before the session is actually closed - so you can still end up with more concurrent tests running than available browser sessions. There's no particular reason Selenium should make you wait until the browser session is actually closed before the API call returns, but if that is indeed the case we need to account for it.

I started to look at requesting/starting a browser session before we install Drupal so that sessions are requested FIFO, but there's major problems if we were to try to do that. We start and stop the browser every test run, so it doesn't solve the problem of stopping the session the waiting for the next one between each test method.

The only way to reserve a session for the duration of a test class would be to make the browser session static, which is major surgery to browser tests and could introduce state issues between test methods potentially.

Also starting a browser session before installing Drupal is inefficient - we'd have a browser session sitting there while we install and don't need it.

So instead I'm increasing the maximum sessions to approximately double the concurrency, this means there should always be a 'spare' browser session when we start each test method, although this doesn't seem to be 100% foolproof. Because concurrency is much lower than the maximum number of sessions, the actual number of sessions open at any one time shouldn't actually hit the maximum, but it allows slack for the time to actually close a session.

Proposed resolution

TBD

Remaining tasks

Investigate if we can make chnages to the selenium switches / chrome options / environment to improve the performance.

User interface changes

None

API changes

None

Data model changes

None

Release notes snippet

TBD

🐛 Bug report
Status

Fixed

Version

11.0 🔥

Component
PHPUnit 

Last updated about 12 hours ago

Created by

🇬🇧United Kingdom alexpott 🇪🇺🌍

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Production build 0.71.5 2024