Adopt the Revolt event loop for async task orchestration

Created on 16 October 2023, about 1 year ago

Problem/Motivation

My opinions are formed by wrangling PHP into serving subscriptions from a Drupal back-end and from having more than a handful of legitimate needs to bootstrap Drupal in a place where Drupal was not the owner of the process.

I believe strongly that Drupal should adopt the Revolt event-loop.

Why an event loop?

We should adopt an event loop because as we make more and more of Drupal able to be async (e.g. finishing the database layer; perhaps adding file access; possibly making external HTTP requests for an aggregated JSON API) managing all those interactions become increasingly complex. There are tried and tested lower level PHP extensions that have spent many years solving this problem; which extensions are available will differ and supporting them all is a challenge. Event loops like Revolt are built to work across these different implementations and automatically select the right one.

In our initial implementations we manually implemented a Fiber loop, but managing tasks on that can quickly become cumbersome and such custom loops provide no extension points for contrib. Additionally, mistakes by implementing this ourselves are easy to make. For exmaple user ReINFaTe on the Drupal Slack pointed out "all fiber loops should call sleep() if every fiber waits for something. Without sleep, while all code waits, the loop will use 100% CPU just to keep checking all the fibers."

Why Revolt specifically?

We should specifically adopt the Revolt event-loop because just like the Drupal community works towards common goals, there's a clear showing that the async PHP community has worked towards a common goal. ReactPHP and AmPHP were the largest options (and only options if you ignore Swoole which requires a custom PHP extension) in the ecosystem and they've bundled forces to create a single re-usable event loop. The goal of Revolt is to be an event-loop only and provide only primitives for interacting with it and scheduling work, leaving the creation of higher level concepts to other libraries. AmPHP has now adopted Revolt as its event loop and Revolt itself provides a ReactPHP adapter by implementing ReactPHP's EventLoop interface.

This means that adopting Revolt as event loop will immediately allow any ReactPHP or AmPHP code to be used in Drupal projects (even if that's outside of core).

Revolts Pedigree

The event loop will play a core part in how Drupal schedules its asynchronous tasks and as such will become an important part of Drupal. This means that it's important to know that Revolt will be around for a long time and that it's solidly built.

Revolt is maintained by Aaron Piotrowski, Niklas Keller, and Saif Eddin Gmati. Niklas and Aaron are also maintainers of Amphp (together with Bob Weinand). Niklas and Aaron are also the authors of the Fibers RFC, so I'd argue they have some knowledge of how to use Fibers.

Async Primitives

Whether Drupal should also adopt an "async tools" library on top of is a discussion for a separate issue. The most important accelerator for asynchronous tasks in Drupal is the adoption of the Revolt event loop, because it suddenly provides an entire async ecosystem for contrib to use.

Drupal's optional tasks on an event loop

There's two important building blocks that Revolt offers here:

  1. Cancellation of callbacks
  2. Referenced/unreferenced callbacks

1) What we could do is to build some primitives in Drupal that say "Attach this optional task to this request" and at the end of the request cancel all those tasks that are optional.

2) However, in the event of the event loop being used within an process like PHP-FPM serving a single request there is an easier way. Revolt allows callbacks to be either "referenced" or "unreferenced". A referenced callback will keep the event loop alive (this is the default). An unreferenced callback will not keep the event loop alive and will allow the process to exit once done. This means that those background tasks can be registered as unreferenced callbacks. Then if the things needed for the request are done and the clean-up tasks are completed the event loop will shut down, throwing away those optional tasks that were not yet completed.

Proposed resolution

Adopt the Revolt event loop. This should happen in the index.php outside of what would be considered Drupal's Runtime ( ✨ Use symfony/runtime for less bespoke bootstrap/compatibility with varied runtime environments Active ). This ensures that applications that have different lifecycles (e.g. Drush) can control the starting and stopping of the event loop themselves and decide when Drupal might need to be bootstrapped as part of a longer running process.

Remaining tasks

  • Get buy-in for the core committers/maintainers
  • Should this be considered a new subsystem?
  • Update index.php
  • Update the already introduced manual Fiber loops and rewrite them to use the new Revolt loop

Useful reading

Since Fibers and async programming may be new to some below is a list of useful reading:

User interface changes

API changes

Data model changes

Release notes snippet

🌱 Plan
Status

Active

Version

11.0 πŸ”₯

Component
BaseΒ  β†’

Last updated about 3 hours ago

Created by

πŸ‡³πŸ‡±Netherlands kingdutch

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @kingdutch
  • πŸ‡³πŸ‡±Netherlands kingdutch
  • πŸ‡¬πŸ‡§United Kingdom catch

    We should add a dependency evaluation to the issue summary, it almost there already, just release cycle and security policy I think: https://www.drupal.org/about/core/policies/core-dependency-policies/depe... β†’

    I still have not fully grasped the benefit of the central event loop vs. for example adding a trait to cover the current raw Fibers implementation (which could then handle sleeping, suspending to any parent Fiber in one place etc. centrally, but would still keep the management of each loop local). For me at least, it would be useful to see a conversion of the manual Fibers loops in core, we also have some examples of suspension in the cache prewarming/stampede issue (if not actual async anywhere yet).

  • πŸ‡³πŸ‡±Netherlands kingdutch

    I've filled out the dependency evaluation.

    I still have not fully grasped the benefit of the central event loop vs. for example adding a trait to cover the current raw Fibers implementation (which could then handle sleeping, suspending to any parent Fiber in one place etc. centrally, but would still keep the management of each loop local). For me at least, it would be useful to see a conversion of the manual Fibers loops in core, we also have some examples of suspension in the cache prewarming/stampede issue (if not actual async anywhere yet).

    I emailed Aaron with this question and he replied with

    Drupal absolutely should use the Revolt event loop. The entire reason Revolt exists is to avoid fragmentation of the event loop component among PHP libraries which want to run asynchronous tasks. The event loop essentially becomes a part of the runtime – you cannot mix multiple event loops in the same application because only one can be running at a time. We talk a bit more about this at https://revolt.run/fundamentals.

    Using a propriety loop which schedules fibers would make Drupal incompatible with any library using a different fiber scheduler – i.e. any library using Revolt.

    AMPHP might also be useful for some of the primitives it provides, such as Futures and Cancellations, as well as some of the lower-level helper libraries like amphp/pipeline. Note though using AMPHP would be completely optional – Drupal could implement it's own promises/futures, etc. and still be compatible with AMPHP so long as it was using Revolt to schedule events.

    Revolt is flexible and un-opinionated, making it easy to create new fibers, use timers, and wait for I/O. Check out the docs at https://revolt.run and let me know if I can provide any additional examples or assistance.

    The relevant part of that fundamentals document (in case it changes and someone is reading this in 2025 πŸ‘‹):

    Every application making use of cooperative multitasking can only have one scheduler. It doesn’t make sense to have two event loops running at the same time, as they would just have to schedule each other in a busy waiting manner, wasting CPU cycles.

    Revolt provides global access to the scheduler using methods on the Revolt\EventLoop class. On the first use of the class, it will automatically create the best available driver. Revolt\EventLoop::setDriver() can be used to set a custom driver.

    To add to that personally I think with the initial Fiber code that was added we've already seen some challenges with CPU spinlocking. To me this feels very much like a problem where the initial case is trivial and as we adopt Fibers more we'll find more of these edge cases (oh we'd like to just let the system sleep until I/O comes back if there's nothing else to do). We'd then be solving exactly the problems that Revolt has already solved, but doing so in a way not compatible with other async code in the ecosystem.

  • πŸ‡¬πŸ‡§United Kingdom longwave UK

    No current security policy published.

    Can we ask the maintainers if they are willing to publish a security policy? Given that this is a low level runtime dependency it seems quite important that if there is a security issue the maintainers are prepared to fix it within a reasonable timescale.

  • πŸ‡³πŸ‡±Netherlands kingdutch

    I've opened an issue with the request: https://github.com/revoltphp/event-loop/issues/87

  • πŸ‡¬πŸ‡§United Kingdom catch

    To add to that personally I think with the initial Fiber code that was added we've already seen some challenges with CPU spinlocking. To me this feels very much like a problem where the initial case is trivial and as we adopt Fibers more we'll find more of these edge cases (oh we'd like to just let the system sleep until I/O comes back if there's nothing else to do). We'd then be solving exactly the problems that Revolt has already solved, but doing so in a way not compatible with other async code in the ecosystem.

    I think we could solve some of that by moving the individual loops to use a helper class
    so there's less repetition, if that was the only reason I'm not sure it would be worth it, but the interoperability arguments here are quite strong so that is pushing me over from neutral/on the fence towards pro-adoption of Revolt at the moment.

  • πŸ‡³πŸ‡±Netherlands kingdutch

    Updated the remaining tasks. I've created a child issue to add the dependency to the composer.json: πŸ“Œ Add revoltphp/event-loop dependency to core Active which now also contains the dependency evaluation.

  • πŸ‡³πŸ‡±Netherlands kingdutch

    Updated the issue summary with the remaining tasks to show tasks in progress. At least with the current proposed implementations it appears no work for PHPUnit is needed. If tests want to test something specifically that doesn't block the main thread at some point then they'll have to run EventLoop::run() in the test themselves.

  • πŸ‡·πŸ‡ΊRussia Chi

    That's not clear how will Drupal benefit from this.

    This command can give a very rough estimation of how much request time we could potentially save. Compare real and user time.

    $ time php index.php > /dev/null
    
    real    0m0.242s
    user    0m0.152s
    sys     0m0.055s
    

    Yes, there are lots of IO operations but not many of them can be taken out of main code flow because subsequent steps typically depend on previous ones. I believe the earlier mentioned cases (Cache Prewarm and Big Pipe) won't give big performance gain. They need benchmarks to prove their usefulness.

  • πŸ‡·πŸ‡ΊRussia Chi

    Yes, there are lots of IO operation

    I guess those were mostly file operations performed by Composer autoloader, because in CLI Opcache and Apcu do not make much sense. In real HTTP request served by FPM the difference between real and user will be much smaller.

  • πŸ‡¬πŸ‡§United Kingdom catch

    @chi are those numbers for core or for a real site that you're working on? If for a real site, was it with warm or cold caches? How many entities are there? What sorts of things are on the front page?

    If it was for core, try loading a page immediately after a cache clear on a relatively complex site - e.g. with lots of content and a handful of views on the front page.

    There are already some performance numbers on the cache prewarm issue.

  • πŸ‡·πŸ‡ΊRussia Chi

    Re #13. It was custom project with a few millions entities. However the front page is just a user login form without extra blocks.

    Here are results with brand new D11 installation.
    Cold cache

    real    0m0.638s
    user    0m0.346s
    sys     0m0.103s
    

    Warned cache

    $ time php index.php > /dev/null
    
    real    0m0.123s
    user    0m0.086s
    sys     0m0.030s
    

    Though standard profile does render many things on front page.

  • πŸ‡·πŸ‡ΊRussia Chi

    @catch Those number are not for measuring site performance. The point was to check how many IO bottlenecks we potentially have. And again checking it in CLI SAPI is not correct.

    As of benchmarking, testing just front page may not be sufficient. I think Drupal needs a comprehensive set of load tests that can help to figure out performance gains and track performance regressions.

    I created a few K6 scenarios to for testing performance of Drupal site. Wonder if Drupal core could implement something similar as part of CI workflow.
    https://github.com/Chi-teck/k6-umami

  • πŸ‡¬πŸ‡§United Kingdom catch

    As of benchmarking, testing just front page may not be sufficient.

    A page with just a login form is not going to benefit from this.

    The sort of page that will benefit is a dashboard-y page like https://www.drupal.org/dashboard β†’

    A landing page with various views.

    A content page with a 'related articles' block at the bottom etc.

    If you have a page with 2-3 views and/or 2-3 entity queries that is where it gets interesting.

    Let's say you have three slow views queries that take ~500ms each. If they are all executed async, then instead of 1500ms executed one by one, the linear time spent executing the queries could go down to ~500ms. Additionally, other CPU intensive tasks (like rendering results) could be happening while waiting for all three queries to come back.

    Or if you have 5 entity queries that are 50 ms each, then instead of 250ms one by one, it could be 50ms (or 60ms more realistically given some time to execute each and collect the results) executing in parallel. And again other things can be happening while waiting for them to come back.

    If you have just one related articles block that does a views query using similar by terms module taking about 40ms, then that can still be executed async, and for example your footer menu, or social sharing buttons, or whatever other blocks could be rendered while waiting.

    On pages like this, unless everything is a cache hit, Drupal is both i/o and CPU bound, but we can do both database query execution and CPU-intensive tasks in parallel once we have async database queries implemented.

  • πŸ‡·πŸ‡ΊRussia Chi

    Additionally, other CPU intensive tasks (like rendering results) could be happening while waiting for all three queries to come back.

    That's unclear. How do we render results without having those result yet?

    Drupal is both i/o and CPU bound, but we can do both database query execution and CPU-intensive tasks in parallel once we have async database queries implemented.

    Having async DB driver in place we can run queries in parallel. But is it really possible for CPU tasks like rendering?

  • πŸ‡¬πŸ‡§United Kingdom catch

    That's unclear. How do we render results without having those result yet?

    Let's say you have a landing page and it has 10 blocks on it - like a newspaper front page with hero image, breaking news, regional, business, sport all that kind of thing.

    Let's say the listing query for five blocks takes 30ms, and for five blocks it's 60 milliseconds.

    Rendering the results of each block (loading entities and entity view), takes 30 milliseconds each.

    5 * 30ms +
    5 * 60ms +
    10 * 30ms
    = 650ms

    I'm deliberately using relatively short query times here to make the amount of i/o we have to play with fairly conservative.

    We have the 10 blocks in a list somewhere, so each one fires off its listing query, and immediately after the query is sent to says to the event queue 'do something else while I'm waiting for the query to come back'.

    For each of the 10 blocks, this takes 1ms to go around and fire off each query.

    Then we get back to the first block, it's been 10ms since we left it, and the query hasn't come back yet. Because we've got nothing to do, the event loops can either try to prewarm a cache somewhere, or do nothing for a while, it sleeps 0.5ms each iteration if there's nothing to do. Let's assume it does nothing useful for 20ms and just usleeps.

    20ms later, the first query comes back,and we immediately load the relevant entities and render the block. This does not happen async as such, but it happens before we check if the other nine listing queries have come back.

    This has now taken a total of 60ms wall time to render one block. 30ms to send and receive the initial query, and 30ms to render.

    By now, because all of the other async queries we issued take either 30ms or 60ms to return, when we move onto the other nine blocks, all of the query results are sitting there waiting.

    Each block takes 30ms to render, this does not and cannot happen async, we just immediately render each block.

    So now the entire process has taken 30ms for the initial queries to fire and waiting time with nothing to do + 10 * 30ms to render each block sequentially = 330ms. Almost twice as fast.

    In a more realistic situation, there is likely to a much more variable distribution of query times.

    So let's say nine queries are 30-60ms but one query takes 250ms and this is the fifth query to run.

    In this case, blocks 1-4 and 6-10 might return their results and render first, then finally that result comes back and we render the block.

    This could still end up taking only 330ms for the entire process to complete, because 250ms + 30ms < 330ms and we can be rendering all the other blocks while waiting for the slow one to come back.

  • πŸ‡·πŸ‡ΊRussia Chi

    Re: #18. We can start rendering blocks with fast queries while waiting for query results from blocks with slow queries? Is it correct?

  • πŸ‡·πŸ‡ΊRussia Chi

    Any way for sites where SQL queries is not a bottleneck the only solution would be having additional PHP processes.

    Could think of a few options here:

    1. Spawning new processes with proc_open (Symfony Process). For example, for landing pages you described about those could be very simple PHP scripts that do just one thing - render a single block.

    2. Rendering blocks with PHP-FPM through hollodotme/fast-cgi-client. That should be faster than CLI workers as it compatible with Opcache and APCu. And supports async requests as well. Also FPM has some options to control number of workers in the pull.

    3. Having a few workers connected to a queue. They should listen for jobs (block rendering) and reply instantly. That means Drupal DB based queue is not quite suitable as it would require frequent polling.

    4. The most extreme option. Having a multi-thread socket server written in PHP with Drupal bootstrapped. It should be able to handle multiple connections simultaneously. So when a Drupal site needs to render a dozen heavy blocks it just send tasks to that sever to render blocks in parallel.

    I suppose nothing of that can be fully implemented in Drupal core. However it could provide some async API to allow delegation of block rendering to external systems.

  • πŸ‡¬πŸ‡§United Kingdom catch

    Re #19 yes exactly.

    A persistent queue worker (Drupal as a Daemon) is also a possibility yes. To make they possible we'd need to resolve a lot of issues in core that make that hard to run, and yes the actual implementation probably couldn't live in core but we could make it easier. There are some issues around tracking this stuff.

  • πŸ‡·πŸ‡ΊRussia Chi

    Re 18: There are a couple things that can potentially break that calculation.

    1. Database locks
    When cache is empty the request will likely trigger lots of cache updates that potentially may case raise conditions. Especially when same cache items need to be updated in different blocks.

    2. Building vs rendering
    Those terms are often misused. Building means creating a render array while rendering is creating an HTML presentation of the content (string or Markup object). The problem here is that blocks typically relies on lazy builders, pre_render callbacks, theme functions etc. That stuff delegates rendering process to theme layer.
    Consider this render array. It costs nothing for CPU to produce such content. The main work will happen later when Drupal is rendering content. And that means that async orchestration have to cover theming as well.

    $build['content'] = [
      '#theme' => 'example,
    ]
    
  • πŸ‡¬πŸ‡§United Kingdom catch

    Building means creating a render array while rendering is creating an HTML presentation of the content (string or Markup object). The problem here is that blocks typically relies on lazy builders, pre_render callbacks, theme functions etc.

    With BigPipe, each block placeholder is rendered to HTML independently (and then rendered as an inline AjaxResponse which then replaces the placeholder), so the bit that is controlled (currently by Fibers, eventually by Revolt) incorporates both building and rendering.

    Building also generally includes loading entities - e.g. entity query, then load entities, then call view on the entities. So even if the actual rendering happens later, there are things to do in-between querying the entities and returning a render array for them.

    When cache is empty the request will likely trigger lots of cache updates that potentially may case raise conditions. Especially when same cache items need to be updated in different blocks.

    This would all happen in the non-async database connection, so I'm not sure why you think it would be different?

  • πŸ‡·πŸ‡ΊRussia Chi

    I created a module to test options described in #20. It builds sort of landing page with blocks that can be slow down.

    https://github.com/Chi-teck/sample_catalog

    Results are quite interesting though predictable.

    When blocks are too slow it doesn't matter which way you are doing parallel processing. The results are always quite good. I've managed to get 12x boost using 12 CPU cores. However, when blocks are relatively fast, the cost of spawning new processes becomes significant. In that case the best results were achieved with "co-pilot" server powered by Road Runner. It demonstrated its usefulness even when building each block takes about 5-10 ms.

    Overall, landing pages are not the only use case for this. For instance, I have a project with very heavy API endpoint for a collection of entities. Each item in that collection is personalized and is frequently updated. So caching is not possible. Building items in parallel using one of the above mentioned options can potentially improve API performance a big deal.

  • πŸ‡·πŸ‡ΊRussia Chi

    This would all happen in the non-async database connection

    I meant a single HTTP request without any concurrency. In that case in non-async database all queries will happen sequentially. So no locks expected.

  • πŸ‡¬πŸ‡§United Kingdom catch

    However, when blocks are relatively fast, the cost of spawning new processes becomes significant.

    Core will not spawn any new processes, it will be necessary to create a new database connection to run an async query (see ✨ [PP-1] Async database query + fiber support Active ), but everything happens in a single process, this is what Fibers allows for compared to the previous approaches of reactphp and amphp. I think it will probably be possible to add async processing via additional process in contrib though but did not really think that far head yet.

  • πŸ‡«πŸ‡·France andypost

    Faced today with Open Telemetry warning which is caused by required trick to propagate context into newly created Fiber.

    User warning: Access to not initialized OpenTelemetry context in fiber (id: 8909), automatic forking not supported, must attach initial fiber context manually in OpenTelemetry\Context\FiberBoundContextStorage::triggerNotInitializedFiberContextWarning() (line 74 of /var/www/html/vendor/open-telemetry/context/FiberBoundContextStorage.php).

    So having some predictable API to auto-instrument core is good point to keep in mind adopting the topic https://github.com/opentelemetry-php/context/blob/main/README.md#fiber-s...

  • πŸ‡·πŸ‡ΊRussia Chi

    Still trying to comprehend how this event loop will work with new MySQL driver ( πŸ“Œ [PP-1] Create the database driver for MySQLi Postponed ). As I understand the revolt/event-loop is based on streams. stream_select is essentially a backbone of its async operations. That means the DB driver should be implemented through PHP streams like amphp/mysql.
    Did I miss something?

  • πŸ‡«πŸ‡·France andypost

    There's only one way mysqli::poll() and streams now everywhere in PHP

  • πŸ‡³πŸ‡±Netherlands kingdutch

    Catch already did a great job explaining in text. If anyone comes across this and is looking for an explanation that includes visuals then I recommend watching the talk I gave at DrupalCon which attempts to explain the scenario's in which the Revolt event loop will help us now and in the future: https://www.youtube.com/watch?v=tfppKrK1zGU

    In the past week I've also been a guest on the Talking Drupal podcast where I did my best to answer similar questions that may be asked slightly differently and help it click: https://talkingdrupal.com/474

    The question about the Async Database is a good question. The event loop can indeed use streams directly (as e.g. amphp/mysql does). However, with the primitives that the library provides it's also possible to do it in a looping manner. For example:

    
    function drupalAsyncDbHandler(....) {
        // Start query that requires polling
       mysqli::startSomething(...);
    
        $suspension = EventLoop::getSuspension();
    
        // Check our database connection whenever nothing else is happening.
        $callbackId = EventLoop::repeat(0, function ($callbackId) use ($suspension) {
          // Ensure only one instance of this callback runs at at time.
          // Not needed if we're 100% sure that the rest of this function is synchronous.
          $ready = mysqli::poll(...);
          if ($ready > 0) {
             
             // Fetch a result
             // Continue the code that's waiting for us with the query result.
             $suspension->resume($result);
             return;
          }
           
          // Eat the error if the repeat was cancelled.
          // This could happen if we cancel the request and no longer need the result for example.
          // Until: https://github.com/revoltphp/event-loop/issues/91.
          // Otherwise since we're not done we try to poll again in the next callback.
          try {
            EventLoop::enable($callbackId);
          }
          catch (EventLoop\InvalidCallbackError $e) {}
        });
    
        // Wait for the result to have been fetched from the database.
        $result = $suspension->suspend();
        EventLoop::cancel($callbackId);
        return $result;
    }
    

    If you're dealing with Revolt primitives directly then you'll have to think about the async states so that your calling code doesn't have to. For contrib there's the options of pulling in a lower level library of their choice (e.g. ReactPHP, amPHP or something new) to do this for them.

    For the above snippet I modified one of my examples of the Revolt playground which attempts to demonstrate some of the scenarios you might currently find in Drupal (using Fibers) or other scenarios that have been discussed that we might need: https://github.com/Kingdutch/revolt-playground/

Production build 0.71.5 2024