Rust principle and Engineering practice | Tokio Defects of asynchronous propagation

Author: Wu Feixiang

Drawback of tokio asynchronous propagation

Recently, I encountered some bugs in the project: The receiver of tokio Channel was dropped without knowing why, causing send Error

The following changes to examples/web_api.rs in Hyper source code can be reproduced

diff --git a/Cargo.toml b/Cargo.toml
index 862c20f9.. 694b8855 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@ @ + 73-73, 7, 7 @ @Pnet_datalink = "0.27.2" [features] # Nothing by default-default = []
+default = ["full"]
 
 # Easily turn it all on
 full = [
diff --git a/examples/web_api.rs b/examples/web_api.rs
index 5226249b.. 6de7f682 100644
--- a/examples/web_api.rs
+++ b/examples/web_api.rs
@ @ - 56, 6 + 56, 12 @ @async fn api_post_response(req: Request<Body>) -> Result<Response<Body>> { async fn api_get_response() -> Result<Response<Body>> { let data = vec! ["foo", "bar"];+ let (tx, rx) = tokio::sync::oneshot::channel::<()>();
+ tokio::spawn(async move {
+ tokio::time::sleep(std::time::Duration::from_secs(5)).await;
+ tx.send(()).unwrap();
+});
+ rx.await.unwrap();
     let res = match serde_json::to_string(&data) {
         Ok(json) => Response::builder()
             .header(header::CONTENT_TYPE, "application/json")
Copy the code

Async FN API_get_response is an asynchronous function that hyper processes HTTP requests. It spawn a Task to do some time-consuming operations. We use sleep simulation to finish the Task in 5 seconds. Finally, the processed data/results are sent to Async FN API_GEt_response through Chanel. If the client actively closes the connection in advance before waiting for the server response, Hyper drops async FN API_get_response to cancel, so rx is dropped and subsequent sending fails

Cancel has no propagation problems

A large number of malicious requests for active disconnections

The problem caused by this is that the client has disconnected the connection before the response, and the server is still checking the spawn coroutine to check the database, which is very easy to be used to attack the server. For example, someone maliciously sends 100,000 requests that take a long time to process, and the client immediately cances the request after it is sent. The server will waste resources if it is still processing the request that has been cancelled

If the client disconnects, all associated asynchronous tasks/futures that handle the request should be propagated from the root node and cancelled.

Otherwise, the client would have been disconnected and the server would continue to search a bunch of databases and consume a lot of resources before sending the data back to the client

Systemctl stop timeout

For example, the Web Server process uses the libc:: Signal callback function to make the process receive a graceful shutdown signal

Generally, cancel is received and the signal is propagated to each coroutine, but some stubborn coroutines live for a long time (such as polling tasks like loop sleep)

The libc:: Signal callback does not and cannot handle SIGKILL signals to perform custom resource reclamation operations

Dec 18 10:39:21 ww systemd[715]: Stopping graph... Dec 18 10:39:21 WW ATLASD [1518986]: 2021-12-18 10:39:21.588323 INFO ATlasD: Signal SIGTERM received, stopping this daemon server Dec 18 10:39:21 ww atlasd[1518986]: 2021-12-18 10:39:21.588408 INFO Server :: Graph: Prepare to Stop GRAPH Server Dec 18 10:39:21 [1518986]: 2021-12-18 10:39:21.588744 INFO STARt_prometheus_exporter {IP =0.0.0.0 port=19100 instance_kind=Graph}:prometheus_exporter(accept): common::metrics::prome> Dec 18 10:39:21 ww atlasd[1518986]: 2021-12-18 10:39:21.588830 INFO Web :: Server: Graceful Shutdown Web Server Dec 18 10:40:51 ww Systemd [715]: graph.service: State 'stop-sigterm' timed out. Killing. Dec 18 10:40:51 ww systemd[715]: graph.service: Killing process 1518986 (atlasd) with signal SIGKILL. Dec 18 10:40:51 ww systemd[715]: graph.service: Killing process 1518988 (tokio-runtime-w) with signal SIGKILL. Dec 18 10:40:51 ww systemd[715]: graph.service: Killing process 1518989 (tokio-runtime-w) with signal SIGKILL. Dec 18 10:40:51 ww systemd[715]: graph.service: Killing process 1518993 (tokio-runtime-w) with signal SIGKILL. Dec 18 10:40:51 ww systemd[715]: graph.service: Killing process 1519000 (tokio-runtime-w) with signal SIGKILL. Dec 18 10:40:51 ww systemd[715]: graph.service: Killing process 1519002 (tokio-runtime-w) with signal SIGKILL. Dec 18 10:40:51 ww systemd[715]: graph.service: Killing process 1519007 (tokio-runtime-w) with signal SIGKILL. Dec 18 10:40:51 ww systemd[715]: graph.service: Main process exited, code=killed, status=9/KILL Dec 18 10:40:51 ww systemd[715]: graph.service: Failed with result 'timeout'. Dec 18 10:40:51 ww systemd[715]: Stopped graph.Copy the code

The Future pruning?

Async fn API_get_response can be abstracted as the root of the Future, and spawn’s Future can be abstracted as the child

I thought the Future, the root node of an RPC request handler, would be cut off by Hyper because the client was disconnected, along with all of its leaf nodes, but Tokio has no such API or design

I hope that a constraint like scoped_thread will make the Future generated by async FN API_get_response spawn live shorter. This reduces a lot of the mental burden of not having to worry about the child Future holding the parent’s resources after the parent is dropped

It sounds a bit like Scoped Future

Discussion of Monoio groups

Soaring wu: If it takes 15 seconds for Hyper HTTP to process an HTTP (RPC) request, the handler function tokio::spawn is generated. If the request is not processed, the client disconnects. Hyper cancels The HTTP server's async FN handler for the current request to Cancel. But Rust's asynchronous spawn is a big problem. The tokio::spawn inside the Async FN handler will not cancel and will continue to execute the spawn and wait 15 seconds to send to the outside Async FN API_get_response () channel Shuai Xu: So everyone wants to have the ability of structured concurrency. I want 'async fn api_get_response()' to hold a join handle returned by tokio::spawn, When async fn api_get_response() is dropped, cancel all Future sons of spawn. It seems that Rust is very difficult to implement now Shuai Xu: Yes, the most important thing is that AsyncDrop can only be noticed and handled by itself. Community was proposed with the help of linear - types AsyncDrop abstract, https://aidancully.blogspot.com/2021/12/linear-types-can-help.html) soaring wu: The parent Future cancels out some resources such as chancel, resulting in the son Future still sending data to the parent Future's receiver. JZ: The compiler can't check this situation. Smol tasks have to explicitly detach() to run in the background, tokio and async-STD are both drop detach, this cancelling problem is not easy to solve if smol is used, you can join tasks and Rx, it should be ok. Glommio Runtime also needs explicit detach) Zhang Handong: @Wu Feiyuan this can't check this looks like a code design problem JZ: You can search Rust Async Cancellation. It should be Joshua, and there is a temporary cancellation in itCopy the code

How do you get Hyper’s Cancel to spread

let (tx, rx) = tokio::sync::oneshot::channel::<()>();
let task_handle = tokio::spawn(async move{ dbg! ("task start");
    tokio::time::sleep(std::time::Duration::from_secs(5)).await; dbg! ("task finished, prepare send back to parent future Node");
    tx.send(()).unwrap();
});
Copy the code

Tx.send (()).unwrap(); This row will panic

In order for tokio::spawn’s quest to be aborted early when the request is cancelled, more Boilerplate codes need to be introduced

struct ReqHandlerContext {
    task_handle: tokio::task::JoinHandle<()>,
    // rx: tokio::sync::oneshot::Receiver<()>
}
impl Drop for ReqHandlerContext {
    fn drop(&mut self) { dbg! ("fn api_get_response() ReqHandlerContext in drop");
        self.task_handle.abort(); }}let ctx = ReqHandlerContext {
    task_handle,
};
Copy the code

When the client disconnects prematurely, Hyper logs as follows

Listening on http://127.0.0.1:1337
[examples/web_api.rs:60] "task start" = "task start"
[examples/web_api.rs:72] "fn api_get_response() ReqHandlerContext in drop" = "fn api_get_response() ReqHandlerContext in drop"
Copy the code

Visible DBG! (” Task Finished “) The spawn line was cancelled before it was executed, as expected

You must request permission from the parent before writing spawn

In order to avoid all kinds of bugs caused by tokio::spawn mishandling, code review must be strictly limited on the use of spawn.

Our company requires that before introducing a new tokio::spawn, we must apply to the superior and specify the reason for using the spawn, the survival of the spawn, and the external resources introduced in detail in the notes

And the JoinHandle that requires explicit store spawn will actively call handle.abort() in the Drop.

Spawn is not a silver bullet. There are other ways to write spawn that do not block code execution. Think carefully before using spawn

Some thoughts on Tokio’s problems

tokio::task::JoinHandle

A JoinHandle detaches the associated task when it is dropped

Personally unable to feel like the coroutine detach of Glommio or the thread pthread_detach of libpthead.so is really inconvenient to let callers decide for themselves whether detach is needed or not

pin Future to CPU core

Tokio lacks concepts like glommio pin to core, similar to libpthread_setaffinity_NP? Neither does Tokio, it seems

This is one of tokio’s cross-platform limitations. Some of the apis are already available on Linux but may not be available on Windows/MAC

Can Tokio fix a Future to one CPU core to avoid the context-switching overhead of executing across multiple cores, but this seems to conflict with Tokio’s work-stealing approach of balancing all core loads. Tokio doesn’t want one core to have difficulty with multiple cores

Single-threaded runtime can be faster in some cases

In our project, Benchmark has found that some modules perform better using tokio’s single-threaded runtime, so we can’t believe that multi-threaded runtime is better than single-threaded runtime, because it’s expensive to synchronize data with multiple CPU cores. You need to make a case-by-case analysis of whether to use single-threaded or multi-threaded runtime

(But single-threaded and multi-threaded tokio:: Spawn’s Future signature is the same… Single thread does not have one less Send constraint…

Finally, let’s imagine how we can design the Runtime from a Linux-only perspective, regardless of MAC/Windows compatibility, which can squeeze Linux performance out of many Linux-only apis. You may end up discovering that Rust’s own Future is inadequate or unusable, or even creating its own run-time Future (just as many C++ libraries create their own Future).