One heartbeat, two designs

1 introduction

In the previous article, “Talking about TCP Long Connections and Heartbeats,” we talked about KeepAlive in TCP and what it means to design heartbeats at the application layer, but we didn’t go into detail about the design of long-connection heartbeats. In fact, it is not easy to design a good heartbeat mechanism. Several RPC frameworks I am familiar with have very different heartbeat mechanisms. In this article, I will discuss how to design an elegant heartbeat mechanism, mainly from Dubbo’s existing solution and an improved solution.

2 Preliminary knowledge

Because we will be introduced from the source level later, so some details of the service governance framework need to be explained in advance for everyone to understand.

2.1 How does the Client Know that a request Fails?

High-performance RPC frameworks almost always choose to use Netty as a component of the communication layer, and the efficiency of non-blocking communication does not need my elaboration. However, due to the non-blocking characteristics, sending and receiving data is an asynchronous process. Therefore, when there is a server exception or network problem, the client cannot receive the response. Then, how can we judge that an RPC call fails?

Myth 1: Aren’t Dubbo calls synchronized by default?

Dubbo is asynchronous in the communication layer, presenting the illusion of synchronization to the user because the internal block wait is implemented to achieve asynchronous synchronization.

Channelfuture. isSuccess is all I need to determine if the request was successful.

Note that writeAndFlush success does not mean that the request has been received by the peer end. The return value of true only guarantees that the request has been written to the network buffer, not that it has been sent.

To avoid these two misconceptions, let’s return to the title of this section: How does a client know a request failed? The correct logic should be based on the failure response received by the client. Wait, didn’t we say that in a failure scenario, the server doesn’t return a response? Yes, since the server will not return, the client will have to make it.

A common design is that when a client initiates an RPC request, a timeout client_timeout is set. When the call is initiated, the client starts a timer to delay client_timeout

Remove the timer when a normal response is received.
If the timer is not removed after the countdown, the request times out and a failed response is constructed and sent to the client.

Timeout determination logic in Dubbo:

public static DefaultFuture newFuture(Channel channel, Request request, int timeout) {
    final DefaultFuture future = new DefaultFuture(channel, request, timeout);
    // timeout check
    timeoutCheck(future);
    return future;
}
private static void timeoutCheck(DefaultFuture future) {
    TimeoutCheckTask task = new TimeoutCheckTask(future);
    TIME_OUT_TIMER.newTimeout(task, future.getTimeout(), TimeUnit.MILLISECONDS);
}
private static class TimeoutCheckTask implements TimerTask {
    private DefaultFuture future;
    TimeoutCheckTask(DefaultFuture future) {
        this.future = future;
    }
    @Override
    public void run(Timeout timeout) {
        if (future == null || future.isDone()) {
            return;
        }
        // create exception response.
        Response timeoutResponse = new Response(future.getId());
        // set timeout status.
        timeoutResponse.setStatus(future.isSent() ? Response.SERVER_TIMEOUT : Response.CLIENT_TIMEOUT);
        timeoutResponse.setErrorMessage(future.getTimeoutMessage(true));
        // handle response.DefaultFuture.received(future.getChannel(), timeoutResponse); }}Copy the code

Classes involved in the main logic: DubboInvoker, HeaderExchangeChannel, DefaultFuture, through the above code, we can know a detail, no matter what kind of call, will pass through the timer detection, timeout means call failure, one RPC call failure, The client must receive a failure response.

2.2 Heartbeat detection requires fault tolerance

Network communication should always consider the worst case, one heartbeat failure, cannot be considered as a connection failure, multiple heartbeat failure, can take appropriate measures.

2.3 Heartbeat Detection No busy detection is required

The opposite of busy detection is idle detection. The original intention of our heartbeat is to ensure the availability of the connection, so as to ensure timely disconnection and reconnection measures. If there are frequent RPC calls going on on a channel, we should not burden the channel with sending heartbeat packets. The role of the heartbeat should be to receive the umbrella in sunny days and send the umbrella in rainy days.

3 Dubbo’s existing scheme

The source code for this article corresponds to the Dubbo 2.7.x version, which has been incubated by Apache, and the heartbeat mechanism has been enhanced.

Having introduced some basic concepts, let’s take a look at how Dubbo designs the application layer heartbeat. Dubbo’s heartbeat is bidirectional. The client sends a heartbeat to the server, and vice versa.

3.1 Creating a timer when establishing a connection

public class HeaderExchangeClient implements ExchangeClient {
    private int heartbeat;
    private int heartbeatTimeout;
    private HashedWheelTimer heartbeatTimer;
    public HeaderExchangeClient(Client client, boolean needHeartbeat) {
        this.client = client;
        this.channel = new HeaderExchangeChannel(client);
        this.heartbeat = client.getUrl().getParameter(Constants.HEARTBEAT_KEY, dubbo ! =null && dubbo.startsWith("1.0.")? Constants.DEFAULT_HEARTBEAT :0);
        this.heartbeatTimeout = client.getUrl().getParameter(Constants.HEARTBEAT_TIMEOUT_KEY, heartbeat * 3);
        if (needHeartbeat) { <1>
            long tickDuration = calculateLeastDuration(heartbeat);
            heartbeatTimer = new HashedWheelTimer(new NamedThreadFactory("dubbo-client-heartbeat".true), tickDuration,
                    TimeUnit.MILLISECONDS, Constants.TICKS_PER_WHEEL); <2> startHeartbeatTimer(); }}}Copy the code

<1> The heartbeat detection timer is enabled by default

<2> Create a HashWheelTimer to enable heartbeat detection. This is a classic implementation of Netty’s time wheel timer.

Not only the HeaderExchangeServer client started the timer, HeaderExchangeServer server also started the timer, because the server logic and client almost the same, so I will not repeat the paste server code.

Dubbo used the Shedule scheme in earlier versions and has been replaced with HashWheelTimer in 2.7.x.

3.2 Enabling Two Scheduled Tasks

private void startHeartbeatTimer(a) {
    long heartbeatTick = calculateLeastDuration(heartbeat);
    long heartbeatTimeoutTick = calculateLeastDuration(heartbeatTimeout);
    HeartbeatTimerTask heartBeatTimerTask = new HeartbeatTimerTask(cp, heartbeatTick, heartbeat); <1>
    ReconnectTimerTask reconnectTimerTask = new ReconnectTimerTask(cp, heartbeatTimeoutTick, heartbeatTimeout); <2>

    heartbeatTimer.newTimeout(heartBeatTimerTask, heartbeatTick, TimeUnit.MILLISECONDS);
    heartbeatTimer.newTimeout(reconnectTimerTask, heartbeatTimeoutTick, TimeUnit.MILLISECONDS);
}
Copy the code

Dubbo starts two timers in startHeartbeatTimer: HeartbeatTimerTask and ReconnectTimerTask

<1> HeartbeatTimerTask Periodically sends heartbeat requests

<2> ReconnectTimerTask is used to process the logic of reconnection and disconnection after a heartbeat failure

As for the rest of the code in the method, which is also an important part of this article’s analysis, I’ll keep it in the dark and look back later.

3.3 Scheduled Task 1: Sending heartbeat Requests

HeartbeatTimerTask#doTask:

protected void doTask(Channel channel) {
    Long lastRead = lastRead(channel);
    Long lastWrite = lastWrite(channel);
    if((lastRead ! =null&& now() - lastRead > heartbeat) || (lastWrite ! =null && now() - lastWrite > heartbeat)) {
            Request req = new Request();
            req.setVersion(Version.getProtocolVersion());
            req.setTwoWay(true); req.setEvent(Request.HEARTBEAT_EVENT); channel.send(req); }}}Copy the code

As mentioned earlier, Dubbo adopts a bivariate heartbeat design, that is, the server will send heartbeat to the client, and the client will send heartbeat to the server. The receiving party updates the lastRead field, and the sending party updates the lastWrite field. When the heartbeat interval is over, a heartbeat request will be sent to the peer. Here, lastRead/lastWrite will also be updated by ordinary calls on the same channel. By updating these two fields, we realize the mechanism of sending idle packets only when the connection is idle, which is consistent with our initial popular practice.

Note: lastRead and lastWrite are updated not only for heartbeat requests, but also for normal requests. This corresponds to the idle detection mechanism in our preparatory knowledge.

3.4 Scheduled Task 2: Reconnection and disconnection

Take a look at what reconnection and disconnection timers do.

protected void doTask(Channel channel) {
    Long lastRead = lastRead(channel);
    Long now = now();
    if(lastRead ! =null && now - lastRead > heartbeatTimeout) {
        if (channel instanceof Client) {
            ((Client) channel).reconnect();
        } else{ channel.close(); }}}Copy the code

The second timer is responsible for processing the connection according to the client and server type. When the total heartbeat time exceeds the set value, the client chooses to reconnect, and the server chooses to disconnect directly. It is reasonable to consider that the client call is strongly dependent on the available connection, and the server can wait for the client to re-establish the connection.

The observant will notice that the class’s name ReconnectTimerTask is inaccurately named because it handles both reconnection and disconnection logic.

3.5 Incorrect timing

In Dubbo’s issue, someone once reported that the timing was not accurate, let’s see what happened.

The default heartbeat interval in Dubbo is 60 seconds. Imagine the following timing:

At 0 seconds, heartbeat detection detects active connection
At second 1, the connection is actually disconnected
At 60 seconds, heartbeat detection found that the connection was inactive

Due to time window problems, dead links cannot be detected in time, and the worst case is a heartbeat cycle.

To solve this problem, let’s go back to the startHeartbeatTimer() method above

long heartbeatTick = calculateLeastDuration(heartbeat); 
long heartbeatTimeoutTick = calculateLeastDuration(heartbeatTimeout);
Copy the code

Where calculateLeastDuration calculates a tick time according to heartbeat time and timeout time respectively, which is actually dividing the two variables by 3 to reduce their values and pass them into the second parameter of HashWeelTimer

heartbeatTimer.newTimeout(heartBeatTimerTask, heartbeatTick, TimeUnit.MILLISECONDS);
heartbeatTimer.newTimeout(reconnectTimerTask, heartbeatTimeoutTick, TimeUnit.MILLISECONDS);
Copy the code

Tick refers to the frequency of executing scheduled tasks. In this way, by reducing the detection interval, the probability of timely discovery of dead chain is increased. The former worst case is 60s, but now it is 20s. This frequency can still be accelerated, but the problem of resource consumption needs to be considered.

Incorrect timing occurs in two scheduled tasks of Dubbo, so the tick operation is performed on both tasks. In fact, all timed detection logic suffers from similar problems.

3.6 Dubbo Heartbeat Summary

For each connection, Dubbo starts two timers on both the client and server. One timer is used to periodically send heartbeat messages and the other timer is used to periodically reconnect or disconnect. Both timers are executed at one third of the detection interval. The heartbeat sending task is responsible for sending heartbeat packets to the peer when the connection is idle. The task of scheduled reconnection and disconnection checks whether the lastRead is not updated within the timeout period. If the lastRead is timeout, the client reconnects and the server disconnects.

Before we judge whether the solution is good or not, let’s look at how the improvement is designed.

4 Dubbo improvement scheme

We can actually implement the heartbeat mechanism more elegantly, and I’ll start this section by introducing a new heartbeat mechanism.

4.1 introduce IdleStateHandler

Netty provides a natural support for idle connection detection. IdleStateHandler can be used to easily implement idle connection detection logic.

public IdleStateHandler(
            long readerIdleTime, long writerIdleTime, long allIdleTime,
            TimeUnit unit) {}
Copy the code

ReaderIdleTime: read timeout time
WriterIdleTime: write timeout time
AllIdleTime: Timeout for all types

The IdleStateHandler class loops to check how long the channelRead and write methods have not been called, based on the timeout parameter set. After adding IdleSateHandler to a pipeline, the IdleStateEvent event can be detected in the userEventTriggered method of any of the Pipeline’s handlers.

@Override
public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
    if (evt instanceof IdleStateEvent) {
        //do something
    }
    ctx.fireUserEventTriggered(evt);
}
Copy the code

Why do I need to introduce IdleStateHandler? In fact, when referring to its idle detection + timing, we should be able to think of, it is not natural to serve the heartbeat mechanism? Many service governance frameworks have chosen to implement heartbeats with IdleStateHandler.

IdleStateHandler internally uses eventloop.schedule (Task) to implement scheduled tasks. The benefit of using the eventLoop thread is that it also keeps the thread safe.

4.2 Client and Server Configuration

The first step is to add IdleStateHandler to the pipeline.

Client:

bootstrap.handler(new ChannelInitializer<NioSocketChannel>() {
    @Override
    protected void initChannel(NioSocketChannel ch) throws Exception {
        ch.pipeline().addLast("clientIdleHandler".new IdleStateHandler(60.0.0)); }});Copy the code

Server:

serverBootstrap.childHandler(new ChannelInitializer<NioSocketChannel>() {
    @Override
    protected void initChannel(NioSocketChannel ch) throws Exception {
        ch.pipeline().addLast("serverIdleHandler".new IdleStateHandler(0.0.200)); }}Copy the code

The client set read timeout to 60s, and the server set write/read timeout to 200s. Here are two hints:

Why are the timeout periods configured on the client and server inconsistent?
Why does the client detect read timeouts while the server detects read and write timeouts?

4.3 Idle Timeout Logic – Client

The handling logic for idle timeouts is different on the client and server sides. Let’s start with the client side

@Override
public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
    if (evt instanceof IdleStateEvent) {
        // send heartbeat
        sendHeartBeat();
    } else {
        super.userEventTriggered(ctx, evt); }}Copy the code

When an idle timeout is detected, the action you take is to send a heartbeat packet to the server. How do you send it, and how do you process the response? The pseudocode is as follows

public void sendHeartBeat(a) {
    Invocation invocation = new Invocation();
    invocation.setInvocationType(InvocationType.HEART_BEAT);
    channel.writeAndFlush(invocation).addListener(new CallbackFuture() {
        @Override
        public void callback(Future future) {
            RPCResult result = future.get();
            // Timeout or write failure
            if (result.isError()) {
                channel.addFailedHeartBeatTimes();
                if(channel.getFailedHeartBeatTimes() >= channel.getMaxHeartBeatFailedTimes()) { channel.reconnect(); }}else{ channel.clearHeartBeatFailedTimes(); }}}); }Copy the code

The behavior is not complicated. Construct a heartbeat packet and send it to the server to receive the response

Clear the request failure flag with successful response
If the response fails, the heartbeat failure mark is +1. If the number of failed attempts exceeds the configured number, the connection is re-established

Not only the heartbeat, but the normal request returns a successful response that clears the flag

4.4 Idle Timeout Logic – Server

@Override
public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
    if (evt instanceof IdleStateEvent) {
        channel.close();
    } else {
        super.userEventTriggered(ctx, evt); }}Copy the code

The way the server handles idle connections is very simple and crude: close the connection.

4.5 Improvement Plan Heartbeat summary

Why are the timeout periods configured on the client and server inconsistent?

Because the client has retry logic, the client considers the connection to be disconnected only after sending heartbeat failures for n times. The server is disconnected directly, leaving the server a little longer. 60 * 3 < 200 also indicates that both parties have the ability to disconnect, but since the client initiated the creation of the connection, the client has more right to disconnect.
Why does the client detect read timeouts while the server detects read and write timeouts?

The timing logic is initiated by the client. Therefore, the entire link is interrupted only when the server receives, the server sends, and the client receives. In other words, only the detection of pong on the client side, ping on the server side, and Pong is meaningful.

Take the initiative to pursue others is you, take the initiative to say break up is you.

Using IdleStateHandler to implement the heartbeat mechanism can be said to be very elegant, with the help of the idle detection mechanism provided by Netty, using the client to maintain one-way heartbeat, after receiving three failed heartbeat response, the client is disconnected, sent to the asynchronous thread reconnection, the essence of client reconnection. After being idle for a long time, the server disconnects the connection to avoid unnecessary resource waste.

5 Comparison of heartbeat design schemes

	Dubbo existing scheme	Dubbo improvement scheme
The main design	Start two timers	With IdleStateHandler, shedule is used at the bottom
Direction of the heart	two-way	One-way (client -> Server)
Heartbeat failure determination mode	The heartbeat update mark is successful, and the timer is used to scan the mark periodically. If the heartbeat timeout period is exceeded, the heartbeat fails.	If the number of heartbeat failures exceeds the threshold, the heartbeat fails
scalability	Dubbo has other communication layer implementations such as MINA, Grizzy, and custom timers that are easily adaptable to multiple extensions	The multiple communication layers realize the heartbeat independently without abstracting the heartbeat
design	High coding complexity, large amount of code, complex scheme, not easy to maintain	Small amount of coding, strong maintainability

I have privately consulted Yu Chao (Flash), the person in charge of long connection of Meituan Dianping, and found that the heartbeat plan used by Meituan is almost the same as the Dubbo improvement plan. I believe this plan is the standard realization.

6 Dubbo Actual change suggestions

Given that Dubbo has some other communication layer implementations, the existing logic of timing heartbeat delivery can be retained.

Suggested changes 1:

The design of two-way heartbeat is unnecessary and compatible with existing logic, allowing clients to send one-way heartbeat when the connection is idle and the server to periodically check the connection availability. Ensure the timing time as much as possible: Client timeout x 3 ≈ server timeout

Recommended changes 2:

Dubbo can determine whether a heartbeat request fails to respond by removing the scheduled task of reconnection and disconnection. It can learn from the design of the improved scheme to maintain a mark of the number of heartbeat failures at the connection level. Any successful response will clear the mark. Description The client initiates a reconnection after the heartbeat fails for n consecutive times. This saves one unnecessary timer, and any polling method is inelegant.

Finally, the topic of scalability. I was actually suggesting leaving timers to the lower level of Netty, which uses IdleStateHandler entirely, while the other communication layer components implement their own idle detection logic, but Dubbo’s mina Grizzy compatibility issues put me off, but just ask, Now in 2019, how many people are using Mina and Grizzy? Limiting optimization for mainstream usage because of features that are unlikely to be used is definitely not a good thing. Abstract, functional, scalability is not the more the better, the human resources of open source products are limited, the understanding ability of framework users is also limited, the design that can solve the problems of most people is a good design. Oh, I can’t mina Grizzy. I’m too lazy to learn.

Welcome to follow my wechat official account: “Kirito technology sharing”, any questions about this article will be answered, bring more Java related technology sharing.