RPC timeout setting, careless is an online accident

The monitoring chart above is very familiar to the server r&d students. In daily system maintenance, “service timeout” should be one of the most frequently reported problems.

In microservice architectures in particular, a single request may travel over a long link across multiple service invocations before a result is returned. When the service timeout occurs, the r&d students often need to analyze the performance of their own system and the performance of dependent services, which is why the service timeout is more difficult to investigate than the service errors and abnormal service call volume.

This article will systematically introduce how to correctly understand and set the timeout time of RPC interface under the microservice architecture through a real online accident, so that we can have a more global view when developing the server interface. The content will be divided into the following four parts:

1. Start with an online accident caused by RPC interface timeout

2. What is the implementation principle of timeout?

3. What problem is the timeout set to solve?

4. How to set the timeout time reasonably?

01 Start with an online accident

The accident happened in the recommendation module of the home page of e-commerce APP. At noon one day, we suddenly received feedback from users: except for the banner chart and navigation area on the home page of APP, the recommendation module below became a blank page (the recommendation module takes up 2/3 of the space of the home page and is the list of goods recommended by algorithm in real time according to user interests).

The above business scenario can be understood with the help of the following call chain

1. The APP sends an HTTP request to the service gateway

2. The service gateway RPC invokes the recommendation service to obtain the list of recommended products

3. If step 2 fails to be used, the service will be degraded. Instead, RPC will call the commodity ordering service to obtain the list of hot commodities for backing up

4. If step 3 fails, demote again and get the hot item list directly from Redis cache

At first glance, the demotion strategy of the two dependent services has been taken into account. Theoretically, the server should be able to return data to the APP even if the recommendation service or product ranking service fails. However, the recommendation module of the APP terminal does appear blank, and the demotion strategy may not take effect. The positioning process is described in detail below.

1. Problem locating process

Step 1: Through packet capture, the APP finds that the HTTP request has an interface timeout (the timeout period is set to 5 seconds).

Step 2: Through logs, the service gateway finds that a large area timeout occurs on the RPC interface that invokes the recommended service (the timeout period is set to 3 seconds). The error information is as follows:

Step 3: Recommend service to log: Dubbo thread pool exhausted with the following error message:

Through the above three steps, the problem was basically located in the recommendation service. Later, further investigation showed that the redis cluster on which the recommendation service depended was unavailable, which led to the timeout, and then the thread pool was exhausted. The detailed reasons are not discussed here and are not relevant to the topic discussed in this article.

2. Cause analysis of failure of downgrade strategy

Here’s another analysis: when the recommendation service invocation fails, why does the business gateway’s degrade policy not take effect? Theoretically, shouldn’t you downgrade to call the ordering service for the bottom line?

Final tracking analysis to find the root cause: APP called business gateway timeout time is 5 seconds, recommended service business gateway call timeout time is 3 seconds, and also set the timeout retry three times, so when the second recommendation service invocation fails to retry, HTTP request already timeout, so business gateway all the drop strategy will not take effect. Here’s a more intuitive diagram:

3. Solutions

1. Change the timeout period for the service gateway to invoke the recommendation service to 800ms (TP99 of the recommendation service is about 540ms), and change the timeout retry times to 2 times

2. Change the timeout time of invoking commodity sequencing service by the business gateway to 600ms (TP99 of commodity sequencing service is about 400ms), and change the timeout retry times to 2 times

Setting the timeout and retry times requires consideration of the time spent on all dependent services in the entire invocation chain, whether each service is a core service, and many other factors. I will not expand it here, but I will explain the method in detail later.

02 What is the implementation principle of timeout?

Only by understanding the implementation principle of RPC framework timeout, can we better set it. Whether it’s Dubbo, SpringCloud, or a homegrown microservices framework (such as JD.com’s JSF), timeouts work similarly. Below to dubbo 2.8.4 version of the source code as an example to see the specific implementation.

For those of you familiar with Dubbo, you can configure the timeout in two places: provider and consumer. Server timeout configuration is spending the default configuration, that is to say as long as the server set the timeout, then all consumers don’t need to set up, through the registry is passed to the consumer side, like this: on the one hand, simplified configuration, on the other hand because the server interface more aware of their performance, so set to server also reasonable.

Dubbo supports very fine-grained timeout Settings, including method level, interface level, and global. If all levels are configured at the same time, the priorities are as follows: Consumer method level > Server method level > Consumer Interface level > Server Interface level > Consumer Global > Server Global.

Through the source code, we first look at the server timeout handling logic

1 public class TimeoutFilter implements Filter { 2 3 public TimeoutFilter() { 4 } 5 6 public Result invoke(...) Throws RpcException {7 // Execute a real logical call and count the time 8 Long start = System.CurrentTimemillis (); 9 Result result = invoker.invoke(invocation); 10 long elapsed = System.currentTimeMillis() - start; If (invoker.geturl ()! = null && Elapsed > timeout) {14 // Print log 15 logger.warn(" Invoke timeout..." ); 16 } 17 18 return result; 19}} 20Copy the code

As you can see, the server only prints a WARN log even if it times out. Therefore, the timeout setting on the server does not affect the actual invocation process, and the entire processing logic will be executed even if the timeout is set.

Take a look at the time-out logic on the consumer side

1 public class FailoverClusterInvoker { 2 3 public Result doInvoke(...) {4... 6 for (int I = 0; i < retryTimes; ++i) { 7 ... 8 try { 9 Result result = invoker.invoke(invocation); 10 return result; 11} catch (RpcException e) {13 if (e.isbiz ()) {14 throw e; 15 } 16 17 le = e; 18 } catch (Throwable e) { 19 le = new RpcException(...) ; 20 } finally { 21 ... 22 } 23 } 24 25 throw new RpcException("..." ); 26}} 27Copy the code

FailoverCluster is the default fault tolerance mode of the cluster. If the invocation fails, it switches to another server. Taking a look at the doInvoke method, when the call fails, it determines whether it is a business exception, and if it is, it stops retry, otherwise it continues to retry until the number of retries is reached.

Following the invoker invoke method, you can see that the Future’s get method is used to retrieve the result after the request is issued.

1 public Object get(int timeout) { 2 if (timeout <= 0) { 3 timeout = 1000; 4 } 5 6 if (! isDone()) { 7 long start = System.currentTimeMillis(); 8 this.lock.lock(); 9 10 try {11 // loop judge 12 while(! IsDone ()) {15 done.await((long)timeout, timeunit.milliseconds); 17 Elapsed = system.currentTimemillis () -start; 17 Elapsed = system.currentTimemillis () -start; 18 if (isDone() || elapsed > (long)timeout) { 19 break; 20 } 21 } 22 } catch (InterruptedException var8) { 23 throw new RuntimeException(var8); 24 } finally { 25 this.lock.unlock(); 26 } 27 28 if (! IsDone ()) {29 // Throw new TimeoutException(...) ; 31 } 32 } 33 34 return returnFromResponse(); 35}Copy the code

When the method is entered, timing begins, and if no result is returned within the specified timeout, a TimeoutException is thrown. Therefore, the timeout logic on the consumer side is controlled by two parameters: timeout time and timeout times. For example, network exceptions and response timeouts will be retried until the number of retries is reached.

03 What problem is the timeout period Set to Solve?

What problem is RPC framework’s timeout retry mechanism designed to solve? From the macro perspective of microservice architecture, it provides frame-level fault tolerance to ensure the stability of service links. How do you think about it at the micro level? It can be seen from the following specific cases:

1. When the consumer invokes the provider, if the timeout is not set, the response time of the consumer will be greater than that of the provider. When the provider’s performance deteriorates, the consumer’s performance suffers because it must wait indefinitely for the provider to return. If the entire invocation link passes through multiple services A, B, C, and D, as long as the performance of D deteriorates, it will affect A, B, and C from bottom to top, and eventually cause timeout or even breakdown of the whole link. Therefore, it is necessary to set the timeout period.

2. Assume that consumer is the core commodity service and Provider is the non-core review service. When the evaluation service has performance problems, the commodity service can accept and not return the evaluation information, so as to ensure that it can continue to provide external services. In this case, a timeout period must be set so that when the evaluation service exceeds this threshold, the goods and services do not have to wait.

3. The provider may time out due to network jitter or heavy load on the machine. If the provider directly gives up after the timeout, services may be lost in some scenarios (for example, the inventory interface times out, leading to a failure to place an order). Therefore, temporary jitter can be saved by a retry after a timeout, so it is necessary to use a retry mechanism to resolve it.

But not everything is perfect with the introduction of timeout retries. It also has side effects. These are some of the most important things to consider when developing RPC interfaces, and they are the easiest to ignore:

1. Repeated requests: It is possible that the provider has finished executing, but because the network jitter consumer thinks it timed out, in which case the retry mechanism will cause repeated requests, resulting in dirty data, so the server must consider the idempotency of the interface.

2. Reduce the load of the consumer: If the provider is not temporarily jitter but has a real performance problem, multiple retries will not succeed and will increase the average response time of the consumer. For example, under normal circumstances, the average response time of the provider is 1s, and the timeout time of the consumer is set to 1.5s and the retry times to 2, so that a single request will take 3s and the overall load of the consumer will be pulled down. If the Consumer is a high QPS service, It could also set off a chain reaction that could lead to an avalanche.

3, explosive retry storm: if a call link after four services, at the bottom of the service D appear timeout, this upstream service will launch retry, assuming that retries are set 3 times, then B will face three times the amount of normal, C is 9 times, D is 27 times, the entire service cluster may therefore avalanche.

04 How to set a proper timeout period?

Once you understand how the RPC framework implements timeouts and the possible side effects, you can set the timeout as follows:

1. Before setting the timeout time of the caller, first understand the response time of TP99 that depends on the service (if the performance of the dependent service fluctuates greatly, you can also see TP95), and the timeout time of the caller can be increased by 50% on this basis

2. If the RPC framework supports multi-granularity timeout Settings, the global timeout should be slightly longer than the maximum elapsed time of the interface level, the timeout of each interface should be slightly longer than the maximum elapsed time of the method level, and the timeout of each method should be slightly longer than the actual method execution time

3. Distinguish between retries and non-retries. If the interface is not idempotent, the retry times cannot be set. Note: The read interface is naturally idempotent, while the write interface can use the business document ID or generate a unique ID on the caller and pass it to the server to prevent the introduction of dirty data through this ID

4. If the RPC framework supports the timeout setting on the server side, set the timeout setting in sequence based on the previous three rules. In this way, the configuration is reasonable without setting on the client side and risks are reduced

5. If service availability requirements are not so high (for example, in an internal application system), you can manually retry instead of setting timeout retries, which reduces the complexity of interface implementation and facilitates later maintenance

6. The higher the retry times, the higher the service availability, the lower the service loss, but the greater the performance risk. This needs to be set to several times (generally 2 times, up to 3 times).

7. If the caller is a high QPS service, degrade and circuit breaker strategies in case of timeout must be considered. (For example, if more than 10% of the requests fail, the retry mechanism will be stopped and the circuit will be switched to another service, asynchronous MQ mechanism, or use the caller’s cached data.)

Finally, a brief summary:

The timeout setting of RPC interfaces is seemingly simple, but in fact it is a lot of knowledge. Not only are there many technical issues involved (such as interface idempotent, service degradation and fuses, performance evaluation and optimization), but there is also a need to evaluate the necessity from a business perspective. Hopefully, this knowledge will give you a more global perspective when developing RPC interfaces.

About the author: 985 master, former Engineer of Amazon, now 58-year-old technical director

Welcome to pay attention to my personal public number: IT career advancement

RPC timeout setting, careless is an online accident

01 Start with an online accident

02 What is the implementation principle of timeout?

03 What problem is the timeout period Set to Solve?

04 How to set a proper timeout period?

Related Posts

How do I install Redis on Linux

He is a panda

CookieAndSession