Make it a habit to like it first
preface
In network requests, because the network is unreliable, there is often a request failure scenario. To solve this problem, the usual approach is to add retry mechanism, after the request failed to ensure the success of the request, so as to improve the stability of the service.
Risk of retry
However, most people do not want to retry easily, because retry often brings more risk. Too many retries, for example, can put more stress on the called service, magnifying the problem.
As shown in the figure below, service A invokes service B, which invokes service C and service D, depending on the request data. When service C fails and becomes unavailable, all requests to service C in service B will time out, but service D is still available; However, the load on service B can increase rapidly due to the large number of retries in service A, which can quickly fill service B up (such as connection pool caking). Now the branch request calling service D is also unavailable because service B is overloaded with retry requests and cannot process any more requests.If the service itself is available, but the network has a large delay, jitter, or packet loss, resulting in a timeout for the request to reach the target service or return to the initiating service. If the client initiates a retry at this point, the receiver is likely to receive multiple identical requests. soThe server also needs to add idempotent processing to ensure consistent results across multiple requests
If retry is risky, shouldn’t it be? Failure is just failure, and nothing?
Retry the failure at different times
To determine whether to retry, you need to determine the cause of the current failure. You cannot simply retry or not retry. The network is complex, the link is long, and the retry policy is different for different types of protocols.
Retry over HTTP
A basic HTTP request consists of the following phases:
- The DNS
- TCP three-way handshake
- Sends and receives data from the peer
During DNS resolution, if the domain name does not exist or has no DNS record, the host address list cannot be resolved based on the domain name. Therefore, the request cannot be sent at all. In this case, retry is meaningless, so there is no need to retry
During the TCP handshake phase, if the target service is not available, there is no point in retrying at this point, because the handshake at the first step of the request is not successful, and the host is most likely unavailable.
After surviving the DNS and the handshake, it’s time to send and receive data. Once a failure occurs at this point, there are more factors to consider when trying again.
In the case as shown in the following figure, it takes a long time for the data to reach the server due to network congestion. However, the server finally receives the complete packet and starts to process the request. However, the client abandons the request due to timeout. In this case, the server receives the same request packets twice and processes the request packets twice, which may cause serious consequences
So this kind ofSent successfullyIs not suitable for retryThe question is, how do I know if I sent it successfully? Socket. write is successful if no errors are reported. Socketchannel. write if the Buffer is empty?
Socket write at the application layer simply writes data to the SND Buffer. There is no guarantee that the SND Buffer will be sent to the network by the operating system. Blocking and non-blocking also apply only to socket.write operations, which block when the SND Buffer is too full to write data to the kernel SND Buffer.
If socket.write succeeds and the application’s buffer is written empty, the message has been sent successfully.
Now consider another case where the socket is closed when the data is sent and the RST flag is returned:
In this case, it’s a good time to retry. Since the server has not started processing the request, retrying (retransmitting the request by reconnecting the connection) only increases availability and does not impose any burden
The HTTP protocol also has some semantic conventions for Request Method:
GET | POST | PUT | DELET |
---|---|---|---|
listURI and, optionally, the details of each resource in the resource group. | In this group of resourcesCreate/AppendA new resource. This operation usually returns the URL of the new resource. | Use a given set of resourcesreplaceThe current group of resources. | deleteEntire group of resources. |
Security (more idempotent) | The power etc. | Power etc. | Power etc. |
PUT/DELETE are idempotent operations. Therefore, data duplication does not occur even when the same packet is processed. But POST is not. The semantics of POST are create/add, which is a non-idempotent request type.
Now back to the retry problem above, if the request message has been sent successfully, but the response times out, but the requested API Method is a DELETE type, then retry can be considered, because DELETE is semantically idempotent; Similarly, semantically idempotent ones can consider retry.
POST is not an option because it is semantically non-idempotent, and retries are likely to result in repeated processing of requests
But…… Is everything really that good? How many apis can strictly adhere to semantics? Therefore, relying on semantic conventions alone is not safe. It is important to know whether the server interface supports idempotent or not before you can consider retry.
Retry in HTTPS
HTTPS has been around for so many years, but in recent years it has become completely popular. Unupgraded websites will be warned in the browser that they are not secure, and the current Web API that can be exposed to the public network is basically based on HTTPS.
In HTTPS, the retry policy changes again:Above,HTTPS handshake processAfter the TCP connection is established, the SSL handshake is performed, the peer certificate is verified, and temporary symmetric keys are generated.
If a failure occurs during the SSL handshake, such as an expired certificate or an untrusted certificate, there is no need to retry. Because this problem is not transient, it will fail for a long time once it occurs, and retry will fail as well.
Retry mechanism in RPC framework
Having covered retry considerations under the HTTP(S) protocol, let’s take a look at the way retries are handled by mainstream web libraries and see if the handling mechanism in mainstream open source projects is “reasonable” enough
Apache HttpClient retry mechanism (V4.x)
Apache HttpClient is one of the most popular HTTP libraries in Java. The JDK also provides the basic HTTP SDK. It’s too basic to use directly. Apache HttpClient (Apache HC) makes up for this by providing a super powerful HTTP SDK that is powerful, easy to use, and customizable for all components.
Apache HC retry strategy class by default in org. Apache. HTTP. Impl. Client. DefaultHttpRequestRetryHandler, let’s look at implementation (omitted some important code) :
// Returns true, which means retry is required. False does not retry
@Override
public boolean retryRequest(
final IOException exception,
final int executionCount,
final HttpContext context) {
// Check whether the number of retries reaches the upper limit
if (executionCount > this.retryCount) {
// Do not retry if over max retry count
return false;
}
// Determine which exceptions do not need to be retried
if (this.nonRetriableClasses.contains(exception.getClass())) {
return false;
}
// Determine if it is an idempotent request
if (handleAsIdempotent(request)) {
// Retry if the request is considered idempotent
return true;
}
// Whether the request packet has been sent
if(! clientContext.isRequestSent() ||this.requestSentRetryEnabled) {
// Retry if the request has not been sent fully or
// if it's OK to retry methods that have been sent
return true;
}
// otherwise do not retry
return false;
}
Copy the code
A brief summary of Apache HC retry policy:
- Check whether the number of retries exceeds the maximum number (3 by default)
- Determine which exceptions do not need retries
- UnknownHostException – Failed to find the host
- Connectexception-tcp handshake failed
- SSLException – SSL handshake failed
- InterruptedIOException (ConnectTimeoutException/SocketTimeoutException) – handshake timeout, Socket read timeout (can also be rough considered response timeout)
- Check if it is an idempotent request. Idempotent requests can be retried
- Check whether the request packet is sent successfully. If the request packet is not sent successfully, retry
- Rerequest directly when retry, without interval
It appears that the default retry policy in Apache HC is exactly the same as the “reasonable” retry policy we described in the previous section. This shows that this mainstream open source project is really excellent, very high quality, all design in accordance with the standard, with this project source code as learning materials can get twice the result with half the effort.
Retry mechanism for Dubbo (v2.6.x)
Dubbo code in com. Retry mechanism in alibaba. The Dubbo. RPC. Cluster. Support. FailoverClusterInvoker (after 2.7 update package name for org. Apache. Dubbo)
public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
// Obtain the configured number of retries. The default value is 1
int len = getUrl().getMethodParameter(invocation.getMethodName(), Constants.RETRIES_KEY, Constants.DEFAULT_RETRIES) + 1;
Set<String> providers = new HashSet<String>(len);
for (int i = 0; i < len; i++) {
Invoker<T> invoker = select(loadbalance, invocation, copyinvokers, invoked);
invoked.add(invoker);
RpcContext.getContext().setInvokers((List) invoked);
try {
Result result = invoker.invoke(invocation);
if(le ! =null && logger.isWarnEnabled()) {
logger.warn("Although retry the method " + invocation.getMethodName()
+ " in the service " + getInterface().getName()
+ " was successful by the provider " + invoker.getUrl().getAddress()
+ ", but there have been failed providers " + providers
+ "(" + providers.size() + "/" + copyinvokers.size()
+ ") from the registry " + directory.getUrl().getAddress()
+ " on the consumer " + NetUtils.getLocalHost()
+ " using the dubbo version " + Version.getVersion() + ". Last error is: "
+ le.getMessage(), le);
}
return result;
} catch (RpcException e) {
// An exception of the Biz type will be thrown and will not be retried. Rpcexceptions of non-BIZ types will be retried
if (e.isBiz()) { // biz exception.
throw e;
}
le = e;
} catch (Throwable e) {
le = new RpcException(e.getMessage(), e);
} finally{ providers.add(invoker.getUrl().getAddress()); }}}Copy the code
As you can see from the code, only rpcExceptions that are not of type Biz trigger retries. Continue parsing the code to see what scenarios trigger retry… Calculated not to stick code, directly on the answer!
A quick summary of the retry strategy in Dubbo:
- The default number of retries is 3 (including the first request). The retries are triggered only when the value is greater than 1
- By default, the Failover policy is used, so a retry will not retry the current node, but only the next node (after the available node -> load balancer -> route)
- TCP handshake timeout triggers retry
- A response timeout triggers a retry
- If the request cannot be found due to packet errors or other errors, the Future times out. If the request times out, the system tries again
- Exceptions returned by the server (such as those thrown by the provider) are called successfully and will not be retried
Dubbo’s retry strategy is somewhat aggressive and not as cautious as Apache HC’s… So when using Dubbo, the retry strategy must be careful to avoid retry to services that do not support idempotent. If your provider does not support idempotent, it is best to set the retry count to 0
Feign’s Retry mechanism (V11.1)
Feign is an Http client that uses simple Java and is the recommended RPC framework in Spring Cloud. Although Feign is also an Http client, it is quite different from libraries such as Apache HC.
In addition to the JDK’s built-in Http Client, Feign also supports Http libraries such as Apache HC, Google Http, and OK Http.There’s also the Encoders /decoder abstraction… So it’s not really a basic Http client, but rather an “Http tool”? Or an RPC base abstraction?
What about the retry strategy in Feign? This is a difficult question to answer because there are many cases to distinguish, and the retry strategy varies from Feign Client to Feign Client
First, Feign has a built-in retry policy, as shown in the figure below. Feign retries are performed outside of the HttpClient call, and each retry is preceded by a certain interval.
By default, the maximum number of retries (including the first one) is five. Each retry is performed at an interval of sleep, which increases with the number of retries. The interval is calculated as follows:
As shown in the figure below, the larger the number of retries, the longer the retry interval
However, this is a retry in addition to calling HttpClient, and it wouldn’t be a problem if you just used Feign’s built-in default JDK HTTP Client, because the JDK HTTP Client is simple and doesn’t have a retry mechanism. Feign’s retry mechanism alone is sufficient.
This is not the case when working with a three-party Http Client (such as Apache HC), which often has a retry mechanism inside.
If three HttpClient retries and Feign retries, then it is equivalent to two layers of retries, the number of retries becomes N * N
For example, under Apache HC, the default is 3 retries, as described earlier, and Feign defaults to 5 retries, which could be up to 15 retries in the worst case.
And that’s just Feign in its basic usage. It’s a little more complicated with Spring Cloud with load balancers like the Ribbon, but it’s interesting to see how Feign works in Spring Cloud
conclusion
Retry may seem simple, but there are many factors to consider if you want safe and stable retries. Whether retries should be performed and the number of retries should be considered in combination with the current business scenario and context information. Rather than creating a retry mechanism at the drop of a finger, violent retries tend to magnify problems and lead to more serious consequences.
If you’re not sure if it’s safe to retry, don’t retry, disable retry with these frameworks, and Failfast is better than problem escalation.
reference
- How to retry gracefully -InfoQ
- Apache Dubbo – Github
- Feign – Github
- Apache HttpClient – Github
Original is not easy, prohibit unauthorized reprint. Like/like/follow my post if it helps you ❤❤❤❤❤❤