First thumb up and then watch, form a good habit

preface

In network requests, because the network is unreliable, there is often a scenario where the request fails. To solve this problem, the usual method is to increase the retry mechanism, after the failure of the request to request again, try to ensure the success of the request, so as to improve the stability of the service.

Risk of retry

However, most people do not want to retry easily, because retrying often brings greater risk. Too many retries, for example, can put more pressure on the called service and amplify the existing problem.

As shown in the figure below, Service A invokes Service B, which, depending on the requested data, invokes Service C and Service D. If service C fails and becomes unavailable, all requests to service C from service B will time out, but service D will still be available. Due to the large number of retries in service A, the load of service B increases rapidly and the load of service B becomes full (such as connection pool bloating) very quickly. Now the branch request calling service D is also not available because service B has been overwhelmed with retry requests and cannot process any more requests.

If the service itself is available, but there is a large delay, jitter or packet loss in the network, resulting in the request reaching the target service or returning the originating service timeout; At this point, if the client initiates a retry, it is likely that the receiver will receive multiple identical requests. Therefore, the server also needs to increase idempotent processing to ensure the same result under multiple requests

If retrying is risky, shouldn’t it be? Failure is just failure, nothing at all?

Failed retries at different times

Whether to retry or not, this needs to be distinguished from the cause of the current failure, rather than simply deciding whether to retry or not to retry. Networks are complex, with long links, and different protocols have different strategies for deciding whether to retry.

Retry under the HTTP protocol

A basic HTTP request consists of the following phases:

  1. The DNS
  2. TCP shakes hands three times
  3. Send & receive data on the opposite end

During the DNS resolution phase, if the domain name does not exist, or if the domain name does not have a DNS record, it cannot be resolved to the corresponding host address list according to the domain name, then the request cannot be made at all. There is no point in retrying at this point, so there is no need to retry

During the TCP handshake phase, if the target service is not available, there is no point in retrying at this point, because there is a high probability that the host will not be available in the first step of the request – the handshake is unsuccessful.

After DNS and handshaking, we finally reach the stage of sending and receiving data. Once failure occurs at this stage, there are even more factors to consider in deciding whether to retry.

In this case, as shown in the figure below, due to network congestion and other reasons, it takes too long for the data to arrive at the server, but finally the server also receives the complete message and has started to process the request, but at this time the client abandonthe request due to timeout. If the client creates a new TCP connection and initiates a retry at this time, Then the server will receive the same request packet twice and process the request twice, which may cause serious consequences

So if this has been sent successfully, it is not suitable to retry

The question is, how can I know if I have sent it successfully? Socket. Write is successful if there is no error? After SocketChannel.write, if the Buffer is written empty, it is successful. Right?

It is not that simple. The socket write in the application layer only writes data into the SND Buffer. There is no guarantee when the data in the SND Buffer will be sent to the network by the operating system. Blocking and non-blocking also only apply to socket.write operations. Blocking occurs when the SND Buffer is full and no data can be written to the kernel SND Buffer.

However, we can roughly think that if the socket.write succeeds and the application layer buffer is written empty, it has been sent successfully.

Now let’s look at another case where the other end closes the socket directly while the data is being sent and returns the RST identifier:

In this case, it’s a good place to retry. Since the request has not yet been processed on the server side, retrying (reconnecting and resending the request) only increases availability and does not cause any burden


There are also semantic conventions for Request Method in the HTTP protocol:

GET POST PUT DELET
listURI and, optionally, details of each resource in the resource group. In this set of resourcesCreate/AppendA new resource. This operation often returns the URL of the new resource. Use a given set of resourcesreplaceThe current set of resources. deleteWhole set of resources.
Security (more idempotent) The power etc. Power etc. Power etc.

PUT/DELETE is idempotent, so requests for the same packet can be processed without data duplication and the like. But POST is not. The semantics of POST are create/add, which is a non-idempotent request type.

Now going back to the retry problem above, if the request message has been sent successfully but the response timeout occurs, but the requested API Method is of type DELETE, then retry can be considered because DELETE is semantically idempotent. Similarly with GET/PUT, semantically idempotent retries can be considered.

But POST is not possible because it is semantically non-idempotent, and retries are likely to result in duplicate processing requests

But…… Is everything really that good? How many semantically correct APIs are there? Therefore, it is not safe to rely on semantic conventions alone. You must know enough to know whether the server interface supports idempotency before you can consider retry.

Retry under HTTPS

After years of being around, HTTPS has finally become ubiquitous in the last few years. Web sites that haven’t been upgraded will be told in the browser that they are not secure, and almost all Web APIs that are currently exposed to the public Web use HTTPS.

In HTTPS, the retry policy changes somewhat:

The above diagram shows the HTTPS handshake process. After the TCP connection is established, the SSL handshake is performed, the counterside certificate is verified, and the operations in the list of temporary symmetric keys are generated.

If a failure occurs during the SSL handshake phase, such as certificate expiration, untrusted certificate, etc., retry is not necessary at all. Because this kind of problem is not short-lived, once it occurs it is a long time failure, and retry is also a failure.

The retry mechanism in the mainstream network library & RPC framework

Having covered retry considerations under the HTTP(S) protocol, let’s take a look at how retry is handled by major web libraries and see if this is a “reasonable” handling mechanism in major open source projects.

Apache HttpClient’s retry mechanism (v4.x)

Apache HttpClient is one of the most popular HTTP tool libraries in Java (in the backend direction), and while the JDK provides the basic HTTP SDK,… It’s too basic to use directly. Apache HttpClient (Apache HC) fills this gap by providing a super powerful HTTP SDK that is powerful, easy to use, and all components can be customized.

Apache HC retry strategy class by default in org. Apache. HTTP. Impl. Client. DefaultHttpRequestRetryHandler, let’s look at implementation (omitted some important code) :

// Returns true to indicate that a retry is required. Override public Boolean retryRequest(final IOException, final int executionCount, final IOException, final int executionCount, final IOException, final int executionCount, Final HttpContext context) {if (executionCount > this.retryCount) {// Do not retry if over Max retry count return false; } / / determine which exceptions don't have to retry the if (this. NonRetriableClasses. The contains (exception. GetClass ())) {return false. } if (handleAsIdempotent(request)) {Retry if the request is considered idempotent return true; } // Whether the request message has been sent if (! clientContext.isRequestSent() || this.requestSentRetryEnabled) { // Retry if the request has not been sent fully or // if it's OK to retry methods that have been sent return true; } // otherwise do not retry return false; }

A brief summary of Apache HC’s retry policy:

  1. Determine whether the number of retries has exceeded the maximum number (the default is 3), and do not retry if it has exceeded the maximum number
  2. Determine which exceptions do not need to be retried

    1. UnknownHostException – The host cannot be found
    2. ConnectException – TCP handshake failed
    3. SSLException – SSL handshake failed
    4. InterruptedIOException (ConnectTimeoutException/SocketTimeoutException) – handshake timeout, Socket read timeout (can also be rough considered response timeout)
  3. An idempotent request can be retried
  4. Check whether the request message has been sent. If it has not been sent, you can retry
  5. The request is retried directly with no interval

It looks like the default retry policy in Apache HC is exactly the same as the “reasonable” retry policy we introduced in the previous section. It can be seen that this mainstream open source project is really excellent, the quality is very high, all the design is in accordance with the standard, take the source code of this project when learning materials can get twice the result with half the effort.

Dubbo’s retry mechanism (v2.6.x)

Dubbo code in com. Retry mechanism in alibaba. The Dubbo. RPC. Cluster. Support. FailoverClusterInvoker (after 2.7 update package name for org. Apache. Dubbo)

public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadBalance) throws RPCException {// Get the retry count of the configuration, Int Len = getURL ().getMethodParameter(Invocation. GetMethodName (), Constants. Retries_Key, int Len = getURL ().getMethodParameter(Invocation. Constants.DEFAULT_RETRIES) + 1; Set<String> providers = new HashSet<String>(len); for (int i = 0; i < len; i++) { Invoker<T> invoker = select(loadbalance, invocation, copyinvokers, invoked); invoked.add(invoker); RpcContext.getContext().setInvokers((List) invoked); try { Result result = invoker.invoke(invocation); if (le ! = null && logger.isWarnEnabled()) { logger.warn("Although retry the method " + invocation.getMethodName() + " in the service " + getInterface().getName() + " was successful by the provider " + invoker.getUrl().getAddress() + ", but there have been failed providers " + providers + " (" + providers.size() + "/" + copyinvokers.size() + ") from the registry " + directory.getUrl().getAddress() + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version  " + Version.getVersion() + ". Last error is: " + le.getMessage(), le); } return result; } catch (rpcException e) {if (e.isBiz()) {//Biz Exception. Throw e; //Biz Exception. } le = e; } catch (Throwable e) { le = new RpcException(e.getMessage(), e); } finally { providers.add(invoker.getUrl().getAddress()); }}}

As you can see from the code, only RPCExceptions that are not of type BIZ will trigger a retry. Continue analyzing the code to see what scenarios trigger retries… Calculate not stick code, directly on the answer!

To summarize the retry policy in Dubbo:

  1. The default retry count is 3 (including the first request), and retry is triggered when the configuration is greater than 1
  2. The default is Failover policy, so retry does not retry the current node, but only retry the next node (after the available node -> load balancer -> route)
  3. A TCP handshake timeout triggers a retry
  4. A response timeout triggers a retry
  5. A message error or other error that causes the request to be unable to be found also causes the Future to timeout, which in turn retries
  6. Exceptions returned by the server (such as those thrown by the provider) are considered successful and will not be retried

Dubbo’s retry policy is somewhat aggressive and not as cautious as Apache HC’s… So when using Dubbo, the retry policy must be careful to avoid retrying services that do not support idempotency. If your provider does not support idempotency, it is best to set the retry count to 0

Feign’s retry mechanism (v11.1)

Feign is an HTTP client that uses simple Java and is the recommended RPC framework in the Spring Cloud. Although Feign is also an HTTP client, it is very different from a library like Apache HC.

Feign’s Client side also supports Apache HC, Google HTTP, OK HTTP and other HTTP libraries, in addition to the JDK’s built-in HTTP Client.

And there is the abstraction of Encoders/Decoder… So it’s not really a basic HTTP client, but rather an “HTTP tool”? Or a basic abstraction for RPC?

What about the retry policy in Feign? This is a difficult question to answer, because there are many cases to distinguish, and the retry policy varies from one Feign Client to another

First, Feign has a built-in retry policy. As shown in the figure below, Feign retries outside of calling HttpClient, and there is an interval before each retry.

Under the default configuration, the maximum retry is 5 times (including the first time). Before each retry, there will be an interval of some time (sleep), and the interval of each retry will increase with the increase of retry times. The retry interval is calculated by the formula:

$$retry interval = retry interval (default 100ms) * 1.5 ^ {current retry number -1}$$

As the figure below shows, the larger the number of retries, the longer the interval between each retry

However, this is a retry in addition to calling HttpClient, and if you just use the built-in default JDK HTTP Client, it won’t be a problem, because the JDK HTTP Client is simple and has no retry mechanism. Feign’s retry mechanism alone is sufficient.

This is not the case with a third-party HTTP Client such as Apache HC, which often has a retry mechanism in place.

If the tripartite HttpClient retries, and the Feign retries, then the retry is equivalent to two levels, and the retry number becomes N * N

For example, with Apache HC, which retries 3 times by default and Feign 5 times by default as described above, the retry can be as high as 15 times in a worst-case scenario.

This is just a retry mechanism for Feign in basic usage. If you use a load balancer like the Ribbon in Spring Cloud, the situation is more complicated. I won’t cover it in this article, but it’s interesting to see how Feign is configured in Spring Cloud

conclusion

While retry may seem simple, there are many factors to consider if you want to retry safely and securely. Be sure to take into account the current business scenario, context information to consider whether or not to retry and the number of retries; Instead of setting up a retry mechanism with a slap on the head, violent retries tend to magnify the problem and bring more serious consequences.

If you are not sure if it is safe to retry, then do not retry. Disable retry for these frameworks. Failfast is better than problem scaling.

reference

  • How do I retry gracefully -InfoQ
  • Apache Dubbo – Github
  • Feign – Github
  • Apache HttpClient – Github

Original is not easy, prohibit unauthorized reprint. If my article is helpful to you, please feel free to support me at thumb up/bookmark/follow me