The request failed. Should I try again? Shouldn&#39;t it?

Like it first, then watch it, develop a good habit

Preface

In network requests, because the network is unreliable, there are often scenarios where the request fails. In response to this problem, the usual approach is to increase the retry mechanism, re-request after the request fails, try to ensure the success of the request, thereby improving the stability of the service.

Risk of retry

But most people are not willing to retry easily, because retrying often brings greater risks. For example, too many retries will put more pressure on the called service and magnify the original problem.

As shown in the figure below, service A calls service B, and service B calls service C and service D according to different requested data. At this time, service C fails and is unavailable, so all requests to service C in service B will time out, but service D is still available; the load of service B increases rapidly due to a large number of retries in service A. Quickly fill up the load of service B (for example, the connection pool is full). Now the branch request that calls service D is also unavailable, because service B has been filled with retry requests and can no longer process any more requests.

If the service itself is available, but there is a large delay, jitter, or packet loss in the network, which causes the request to reach the target service or return to the initiating service timeout; at this time, if the client initiates a retry, it is likely to be Received multiple identical requests. Therefore, the server also needs to add idempotent processing to ensure that the results are consistent under multiple requests

Since retrying is risky, shouldn't it be retryed? If you fail, you fail directly, don't you care about everything?

Failure retry at different times

Whether or not to retry, this needs to distinguish the cause of the current failure, and it cannot be simply and rudely decided to retry or not to retry. The network is very complex, the link is very long, and different types of protocols have different strategies for deciding whether to retry.

Retry under HTTP protocol

A basic HTTP request will include the following stages:

DNS resolution
TCP three-way handshake
Send & receive peer data

In the DNS resolution phase, if the domain name does not exist, or the domain name does not have a DNS record, and the corresponding host address list cannot be resolved according to the domain name, then the request cannot be initiated at all. At this time, there is no point in retrying, so there is no need to retry.

In the TCP handshake phase, if the target service is unavailable, then there is no point in retrying at this time, because in the first step of the request-the handshake is not successful, there is a high probability that the host is unavailable.

After going through the two stages of DNS and handshake, we finally arrived at the stage of sending and receiving data. At this point, once there is a failure, there are more factors to consider whether to retry.

In this situation as shown in the figure below, due to network congestion and other reasons, it takes too long for the data to reach the server, but in the end the server also received the complete message and has started to process the request, but the client has timed out at this time Abandon the request , then if the client creates a new TCP connection and initiates a retry at this time, then will receive the same request message twice for the server, and processing the request twice may cause serious consequences

So this kind of has been sent successfully is not suitable for retrying

The question is coming, how can I know that I sent it successfully? Socket.write is successful if no error is reported? After SocketChannel.write, if the Buffer is empty, it is considered successful?

It's not that simple. The socket write at the application layer just writes data into the SND Buffer. As for when the data in the SND Buffer will be sent to the network by the operating system, there is no guarantee. Blocking and non-blocking are only for the socket.write operation. When the SND Buffer is full and data cannot be written to the kernel SND Buffer, blocking will occur.

But we can roughly think that the socket.write is successful and the application layer buffer is empty, that is, it has been sent successfully.

Now look at another situation. When data is sent, the peer directly closes the socket and returns the rst identifier:

this case, it is suitable for retrying. Because the server has not started to process this request, retrying (rebuilding the connection and resending the request) will only improve the availability and will not cause any burden

In the HTTP protocol, there are some semantic conventions for Request Method:

GET	POST	PUT	DELET
lists the URI and the detailed information of each resource in the resource group (the latter is optional).	In this set of resources create / append a new resource. This operation often returns the URL of the new resource.	Using a set of given resources Alternatively this entire set of resources.	deletes the entire group of resources
Safe (more idempotent)	Non-idempotent	Idempotent	Idempotent

PUT/DELETE is an idempotent operation, so even if the request of the same message is processed, there will be no problems such as data duplication. But POST is not. The semantics of POST is to create/add, which is a non-idempotent request type.

Now return to the above retry problem. If the request message has been sent successfully, but the response timed out, but the requested API Method is a DELETE type, in this case, you can consider retrying, because DELETE is semantically idempotent; Similarly for GET/PUT, retrying can be considered if it is semantically idempotent.

but POST is not possible, because it is semantically non-idempotent, retrying is likely to cause repeated processing requests

But... is everything really that beautiful? How many APIs can strictly observe semantics? So relying solely on semantic conventions is very unstable. You must know enough about whether the server interface supports idempotence before you can consider the problem of retrying.

Retry under HTTPS

HTTPS has been around for so many years, and it has finally become popular in recent years. Websites that have not been upgraded will prompt insecure in the browser. At present, the Web APIs that can be exposed on the public network are basically HTTPS.

In HTTPS, there will be some changes to the retry strategy:

The above figure is the HTTPS handshake process . After the TCP connection is established, the SSL handshake will be performed first, the peer certificate is verified, and the temporary symmetric key is generated.

If a failure occurs during the SSL handshake phase, such as certificate expiration, untrusted certificate, etc., then there is no need to retry at all. Because this kind of problem is not short-lived, once it occurs, it will fail for a long time, and retrying will also fail.

Mainstream network library & retry mechanism in RPC framework

After introducing the consideration of retry under the HTTP(S) protocol, now let’s take a look at the way the mainstream network library handles retry, and see if the processing mechanism in this mainstream open source project is "reasonable."

Apache HttpClient's retry mechanism (v4.x)

Apache HttpClient is the most mainstream HTTP tool library in Java (back-end direction). Although the JDK also provides a basic HTTP SDK, it is... too basic to use directly. The Apache HttpClient (Apache HC for short) makes up for this shortcoming, providing a super powerful HTTP SDK, powerful, easy to use, and all components can be customized.

The default retry strategy class of Apache HC is org.apache.http.impl.client.DefaultHttpRequestRetryHandler . Let's take a look at the implementation first (some unimportant codes are omitted):

//返回true，代表需要重试，false不重试
@Override
public boolean retryRequest(
    final IOException exception,
    final int executionCount,
    final HttpContext context) {
    
    //判断重试次数是否达到上线
    if (executionCount > this.retryCount) {
        // Do not retry if over max retry count
        return false;
    }
    //判断哪些异常不用重试
    if (this.nonRetriableClasses.contains(exception.getClass())) {
        return false;
    } 
    //判断是否是幂等请求
    if (handleAsIdempotent(request)) {
        // Retry if the request is considered idempotent
        return true;
    }
    //请求报文是否已经发送
    if (!clientContext.isRequestSent() || this.requestSentRetryEnabled) {
        // Retry if the request has not been sent fully or
        // if it's OK to retry methods that have been sent
        return true;
    }
    // otherwise do not retry
    return false;
}

Briefly summarize the retry strategy of Apache HC:

Determine whether the number of retries has exceeded the maximum number (default 3 times), if it exceeds, do not retry
Determine which exceptions do not need to be retried
1. UnknownHostException-the host could not be found
2. ConnectException-TCP handshake failed
3. SSLException-SSL handshake failed
4. InterruptedIOException(ConnectTimeoutException/SocketTimeoutException)-Handshake timeout, Socket reading timeout (can also be roughly regarded as response timeout)
Determine whether it is an idempotent request. Only when the idempotent request can be retry
Determine whether the request message has been sent, and if it has not been sent, you can try again
Re-request directly when retrying, no interval

It seems that the default retry strategy in Apache HC is exactly the same as the "reasonable" retry strategy we introduced in the previous section. It can be seen that this kind of mainstream open source project is really excellent and the quality is very high. All designs are in accordance with the standard. It is more effective to use the source code of this kind of project as learning materials.

Dubbo's retry mechanism (v2.6.x)

The code of the retry mechanism in Dubbo is com.alibaba.dubbo.rpc.cluster.support.FailoverClusterInvoker (the package name is updated to org.apache.dubbo after 2.7)

public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    //获取配置的重试次数，默认1即不重试
    int len = getUrl().getMethodParameter(invocation.getMethodName(), Constants.RETRIES_KEY, Constants.DEFAULT_RETRIES) + 1;
    Set<String> providers = new HashSet<String>(len);
    for (int i = 0; i < len; i++) {
        Invoker<T> invoker = select(loadbalance, invocation, copyinvokers, invoked);
        invoked.add(invoker);
        RpcContext.getContext().setInvokers((List) invoked);
        try {
            Result result = invoker.invoke(invocation);
            if (le != null && logger.isWarnEnabled()) {
                logger.warn("Although retry the method " + invocation.getMethodName()
                            + " in the service " + getInterface().getName()
                            + " was successful by the provider " + invoker.getUrl().getAddress()
                            + ", but there have been failed providers " + providers
                            + " (" + providers.size() + "/" + copyinvokers.size()
                            + ") from the registry " + directory.getUrl().getAddress()
                            + " on the consumer " + NetUtils.getLocalHost()
                            + " using the dubbo version " + Version.getVersion() + ". Last error is: "
                            + le.getMessage(), le);
            }
            return result;
        } catch (RpcException e) {
            //Biz类型的异常，会抛出异常，不进行不重试，非Biz类的RpcException都会进行重试
            if (e.isBiz()) { // biz exception.
                throw e;
            }
            le = e;
        } catch (Throwable e) {
            le = new RpcException(e.getMessage(), e);
        } finally {
            providers.add(invoker.getUrl().getAddress());
        }
    }
}

As can be seen from the code, only is not a RpcException Biz type will trigger a retry. Continue to analyze the code to see what scenarios will trigger a retry... Forget it, don’t post the code, just go to the answer!

Briefly summarize the retry strategy in Dubbo:

default number of retries for 16084c1e2a84e3 is 3 (including the first request) , the retry will only be triggered when the configuration is greater than 1
The default is Failover policy, so retrying will not retry the current node, only the next node (available node -> load balancing -> routing)
TCP handshake timeout will trigger a retry
Response timeout will trigger a retry
Message errors or other errors cause the corresponding request to be unable to be found, and also cause the Future to time out, and the timeout will retry
For the Exception returned by the server (such as thrown by the provider), the call is successful and will not be retried

Dubbo's retry strategy is still a bit aggressive, and it is not as cautious as Apache HC... So when using Dubbo, the retry strategy must be careful to avoid retrying to some services that do not support idempotence. If your provider does not support idempotence, it is best to configure the number of retries to 0

Feign's retry mechanism (v11.1)

Feign is an Http client using simple Java, and it is also the recommended RPC framework in Spring Cloud. Although Feign is also an Http client, it is quite different from libraries such as Apache HC.

Below is the core structure diagram of Feign. As you can see from the diagram, the client part of Feign supports Apache HC, Google Http, OK Http and other Http libraries in addition to the JDK built-in Http Client.

And it also mentions the abstraction of encoders/decoders... So it seems that it can't be regarded as a basic Http client, it should be called "Http tool"? Or is it called a basic abstraction of RPC?

What about the retry strategy in Feign? This question is really difficult to answer, because there are many situations that need to be distinguished. Under different Feign Clients, the retry strategy is different.

First of all, Feign has a built-in retry strategy. As shown in the figure below, Feign's retry is outside of calling HttpClient, and there is a certain interval before each retry.

In the default configuration, the maximum retries are 5 times (including the first time), and there will be a certain interval of time (sleep) before each retry, and this interval time increases with the increase of the number of retries, and the retry interval is calculated The formula is:

$$retry interval = retry interval (default 100ms) * 1.5 ^ {current number of retries-1}$$

As shown in the figure below, the greater the number of retries, the longer the retry interval will be

But this is a retry other than HttpClient. If you just use Feign's built-in default JDK HTTP Client, there will be no problem, because JDK HTTP Client is very simple, there is no retry mechanism, and Feign's retry mechanism alone Will suffice.

However, when working with three-party Http Client (such as Apache HC), the situation is different, because the three-party Http Client often has a retry mechanism inside.

If the three-party HttpClient has a retry, and Feign has a retry, it is equivalent to two retries, and the number of retries becomes N * N

For example, under Apache HC, according to the previous introduction, the default retries are 3 times, and Feign defaults to retries 5 times. In the worst case, the number of retries is as high as 15 times.

And this is just Feign's retry mechanism under basic usage. If under Spring Cloud, with a load balancer such as Ribbon, the situation will be more complicated. This article will not introduce too much. If you are interested, take a look at Spring Cloud. Feign configuration

to sum up

Although retrying may seem simple, there are still many factors to consider if you want to retry safely and stably. Must be combined with the current business scenario and context information to comprehensively consider whether you should retry and the number of each retry; instead of deciding on a retry mechanism with one stroke of your head, violent retry will often only amplify the problem and cause more serious problems. as a result of.

If you are not sure whether it is safe to retry, then do not try again. Disable the retry of these frameworks. Failfast is better than problem expansion.

reference

Originality is not easy, unauthorized reprinting is prohibited. If my article is helpful to you, please like/favorite/follow to encourage and support it ❤❤❤❤❤❤

The request failed. Should I try again? Shouldn't it?

Preface

Risk of retry