Analysis of timeout and retry

Preface

Timeout can be said to be the exception we are most familiar with except for the null pointer. From the access layer of the system, to the service layer, to the database layer, etc., you can see the timeout; in many cases, the timeout is accompanied by a retry, because a certain In some cases, such as network jitter problems, retrying can be successful; of course, retrying often also specifies the upper limit of the number of retries, because if the program does have a problem, the number of retries will not help, which is actually a waste of resources.

Why set a timeout

For developers, the most common thing we usually do is to set the timeout period, such as database timeout setting, cache timeout setting, middleware client timeout setting, HttpClient timeout setting, and possibly business timeout; why set the timeout period, because if not Setting the timeout period may cause the entire link to wait for a long time because a request cannot be responded to immediately. If such requests are too many, the entire system will be paralyzed. Throwing a timeout exception is actually a timely stop loss; looking at various timeouts Time settings, you can see that most of them are actually around network timeouts, and network timeouts have to mention Socket timeout settings.

Socket timeout

Socket is the most basic class of network communication. Communication is basically divided into two steps:

Establish a connection: a connection must be established before reading and writing messages; there will be a connection timeout setting ConnectTimeout during the connection phase;
Read and write operations: Reading and writing means that the two parties officially exchange data. At this stage, there will be a read and write timeout setting ReadTimeOut;

Connection timed out

The connect method provided by Socket provides connection timeout settings:

public void connect(SocketAddress endpoint) throws IOException
public void connect(SocketAddress endpoint, int timeout) throws IOException

timeout is not set, the default is 0. In theory, there should be no time limit. After testing, there is still a time limit of about 21 seconds by default;

Various exceptions may be thrown when establishing a connection, such as:

ProtocolException: There is an error in the basic protocol, such as a TCP error;
```
java.net.ProtocolException: Protocol error
```
ConnectException: remote connection refused (for example, no process is listening on the remote address/port);
```
java.net.ConnectException: Connection refused
```

SocketTimeoutException: The socket read (read) or accept (accept) timeout;

java.net.SocketTimeoutException: connect timed out
java.net.SocketTimeoutException: Read timed out

UnknownHostException: indicates that the IP address of the host cannot be determined;
```
java.net.UnknownHostException: localhost1
```
NoRouteToHostException: An error occurred while connecting to the remote address and port. Usually, due to the intervention of the firewall, or the intermediate router is closed, the remote host cannot be accessed;
```
java.net.NoRouteToHostException: Host unreachable
java.net.NoRouteToHostException: Address not available
```

SocketException: An error occurred when creating or accessing the socket;

java.net.SocketException: Socket closed
java.net.SocketException: connect failed

Here we focus on SocketTimeoutException , and Connection refused also often appears, here is a simple comparison

Connect timed out

The local can directly use a non-existent ip to try to connect:

SocketAddress endpoint = new InetSocketAddress("111.1.1.1", 8080);
socket.connect(endpoint, 2000);

Try to connect and report the following error:

java.net.SocketTimeoutException: connect timed out
    at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)

Connection refused

Local test can use 127.xxx to simulate, try to connect and report the following error:

java.net.ConnectException: Connection refused: connect
    at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)

Compared

Connection refused: indicates that the route from the local client to the target IP address is normal, but there is no process listening on the target port, and then the server rejects the connection; the beginning of 127 is used for the local loopback test (loopback test) of the host Communication between processes, so datagrams will not be sent to the network, routing is normal;
Connect timed out: The possibility of timeout is more common, such as the server cannot be pinged, the firewall discards the request packet, network intermittent problems, etc.;

Read and write timeout

Socket can be set to SoTimeout indicate the timeout time for reading and writing. If it is not set to 0 by default, it means that there is no time limit; you can simply do a simulation to simulate the server-side business processing delay of 10 seconds, and the client-side read and write timeout time is 2 seconds :

Socket socket = new Socket();
SocketAddress endpoint = new InetSocketAddress("127.0.0.1", 8189);
socket.connect(endpoint, 2000);//设置连接超时为2秒
socket.setSoTimeout(1000);//设置读写超时为1秒

InputStream inStream = socket.getInputStream();
inStream.read();//读取操作

Because the server has delayed processing, the read and write timeout time set by the client is exceeded, and the following error is directly reported:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.net.SocketInputStream.read(SocketInputStream.java:224)

NIO timeout

The above is based on the timeout configuration of traditional Socket. NIO provided by SocketChannel also has timeout; NIO mode provides blocking mode and non-blocking mode. Blocking mode is the same as traditional Socket, and there is a corresponding relationship; non-blocking mode It does not provide a timeout setting;

Blocking mode

SocketChannel client = SocketChannel.open();
//阻塞模式
client.configureBlocking(true);
InetSocketAddress endpoint = new InetSocketAddress("128.5.50.12", 8888);
client.socket().connect(endpoint, 1000);

Under the above blocking mode by client.socket() be obtained SocketChannel corresponding Socket , connection timeout provided, reported the following errors:

java.net.SocketTimeoutException
    at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)

Non-blocking mode

SocketChannel client = SocketChannel.open();
// 非阻塞模式
client.configureBlocking(false);
// select注册
Selector selector = Selector.open();
client.register(selector, SelectionKey.OP_CONNECT);
InetSocketAddress endpoint = new InetSocketAddress("127.0.0.1", 8888);
client.connect(endpoint);

Simulate the above two situations in the same way, and report the following errors:

//连接超时异常
java.net.ConnectException: Connection timed out: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

//连接拒绝异常
java.net.ConnectException: Connection refused: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

Common timeouts

Understand the Socket understand other timeouts caused by the network. Common network read and write timeout settings include: database client timeout, cache client timeout, RPC client timeout, HttpClient timeout, gateway layer timeout; The above cases are actually timeout settings from the perspective of the client. For example, the web container also handles the timeout on the server side. Of course, in addition to network-related timeouts, there may also be some business timeout situations, which are introduced below;

network timeout

Here we will focus on the client-related timeout settings, and the server will focus on the Web container;

Database client timeout

Take Mysql as an example, the simplest timeout setting only needs to be added after the url:

jdbc:mysql://localhost:3306/ds0?connectTimeout=2000&socketTimeout=200

connectTimeout: connection timeout time;

socketTimeout: read and write timeout time;

In addition to the timeout configuration provided by the database driver itself, we generally use the ORM framework directly, such as Mybatis etc. These frameworks themselves will also provide the corresponding timeout time:

 <setting name="defaultStatementTimeout" value="25"/>

defaultStatementTimeout: Set the timeout period, which determines the number of seconds that the database driver waits for a response from the database.

Cache client timeout

Take Redis as an example, use Jedis as an example, you can also configure the timeout period when creating a connection:

public Jedis(final String host, final int port, final int timeout)

Only one timeout period is configured here, but in fact the connection and read-write timeout share the same value. You can view the source code of Connection

public void connect() {
        if (!isConnected()) {
            try {
                socket = new Socket();
                socket.setReuseAddress(true);
                socket.setKeepAlive(true);
                socket.setTcpNoDelay(true);
                socket.setSoLinger(true, 0);
                //timeout连接超时设置
                socket.connect(new InetSocketAddress(host, port), timeout);
                //timeout读写超时设置
                socket.setSoTimeout(timeout);
                outputStream = new RedisOutputStream(socket.getOutputStream());
                inputStream = new RedisInputStream(socket.getInputStream());
            } catch (IOException ex) {
                throw new JedisConnectionException(ex);
            }
        }
    }

RPC client timeout

Take Dubbo as an example, you can directly configure the timeout period xml

<dubbo:consumer timeout="" >

The default time 1000ms, Dubbo as RPC frame, using the bottom Netty peer communication frameworks, but Dubbo by Future achieve their timeout mechanism can directly view DefaultFuture , part of the code as follows:

 // 内部锁
 private final Lock lock = new ReentrantLock();
 private final Condition done = lock.newCondition();
 // 在指定时间内不能获取直接返回TimeoutException
 public Object get(int timeout) throws RemotingException {
        if (timeout <= 0) {
            timeout = Constants.DEFAULT_TIMEOUT;
        }
        if (!isDone()) {
            long start = System.currentTimeMillis();
            lock.lock();
            try {
                while (!isDone()) {
                    done.await(timeout, TimeUnit.MILLISECONDS);
                    if (isDone() || System.currentTimeMillis() - start > timeout) {
                        break;
                    }
                }
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            } finally {
                lock.unlock();
            }
            if (!isDone()) {
                throw new TimeoutException(sent > 0, channel, getTimeoutMessage(false));
            }
        }
        return returnFromResponse();
    }

HttpClient timeout

HttpClient can be said to be the most frequently used Http client. You can set the timeout time RequestConfig

RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(2000).setConnectTimeout(1000)
                .setConnectionRequestTimeout(3000).build();

The three timeout periods that can be configured are:

socketTimeout: The connection is established successfully, the read and write timeout time;
connectTimeout: connection timeout time;
connectionRequestTimeout: The timeout used when requesting a connection from the connection manager;

Gateway layer timeout

Taking the common Nginx as an example, as a proxy forwarding, from the point of view of the downstream web server, Nginx as a forwarder is actually a client. It also needs to configure the timeout period for connection, reading and writing:

server {
        listen 80;
        server_name localhost;
        location / {
           // 超时配置
           proxy_connect_timeout 2s;
           proxy_read_timeout 2s;
           proxy_send_timeout 2s;
           
           //重试机制
           proxy_next_upstream error timeout;
           proxy_next_upstream_tries 5;
           proxy_next_upstream_timeout 5;
        }
    }

Related timeout configuration:

proxy_connect_timeout: the timeout period for establishing a connection with the back-end server, the default is 60s;
proxy_read_timeout: The timeout time for reading the response from the back-end server, the default is 60s;
proxy_send_timeout: The timeout period for sending a request to the back-end server, the default is 60s;

As a proxy server, Nginx also provides a retry mechanism. For upstream servers, multiple servers are often configured to achieve load balancing. The relevant configuration is as follows:

proxy_next_upstream: Under what circumstances need to request the next back-end server to retry, the default error timeout;
proxy_next_upstream_tries: the number of retries, the default is 0 means unlimited times;
proxy_next_upstream_timeout: The maximum retry timeout, the default is 0 means unlimited times;

Server timeout

In the above cases, we are all from the perspective of the client, which is also the timeout configuration most commonly used by developers. In fact, the corresponding timeout period can also be configured on the server side, such as the most common web container Tomcat, the Nginx described above, etc. , Let’s take a look at the relevant timeout configuration of Tomcat:

<Connector connectionTimeout="20000" socket.soTimeout="20000" asyncTimeout="20000" disableUploadTimeout="20000" connectionUploadTimeout="20000" keepAliveTimeout="20000" />

connectionTimeout: After the connector accepts the connection, the request URI line is not received within the specified time, which indicates that the connection has timed out;
socket.soTimeout: The timeout time for reading request data from the client, the default is the same as connectionTimeout;
asyncTimeout: the timeout period of asynchronous requests;
disableUploadTimeout and connectionUploadTimeout: the timeout period for file upload;
keepAliveTimeout: Set the Http long connection timeout time;

More configuration: Tomcat8.5

Business timeout

Basically, the middleware we use provides timeout settings. Of course, some situations in the business also require us to do timeout processing. For example, a certain function needs to call multiple services, and each service has its own timeout period, but this The function has a total timeout time. At this time, we can refer to Dubbo use Future to solve the timeout problem.

Retry

Retry is often accompanied by timeout, because the timeout may be caused by a temporary request failure due to some special reasons, that is to say, the retry may cause the request to succeed again; in fact, many systems that provide load balancing are not only Retry when the timeout expires. Any exception will be retryed, such as gateways like Nginx, RPC, MQ, etc.; let’s take a look at how various systems implement retry;

RPC retry

The RPC system generally provides a registration center, and the service provider provides multiple nodes. Therefore, if a server node is abnormal, the consumer will re-select other nodes; taking Dubbo as an example, it provides a fault-tolerant mechanism class FailoverClusterInvoker , which will fail by default Retry twice, the specific retry is realized by for loop:

 for (int i = 0; i < len; i++) {
     try{
         //负载均衡选择一个服务端
         Invoker<T> invoker = select(loadbalance, invocation, copyinvokers, invoked);
         //执行
         Result result = invoker.invoke(invocation);
     } catch (Throwable e) {
         //出现异常并不会退出
        le = new RpcException(e.getMessage(), e);
    }
 }

By the above for loop capturing exception retry is achieved is a better way than catch easier to achieve and then retry clause;

MQ retry

Many messaging systems provide retry mechanisms such as ActiveMQ, RocketMQ, Kafka, etc.;

ActiveMQMessageConsumer of the 060acedd40cdb3 class in rollback provides a retry mechanism, and the maximum number of retransmissions is DEFAULT_MAXIMUM_REDELIVERIES=6 ;

When RocketMQ has a large amount of messages and the network fluctuates, retrying is also a high probability event; setRetryTimesWhenSendFailed in Producer sets the number of automatic retries in synchronization mode, and the default value is 2;

Gateway retry

As a load balancer, one of the core functions of the gateway is the retry mechanism. In addition to this mechanism, there is also a health detection mechanism. The problematic business logic nodes are excluded in advance, which also reduces the chance of retry. The retry itself is also It's a waste of time; Nginx-related retry configuration has been introduced in the previous section, so I won't repeat it here;

HttpClient retry

HttpClient actually provides a retry mechanism internally, the implementation class RetryExec , the default number of retries is 3 times, the code part is as follows:

for (int execCount = 1;; execCount++) {
     try {
        return this.requestExecutor.execute(route, request, context, execAware);
     } catch (final IOException ex) {
         // 重试异常检查
     }
}

Retry will only occur when IOExecetion occurs;
InterruptedIOException, UnknownHostException, ConnectException, SSLException, do not retry if these 4 exceptions occur;

It can be found that SocketTimeoutException inherits from InterruptedIOException , so it will not retry;

Timer retry

I have encountered the need to notify external systems before, because the real-time performance is not so high, and many external systems are not so stable, and may not enter maintenance at any time; database + timer is used to retry, Each notification record will save the time of the next retry (the retry time is incremented), and the timer periodically finds which of the next retry time is within the current time, if the update status is successful, if it fails, update the next time Retry time, the number of retries +1, and of course the maximum retry value will be set;

`be careful`

Of course, the retry also needs to pay attention to whether it is the query type or the update type. If it is the query type multiple retry does not affect the result, if it is the update type, it needs to be idempotent.

`to sum up`

Reasonable setting of the timeout and retry mechanism is one of the prerequisites for ensuring the high availability of the system; too many failures are caused by unreasonable setting of the timeout period, so we must pay attention to it during the development process; another point is that you can take a look at more The source code of the middleware, many solutions can be used to find the answer in these middleware, such as Dubbo , which can be used as a good reference.

`Thanks for attention`

You can follow the WeChat public "160acedd40cf1f roll back the code ", read the first time, the article is continuously updated; focus on Java source code, architecture, algorithm and interview.