Author: vivo Internet Server Team - Zhi Guangquan
HttpClient is the most commonly used Http tool for Java programmers. Its management of Http connections can simplify development and improve the efficiency of connection reuse. Under normal circumstances, HttpClient can help us manage connections efficiently, but in some cases, the concurrency is high and the message body is large. Under the circumstance, if you encounter network fluctuations again, how to ensure that the connection is used efficiently, and what is the room for optimization.
1. Problem phenomenon
On X month X Beijing time, the monitoring of the browser information flow service was abnormal, mainly in the following three aspects:
- From a certain point in time, cloud monitoring shows that the circuit breakers of some Http interfaces are turned on, and the problem machine can be found from the detailed list:
2. From the Hystrix fuse management interface of the PAAS platform, it can be further confirmed that all Http interface calls of the problem machine have blown:
3. There are a lot of exceptions in the log center getting connections from the Http connection pool: org.apache.http.impl.execchain.RequestAbortedException: Request aborted.
2. Problem location
Combining the above three phenomena, it can be inferred that there is a problem with the TCP connection management of the problem machine. It may be a problem with the virtual machine or a physical machine. After communicating with the operation and maintenance and the system side, it is found that neither the virtual machine nor the physical machine Obviously abnormal, contact the operation and maintenance for the first time to restart the problem machine, and the online problem is solved.
2.1 Temporary solution
A few days later, the above phenomenon also appeared on some other machines on the line. At this time, it can basically be confirmed that the service itself has a problem; since the problem is related to the TCP connection, so contact the operation and maintenance to establish a job on the problem machine to check the TCP connection. State distribution:
netstat -ant|awk '/^tcp/ {++S[$NF]} END {for(a in S) print (a,S[a])}'
The result is as follows:
As shown in the figure above, the number of connections in the CLOSE\_WAIT state of the problem machine is close to about 200 (the maximum number of connections in the service Http connection pool is set to 250), and the direct cause of the problem can basically be confirmed to be caused by too many connections in the CLOSE\_WAIT state; In line with the principle of solving online problems first, first adjust the connection pool to 500, and then let the operation and maintenance restart the machine, and the online problem is temporarily solved.
2.2 Reason Analysis
Adjusting the size of the connection pool only temporarily solves the online problem, but the specific reason is still uncertain. According to past experience, the connection that cannot be released normally is basically due to improper use by the developer, and the connection is not closed in time after the use is completed; but soon the idea It was rejected. The reason is obvious: the current service has been running online for about a week, and there has been no release in the middle. Based on the business volume of the browser, if the connection is used up, it is not closed in time.
If it is closed, the number of connections of 250 will be exploded if it can't last for even a minute. Then the problem can only be that the connection is not released due to some abnormal scenarios; therefore, the focus is on the recently launched service interfaces, especially the interfaces with large data packets and long response time, and finally locked in the target. On a specific page optimization interface; first check the IP and port connection pair in the CLOSE_WAIT state, and confirm the IP address of the other party's server.
netstat-tulnap|grep CLOSE_WAIT
After confirming with the partner, the target IPs are all from the partner, which is consistent with our speculation.
2.3 TCP packet capture
While locating the problem, I also asked the operation and maintenance colleagues to help capture the TCP data packets. The result shows that it is indeed the client (browser server) that did not return ACK to end the handshake, which caused the wave to fail. The client was in the CLOSE_WAIT state, and the data The packet size also matches the suspected problem interface.
In order to facilitate your understanding, I found a picture from the Internet, you can use it as a reference:
CLOSE\_WAIT is a passively closed state. If the connection is actively disconnected by SERVER, the state of CLOSE\_WAIT will appear in CLIENT, and vice versa;
Under normal circumstances, if the client does not close the stream (stream socket in tcp) in time after an http request is completed, the server will actively send a FIN to close the connection after the timeout, and the client does not actively close, so it stays In the CLOSE_WAIT state, if this is the case, the connection pool will soon be exhausted.
Therefore, the situation we encountered today (the number of connections in the CLOSE_WAIT state is slowly increasing every day) is more like a connection that is not closed due to an abnormal scenario.
2.4 Independent Connection Pool
In order not to affect other business scenarios and prevent systemic risks, we first independently managed the problem interface connection pool.
2.5 In-depth analysis
With the question of 2.3, let's take a closer look at the business call code:
try {
httpResponse = HttpsClientUtil.getHttpClient().execute(request);
HttpEntity httpEntity = httpResponse.getEntity();
is = httpEntity.getContent();
}catch (Exception e){
log.error("");
}finally {
IOUtils.closeQuietly(is);
IOUtils.closeQuietly(httpResponse);
}
There is an obvious problem with this code: it not only closes the data transmission stream ( IOUtils.closeQuietly(is) ), but also closes the entire connection ( IOUtils.closeQuietly(httpResponse) ), so that we cannot reuse the connection; But it is even more confusing: since the connection is manually closed every time, why are there still a large number of connections in the CLOSE_WAIT state?
If the problem is not in the business calling code, then it can only be caused by some particularity of the business interface; through packet capture analysis, it is found that the interface has an obvious feature: the interface returns a large number of packets, with an average of about 500KB . Then the problem is most likely that the packet is too large and causes some kind of exception, which prevents the connection from being reused or released.
2.6 Source code analysis
Before starting the analysis, we need to understand a basic knowledge: Http's long connection and short connection . The so-called long connection is that after the connection is established, the connection can be multiplexed for data transmission multiple times; while the short connection is that the connection needs to be re-established each time before data transmission.
Through the packet capture of the interface, we found that there is the word Connection:keep-live in the response header, then we can focus on the management of long connections by HttpClient for code analysis.
2.6.1 Connection pool initialization
Initialization method:
Entering the PoolingHttpClientConnectionManager class, there is an overloaded constructor that contains the connection lifetime parameter:
continue to look down
This is the end of the manager's construction method. It is not difficult to find that the validityDeadline will be assigned to the expiry variable. Then we will see where the HttpClient uses the expiry parameter;
Usually, some policy parameters are initialized when the instance object is constructed. At this time, we need to look at the method of constructing the HttpClient instance to find the answer:
This method includes a series of initialization operations, including building a connection pool, setting the maximum number of connections for the connection pool, specifying a reuse strategy and a long connection strategy, etc. Here we also notice that HttpClient creates an asynchronous thread to monitor and clean up idle connections.
Of course, the premise is that you have turned on the configuration of automatically cleaning up idle connections, which is turned off by default.
Then we saw the specific implementation of HttpClient closing idle connections, which contains what we want to see:
At this point, we can draw the first conclusion: when initializing the connection pool, we can modify the value of validityDeadline by implementing the PoolingHttpClientConnectionManager constructor with parameters, thereby affecting the management strategy of HttpClient for long connections.
2.6.2 Execution method entry
First find the execution entry method: org.apache.http.impl.execchain.MainClientExec.execute, and see the implementation of keepalive related code:
Let's take a look at the default strategy:
Since the calling logic in the middle is relatively simple, the links that are called one by one will not be posted here. Here is a direct conclusion: HttpClient sets the validity period to permanent (Long.MAX_VALUE) for long connections without a specified connection validity period.
Based on the above analysis, we can draw the final conclusion:
HttpClient manages the validity period of long connections by controlling newExpiry and validityDeadline, and for long connections without a specified connection validity period, the validity period is set to permanent.
At this point, we can boldly give a guess: the validity period of a long connection is permanent, and due to some abnormality, the long connection is not closed in time, but survives permanently, and cannot be reused or released. (It is just a guess based on the phenomenon, although it is not completely correct in the end, but it does improve the efficiency of our problem solving).
Based on this, we can also manage long connections by changing these two parameters:
After this simple modification went online, the number of connections in the close_wait state did not continue to increase, and this online problem was completely solved.
But at this time, I believe everyone also has a question: as a widely used open source framework, is HttpClient so rough in managing long connections? A simple exception call can cause the entire scheduling mechanism to completely crash, and it will not recover by itself;
So with questions, I checked the source code of HttpClient in detail again.
3. About HttpClient
3.1 Preface
Before starting the analysis, let's briefly introduce the following core classes:
- [PoolingHttpClientConnectionManager] : The connection pool manager class, the main function is to manage connections and connection pools, encapsulate the creation of connections, state transfer and related operations of connection pools, and is the entry method for operating connections and connection pools;
- [CPool] : The concrete implementation class of connection pool, the concrete implementation of connection and connection pool are implemented in CPool and abstract class AbstractConnPool, which is also the focus of analysis;
- [CPoolEntry] : The specific connection encapsulation class, including some basic properties and basic operations of the connection, such as connection id, creation time, validity period, etc.;
- [HttpClientBuilder] : The constructor of HttpClient, focusing on the build method;
- [MainClientExec] : The execution class requested by the client is the entry point of execution, focusing on the execute method;
- [ConnectionHolder] : The main method of encapsulating and releasing connections is based on PoolingHttpClientConnectionManager.
3.2 Two connections
- Maximum number of connections (maxTotal)
- Maximum number of single route connections (maxPerRoute)
- The maximum number of connections , as the name implies, is the maximum number of connections that the connection pool allows to create;
The maximum number of single route connections can be understood as the maximum number of connections allowed by the same domain name, and the sum of all maxPerRoute cannot exceed maxTotal.
Taking the browser as an example, the browser is connected to Toutiao and Yiyi. In order to achieve business isolation and not affect each other, you can set maxTotal to 500 and defaultMaxPerRoute to 400, mainly because the number of service interfaces of Toutiao is much larger than that of Yiyi, defaultMaxPerRoute It needs to satisfy the party with the larger call volume.
3.3 Three Timeouts
- connectionRequestTimout
- connectionTimeout
- socketTimeout
- [connectionRequestTimout]: refers to the timeout time for obtaining a connection from the connection pool;
- [connectionTimeout] : refers to the timeout time for the client to establish a connection with the server. After the timeout, a ConnectionTimeOutException will be reported;
- 【socketTimeout】 : Refers to the maximum time interval between data packets during data transmission after the client and the server establish a connection. If it exceeds, a SocketTimeOutException will be thrown.
It must be noted that the timeout here is not the completion of data transmission, but only the interval between receiving two data packets, which is also the root cause of many strange online problems.
3.4 Four containers
- free
- leased
- pending
- available
- [free]: The container of idle connection, the connection has not been established, theoretically freeSize=maxTotal -leasedSize
- - availableSize (in fact, there is no such container in HttpClient, it is just a container specially introduced for the convenience of description).
- [leased] : leases the connected container. After the connection is created, it will be transferred from the free container to the leased container. You can also lease the connection directly from the available container. After the lease is successful, the connection will be placed in the leased container. It is also a very important capability of connection pooling.
- [pending] : A container waiting for a connection. In fact, the container is only used as a blocking thread when waiting for the connection to be released. It will not be mentioned below. If you are interested, you can refer to the specific implementation code, which is related to connectionRequestTimout.
- [available] : The container for reusable connections, usually transferred directly from the leased container. In the case of a long connection, after the communication is completed, the connection will be placed in the available list. Some management and release of connections are usually carried out around this container. of.
Note: Due to the limitation of the number of connections maxTotal and maxPerRoute, when referring to these four containers below, if there is no prefix, it means the total number of connections. If it is r.xxxx, it means a container in the routing connection size.
Composition of maxTotal
3.5 Connection Generation and Management
- The loop obtains the connection from the available container. If the connection is not invalid (judged by the expiry field mentioned above), the connection is deleted from the available container, added to the leased container, and the connection is returned;
- If an available connection is not obtained in the first step, then judging whether r.available + r.leased is greater than maxPerRoute is actually judging whether there is a free connection; if not, you need to release the excess allocated connection (r. available + r.leased - maxPerRoute), to ensure that the real number of connections is controlled by maxPerRoute (as to why r.leased+r.available>maxPerRoute is actually very easy to understand, although the entire state flow process is locked , but the flow of the state is not an atomic operation, and there are some abnormal scenarios that will cause the state to be incorrect for a short time); so we can conclude that maxPerRoute is only a theoretical maximum value, in fact, the actual number of connections generated in a short period of time may be greater than this value;
- In the case where the actual number of connections (r .leased+ r .available) is less than maxPerRoute and maxTotal>leased: if free>0, re-create a connection; if free=0, use the earliest created connection in the available container Close it, and then re-create a connection; it seems a bit confusing, but in fact, the connection in the free container is used first, and the connection in the available container cannot be obtained and then released;
- If an available connection is still not obtained after the above process, it can only wait for a connectionRequestTimout time, or there is a signal notification from other threads to end the entire process of obtaining a connection.
3.6 Release of the connection
- If it is a long connection (reusable), delete the connection from the leased container, then add it to the header of the available container, and set the validity period to expire;
- If it is a short connection (non-reusable), the connection is directly closed and deleted from the released container. At this time, the connection is released and is in the free container;
- Finally, wake up the waiting thread in the fourth part of "Connection Generation and Management".
After analyzing the whole process, I understand how httpclient manages connections, and then looking back at the problem we encountered is clearer:
Under normal circumstances, although a long connection is established, we will manually close it in the finally code block. This scenario actually triggers step 2 in " Release of the connection ", and the connection is directly closed; so there is no problem under normal circumstances. Yes, the long connection does not actually play a real role;
Naturally, the problem can only occur in some abnormal scenarios, resulting in the long connection not being closed in time. Combined with the initial analysis, the server actively disconnected the connection, which is likely to occur in some abnormal scenarios where the connection is disconnected due to timeout. Let's go back to the org.apache.http.impl.execchain.MainClientExec class and find these lines of code:
connHolder.releaseConnection() corresponds to step 1 mentioned in " Release of the connection ". At this time, the connection is only put into the available container, and the validity period is permanent;
The ConnectionHolder returned by return new HttpResponseProxy(response, null) is null. Combined with the specific implementation of IOUtils.closeQuietly(httpResponse), the connection is not closed in time, but is permanently placed in the available container, and the status is CLOSE_WAIT, which cannot be recovered. use;
According to the description of Step 3 of " Connection Generation and Management ", when the free container is empty, httpclient can actively release the connection in the available container. Even if the connection is permanently placed in the available container, it will theoretically not cause a connection. can never be released;
However, combined with step 4 of " connection generation and management ", when the free container is empty, it needs to wait for the connection in the available container to be released when obtaining a connection from the connection pool. The whole process is single-threaded and extremely inefficient. It is bound to cause congestion, and eventually lead to a large number of waiting to obtain a connection timeout error, which is also consistent with the scene we have seen online.
4. Summary
- There are two main functions of the connection pool: connection management and connection multiplexing. When using the connection pool, you must pay attention to only closing the current data stream instead of closing the connection every time, unless your target access address is completely random;
- The settings of maxTotal and maxPerRoute must be careful. Reasonable allocation of parameters can achieve business isolation, but if an accurate evaluation cannot be made, it can be temporarily set to the same, or two independent httpclient instances can be used;
Be sure to remember to set the validity period of the long connection, use
PoolingHttpClientConnectionManager(60, TimeUnit.SECONDS) constructor, especially in the case of a large number of calls, to prevent unpredictable problems;
- You can clean up idle connections regularly by setting evictIdleConnections(5, TimeUnit.SECONDS), especially when the http interface has a short response time and a large amount of concurrency, clean up idle connections in time to avoid closing the connection when it is found to be expired when obtaining a connection from the connection pool. connection can improve the interface performance to a certain extent.
5. Write at the end
As the most widely used Http call framework based on Java language, HttpClient has two obvious shortcomings in the author's opinion:
- It does not provide an entry to monitor the connection status, nor does it provide an extension point that can dynamically affect the life cycle of the connection through external intervention. Once a problem occurs online, it may be fatal;
- In addition, the way to obtain connections is to use synchronous locks. In the case of high concurrency, there are certain performance bottlenecks, and there are problems in the management of long connections. A little carelessness will lead to the establishment of a large number of abnormal long connections. Unable to release in time, causing systemic disaster.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。