background
A customer implements DAP and needs to stress test the DAP before going online. There is a special stress test environment, and the concurrency is required to reach 1000. The team uses jmeter as a stress test tool. The entire system architecture is very simple. >Nginx-> Tomcat, encountered the following problems during the stress test
- Perform pressure test on the DAP process submission interface, and the pressure test fails
- Deploy one that only returns request parameters, without any business logic, and fails the stress test
- Bypass Nginx, directly perform pressure test on Tomcat, and the pressure test passes
- Perform pressure test on Nginx directly, pressure test Nginx welcome page, the pressure test fails
In addition, Nginx has been tuned a lot. The parameters that can be adjusted are basically tried, but the stress test fails. In summary, as long as the Nginx is passed, the stress test fails. It seems that the problem is on Nginx.
Environmental information
- Nginx: 16CPU 64G memory
- Nginx configuration, here are only a few more important parameters
worker_processes 16;
worker_connections 1000;
keepalive_timeout 60;
These parameters have been adjusted and have little effect on the results. Of course, if the adjustment is too extreme, such as worker_processes set to 1, an error will definitely be reported.
- Tomcat: 16CPU 64G memory
This configuration is sufficient for Nginx to run 1000 concurrently
Troubleshoot
First look at the comparison of the next two test results
- Stress test tomcat simple interface, concurrent 1000, loop 50 times, a total of 50000 requests
pressure test result
All requests did not appear Error
- Pressure test the Nginx welcome page, concurrently 1000, loop 50 times, a total of 50,000 requests
There is an error rate of 0.31%. Although it is not high, this error rate has always been there. If Nginx's pressure test is not as good as Tomcat, I believe that this result is unacceptable. Nginx is recognized as the strongest among load balancing products. One is load balancing and the other is application server. The most important thing for load balancing is concurrency. The most important thing for application servers is business capability. If Tomcat surpasses Nginx in concurrency, it may subvert the perception of many people. So there must be something wrong with it.
The errors of Nginx stress test requests are mainly concentrated in the following points
- java.net.ConnectionException: Connection refused: connect
- java.net.SocketException: Connection reset
- java.net.SocketException: Unexpected end of file from server
- org.apache.http.NoHttpResponseException: failed to respond
Basically it is a network error. In order to eliminate the network error, two tests have been done
Run locally
I opened nginx and tomcat on my own computer (MAC), and used jmeter for pressure measurement, but the result was not ideal, and it was also 1000 concurrent
Although there is such a high error rate, no matter whether it is nginx or tomcat, there is no error . What can this explain? It means that the error is probably reported at the kernel level, that is, the request did not come in at all.
Linux host running
Jmeter supports running under LInux and also supports command line mode, so you can consider finding a host on the same network segment as the stress test machine in the internal network server. The steps are as follows
- Upload the jmeter installation package to the Linux pressure testing machine and unzip it
- Configure the test plan locally, save it as a jmx file, and upload the jmx file to the Linux stress testing machine
- Execute the following command to start pressure test in command line mode
apache-jmeter-5.4.1/bin/jmeter.sh -n -t plan.jmx -l plan.jtl
- Download the result file plan.jtl to the local, and click Browser to open the file in the summary report
The error rate is 0, combined with the results of the local operation, it can be guessed that it is related to the environment of the pressure testing machine.
Pressure testing machine
The pressure test machine used this time is a windows 7 host. If it is the reason for the pressure test machine, then why it is okay to press tomcat, and there is a problem with nginx. What is the difference between the two pressure tests, we can use wireshark to perform Capture packets to see the difference between the two pressure tests on the network packets
- Nginx pressure test to capture packets, found that there are a large number of RST packets, and continue to appear quickly
RST is the data packet that will be sent when the connection is closed
- There are also RST packages for Tomcat pressure test and capture packages, but the speed is obviously slower and the number is not large.
That is to say, Nginx pressure test is constantly creating and closing connections, and tomact creates and closes connections less frequently. We can use the following commands on windows to check the connection
netstat -a | find /i /c "TIME_WAIT"
This command can count the number of connections in the current system in the TIME_WAIT state. TIME_WAIT indicates that the connection is about to be closed. At this time, the port occupied by the connection cannot be released and cannot be redistributed. If a large number of TIME_WAIT connections are generated in a short period of time, it is easy to cause The port number is exhausted to create a new connection, the service cannot respond
- Tomcat pressure test TIME_WAIT connection number situation
TIME_WAIT
represents the number of connections waiting to be closed, which can be seen to grow slowly, and stops growing when it reaches about two thousand
- Nginx pressure test TIME_WAIT connection number situation
You can see the rapid growth, and the total can reach more than 2w, up to 4w, so the difference between the two pressure tests is the number of TIME_WAIT, then why this result is produced, the following is my personal guess
Nginx uses the epoll non-blocking model. Tomcat uses the thread model. One request per thread belongs to the blocking model. Therefore, nginx can "eat" a large number of connections within a period of time. Tomcat needs to queue for entry, which causes the above phenomenon. , Nginx creates a large number of connections in a short time, while tomcat slowly increases connections
in conclusion
Windows as a stress testing machine for high-concurrent stress testing is not appropriate because it will generate a large number of connections in a short period of time. As a working PC, windows is not as capable as Linux in this regard. In addition, due to factors such as network fluctuations, it is easy to cause errors and cause stress. The test result is distorted. Although the error rate can be reduced by optimizing some parameters, the fundamental problem cannot be solved. It is recommended to use the Linux host on the same network segment as the stress test machine. The test plan can be configured in windows, and the result analysis can also be completed in windows, but It is recommended to perform the test task on the Linux host.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。