Author: Fan Yong
What is a stress test
Stress testing, that is, stress testing, is a test method to establish system stability, usually performed outside the normal operating range of the system, to examine its functional limits and possible hidden dangers.
The stress test is mainly used to detect the bearing capacity of the server, including the user's bearing capacity, that is, how many users use the system at the same time without affecting the quality and traffic capacity. In addition, some stability problems of the system can also be found through fatigue tests, such as whether the connections in the connection pool are exhausted, the memory is exhausted, and the thread pool is exhausted. These can only be found and located through the fatigue test.
Why stress test
The purpose of the stress test is to measure the performance of the machine (QPS, TPS of a single machine) by simulating the behavior of real users, so as to calculate how many machines the system needs to support when it withstands a specified number of users (100 W). Therefore, the target value of the pressure measurement must be set in advance when performing the pressure measurement. This value cannot be too small or too large, and a reasonable evaluation should be made according to the current estimated business growth. The stress test is an estimate (preliminary drill) to deal with the number of users that may be reached in the future before going online. After the stress test, the performance of the program is optimized or sufficient machines are prepared to ensure the user experience. Stress testing can also detect the stability of the application system when there is a transaction peak, as well as some problems that may occur, and find a weak link in the application system, so that it can be strengthened in a more targeted manner.
Stress test classification
These kinds of tests can be interspersed. Generally, the durability test will be arranged after the performance indicators of the stress test are up to standard.
Explanation of stress test terms
Common pressure testing tools
ab
ApacheBench is a web stress testing tool that comes with the Apache server, ab for short. ab is another command-line tool, which has very low local requirements for initiating loads. According to the ab command, many concurrent access threads can be created to simulate multiple visitors accessing a URL address at the same time, so it can be used to test the target server load pressure. In general, the ab tool is small and simple, quick to learn, and can provide the required basic performance indicators, but it has no graphical results and cannot be monitored.
Jmeter
Apache JMeter is a Java-based stress testing tool developed by the Apache organization. Used for stress testing software, it was originally designed for web application testing, but has since expanded to other testing areas.
JMeter can do functional/regression testing of your application by creating scripts with assertions to verify that your program returns the results you expect.
The function of JMeter is too powerful. The usage will not be introduced here for the time being. You can query related documents for use (recommended tutorial documents in the references)
LoadRunner
LoadRunner is a performance testing tool produced by HP (Mercury) company. It is very powerful and is used by many enterprise customers. For details, please refer to the official website link.
Alibaba Cloud PTS
Performance Testing Service (PTS) is a performance testing tool. Supports initiating stress test tasks on demand, providing the ability to initiate millions of concurrent and tens of millions of TPS traffic, and is 100% compatible with JMeter. The provided functions such as scene orchestration, API debugging, traffic customization, and traffic recording can quickly create business stress test scripts, accurately simulate users of different levels of access to business systems, and help businesses quickly improve system performance and stability.
As a performance testing tool used by Alibaba for many years, PTS has the following features:
- Free operation and maintenance, out of the box. SaaS-based pressure, maximum support for millions of concurrent, tens of millions of TPS traffic self-initiated capabilities.
- Support multi-protocol HTTP1.1/HTTP2/JDBC/MQTT/Kafka/RokectMq/Redis/Websocket/RMTP/HLS/TCP/UDP/SpringCloud/Dubbo/Grpc and other mainstream protocols.
- Support traffic customization. Global pressure region customization/operator traffic customization/IPv6 traffic customization.
- Stable and safe. Alibaba's self-developed engine, polished for many years of Double Eleven scenarios, and support for VPC network stress testing.
- One-stop solution for performance stress testing. ** 0 coding to construct complex stress testing scenarios, covering the complete stress testing life cycle of stress testing scenario construction, stress testing model setting, stress initiation, problem analysis and positioning, and stress testing report output.
- 100% compatible with open source JMeter.
- Provide a safe and non-invasive production environment writing pressure test solution.
Comparison of stress testing tools
How to choose a stress measurement tool
There is no best tool in this world. There are only the most suitable tools. There are thousands of tools. The most important thing is to choose the one that suits you. There are various scenarios in actual use. own tools:
- Determining performance stress testing goals: performance stress testing goals may originate from project plans, business requirements, etc.
- Determine the performance stress testing environment: In order to maximize the performance stress testing effect, the performance stress testing environment should be as consistent as possible with the online environment
- Determine the performance stress test pass criteria: For the performance stress test target and the selected performance stress test environment, formulate the performance stress test pass criteria. For the performance stress test environment different from the online environment, the pass criteria should also be moderately relaxed
- Design performance stress test: arrange stress test links, construct performance stress test data, and simulate the real request link and request load as much as possible
- Perform performance stress testing: Use performance stress testing tools to perform performance stress testing as designed
- Analyze the performance stress test result report: Analyze and interpret the performance stress test result report to determine whether the performance stress test achieves the expected goals. If not, analyze the reasons based on the performance stress test result report.
It can be seen from the above steps that a successful performance stress test involves multiple links, from scene design to stress to analysis, all of which are indispensable. If you want to do a good job, you must first sharpen your tools, and a suitable performance tool means that we can complete a reasonable performance stress test in the shortest time possible, and achieve multiplier effect with half the effort.
Java Application Performance Troubleshooting Guide
question category
There are all kinds of problems, and there will be all kinds of problems. It is very necessary to abstract and categorize it. Here we will classify performance problems from two dimensions. The first dimension is the resource dimension and the second dimension is the frequency dimension.
Problems in the resource dimension: high CPU, improper memory usage, network overload.
Problems of the frequency dimension: the transaction is persistently slow, and the transaction is occasionally slow.
There are corresponding solutions for each type of problem. Improper use of methods or tools will lead to inability to quickly and accurately troubleshoot and locate problems.
Stress measurement performance problem location and tuning is a technical job that requires a combination of comprehensive capabilities. It requires personal technical capabilities, experience, and sometimes some intuition and inspiration, as well as certain communication skills, because sometimes The problem is not discovered by the person who locates the problem, so it takes constant communication to find some clues. The technical knowledge involved is far from limited to the programming language itself, and may also require solid technical basic skills, such as operating system principles, networking, compilation principles, JVM and other knowledge, not just a simple understanding, but a real mastery, such as TCP/ IP must be mastered in depth. The JVM has to deeply grasp the memory composition, memory model, and some algorithms of GC. This is also why some junior and intermediate technicians are dumbfounded when they encounter performance problems, and they don't know where to start. If you have solid basic technical skills, plus some actual combat experience, and then form a set of your own style of play, you will not be confused after encountering problems, quickly clear the fog, and finally find the crux of the problem.
In this article, the author also brings some typical cases of performance problems located and investigated in actual work. Each case will introduce the relevant background of the problem, the problem phenomenon provided by the front-line personnel and the preliminary investigation and positioning conclusion, and after the author intervenes See the problem phenomenon, and then cooperate with some common problem location tools to introduce the whole process of finding and locating the problem, the root cause of the problem, etc.
Analytical thinking framework
When encountering a performance problem, we must first define and classify the problem from various appearances and some simple tools, and then do further positioning analysis. You can refer to a decision diagram summarized by the author in Figure 1. A summary of the performance positioning and tuning process in recent ToB projects in the financial industry is not necessarily suitable for all problems, but at least it covers the troubleshooting process of performance problems encountered in recent projects. In the next chapter, we will expand on each type of problem, and attach some real classic cases. These cases are real and have a certain representativeness, and many of them are customers who have positioned themselves for a long time. Time did not find the root cause of the problem. Among them, GC problems will not be analyzed too much in this article, and I will write a special article to expand on GC problems in the future.
memory overflow
The memory overflow problem can be further divided into heap memory overflow, stack memory overflow, Metaspace memory overflow and Native memory overflow according to the frequency of occurrence of the problem. The following is a detailed analysis of each overflow situation.
- heap memory overflow
I believe that everyone has encountered this kind of problem more or less. The root cause of the problem is that the heap memory requested by the application exceeds the value set by the Xmx parameter, which in turn causes the JVM to be basically in an unusable state. As shown in Figure 2, the sample code simulates heap memory overflow. The heap size is set to 1MB at runtime. The result after running is shown in Figure 3. An OutOfMemoryError error exception is thrown. The corresponding Message is Java heap space, representing overflow The part is heap memory.
- stack overflow
This kind of problem is mainly caused by the deep method call depth, incorrect recursive method call, or improper Xss parameter setting. As shown in Figure 4, a simple infinite recursive call will cause stack memory overflow. , the error result is shown in Figure 5, and an error exception of StackOverflowError will be thrown. The Xss parameter can set the maximum size of the stack memory of each thread. The default size of JDK8 is 1MB. Under normal circumstances, it is generally not necessary to modify this parameter. The parameter setting problem, if the method call depth is really deep, the default 1MB is not enough, then you need to increase the Xss parameter.
- Native memory overflow
This overflow occurs when the JVM uses off-heap memory and exceeds the maximum memory limit supported by a process, or when the off-heap memory exceeds the value specified by the MaxDirectMemorySize parameter, a Native memory overflow occurs. As shown in Figure 6, the MaxDirectMemorySize parameter needs to be configured. If this parameter is not configured, it is difficult to simulate this problem. The author's machine is a 64-bit machine, and the size of the off-heap memory can be imagined. The result of running the program is shown in Figure 7. The thrown exception is also OutOfMemoryError, which is similar to heap memory exception, but Message is Direct buffer memory, which is different from Message overflowing heap memory. Please pay special attention to this Message, which is very important to accurately locate the problem.
- Metaspace memory overflow
Metaspace only appeared in JDK8. In previous versions, it was called Perm space, and the uses are probably not much different. The way to simulate Metaspace overflow is very simple. As shown in Figure 8, classes are dynamically created and loaded into the JVM through cglib. These class information is stored in the Metaspace memory. In order to quickly simulate problems, set MaxMetaspaceSize to 10MB. The execution result is shown in Figure 9. The error exception of OutOfMemoryError is still thrown, but the Message becomes Metaspace.
The most common types of JVM memory overflows are these four. If you can know the cause of each memory overflow, you can locate them quickly and accurately. The following is an analysis of some real classic cases encountered.
- Case: Out-of-heap memory overflow
This kind of problem is also easy to check. The premise is that when the heap memory overflows, the heap memory must be automatically dumped to a file. If the heap memory dump is triggered by the kill -3 or jmap command during the pressure measurement process. Then use some heap memory analysis tools such as IBM's Heap Analyzer to find out which object occupies the most memory, and finally find out the cause of the problem.
If you need to dump the heap memory automatically when OOM occurs, you need to add the following parameters to the startup parameters:
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/usr/local/oom
If you need to obtain thread dumps or memory dumps manually, use the kill -3 command, or use the jstack and jmap commands.
jstack -l pid > stackinfo, this command can dump thread information to a text file, download the file locally and analyze it with tools such as IBM Core file analyze.
jmap -dump:format=b,file=./jmap.hprof pid, this command can download the heap memory information to the jmap.hprof file in the current directory, download it locally, and then use heap memory analysis tools such as IBM Heap Analyze For analysis, according to the law of 28, 80% of the problems can be solved by identifying the object that consumes the most memory.
Figure 10 is a real case. The phenomenon of the problem is as follows. After the stress test starts, everything is normal for the first ten minutes, but after about ten minutes, the TPS gradually drops until the TCP connection of the client cannot be established. Going up, the customer once thought that there was a problem with the parameter settings of the Linux network stack on the server side, which caused TCP to fail to establish a connection. The evidence given was that there were a large number of connections in the TIME_WAIT state on the server side, and then asked to adjust the Linux kernel network parameters to reduce TIME_WAIT The number of connections in the state. What is TIME_WAIT? At this time, we have to sacrifice the picture of the ancestral TCP state machine, as shown in Figure 11. Comparing this picture, you can know the ins and outs of TIME_WAIT. TIME_WAIT mainly occurs when the connection is actively closed. Of course, if both parties close the connection at the same time, both parties will appear in the TIME_WAIT state. In the closing connection four-way handshake protocol, the final ACK is sent by the active closing end. If this final ACK is lost, the server will resend the final FIN, so the client must maintain state information to allow it to resend the final FIN ACK. If this state information is not maintained, the client will respond with an RST section, which the server interprets as an error (a connection reset SocketException will be thrown in java). Therefore, in order to realize the normal termination of the TCP full-duplex connection, the loss of any one of the four subsections of the termination sequence must be handled, and the client that is actively closed must maintain the state information and enter the TIME_WAIT state.
Figure 10 Real heap memory overflow case 1
Figure 11 TCP state machine
Following the information provided by the customer, I checked the pressure test client, the HTTP protocol is used, keep-alive is on, and the connection pool is used to interact with the server. In theory, it should not appear on the server. There are so many TIME_WAIT connections. One possibility is that the TPS is relatively high when the client side starts the stress test, which occupies a large number of connections. After the subsequent performance is down, the number of connections is idle and there is no time to keep alive with the server, resulting in The connection was actively closed by the server, but this is only guesswork.
In order to locate the problem more accurately, I decided to go to the front-line site to see the situation. When the TPS dropped seriously, I conducted preliminary detection through commands such as top and vmstat, and found that the proportion of CPU was not very high, about 70%. However, the memory occupied by the JVM is almost close to the value configured by the Xmx parameter, and then use the jstat -gcutil -h10 pid 5s 100 command to check the GC situation. It is an abnormal GC data. First, the proportion of the old generation is close to 100%, and then 7 FullGCs are performed within 5 seconds, and the eden area accounts for 100%, because the old generation is full, and the GC of the young generation is full. It has been stagnant, which is obviously abnormal. While the JVM is still alive, quickly execute jmap -dump:format=b,file=./jmap.hprof pid to take a snapshot of the entire heap file, which is a full 5G. After taking it down, I analyzed the heap file through IBM's HeapAnalyzer tool. The result is shown in Figure 10. After some searching, it was found that a certain object accounted for a very large proportion, accounting for 98%. I continued to track the holding object and finally located the problem. I applied for a certain resource, but it has not been released. After the modification, the problem was solved perfectly. After an 8-hour durability test, no more problems were found. TPS has always been very stable.
Figure 12 Statistical analysis of GC situation
Let's take a look at why there are so many TIME_WAIT connections, which is consistent with the initial guess. Since a large number of idle connections are actively closed by the server, so many connections in the TIME_WAIT state will appear.
High CPU
- case
A financial bank customer found a problem during the stress test, resulting in extremely low TPS, and the transaction response time was even close to an astonishing 30S, which was seriously under-received. The service response time is shown in Figure 23. This is the tracer log recorded by the application. The time is not optimistic. The application is built with SOFA, deployed on the proprietary cloud container, the container specification is 4C8G, and the OceanBase database is used. During the slow transaction process, the customer uses the top and vmstat commands to obtain OS information in the corresponding container, and finds that the memory usage is normal, but the CPU is close to 100%, and the thread dump file is obtained through the jstack command. As shown in Figure 22, the customer finds that a large number of threads are Stuck on getting the database connection, and the application log also reported a large number of error logs that failed to get the DB connection, which made the customer think that the number of connections in the connection pool was not enough, so he continued to increase the MaxActive parameter, the DB connection The pool uses Druid. After increasing the parameters, there is no improvement in performance, and the problem of not getting connections remains. After investigating the problem for about two weeks without any substantial progress, the customer began to seek help from the students of Alibaba GTS.
The author happened to be at the customer site to intervene in the positioning of the performance problem. After communicating with the customer and reviewing the historical positioning information records, according to past experience, this problem is definitely not caused by the insufficient maximum number of connections in the connection pool, because the customer has already adjusted the parameters of MaxActive at this time. At the terrifying 500, but the problem remains, you can still see some useful information in Figure 22. For example, the number of threads in Waiting is as high as 908, and the number of threads in Runnable is as high as 295, which are all terrible numbers. A large number of threads are in Runnable. state, the CPU is busy switching thread contexts, and the CPU is whirling, but it doesn't actually do much meaningful things. After inquiries, the customer adjusted the number of SOFA's business processing threads to 1,000, and the default is 200.
Figure 22 Thread stuck in acquiring connection in DB connection pool
Figure 23 Screenshot of slow transaction
It can be basically concluded that the customer has fallen into the dilemma of "treating the head with a headache, treating the foot with a sore foot" and "treating the symptoms but not the root cause". After further communication with the customer, it turned out to be the case. At the beginning, it was because SOFA reported an error that the thread pool was full, and then the customer continued to increase the maximum number of threads in the SOFA business thread pool, and finally increased to 1000, the performance improvement was not obvious, and then an error was reported that the database connection could not be obtained. , the customer thinks that this is because the database connection is not enough, and increases the MaxActive parameter of Druid. In the end, no matter how much you adjust the performance, the performance cannot be improved, and even the memory is about to burst. As shown in Figure 24, some The business DO object was filled up, and the client thought that there was a memory leak later. For such problems, as long as the database connection pool is not enough, or the connection from the connection pool times out, or the thread pool is exhausted, as long as the parameter settings are within a reasonable range, there are nine out of ten problems. It's just that the transaction itself is too slow. After further investigation, the final location was that a certain SQL statement and some internal processing caused the transaction to be slow. After the correction, the TPS is normal. Finally, the maximum size parameters of the thread pool and the parameters of the DB connection pool are called back to the values recommended in the best practice. After the pressure test again, the TPS remains at a normal level, and the problem is finally solved.
Figure 24 Memory fills up with business domain objects
Although this case is due to the typical problems of high CPU and continuous slow transactions, in fact, as described in this case, it is easy to fall into a dilemma of treating the symptoms but not the root cause when positioning and tuning. It is easy to be confused by some appearances. How to clear the clouds and see the moon, the author's opinion is that 5 points are for experience, 1 point for inspiration and luck, and 4 points for continuous analysis. What if I have no experience? Then you can only settle down and analyze the relevant performance files, whether it is a thread dump file or JFR, or other collection tools to collect performance information, anyway, don’t let any clues go, and finally ask for the assistance of experienced experts. Troubleshoot and resolve.
- Use JMC+JFR to locate problems
If the super long problem occurs by accident, here is a relatively simple and very practical method, using JMC+JFR, you can refer to the link for use. However, the JMX and JFR features must be enabled before use, and the startup parameters need to be modified at startup. The specific parameters are as follows, and the parameters should not be brought into production. In addition, if the port of the host to which the container belongs is also exposed as the same port as jmxremote.port, as shown in the following example If it is 32433, then you can also use the JConsole or JVisualvm tools to observe the status of the virtual machine in real time, which will not be described in detail here.
-Dcom.sun.management.jmxremote.port=32433
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.
authenticate=false
-XX:+UnlockCommercialFeatures -XX:+FlightRecorder
Let's take an actual JFR instance as an example.
First, to enable JMX and JFR functions, you need to add JMX enable parameters and JFR enable parameters to the startup parameters, as described above, and then execute the following command in the container, after execution, it will display "Started recording pid. The result will be written to xxxx", which means that the recording has started, and the pressure measurement starts at this time. The duration in the following command is 90 seconds, which means that the recording will stop after 90 seconds of recording. After recording, download the file to the local, use the jmc tool Analysis, if you do not have this tool, you can also use IDEA for analysis.
jcmd pid JFR.start name=test duration=90s filename=output.jfr
By analyzing the flame graph, please refer to the link for how to see the flame graph. Through this figure, we can see which method is the main time-consuming method, which provides great convenience for us to analyze the problem.
You can also view the call tree, and you can also see where the time consumption mainly occurs.
JMC tool download address: JDK Mission Control (JMC) 8 Downloads (oracle.com)
Finally, I will introduce a tool, Alibaba's open source arthas, which is also a powerful tool for performance analysis and positioning. The specific use will not be introduced here. You can refer to the arthas official website.
- How to locate threads and methods with high CPU time consumption
First find the PID of the JAVA process, and then execute top -H -p pid, so that the most time-consuming thread can be found, as shown in the following figure. Then use printf "%x\n" 17880 to convert the thread number into hexadecimal, and finally go to the jstack thread dump file through this hexadecimal value to find which thread occupies the most CPU.
Other problem cases
When this type of problem occurs, the JVM behaves as still as water, and the usage of CPU and memory is at normal water level, but the transaction is slow. Dump file or use JFR to record a JVM run. The reason for the high probability of this type of problem is that most threads are stuck in a certain IO or locked by a certain block. The following also brings a real case.
- Case number one
A financial and insurance head customer responded very slowly to a transaction, often with a response time of more than 10S. The application was deployed on a public cloud container, the container size was 2C4G, and the database was OceanBase. The problem can be reproduced every time. The distributed link tool can only locate the slowness on a certain service, and cannot precisely determine which method is stuck. During the slow transaction period, check the status of the OS through the top and vmstat commands. The CPU and memory resources are at normal water levels. Therefore, it is necessary to look at the state of the thread during the transaction. During the slow execution of the transaction, the transaction thread is dumped, as shown in Figure 29, which method the corresponding thread is stuck on can be located. The thread in the case is stuck in the stage of executing socket read data. Stuck on reading the database. If this method is still not easy to use, you can also use the packet capture method to locate.
Figure 29 Example of transaction being hanged
- Case 2
During the stress test of a financial bank customer, it was found that TPS could not go up, less than 10TPS, and the response time was extremely high. After a period of training, empowerment and running-in, the customer already had some performance positioning capabilities. The feedback information is that the SQL execution time, CPU and memory usage are all normal. The customer prints a thread dump file and finds that most threads are stuck on the distributed lock using RedissionLock, as shown in Figure 30. It was found that the customer did not use the distributed lock properly. After solving the problem, the TPS increased by 20 times.
Figure 30 Example of problems caused by improper use of distributed locks
These two cases are actually not complicated, and they are easy to investigate. I just want to reiterate an overall idea and method for troubleshooting such problems. If the transaction is slow and the resource usage is normal, you can locate the problem by analyzing the thread dump file or JFR file. Such problems are generally caused by IO bottlenecks or blocked blocks.
Summarize
There are thousands of problems, but as long as you cultivate enough deep internal skills to form a set of your own problem-solving ideas and playstyles, plus a set of tools to support problem-solving, there are still occasional ones based on your existing experience. A trace of inspiration, I believe that all problems will be solved.
For more communication, welcome to the DingTalk group to communicate, PTS user exchange DingTalk group number: 11774967.
In addition, PTS has recently made a new upgrade to the sales method, and the price of the basic version has dropped by 50%! The price of 5W concurrency is only 199, eliminating the trouble of self-operation and maintenance pressure testing platform! There are also 0.99 trial version for new users and VPC stress test exclusive version, welcome to buy!
Click here to go to the official website to see more!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。