Author: Zhiyun
Why do we need to locate the problem of pressure measurement?
Performance Testing Service (PTS) is a SaaS stress testing platform with powerful distributed stress testing capabilities. It can simulate the real business scenarios of a large number of users and comprehensively verify the performance, capacity and stability of business sites.
In the process of continuously measuring the water level of the pressured server, we can see more comprehensive pressure measurement indicators from the pressure measurement view or pressure measurement report, such as QPS, RT, TPS, etc., but only from these indicators , it is not possible to quickly locate the specific problem of the server. For example, we can see the response body of the interface corresponding to the error code from the error information center of the whole scene, but which link in the downstream is wrong, and the wrong stack What is it, you can't see it from the report alone, and what is wrong in the downstream of the interface and what is the error stack are exactly what users care about.
With the help of problem diagnosis, we can clarify the upstream and downstream calls of the pressed interface. At the same time, from the link view, we can see the message components (Kafka, RocketMQ, etc.), caches (Redis, MongoDB, etc.) that the entire link passes through. etc.), database (MySQL, Oracle, etc.), RPC call (Feign, Dubbo, HttpClient, etc.), for example, if an interface has abnormal status code or other errors, then we can see from the call chain that it is an Rpc call If there is a problem, there is a problem with the database read and write, and the corresponding error stack can be seen from the call chain. Based on this information, it is relatively clear where the problem should be located.
Basic introduction and core advantages of problem diagnosis
basic introduction
When it comes to problem diagnosis, users are mainly concerned about whether access problem diagnosis requires a series of modifications to the application-side code, whether complex configuration is required, and so on. The problem diagnosis provided by PTS is based on JavaAgent, and does not require business code modification on the user side. For the Tomcat-based deployment method, users only need to add some necessary parameters to the startup script to access problem diagnosis; for Kubernetes users, users only need to You need to add some necessary annotations to the Yaml configuration file to access problem diagnosis. For the collection rules of links, PTS will provide the default configuration, and users can also change them according to their own needs.
PTS integrated problem diagnosis During the stress testing process, for each request, a TraceId will be generated on the stressing engine side, and the upstream and downstream links involved in the request will be associated with the TraceId. Users can see the request from the request. As an entry to the complete call chain involved in the end of this request, at the same time, problem diagnosis will generate a corresponding application topology view for the call chain, allowing users to clearly see the call relationship between applications.
For abnormal interfaces, we can see the corresponding error causes in the call chain. At the same time, users can troubleshoot and optimize server-side problems according to the specific error stack. During the stress test, users can view the call chain of the specified request in real time. At the same time, after the stress test is completed, the problem can also be traced back from the stress test report.
core advantages
1. Zero code intrusion: Java-type services, the user side can complete the probe access for problem diagnosis without modifying the business side code.
2. High integration: pressure measurement, monitoring, and problem diagnosis are integrated in the same console, with relatively low user understanding and operating costs.
3. Full monitoring indicators: provides interface, machine, and application-level monitoring for each service in addition to basic monitoring indicators during the stress measurement process.
4. Low threshold: only needs simple configuration parameters to complete the access to the problem diagnosis probe. At the same time, the probe also has functions such as multi-protocol mocking and full-link pressure testing.
Quickly play the problem diagnosis
The basic flowchart for diagnosing access problems is as follows:
Access the probe and check whether the access is successful
First, we will sort out the applications involved in the under pressure scenario, and follow the steps in the [Problem Diagnosis] -> [Probe Access [1] ] document for all the applications involved to perform problem diagnosis and probe access. . We can check whether the application probe is successfully connected in the application configuration of the PTS console or any one of application monitoring, interface monitoring, and machine monitoring. The stress test scenario we demonstrate this time involves five applications, namely petstore-web, petstore-user, petstore-order, petstore-catalog, and petstore-cart. This is an example of application monitoring to check whether the application is successfully connected. Click [Problem Diagnosis] in the PTS console -> [Application Monitoring [2] ] -> select the Region and Namespace we configured, if you see that all the applications involved in the stress test scenario are on this page, it means the application Access is successful.
Turn on the problem diagnosis switch in the stress test scenario
Then, we create a stress test scene in [Pressure Test Center] -> [Create Scene [3] ] in the PTS console. Here we can choose a PTS scene or a JMeter scene, etc. Here we take the PTS scene as an example, because this time The demonstration is mainly to verify the ability of problem diagnosis, so you need to turn on the problem diagnosis switch in [Advanced Settings] in the scene configuration. For specific monitoring and collection rules, PTS will push the configuration that the default collection switch is turned on for the user, and at the same time, set the sampling rate to 1/1,000, and users can also customize it according to their own needs.
Start stress test and view application monitoring
After completing the above steps, our stress test scenario has the ability to diagnose problems. When we click to start the stress test, we can go to the application monitoring, interface monitoring, and machine monitoring to select the service we care about to view the corresponding monitoring situation. Here, the application monitoring [2] as an example. The operation steps of other types of monitoring are similar. We select the petstore-user service to view application monitoring, as shown in the following figure:
After the stress test is completed, view the error information of the whole scene
After the stress test is over, we need to check the problems of the stressed server from the stress test report, and open the stress test report of the corresponding scenario. The specific steps are: PTS console -> [stress test center] -> [report list [4 ] ], select the corresponding pressure test report, you can see the information of the whole scene from the overview page, as shown in the following figure:
Select probe sampling to view the specific call chain situation
Click [View Sampling Log], and select "Probe Sampling" as the sampling type to filter out the call chain collected by the problem diagnosis probe, as shown in the following figure:
View the specific error stack information of the call chain to locate the server-side problem
After filtering out the call chain collected by the probe, you can analyze the call chain of the problematic interface. For example, the status code returned by the interface of the product list is 500. Click View Details to see the specific reason, as shown in the following figure:
From the call stack, you can see the specific cause of the error, so that the server code can be optimized and repaired. At the same time, you can view the calls between services and database usage through the application topology view and database view. Here is an example of the application topology view, as shown in the following figure:
Summary of common error codes in stress test reports
Problem diagnosis error code summary
The common error codes in the problem diagnosis call link are summarized as follows:
- java.lang.NullPointerException: Null pointer on the server side. Specifically, the code on the server side can be checked according to the error stack in the call chain.
- com.microsoft.sqlserver.jdbc.SQLServerException: The server side SQL reports an error, and the server side SQL syntax can be checked according to the stack information collected by the call chain.
Pressure test report error code summary
The common errors in the stress test report are listed here. We can see the relevant error information from the error information of the whole scene, as follows:
- class java.net.SocketTimeoutException: null Indicates that the request timed out waiting for a response or in the middle of a read (idle). Please check whether the server's health status or the timeout time of PTS's stress test API is reasonable. In addition, there may be a bottleneck in the server's processing capacity.
- class java.net.ConnectException:null indicates that the request fails or is rejected by the remote end when establishing a TCP connection with the remote end (the end under pressure). Please check the server health, or if there is a bottleneck in the network connection layer.
- class java.util.concurrent.TimeoutException:null indicates that the request fails or is rejected by the remote end when establishing a TCP connection with the remote end (the end under pressure). Please check the server health, or if there is a bottleneck in the network connection layer.
- class org.apache.http.ConnectionClosedException:Connection closed indicates that the connection is closed abnormally, and the server actively closes the connection.
- class java.io.IOException:Connection reset by peer indicates that the connection was reset. If SLB is used, please check whether there is any problem with the configuration of SLB.
- class org.apache.http.ConnectionClosedException:Connection closed unexpectedly indicates that the connection has been closed before the data has been received. It is possible that the server did not respond in a timely manner or terminated debugging or stress testing in advance.
- class java.lang.RuntimeException:java.net.UnknownHostException indicates that the domain name information cannot be resolved. Please check whether the domain name has been registered normally and can be resolved, and whether the unregistered domain name has been bound to the domain name.
- class org.apache.hc.core5.http.ProtocolException:Header 'key: value' is illegal for HTTP/2 messages indicates that when the server uses the HTTP2 protocol first, the scene is configured with a header that is not supported by the HTTP2 protocol. Please move Retry after removing the corresponding header. Common headers not supported by HTTP2 are: Connection, Keep-Alive, Proxy-Connection, Transfer-Encoding, Host, Upgrade.
Related Links
[1] Probe Access
https://pts.console.aliyun.com/#/diagnosis/probeAccess/pts
[2] Application Monitoring
https://pts.console.aliyun.com/#/ahas/appList?type=Summary
[3] Create a scene
https://pts.console.aliyun.com/#/create/scene
[4] Report list
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。