: In the full-link stress test solution, at least 70% performance improvement in non-encrypted scenarios, 10% performance improvement in encrypted scenarios, and a substantial performance improvement after MGS expansion is completed, the result of tuning Far exceeded expectations.
Business background
As the mobile development industry enters the era of inventory, the load capacity of the overall App architecture and the optimization of various links have gradually become the focus of various developers.
Stress testing is the main solution to achieve the above functions. Generally can be based on stress test:
- Test the load bottleneck of the back-end business;
- Evaluate the overall architecture performance;
- Business stable peak;
- Find out the weak relationship of each node;
- Optimize system resources;
- Avoid short board effects;
Provide accurate user load for operations as a testimony to avoid poor user experience caused by sudden traffic caused by the launch of activities/new applications.
Today, we will introduce to you the principle and implementation path of the full-link stress test solution.
Full link stress test and principle
Usually we can simply apply the formula of load performance=single machine performance*total machine quantity to the estimated solution, but in actual scenarios, a large number of business nodes are often involved, such as DNS, gateway, database, etc. , All may be the bottleneck of the overall business performance, so the actual service capacity may have a large error with the expected.
General users will implement server performance stress tests in production environments through solutions such as loadrunner, but in mPaaS applications, complex deployments cannot pass the MGS gateway, and high costs and other difficulties have arisen in order to solve these pain points.
The mPaaS team here provides an MGS full-link stress test solution based on the requirements of multiple customers.
Different from the previous test solutions, the biggest difference in the full-link stress test solution is the perspective. From the perspective of the client, the entire server link is used as a black box, and the real request and response are used as the entry point. As the basis for evaluation, simulate real business requests, real data traffic, and real user habits to achieve the most realistic evaluation results possible.
Link grooming
In a standard data link, generally the following model
In the full-link stress test, we regard the overall server implementation as a black box, so the focus we need to focus on is the first half, and the focus can be summarized as:
1. Client request construction;
2. The client request is sent and passed through the MGS gateway;
3. The client parses the response returned by the MGS gateway and handles it correctly;
4. Realize high-concurrency client request cluster.
Sorting out the above again, we can summarize the following difficulties
Difficulty 1 Client request to build
The mPaaS mobile gateway RPC communication is a standardized interface method implemented on the basis of the HTTP protocol. Under the premise of reusing the HTTP request standard, a set of data exchange format is defined, using Header and Body as the actual distinction, which can be approximated In order to use the Operation-Type in the Header as the real api point, the body part is encapsulated according to the rules and then forwarded.
In this step, we use JMeter as the implementation solution. The flexible scripting feature of Jmeter can well realize the real request simulation of the client.
Difficulty 2 Data encryption and decryption
The peculiar data encryption method of mPaaS mobile gateway RPC request constructs the more complicated part of the request. Existing test schemes on the client side cannot cover this part of the capabilities, so they often choose to turn off the signature verification and encryption functions of the gateway server to implement stress testing.
The hidden danger of this method is that it is impossible to estimate the computational pressure of encryption and decryption on the gateway server.
According to experience, different encryption and decryption algorithm configurations have 20% ~ 40% impact on the throughput of the gateway. At this stage, the JMeter plug-in MGSJMeterExt, which was custom-developed by the financial line SRE team based on the user's production environment, reversely realized the encryption and decryption process of the request body, so that the layout of the pressure test script can include the encryption part.
Difficulty 3 Request signature construction
The signature verification mechanism unique to mPaaS mobile gateway RPC requests is also special. As with data encryption and decryption, there is currently no solution on the client side that can cover this part of the capability, and it often chooses to close the interface verification for testing. Also with the help of MGSJMeterExt, the correct signature of the message can be realized in JMeter, and it can be verified by the server.
Difficulty 4 Stress test cluster environment deployment
For stress testing, the focus needs to be on real, real traffic entrances and real concurrency to get real results. However, implementing the stress testing environment by yourself and the high cost of cluster deployment have also become unnecessary expenses. .
Therefore, we recommend that users use Alibaba Cloud PTS as the stress test platform. Based on other solutions, it has the advantages of easy deployment, support for Jmeter scripts, and real traffic. It can also provide users with more detailed stress test reports.
overview
The above model is simple and can be summarized as the following structure
Full link plan and implementation
Part1 preliminary preparation and research
In the early stage, the goal is to provide relevant preparation and data support for the actual stress test, and to establish the target and overall direction of the stress test.
1.1 Target and data preparation
1. Customers need to clarify their own stress test objectives and stress test objectives. Based on the stress test objectives, refer to the previous operational data, give the specific business categories and possible user behavior habits involved, and each habit in the overall business The relative weighting relationship brought by it.
1.2 Client preparation
1. The client side needs to sort out the interfaces and data flows that may be involved in the client implementation based on the corresponding business goals, such as whether it includes pre-steps, such as login, etc., and whether it includes mandatory steps, such as home page refresh And so on, collect the real request and response in this step through packet capture, etc., and determine the value conditions that meet the expectations.
2. This step involves different business structures, and the preparation can also be completed by the server interface.
1.3 Server preparation
1. On the server side, according to the relevant interface statistics in 1.2, do the relevant data baffle to avoid causing the test data to pollute the real database.
2. In the mPaaS full-link stress test, the server is regarded as a black box, so it is necessary to monitor the performance indicators of the various services of the server to serve as a basis for later server tuning.
1.4 MGSJMeterExt plug-in preparation
Since MGSJMeterExt needs to be customized and developed according to the actual gateway environment, users are required to provide the following data:
1. Environmental data related to the workspace
2. Encryption algorithm and public key
Q&A
Q: How to implement pressure test script?
A: Our team of experts and on-site students will complete the stress testing script training in a simple scenario. In the actual scenario, multiple links of the business may be involved, such as the acquisition of the login token, and some clear pre-steps. The first type involves complex business scenarios and requires customers to complete it by themselves with the assistance of the Ali expert team.
Q: Why is it full link?
A: Although our stress test script is implemented based on client logic, we actually simulate a real data request and will also confirm whether the server's return meets expectations, involving the entire complete data link and node.
Q: How to realize the buried point of the link index?
A: The target of the stress test program is based on the black box. Through the system's pts indicator, request parameters and return rate, verify the success rate that meets the expected result to confirm the performance that the entire architecture can load based on the user's perspective For some back-end indicators, there are many differences in the server architecture adopted by different customers. For back-end indicators, the corresponding service providers can generally provide related monitoring solutions, and there is no need for mPaaS to process them.
Q: Why use PTS?
A: The mPaaS team actually provides MGS communication solutions to assist customers in writing PTS scripts. It is not mandatory to use PTS. It only needs to provide the relevant Jmeter cluster deployment environment, and PTS related resources need to be purchased by users. , But the current mPaaS team is based on multiple case evaluations. Relatively speaking, using PTS has a higher cost performance, and can provide a more expected pressure test environment and a complete pressure test report. Therefore, it is recommended that users use PTS for pressure test .
Q: Are there any detailed standards, such as 2c4g, or 4c8g, what performance indicators should be achieved?
A: The stress test itself is to clarify the performance indicators that can be achieved under the relevant system resources. Due to the different server-side architectures, the actual business involves different process nodes, and there are huge differences in performance under different environments. These are The purpose of using stress testing is to clarify the real indicators and evaluate the actual resource time consumption of each node through stress testing.
Part2 Jmeter development and script modification
We have summed up the special focus of the MGS communication solution, so we need to complete the transformation of these points in Jmeter
2.1 Header transformation
In the Header, we need to pay attention to the following points:
1. The MGS gateway protocol relies on some Header fields, so it is necessary to ensure that the gateway parameters are complete.
2. Some parameters are fixed values and can be written to death directly. For related configuration, please refer to the configuration file downloaded from the console.
3. If the business has other Header dependencies such as cookide and other services that need to be used, it can also be added directly. The MGS gateway will not filter the header information.
2.2 Url transformation
In the URL, we need to pay attention to the following points:
1. The actual point of the URL should be the MGS gateway, not the actual business server. For the relevant configuration, please refer to the configuration file downloaded from the console.
2. At present, all requests to the MGS gateway are post. If there is a get request, it is also changed to get when forwarded by MGS, and it is also post in the communication with MGS.
3. If there is no special requirement for the Body part, the suggestion is as shown in the figure.
2.3 Request Modification
In Request, we need to pay attention to the following points:
1. The encryption/verification here depends on the MGSJMeterExt file, which needs to be quoted.
2. Under normal circumstances, only the //config part needs to be modified.
3. The following parts are generally a unified plan, mainly for the realization of encryption and signature verification, without modification.
2.4 Response transformation
In Response, we need to pay attention to the following points:
1. Considering the performance of the presser here, it will not affect the evaluation ability of the server. Therefore, if there is no need for secondary use of data, or the need for result judgment, it is not necessary to write here
2. If you have relevant requirements, you can complete the secondary processing of Response parameters here
Part3 Actual pressure test
The general steps can be summarized as:
3.1 PTS and script performance tuning
Alibaba Cloud Performance Testing Service (PTS) provides convenient and fast cloud stress testing capabilities. In this stress testing service, PTS is used to realize Internet pressure flow input.
The interesting point is that encryption and decryption calculation not only brings computational pressure to the gateway, but also brings certain computational pressure to the press. Therefore, before the implementation of the first version of the plug-in and the pressure test script, we first conducted a "stress test" for the test press.
First round of basic testing
PTS test press configuration:
1. PTS single IP unit configuration
2. The number of concurrency is 500 (the highest concurrency for a single machine)
3. Fixed pressure value flow model
4. Two-minute pressure test often
From the recovered stress test report, the TPS result is not high, but the returned RT value is not high:
Next, by observing the performance of the press, you can see that the CPU usage level of the press has been relatively high, so there is reason to suspect that the encryption calculation pressure will have a greater impact on the pressure release of the press.
By caching the results of repeated content encryption, the calculation pressure is greatly reduced; at the same time, in order to avoid the memory problem caused by the cache design, the upper limit of the cache is limited.
Second round of testing
The configuration is exactly the same as the first round of testing, only the optimized encryption plug-in is replaced. Judging from the recovered test report, the scene TPS has been improved by 75%:
From the CPU performance of the press, there is an obvious optimization.
third round test
With the first round of finding out and the second round of optimization, the third round of testing uses two presses in the configuration to perform a pressure test at full load, and observe the pressure test results:
From the results, the stress test script and the orchestration process are in line with expectations, and formal PTS cloud stress tests can be performed in the customer's production environment.
pressure test in production environment 16136dcc1dc7f0
At the beginning of the formal stress test, several rounds of small-scale stress tests were conducted to observe whether the working status of the back-end system met expectations. During the investigation, the following problems were discovered:
Problem 1: Nginx traffic forwarding is uneven
Judging from the log performance of the MGS container, some containers never get any requests. After investigation, it is found that the problem is caused by three reasons:
1) The Nginx forwarding configuration in the DMZ zone is missing one MGS container IP;
2) The network policy from the DMZ zone to each MGS container IP needs to be opened for access;
3) The Nginx forwarding rule is set to iphash. In the test case of a single IP source, the traffic can only be forwarded to one container.
After configuring the correct IP list, opening the network permissions, and modifying the forwarding rules, the problem is resolved.
Problem 2: The base CPU load of a specific MGS container is too high
Preliminary testing found that one MGS container (mpaasgw-7) has a CPU load of 25% in the silent state, which is not up to expectations.
Log in to the container and find that there is a JPS process, which consumes a lot of CPU. It is suspected that it was not released normally after being called in the pre-commissioning phase. The problem was solved after killing the JPS process. In order to avoid other problems, the container was restarted.
Note: JPS, Java Virtual Machine Process Status Tool), is a command provided by java to display the all java processes, see: 16136dcc1dc892 https://docs.oracle.com/javase/7/docs/technotes/tools/share /jps.html ).
Problem 3: CoreWatch monitoring platform cannot access
The CoreWatch console cannot be accessed, and a 502 error is reported in the browser. After restarting the CoreWatch container, the page can be loaded, but it is always in the loading state.
http://corewatch. _*_.com/xflush/env.js has been in pending state. The investigation found that there was an error in the monitoring configuration of the ALB instance, and the problem was solved after the correction.
3.3 Production environment stress test & summary
After solving all the problems in 3.2, the system has the conditions for the stress test. The formal stress test will conduct separate stress tests for the "encrypted scenario" and the "non-encrypted" scenario.
Since the production data is not leaked, the following are only some examples of the problems encountered.
"Encrypted" test case
1. During stress testing, it is found that TPS does not increase when the number of concurrent connections is about 500, which means that the bottleneck may have been reached.
2. Observe the load of the MGS gateway container, and the overall CPU load has reached the limit.
3. The CPU load of the MCUBE container during the same time period is healthy, and other performance indicators (IO, network, etc.) are also in a healthy state.
4. From the above situation, in the encryption scenario, the main performance bottleneck is on the MGS gateway. Based on experience and process analysis, the main performance pressure is brought by the intensive calculation in the process of message encryption and decryption. To solve this bottleneck, the MGS container needs to be expanded.
"No encryption" test
1. The growth of TPS stops growing when the concurrency reaches about 1000. Under normal circumstances, this situation shows that the bottleneck of system capacity has been touched.
2. Observe the load situation of the MGS gateway container, which is different from the situation in the case of encryption. At this time, the overall CPU load is not high.
3. At the same time, according to the feedback of the network group: during the stress test, the number of TCP sessions from the Internet to the DMZ area is 3 to 4 times that of the DMZ area to the intranet area, and the firewall CPU pressure on the intranet segment of the transaction is higher.
4. Combining the above three performances, it is suspected that it will hit the network level bottleneck. According to the on-site situation, it was discovered that Nginx in the DMZ zone did not adopt a persistent connection retention strategy when forwarding to the intranet. Modify the Nginx configuration, add the keepalive 1000 configuration, and restart the second round of testing.
About the parameter Keepalive description: By default, Nginx uses short connections (HTTP1.0) to access the backend. For each new request, Nginx will open a new port to establish a connection with the backend, and actively close it after the backend is executed. The link. The Keepalive parameter tells the number of long connections cached between Nginx and the backend server. When a new request comes in, TCP connections can be reused directly, reducing the performance impact of establishing TCP connections. See: http://nginx.org/en/docs/http/ngx\_http\_upstream\_module.html .
summary
After optimizing the above problems, at least 70% performance improvement in non-encrypted scenarios, 10% performance improvement in encrypted scenarios, and a substantial performance improvement can be achieved after the MGS expansion is completed, and the tuning results far exceed expectations.
The author of this article: Alibaba Cloud mPaaS TAM team (Wang Zekang, Beimo, Donglei, Rongyang)
END
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。