1
Introduction to by Internet promotional activities such as "Double 11" and "618". More and more Internet companies use irregular marketing activities to stimulate consumption and achieve the goal of increasing revenue. However, behind every business carnival, how to scientifically prepare corresponding computing resources for promotional activities has become a normal problem that plagues developers. In addition, according to Gartner statistics, under the influence of the epidemic, more and more companies have begun to accelerate the migration of key business modules from local clouds to public clouds to improve the stability and disaster tolerance of enterprise services. How to effectively evaluate and plan the capacity of key resources such as computing power, computing engine, and bandwidth has become a technical challenge in the cloud scenario.

1. Background

After verification of Internet promotional activities such as "Double 11" and "618", more and more Internet companies use irregular marketing activities to stimulate consumption and achieve the goal of increasing revenue. However, behind every business carnival, how to scientifically prepare corresponding computing resources for promotional activities has become a normal problem that plagues developers. In addition, according to Gartner statistics, under the influence of the epidemic, more and more companies have begun to accelerate the migration of key business modules from local clouds to public clouds to improve the stability and disaster tolerance of enterprise services. How to effectively evaluate and plan the capacity of key resources such as computing power, computing engine, and bandwidth has become a technical challenge in the cloud scenario.

In response to this scenario, the Alibaba Cloud Database Autonomous Service Team (DAS) has launched an intelligent stress testing service, dedicated to solving the problem of computing resource evaluation in big promotion scenarios, offline resource capacity planning for migration to the cloud, cross-engine migration and other database selection evaluation problems . DAS (Database Autonomy Service) is a cloud service that realizes database self-awareness, self-repair, self-optimization, self-operation and maintenance and self-security based on machine learning and expert experience, helping users eliminate the complexity of database management and services caused by manual operations Failures effectively ensure the stability, safety and efficiency of database services. The solution architecture is shown in Figure 1. image.png

2. The composition of intelligent pressure test

Stress testing, that is, stress testing, is a test method to establish the stability of the system. It is usually performed outside the normal operating range of the system to investigate its functional limits and hidden dangers. Generally speaking, network server testing is a test that constantly exerts "pressure" on the network server in the traditional sense. It is a test to obtain the maximum service level that the system can provide by determining the bottleneck or unacceptable performance points of a system. In the database scenario, stress testing usually refers to testing the performance of the database. By continuously increasing the number and concurrency of executing SQL on the database server, it is tested whether the database under the established specifications can continuously and stably provide services to the outside world. The test results make corresponding decisions, including adjusting database specifications, deployment patterns, business SQL optimization, and so on. Under normal circumstances, completing a pressure test mainly involves three key parts: pressure test data preparation, flow playback and result analysis, as shown in Figure 2.

image.png

Figure 2 The key components of intelligent pressure testing

Pressure measurement data: In the database scenario, the flow data is SQL statements, but only SQL statements during execution are not enough. During the execution of SQL statements in the database, the actual data distribution and database table indexes will affect the execution time. Therefore, in the database scenario, the pressure test data includes the database table structure, the data in the database table, the index and the SQL execution statement of the database. In addition, in some special scenarios with strict security requirements, only the table structure allows reuse, and the specific raw data cannot be used for flow pressure measurement. In response to this situation, we have proposed an algorithm for intelligently generating data to produce simulated data that conforms to the original data distribution for playback.

Traffic playback technology: In the process of traditional performance stress testing, since SQL execution statements are not restricted in accordance with the concurrency and execution order of the original traffic, the effect of stress testing and the original business traffic is quite different, resulting in a single database resource assessment task Usually, multiple stress tests are performed in the middle, and then the performance result data is averaged before evaluating the resources. This method requires a lot of testing time, and requires the tester to have a certain amount of database experience, and usually requires a DBA to operate. In response to this problem, DAS has made technical improvements to a single stress test, using stress test idempotent technology to ensure that the performance after stress test replay is similar to the original business traffic performance, and does not require multiple replays, which greatly saves resource evaluation time and Reduced the requirements for database pressure testing experience.

Analysis of stress test results: Effective result analysis can help users reasonably select resource specifications and discover hidden dangers in the process of business traffic playback. Data such as key performance parameters of the database, comparison of key performance indicators, and SQL optimization suggestions can help users understand resource differences and potential optimization points, and assist in subsequent decision-making.

3. Insider of intelligent pressure measurement technology

3.1. Intelligent data generation technology

Regarding database performance stress testing, there are many open source tools in the industry, such as Sysbench, mysqlslap, tpcc, etc. These tools can generate a certain amount of SQL traffic by concurrently connecting a large number of database connections and certain query statements, and achieve the effect of simulating the high-intensity use of databases in business. However, the performance under the simulation scenario is usually quite different from the actual performance of the business, so the simulation stress test cannot meet the requirements of computing resource evaluation. Using real data in the business database to perform stress testing has become a basic condition for resource assessment. For Alibaba Cloud database users, the SQL audit function can be used to easily obtain the data required for stress testing. For users of the cloud or Alibaba Cloud ECS self-built database, it is more difficult to obtain historical database table data or traffic data for pressure testing, even in some scenarios with strict security data requirements, even raw data and SQL traffic data They are not allowed to be used.

At present, we use intelligent data generation technology in the single-table query scenario to produce data that conforms to the distribution of business data, which can be used for stress testing and resource evaluation. The premise of this algorithm is that we need to know some SQL templates and the execution indicators corresponding to these SQL templates, such as RT, rows\_sent, rows\_affected, etc. We hope to instantiate these SQL templates to generate SQL so that these SQLs are Similar execution indicators can be obtained when executed on the target database table (here we assume that the SQL of the same template will be executed with the same execution plan). As shown in Figure 3, we need to search for the corresponding parameters a and b to instantiate this SQL template so that the number of rows returned is 1 when the given data is executed.

image.png

Figure 3 SQL template

When searching for SQL parameters, for point query/point update, you can directly use the primary key and unique key to do parameter search. For the case where the number of returned rows/updated rows is greater than 1 row, we use the sampling-based cardinality estimation method to estimate the number of returned/updated SQL rows after instantiation, and then perform the parameter search for SQL template instantiation.

Figure 4 shows the traffic generation pressure test of Dingding a read-write business during the morning peak period. It can be seen that the traffic generation pressure test and the real business have similar performance on multiple indicators, which proves that the generated data can be effectively simulated. Real data online.

image.png

Figure 4 Pressure measurement effect based on generated data

3.2. Pressure test idempotent technology

After data preparation is completed, how to effectively and repeatably replay traffic is another core technology in intelligent stress testing. Although the existing open source tools in the industry can generate a certain amount of SQL traffic by combining a large number of concurrent database connections with certain query statements, and achieve the effect of simulating the high-intensity use of databases in business. However, after using a real business model with a certain data tilt, a more serious problem will be found: if the performance effect of the same model and the same data under RDS MySQL is tested multiple times, in the case of data tilt , The performance curves on both sides may not match. For example, the first round of pressure test found a certain data at time A, while the second pressure test is likely to be found at time B, which will greatly interfere with the analysis problem, as shown in Figure 5. Although the pressure of the two curves is similar, the jitter frequency is completely inconsistent, which is not conducive to analysis.

image.png

Figure 5 The effect of running the same test model twice on the same database instance

In response to this situation, we proposed the concept of stress test idempotence, that is, the same test, no matter how many times it is run, the generated SQL is exactly the same. In the idempotent case, the SQL text generated at each time point is exactly the same (assuming that the database processing capabilities are exactly the same), and the execution order of all SQL is the same in the entire stress test task. At present, the thread level is completely consistent, and there is no strong consistency between different threads from the perspective of performance and demand.

With the support of idempotent technology, DAS intelligent stress testing can achieve consistent stress testing for the scenarios described above, and the effect is shown in Figure 6.

image.png

Figure 6 The effect of running the same smart stress test twice on the same database instance

The idempotent technology of stress testing is mainly processed from the three aspects of stress testing thread generation logic, total number of requests, and final consistency of writes, so that the sequence of random numbers appearing in each thread is the same during the stress testing process. , And different between different threads; by keeping the total number of requests in the thread consistent, the effect of ensuring the total amount of requests is fixed; combined with the method of customizing the primary key and agreeing to the update interval, avoiding the conflict between the self-incrementing primary key and the update, ensuring The final consistency of the data after the end of the pressure test.

4. Product landing

4.1. Product flow

After introducing the components of intelligent stress testing and the corresponding core technologies, let's look at how DAS implements intelligent stress testing into products. From the perspective of the stress testing process, the entire smart stress testing process can be divided into the preparation phase, the SQL processing phase, the playback phase, and the effect evaluation phase, as shown in Figure 7.

image.png

Figure 7 Intelligent pressure test product process

The preparation stage is mainly to solve the problem of the machine environment for stress testing, which involves purchasing ECS machines, preparing target instances for stress testing, and configuring the operating environment on the ECS machines. At present, DAS's intelligent pressure test can independently select the appropriate ECS machine and automatically configure the operating environment according to the QPS of the pressure test flow and the playback time. It also allows users to use their own machines for pressure test. In preparing for the target instance of stress testing, DAS can now help users prepare target instances through RDS backup and recovery and DTS synchronization, and also allows users to freely specify stress testing instances.

The SQL processing stage is mainly to prepare the full amount of SQL detail data used in the stress test before the stress test, and preprocess the SQL data generated based on the SQL insight details or intelligent algorithms, including prepared statement deduplication, log removal, and transaction statements. Combine and so on.

In the playback stage, the flow is mainly played back using pressure measurement idempotent technology, which provides real-time database performance data and pressure measurement machine load conditions, so that users can understand the pressure measurement progress. In this link, DAS combines the intelligent parameter tuning algorithm with pressure measurement. Users can use this function to realize the parameter tuning function. The specific algorithm implementation will be introduced separately in subsequent articles.

The effect evaluation stage is mainly to interpret the indicator data in the stress testing process. DAS compares the commonly used performance parameters and key performance indicators in business tuning to assist users in making resource assessment decisions. For problems such as slow SQL and locks found in the stress testing process, DAS also provides corresponding improvement suggestions and solutions, and provides information assistance for users to optimize their business.

4.2 Product use

The user can use it in the "Smart Pressure Test" menu on the left side of the DAS console, as shown in Figure 8. Currently DAS supports RDS MySQL and PolarDB MySQL pressure testing, and the support of other relational database engines is under development.

image.png

Figure 8 Smart pressure test interface

After the stress test is over, the user can view the performance data comparison between the target instance and the source instance and the comparison of key parameters through the task details, as shown in Figure 9.

image.png

Figure 9 Comparison of effects after pressure test

4.3. Product billing

Currently, the DAS intelligent pressure test function does not charge separately. The newly created ECS and RDS in the pressure test process are billed according to the volume-based billing standard in the official website of the corresponding product, and there is no additional service fee. As mentioned above, stress testing relies on the full amount of SQL detail data at the source or the corresponding database table infrastructure data, so this service only needs to enable the DAS professional version of the stress testing source instance.

4.4. Customer case

Since the DAS intelligent stress testing service was launched in 2020, the main customers are the top customers on the cloud, and it has provided services to nearly 100 customers, including cloud resource evaluation, business promotion evaluation, engine switching evaluation, database operation verification, etc. Scenes.

5. Future Planning

Next, intelligent stress testing will add supported database engines, covering all relational database engines on the cloud; at the same time, intelligent stress testing will be close to the real business problems of customers, closely related to scenarios such as users going to the cloud, resource evaluation, and engine recommendation. Combine and provide corresponding stress test evaluation recommendations and reports, and build database capacity planning capabilities in large-scale scenarios with enterprise customers.

July 7 at 14:00 Database Autonomous Service DAS annual blockbuster release

DAS autonomy is better than walking in the garden

Database autonomous driving enters the era of scale
Scan the code or click " here " to make an appointment to watch the live broadcast

image.png

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。