1
头图

Guide

This article mainly shares some attempts and explorations made by the quality course test team in the last one and a half years to ensure the stability of the big promotion. For example, how to accurately estimate the user traffic at the moment of opening the gate, how to better perform verification and back-testing after performance optimization, and how to solve the embarrassment of the static pressure test in the dead of night, etc. What is gratifying is that, after continuous testing and optimization, all the services of the premium class have shown good stability and reliability under the transaction flow of several billions.

Author / Youdao Quality Course Test and Development Team

Edit/Ein

background

Similar to the e-commerce platform’s 618, Double 11 promotion, there are also two important time nodes for online education platforms: April will continue in the summer and autumn, and October will continue in the cold and spring. The production and research side needs to focus on sales strategies and sales expectations. Provide support in various aspects such as renewal tools, data accounting, and process integration. As a tester, on the basis of ensuring that the function is available, it is necessary to maintain the high availability of the system under the condition of a sudden increase of more than ten times the flow through the means of full link pressure testing, and exercise various degradations, current limiting, fusing, and monitoring , Emergency plan, try our best to ensure that there is no problem with the highest peak of traffic, and even if there is a problem, it can quickly find-locate-deal with-restore.

Concrete practice

Overall goal

In a word: to ensure the overall stability of the system at the moment of opening the gate and during the event.

Further refinement can include the following points:

  1. Capacity planning: Based on the overall flow and business goals, estimate the capacity that each subsystem needs to meet, and combine the pressure test conditions to properly expand and optimize the system to ensure that the system can meet the business flow pressure.
  2. Flow control and degradation: The system needs to prevent the flow from exceeding the supportable capacity, limit the flow beyond the design, and degrade the overloaded or abnormal services in a timely manner. The main verification here: the rationality and availability of current limiting and downgrading strategies.
  3. Monitoring: Test whether existing monitoring methods can reasonably discover and expose problems, so as to provide early warning of problems and achieve early detection and early treatment.
  4. Rehearsal plan: A comprehensive rehearsal of the problems that the system may face, such as basic service abnormalities, computer room failures, and other disaster simulation methods to check system performance and prepare reasonable solutions.

Process


Figure ①Basic test process


Figure ② Problem discovery and location

Common problem solution sharing

Determination of pressure test model

The model mainly includes two aspects: path and index.

Path: Mainly refers to the user's operation path in actual activities. The conversion to the service is the serial or parallel call method and call sequence of each interface. It is mainly obtained by obtaining sop from the product side for packet capture and conversion.

Indicators mainly refer to the proportion and time of various operations in each scenario. From the perspective of the test dimension, it is the QPS, RT and other data of each interface.

  • It can be said that the accuracy of the model is directly related to the success or failure of the stress test. Last year, our model missed the return class interface, but this interface has serious performance problems, which directly triggers the cascading failure of the system, which has a great impact. How to obtain an accurate model, we pay attention to "take it from reality, use it for reality". That is, to obtain the actual call situation of the previous renewal activity, through data cleaning and sorting, combined with the estimated data volume and sop of this time, quantify the indicators of this stress test. Refer to the following figure:
  • Through our self-developed tools, we can complete work such as automatic log analysis, interface list supplementation, and determination of stress test scenarios. Part of the processing flow is as follows: (Note: npt is the abbreviation of Hangyan's stress testing platform)
  • Convert SOP to interface path. The traditional method is to manually filter, compare, and organize the results of the packet capture after the packet is captured, and then manually synchronize the interface changes to the stress testing platform. This work is cumbersome and repetitive. After we refine this part into a web tool, we only need to upload the capture file to get the scene level interface increase and decrease, and support "scene level interface list maintenance", "set interface blacklist", "interface one-click import Features such as stress testing platform" and effect display:
  • The final result: Whether it is the interface list or the magnitude, the flow rate simulated by our pressure test is almost the same as the actual flow rate.

    Data structure

    Data is a prerequisite for the execution of stress testing. For example, we need a virtual user file in a specific format as a request parameter. Another example is that a certain batch of users can only be counted as valid users if they have certain course permissions. Then we need to preset certain batches for users Course permissions. In response to this situation, the main solution is to develop web tools for batch operations through the self-research platform, such as adding course permissions, issuing coupons, and performing tasks through multiple threads to complete the preparation of effective users in a relatively short period of time, including Database update, redis cache refresh, etc., realize tool reuse and reduce manufacturing cost.

    surroundings

    The time left for stress testing was relatively tight. In order to ensure the reliability of the test results, we directly use the online environment stress testing. At the same time, in order to reduce the business impact, the test can only be carried out in the midnight of the morning, which leads to a long test cycle and the relevant personnel More tired. After communication, the development and operation and maintenance sides assisted in setting up a dedicated stress test environment, service deployment independent instance, redis, kafka and other related middleware plus prefix for data offset, core component mysql alone deployment instance, in order to solve the test environment mysql data volume is insufficient The problem and data cleaning problem, researched and developed mysql one-key synchronization and rollback tool, the process is shown in the figure:

Effect: 80% of the core e-commerce business problems can be found and verified in the test environment, no need to stay up late for online testing.

About function backtest

  • In the process of stress testing, performance optimization is indispensable, so the quick completion of optimized interface function verification is an important guarantee for the continued performance of performance testing; in fact, there are many scenarios that require functional backtesting, such as: Modify the logic of the interface itself, such as single query to batch query, etc.; change the data source, such as query es to query doris, etc., here manual backtesting and our existing interface automatic backtesting have certain limitations, so we Introduced the traffic analysis + diff program, the test is efficient and the coverage rate is high.
  • The core goals are: large amount of comparison data, fast comparison speed, simple and convenient operation, use process:
  • Regarding data comparison, there are many third-party libraries that meet our requirements, such as common deepDiff, difflib, json-diff, json_tools, etc., each with their own focus. Among them, DeepDiff can compare iterable objects such as fields and strings. For the deep differences of objects, it supports recursive search for all changes. At the same time, it supports many formats for comparison, including JSON, XML, images, etc., because the functions are relatively complete and meet our requirements. As required, we finally chose the deepDiff library.

to sum up

income

  1. After multiple rounds of pressure testing, more than 20 performance problems and optimization items were found. During the final actual opening of the gate, the system performed well, and the business side feedback was very good.
  2. The stability of the entire system has been significantly improved, and the daily failure rate has been significantly reduced.
  3. With concerted efforts, a comprehensive stress testing operation manual was produced as a guide document for the stress testing tasks of subsequent large-scale activities.

    Follow-up outlook

  4. Daily stress testing → Regarding the execution of stress testing, we will work with development and operation and maintenance to deploy a dedicated stress testing environment, establish daily stress testing procedures and standards, and find and optimize performance issues as soon as possible before new changes go online. Avoid cuddling.
  5. Unattended pressure test → At present, for each full-link test, we need the test performed by the pressure test and the R&D personnel to pay attention to the monitoring and alarm online in real time, so as to find and locate the problem in time. We hope to access ours during the pressure test in the future. Various service and interface alarm systems achieve the effect of automatic problem detection and report output.
  6. The ease of use expansion of related tools is provided for the development of independent backtests → some of the tasks mentioned above, there are still a lot of manual operations, and we are currently doing further tool development, including the diff tool to increase the comparison and statistics of the result display, and the flow Play back web operation pages and so on.
  7. Performance bottleneck analysis tool → When current performance problems occur, they usually rely on R&D personnel to manually locate and troubleshoot. Follow-up will investigate whether the apm system can be used to do some preliminary analysis of the problem.
  8. Tools are integrated to empower other business lines.

-END-


有道AI情报局
788 声望7.9k 粉丝