The practice of traffic recording and playback in vivo

1. Why use traffic recording and playback?

1.1 vivo business status

In recent years, the Internet field of vivo has been in a state of rapid development. At the same time, since the shipment of vivo mobile phones has always been among the best in China, after years of accumulation, the scale of users is very large. Therefore, there are many built-in applications in vivo mobile phones, such as browsers, short videos, live broadcasts, information, and application stores, which are highly concurrent and complex systems that face users directly. These user-oriented systems have very high requirements on user experience, and the quality assurance of these services is the top priority.

1.2 Testing Pain Points

As our business grows in size and complexity, issues and challenges arise. Among them, "how to ensure the correctness of the original business after the system modification when the business is iteratively upgraded or even reconstructed?" is one of the major problems we are addressing.

Simple business systems can be solved by conventional automated testing tools plus manual testing. For complex systems, regression testing will become a difficult project. Taking our recommendation system as an example, a recommendation system undertakes dozens of recommendation scenarios. How to modify a recommended scene without affecting other scenes?

Before, we solved it by writing automated test cases, but there are many pain points in manually written test cases:

It is difficult to write test cases, difficult to construct data, and difficult to simulate the real user behavior.
Part of the code logic is difficult to verify through test scripts. For example, sending a message cannot verify that there is no problem with the content of the message.
It is difficult to consider all the scenarios of the system by relying on manual construction of use cases, which is easy to cause use cases to be missed.
As the complexity of system deployment increases, the cost of environmental maintenance is also relatively high.

In view of the low efficiency of regression testing in the iterative process of these complex business systems, we have carried out some continuous explorations.

1.3 Scheme exploration

We have conducted extensive research and reference on some solutions in the industry based on the characteristics of the vivo Internet system, and listed the following requirements: the new solution should be simple and efficient, and users can easily get started without too much understanding; the service access cost is low enough, and it can be quickly returned. Testing; the generality and scalability of the new solution should be good enough to adapt to the changing system architecture.

We refer to the technical solutions of some leading Internet companies and find that traffic recording and playback is a very good choice. Many leading companies in the industry have made good progress and landing value based on this technology, which has brought us some reference and confidence. Therefore, for traffic recording and playback, we have carried out some more in-depth exploration and implementation, which is our Moonlight Treasure Box platform.

2. What is traffic recording and playback?

Before introducing the specific practice, let me briefly introduce what is traffic recording and playback?

Traffic recording and playback is to verify the logical correctness of the code by copying the real traffic on the line (recording) and then making a simulated request (playback) in the test environment. By collecting online traffic and replaying it in the test environment, compare the difference of each sub-call and the result of the entry call one by one to find out whether there is a problem with the interface code.

Using this mechanism for regression testing has many advantages: first, it is simple and efficient to replace test cases by recording traffic, and it is easy to form rich test cases; second, playback of online traffic can perfectly simulate the real behavior of users and avoid differences in manual writing; In addition, the system logic can be verified in a more in-depth and subtle way by using the object comparison method between the recorded data and the playback data; the final recorded traffic does not need to be maintained and can be used at any time, which is very convenient.

3. Moonlight Treasure Box Platform

The innovative mechanism of traffic recording and playback is excellent in theory, but it is not easy to implement, and there are many problems to be solved. The following will introduce the implementation plan of traffic recording and playback in vivo Internet system and the problems encountered, and how we solve these problems.

3.1 The underlying architecture

The vivo Moonlight Treasure Box platform draws on the experience of the open source Jvm-Sandbox-Repeater project, and has done secondary development and transformation on the basis of Jvm-Sandbox-Repeater. The Moonlight Box platform includes two modules: server and Java Agent. The overall architecture is shown in the figure below.

3.1.1 Business Architecture

The following figure shows the overall business architecture of our server. The entire server can be divided into modules such as task management, data management, coverage analysis, configuration management, and monitoring and alarming.

The task management module manages the user's recording and playback tasks, including task start and stop, task progress, task status, etc.;
The data management module is used to manage the traffic data recorded and played back by the user, as well as the analysis data;
The coverage analysis module is used to count user regression coverage indicators;
The configuration management module is used to configure the global parameters of the system and application;
The monitoring module is used to analyze the performance indicators of various aspects of the Agent;

In addition, there are some message notification modules.

3.1.2 Agent Architecture

The following figure is the overall architecture diagram of the Agent module. The Agent is the core of the traffic recording and playback process. Agent is implemented based on the bytecode mechanism, and the whole includes a four-layer structure:

The bottom layer is the basic container layer, which is the standard Java-Agent implementation;
Above the container layer is the dependency layer, which introduces the third-party resources we need, implements the bytecode instrumentation mechanism, class loading isolation, and class metadata management capabilities.
Above the dependency layer is the basic capability layer, which implements basic atomic functions, such as recording and playback plug-in management, data management, data comparison, sub-calling Mokc, operation monitoring, configuration loading and other capabilities.
The top layer is the business logic layer, which can combine basic logic functions to form a complete business unit. At present, in addition to supporting traffic recording and playback, Moonlight Box also supports functions such as dependency analysis and data mocking.

3.2 The startup process of Moonlight Box

The most important thing to start the recording and playback task is to deliver our Agent to the designated business machine without intrusion and automatically attach the Agent to our business process.

The startup process of Moonlight Box is shown in the figure below. The user first configures the recording and playback tasks on the Moonlight Box platform. After the configuration is completed, the configuration information will be stored and the startup script and vivo-repeater-agent package will be delivered to the machine configured by the user through VCS (vivo's self-developed job scheduling platform). The shell script will then be executed and the sandbox will be pulled up to attach the agent to the target's JVM. Then the agent can create a jvm sandbox on the target JVM through reflection, and the sandbox will pull up multiple modules through spi.

The most important of which is the vivo repeater module, which loads multiple plug-ins through spi. These plug-ins will eventually enhance the code on the target JVM in an ASM manner, so as to implement bytecode instrumentation, and the recording and playback of traffic use these The enhanced plug-in performs traffic interception, delivery and storage.

The above execution process allows users to complete complex traffic recording and playback functions by configuring a small amount of information on the console. We will describe the detailed process of recording and playback below.

3.3 Traffic recording process

The following is a flow recording process. The call link of a flow includes an entry call and several sub-calls. The process of recording the flow is to bind the entry call and the sub-call into a complete call record through a unique ID. Moonlight Treasure Box finds the appropriate code points (key entrances and exits) for entry calls and sub-calls, performs code enhancement at the code points based on bytecode instrumentation technology, implements call interception, records the input parameters and return values of the call, and then according to the corresponding The call type (such as dubbo, http) generates a recording ID. When the call is completed, the moonlight treasure box collects the call records of the entire traffic, and then the moonlight treasure box performs data desensitization, serialization and other operations, and finally encrypts it and sends it to the server for storage.

Recording is a relatively complicated process. During this process, we continued to step on some pits and encountered some problems. Below, I will list a few more important problems and share them with you.

3.3.1 Difficulty 1: Full GC

In the early days, the internal system of vivo experienced a Full GC phenomenon when using the Moonlight Box. After analysis, it was found that the recorded interface called a lot of guava, which caused the recorded request traffic to be too large and caused FULL GC. This is because before the recording of an interface traffic is completed, all the recorded data is in the memory. Once the traffic or sub-call is too large, it is easy to cause frequent Full GC. In addition, some high-concurrency systems have many interfaces, and recording multiple high-concurrency interfaces at the same time has performance pressure. Therefore, we optimized the performance of the Moonlight Treasure Box as follows:

Strictly limit the number of concurrent recordings and the number of sub-calls per flow;
Monitor the recording process and degrade abnormally;
Merge identical subcall recording procedures to reduce the number of subcalls.
Real-time monitoring of recording cache occupancy, and timely downgrade processing beyond the warning line.

After continuous optimization, the recording process is very stable, and there is no Full GC phenomenon caused by excessive traffic or other problems.

3.3.2 Difficulty 2: Invoking Link Concatenation

There are thread context identifiers for traffic recording and playback. Many systems in vivo have custom service thread pools or use third-party frameworks with their own thread pools (such as Hystrix), which will lead to the loss of the identifier and the inability to connect the entire call chain.

The Moonlight Treasure Box initially relied on the basic capabilities of Jvm-Sandbox-Repeater. When the thread pool is not used, the recording identifier can be stored in ThreadLocal to connect the entire call chain; when using the thread pool, we use our own Agent to respond to Java threads. The pool is automatically enhanced and transparently transmitted. We recorded and played back the ID, but doing so would conflict with the enhancement of the thread pool by the company's call chain Agent and cause the JVM to crash abnormally. There is no way to do this.

In the end, we decided to cooperate with the company's call chain team, and use the Tracer context of the call chain to transmit the recording logo. Both parties have carried out a certain degree of transformation, especially the two agents have made certain adjustments to the location of the HTTP and Dubbo buried points. At present, we have not solved the problem of passing the identity of the thread pool framework such as ForkJoinPoool, and will continue to support such thread pools in the future.

3.3.3 Difficulty 3: Data Security

The third is how to ensure data security for the recorded traffic. Many systems have some inscription data. In this regard, we have carried out configurable desensitization processing for the recorded data. Users can configure the fields to be desensitized on the Moonlight Box platform, and the Agent will desensitize these fields in memory according to the configuration information when recording traffic. Sensitive processing to ensure data security during transmission and storage. In addition, Moonlight Treasure Box will strictly control the viewing authority of traffic details to prevent cross-project data query behavior.

3.3.4 Difficulty 4: Deduplication of Traffic

The fourth is the problem of traffic deduplication. Sometimes the business side may record a lot of the same traffic when using the Moonlight Treasure Box platform, resulting in a long time for subsequent playback and low troubleshooting efficiency. Therefore, we consider how to reduce the number of the same traffic as much as possible while ensuring the coverage of the interface. The current solution is to perform deduplication operations based on traffic input parameters and execution call stacks. During recording, the Agent will perform traffic deduplication operations based on the deduplication configuration information to ensure that the traffic data stored in the database is unique. This mechanism greatly reduces the amount of recorded traffic in some scenarios and improves the efficiency of traffic usage.

3.4 Flow playback process

The following figure shows the process of traffic playback. Traffic playback is the process of obtaining the entry call of the recorded traffic, calling the iterative system again, and then verifying the correctness of the system logic. Unlike recording, playback is mocked for external calls, and this process does not actually access the database. The playback process will compare the input parameters of the recording sub-call with the playback sub-call. If the parameters are inconsistent, the playback traffic will be blocked. If the parameters are consistent, the result of the recording sub-call will be used for Mock return. The completion of the playback will also generate a response result. At this time, we will compare the original recording result and the playback response result. According to the comparison result and the comparison result of the sub-call, the correctness of the system under test can be obtained.

Playback is a more complicated process, because recording and playback are generally performed on different versions of the system in different environments, and there may be large differences. If the playback is not handled properly, the success rate of playback will be relatively low. Initially, the application success rate of accessing Moonlight Treasure Box was relatively low. Later, after long-term optimization and refined operation, the playback success rate of Moonlight Treasure Box continued to increase. The following will share some difficulties encountered and some coping strategies.

3.4.1 Difficulty 1: Time Difference

The first difficulty is the impact of time differences. There are time-related logic in some system business logic. Due to the different recording and playback times, many scenes have time-related logic and playback fails. We did some research on this issue and ended up aligning the playback time with the recording time.

For the native method of System.currentTimeMillis(), the Agent will dynamically modify the bytecode of the method body, proxy the business call to the method, and dynamically replace it with the platform's pre-defined time acquisition method to ensure time replacement. Solving this problem is also easy for Date classes. In addition, non-native time methods such as LocalDateTime in JDK8 are relatively simple, and the time method can be called directly by Mock. Using these mechanisms basically eliminates the time difference problem in the business logic, and eliminates the playback failure problem caused by time.

3.4.2 Difficulty 2: System Noise Reduction

The second difficulty is how to deal with system noise. There are some common noise fields such as traceId and sequenceId in many systems. These noise fields are also factors that cause playback failure. The initial service access needs to be checked one by one, and the overall efficiency is relatively low. Later, Moonlight Box supports the noise field configuration at the global level, application level, and interface level. Many common noise fields can be solved directly through global configuration. Service access only needs to configure the noise field individually. Noise reduction configuration greatly improves service access efficiency.

3.4.3 Difficulty 3: Unification of Environment

The third difficulty is environmental differences. Taking the vivo Internet system as an example, recording is generally performed in the online environment, and playback is performed in the test environment and the pre-release environment. At the beginning, there were many cases of playback failure due to inconsistent environments, which affected the overall playback success rate. In response to this problem, we have carried out a series of explorations and solutions. Moonlight Box will record an online environment configuration at the same time when recording online, and use the online configuration to automatically replace the offline environment configuration during offline playback. This mechanism ensures the data consistency of the configuration center. In addition, for some configuration data of the nature of system memory, Moonlight Box supports the configuration interface to synchronize memory data. Through these solutions, we basically ensure the consistency of online and offline environments, and greatly reduce the number of playback failures caused by environment configuration.

3.4.4 Difficulty 4: Subcall Matching

The fourth difficulty is the sub-call matching problem. The matching strategy specified at the beginning cannot meet complex business scenarios, and it often occurs that the traffic cannot be matched or the matching error occurs, which makes the playback difficult to succeed. Later, we specify different matching strategies for different playback sub-calls: the cache type is matched according to the cache key; the HTTP type is matched according to the URI; Dubbo is matched according to the interface, method name, parameter type, etc. In addition, if multiple identical sub-calls are matched, we will compare the system call stack and input request parameters, and combine the two dimensions of call stack and request parameters to find the most likely matching traffic. These refined matching strategies improve the matching success rate.

3.4.5 Difficulty Five: Troubleshooting

The fifth difficulty is troubleshooting. Recording and playback are very complicated processes. It is very difficult to analyze and troubleshoot any problems caused by the Agent running on the business machine. In order to improve the efficiency of investigation, we support several methods:

1), support playback analysis call link diagram, which will be explained in detail below;

2) The task startup detailed command and parameter output. By outputting the task startup command parameters, it is very convenient for us to start and simulate the recording and playback tasks running on the line locally, which improves the efficiency of investigation.

3) One-click installation of Agent locally. After modifying the Agent code locally, we can install the new Agent in the local remote test environment with one click.

In addition to these functions, we have also developed many efficiency tools, which will not be explained here.

3.5 Rich protocol support

There are many types of vivo businesses, and different business technology stacks are different. The access of these systems to our platform requires targeted adaptation of the corresponding plug-ins. Through our continuous improvement of plug-ins, we have supported dozens of plug-ins below, basically covering all kinds of common middleware.

3.6 Other Features of Moonlight Box Platform

3.6.1 Visual call link

At first, the failure of service playback could only be checked by platform developers, and the entire check-up process was time-consuming and labor-intensive. In response to this situation, we provide some visual operation and maintenance tools. One of them is the link call analysis graph. We track and record the recording and playback process in detail, and help users analyze the execution process through the call link diagram. When a problem occurs, users can clearly see the specific abnormal location and root cause, which improves the efficiency of troubleshooting.

3.6.2 Regression code coverage

One of the advantages of Moonlight Box is that it has a high traffic coverage and it is easy to form a high coverage. How to verify that the replayed traffic really covers all scenarios of the business system, so that users can go online without any doubts after using the Moonlight Treasure Box.

In response to this problem, Moonlight Box provides the statistical capability of code regression coverage. We use the internal CoCo-Server platform to count the system full coverage and incremental code coverage. In order to identify the coverage data from traffic playback, we need to call the interface to clear the coverage data in the machine memory before playback. This method may conflict with other traffic. Later, when the CoCo-Server platform completes traffic coloring to distinguish traffic sources No more worries about this.

3.6.3 Scheduled recording and playback

Although the operation process of traffic recording and playback is very simple, it is still cumbersome for some frequently used business personnel, especially some versions involve too many systems, and the efficiency of recording and playback of multiple systems at the same time is relatively low. In order to improve the efficiency of use, the Moonlight Treasure Box supports the ability of users to customize scheduled recording and playback tasks. Through timed tasks, batches can be recorded and played back periodically, which reduces manual operation costs and improves the experience of using the platform.

3.7 Other applications of Moonlight Box

In addition to automated testing, we have also explored and applied in other areas. The first one is traffic stress measurement. Users can analyze the stress measurement model generated by the recorded traffic through the Moonlight Box platform. The second is problem location, using the Moonlight Treasure Box platform to replay and reproduce online problems offline to help testers and developers reproduce the problem scene. The last one is security analysis. Normalized recording of test environment traffic can help security engineers provide traffic materials and identify business system security risks.

4. Core Indicators

The access to the Moonlight Treasure Box platform is very simple, and the initial access to the business can basically be completed within 10 minutes. In less than a year since the platform was launched, nearly 200 business systems have been accessed, many of which are the most core applications in the vivo Internet system. A total of 1W+ recordings and playbacks have been completed in the past year. After connecting to the Moonlight Treasure Box, the platform discovered dozens of online problems in different businesses in advance, effectively reducing the number of online accidents. In many scenarios, using the traffic recording and playback function of the Moonlight Box platform can improve the work efficiency of testers and developers by more than 80%, which has exceeded our expectations in general.

V. Future Planning

In future planning, we mainly focus on two aspects, one is functional planning, and the other is collaborative open source.

5.1. Functional planning

At present, we have completed the construction of the basic functions of the platform, but there are still problems such as use efficiency. In the future, we will focus on optimization in the following two directions:

1) It is hoped that accurate testing can be achieved, avoiding full playback of recorded data each time, and further reducing playback time. Accurate testing requires analyzing the changed code to obtain the impact range of the changed code, and then filtering out the corresponding traffic for playback based on this, which can reduce the playback coverage.

2) It is hoped that it can be combined with CI/CD under the vivo Internet system. When the business system is released to the pre-release environment, it can automatically trigger recording and playback tasks. In this way, some risks can be identified in the system before going online, and the efficiency of users can be improved.

5.2 Open source co-creation

Open source is the development trend of future software. For open source, we have always been beneficiaries. We also expect to actively participate in open source projects and contribute to the community. We participated in the open source https://github.com/alibaba/jvm-sandbox-repeater project and became a core contributor to the community. In the first phase, a total of 5 plug-ins that are not available in the community but are more important have been contributed. In the future, we plan to continue to give back to the community some of the core capabilities of the Moonlight Treasure Box according to the following plan.

Author: vivo Internet Server Team - Liu YanJiang, Xu Weiteng
This article is based on the content of Mr. Liu YanJiang's live speech at the "2021 vivo Developer Conference". The official account will reply to [2021VDC] to obtain information on the topics of the Internet technology sub-venue.