Taobao client diagnostic system upgrade actual combat

Author: Den Yat

Taobao, as an aircraft carrier-level application, is used by hundreds of millions of users every day. Ensuring the stability of the client is our primary goal. To this end, we also proposed a goal of 5-15-60. That is, when a problem is alarmed, it takes 5 minutes to respond, 15 minutes to locate, and 60 minutes to recover. However, the existing inspection system cannot achieve this goal well. The main reasons analyzed are as follows:

Monitoring stage

Aggregate statistics through Crash stack and exception information, which is not precise enough and not sensitive enough;
After the abnormality is monitored, the end-side behavior is relatively single, and only the abnormal information is reported, and no more useful data can be provided;
Most of the problems with manual Taobao are related to online changes, but there is a lack of monitoring of the quality of the changes.

Investigation stage

The abnormal information reported by monitoring is insufficient, relying on the log for troubleshooting;
Logs will not be actively uploaded in the event of a crash or abnormality, and need to be retrieved manually. Users cannot obtain logs when they are not online;
After getting the log:

Lack of classification, lack of standards, messy logs, and inability to understand logs of other modules;
Lack of scene information, unable to completely reproduce the user's operation scene during the abnormal;
Lack of event information related to the entire life cycle, unable to grasp the operation of the app;
The upstream and downstream log information of each module cannot be effectively correlated to form a complete link;
The existing log visualization tools have weak functions and cannot improve the efficiency of investigation;

The troubleshooting is based on manual analysis, which is inefficient and lacks precipitation for the same problems;
The data between various systems lacks correlation, and it is necessary to obtain data from multiple platforms.

Diagnostic system upgrade ideas

In response to the above existing problems, we have redesigned the entire wireless operation, maintenance, troubleshooting and diagnosis system. In the new architecture, we introduced the concept of scenes. In the past, the exceptions that occurred on the end were all independent events, and there was no way to do finer processing and data collection for different exceptions. After the introduction of the concept of scenes, a scene may be a combination of an exception and multiple conditions, which can be configured for different scenarios, making the collection of abnormal information richer and more accurate.

At the same time, we redefine the end-side abnormal data, which mainly includes standard Log log data, trace full-link data recording the calling link, runtime-related metric indicator data, and on-site snapshot data when an abnormality occurs. The platform side can use abnormal data to monitor and alert. The data can also be visually analyzed. In view of business differences, the platform provides plug-in capabilities to analyze data. Using this semantic information, the platform can perform preliminary problem diagnosis.

So the next goal to be achieved is:

Realize end-side scenario-based monitoring operation and maintenance;
Upgrade the log system, realize the integration of LOG, TRACE, and METRIC data, and provide more abundant and accurate investigation information;
Complete high-availability system data integration, and provide a standardized interface and platform for troubleshooting;
Plug-in support platform empowerment, business custom diagnostic plug-in, complete the docking between diagnostic system platforms;
The platform provides the diagnosis results based on the diagnosis information, which can be automated and intelligent;
Based on the diagnosis results, provide solutions or propose rectification requirements, forming a closed loop from demand -> research and development -> release -> monitoring -> troubleshooting -> diagnosis -> repair -> demand.

Log system upgrade

At present, analyzing the running log is still the main method of end-to-side troubleshooting. We also mentioned that there are some problems in our log itself. So our first step is to upgrade the log system. (Before this, we improved the basic capabilities of the log itself, such as improving the write performance, improving the log compression rate, improving the upload success rate, establishing the log data disk, etc.)

In order to improve the efficiency of log investigation, we started from the log content and re-formulated the end-side standard log protocol. The standardized log protocol can help us in the construction of log visualization and automated log analysis on the platform side. From the perspective of Log, Trace, and Metric, we divide the existing logs into the following categories according to the actual situation of manual Taobao:

CodeLog: Compatible with the original logs, which are relatively messy;
PageLog: Record the page jump situation on the end, and the log can be divided from the page dimension when troubleshooting;
EventLog: Record various events on the end side, such as front-end and back-end switching, network status, configuration changes, exceptions, click events, etc.;
MetricLog: Record various indicator data at the end-side runtime, such as various indicator data of memory, CPU, and business;
SpanLog: Full link log data. Connect each independent point in series to define a unified performance measurement and anomaly detection standard. The OpenTrace-based solution is connected to the server, forming an end-to-end full-link investigation mechanism.

[]()

With these data. Cooperating with the log visualization platform on the platform side, the behavior on the end side can be quickly replayed. Thanks to the standardization of logs, logs can be analyzed from the platform side to quickly display abnormal nodes.

End-to-side diagnostic upgrade

The client is the source of the entire diagnostic system. All abnormal data and operating information are collected and reported by various tools on the client. At present, the main tools on the end side are:

APM: Collect various operation and performance information on the end side, and report the data to the server side;
TLOG: Collect log information when the terminal is running. The log file is stored locally, and the server issues instructions to retrieve it when needed;
UT: End-to-side burying tool, a lot of business abnormal information will be reported to the alarm through UT;
Exception monitoring: This is represented by the Crash SDK, which mainly collects Crash information on the end-side, as well as SDKs related to collecting exceptions and user public opinion;
Troubleshooting tools: memory detection, freeze detection, white screen detection and other tools. These are not directly classified in APM, because these tools are sampled and run, and some are even turned off by default.

It can be seen that there are actually many mature tools in use on the end side, but data missing and other problems are often found when troubleshooting problems. The main reason is that on the one hand, the data on the platform side is relatively scattered and there is no unified interface for querying; on the other hand, the data is scattered on the platform side; On the one hand, the client did not integrate the data of these tools, and when an exception occurred, there was little interaction between these tools, resulting in missing data.

In order to solve these problems, we have introduced a new diagnostic SDK and staining SDK on the end side. Their main functions are:

Docking with existing tools to collect end-to-side operating data. Write these data to the TLOG log in accordance with the standard log protocol;
Monitor the change information on the end side, and generate the changed coloring mark;
Monitor the abnormality on the side, generate snapshot information (including operation information and change information) when the abnormality occurs, and report it to the server;
Scenario-based diagnostic data reporting, according to the rules configured by the server, data collection and reporting in specific scenarios;
Support directional diagnosis, according to the configuration issued by the server, call the corresponding troubleshooting tool for data collection;
Support real-time log upload, online debugging for specific users and equipment.

Exception snapshot

Anomalies on the end side mainly include crashes, business anomalies, and public opinion. When an exception occurs, the reported data format, content, channel, and platform are different. If you want to add some data, both the end-side and the corresponding platform need to be modified. Therefore, we monitored the SDK related to abnormal monitoring on the end-side, and provided an abnormal notification interface for the business. When the diagnostic SDK receives an exception, it will generate a snapshot of data using the currently collected operating information. Each snapshot data will have a unique snapshotID. We only need to pass this ID to the corresponding SDK, so that changes to the existing SDK are minimal.

[]()

Snapshot data will become more and more abundant with the enhancement of end-side capabilities. The collected snapshot information will be uploaded to the diagnosis platform, and the snapshotID can be used for data association between the platforms. For the diagnosis platform, the abnormality can be analyzed based on the snapshot information and log information, and a preliminary diagnosis result can be given.

Change monitoring

Most of the problems in manual shopping are caused by online changes. The existing monitoring and investigation system does not specifically monitor changes, and mainly relies on the number of abnormalities and abnormal trends for alarms. This has a certain degree of lag, which makes it difficult to quickly find problems in the heavy volume stage. At the same time, there is no unified control standard for the release of changes. Therefore, we introduced the dyeing SDK on the end side to collect change data, and cooperate with the change diagnosis platform of wireless operation and maintenance to monitor the change release, so that it can be grayscale, observable, and rollback.

The current end-side changes include general configuration changes (Orange platform), AB test changes, and business customization changes (touchstone, security, Neologism, etc.). The dyeing SDK is docked with these SDKs on the end side, and after collecting the change data on the end side, the corresponding dyeing mark will be generated and reported. Cooperate with TLOG and diagnostic SDK to record these changes and mark the abnormal information when an abnormality occurs. The platform side will also connect with various publishing platforms and high-availability data platforms, and make publishing decisions based on the data reported by the client.

Dyeing logo

The change is actually a process in which the server sends data to the client and the client uses it.

So for the change, we define:

Change type: used to distinguish the type of change, such as configuration change (orange), test change (ABTest), business A change, business B change, etc.;
Configuration type: There may be multiple configurations under one change type, such as orange change, each configuration has a namespace, which is used to distinguish the type of configuration;
Version information: Represents a specific release of a configuration. It is not necessary that all configuration changes have clear version information, and the unique identification of a certain release or configuration can be used as the version information. For example, each orange configuration release has a version, and each ABTest release has a publishID.

With this information, we can generate a unique color identification for each change. We can count the number of changes in effect by reporting the dye logo; by carrying the dye logo in the snapshot information, we can calculate the crash rate and the number of public opinions with the change logo; monitor the quality of the change by comparing it with the high-availability market data. The service can also carry a coloring mark in the network request, which is used to count whether the related interface is abnormal.

Gray definition

For the dyeing SDK, we hope to be observable during the grayscale period, to find problems early, and not to bring the problems caused by changes to the full online environment, so we position the release process into three stages:

Preparation period: prepare change data, pre-issue verification, submit for approval, and publish online. At this stage, the type, configuration, and version of the release change can be determined;
Gray period: Issue configuration changes to some users. Our staining monitoring is also mainly run at this stage, mainly for reporting staining marks and carrying staining marks in abnormal snapshots. The platform side processes the data at this stage to generate gray-scale related data;
Full measurement period: When the gray scale reaches the standard, it enters the full measurement period. At this time, the configuration is pushed to all users who meet the conditions.

Data reporting control

Because the number of mobile users is too large, if the effective data continues to be reported after the full amount is changed, the pressure on the server will be great. At the same time, if the abnormal information continues to be marked, it is of little significance. Therefore, we must have a mechanism to control the reporting of dye data on the end-side.

For the general changes such as orange and ABTest, we have made individual adaptations, which can control the black and white list control, sampling control or release status according to the experiment number and configuration namespace. But for custom changes, the controllable conditions are very different. If you want to fine-tune the control, you must understand this specific change. So we have defined some general conditions: gray scale identification, sampling rate, expiration time to control.

[]()

This information is delivered to the end in the form of configuration files. The end side does not need to pay attention to the logic of setting these conditions, but set it on the platform side. The platform side connects the release platform and the high-availability platform, and makes decisions based on the reported data. At present, the decision to report is mainly based on the gray mark + timeout in the configuration.

Release access control

After the end-side reports the effective number, abnormal coloring and other data, the server can monitor the changes based on these data. According to the number of related crashes, the number of public opinions, the gray-scale time, etc., it is determined whether the current change has reached the standard for full release.

At the same time, the crash information and public opinion information related to this change will also be listed. Assist in determining whether there is a risk in this change release.

There are already business accesses such as Orange configuration changes, AB test changes, details, and order placement. The effect is still good, and 4 online faults have been successfully avoided.

Scenario reporting

Scenario-based data reporting is an important capability in the diagnosis system. In the past, we manually fetched relevant data from the end for investigation when an alarm was issued, and the data required for different abnormalities was different, and it was often necessary to perform multiple operations across multiple platforms. This leads to lag in data acquisition, and the entire investigation time is uncontrollable. Therefore, after having the basic capabilities of new log standards, abnormal snapshot collection, and abnormal change dyeing on the end side, we introduced scene-based reporting.

For example, according to the existing investigation method, when an abnormal alarm occurs online, we usually first check the abnormal information reported. However, due to the lack of existing information, it is often necessary to pull the TLOG for further investigation. However, TLOG retrieval relies on the user's online status, which leads to a very long time for the entire investigation and positioning.

After the introduction of the concept of scenario, when the platform detects that the number of abnormalities is about to reach the threshold, it can automatically generate a scenario rule configuration, circle a group of users and send it to the end. When the same exception occurs on the end, the scene engine will be responsible for collecting and reporting the required data, so that when the alarm reaches the threshold, there is already enough data on the platform for analysis and positioning.

Scene rules

The scene engine is mainly used to execute the scene rules issued by the server, and a scene rule is mainly composed of three parts:

Trigger

It can be a behavior or an event on the end. Instead of reporting data only when Crash or business is abnormal, we have expanded the timing of abnormal triggering.

Crash exception
User screenshot feedback
Network abnormality (mtop error, network error, etc.)
The page is abnormal (white screen, abnormal display)
System abnormality (memory usage is too high, cpu usage is too high, power consumption is too fast, fever, freeze)
Business exception (business error code, logic error, etc.)
Start (usually used for directional diagnosis)

Condition (Condition)

Condition determination is the core of the entire scene-based upload. After adding scene conditions, we can more accurately classify abnormal types from multiple dimensions. The conditions can be roughly divided into:

Basic conditions: match and filter from the dimensions of device information, user information, version information, etc.;
Status conditions: mainly include runtime information such as network status, current page, memory water level, etc.;
Specific conditions: The conditions that need to be determined in different scenarios are different. For example, when Carsh occurs, it can be matched based on Exception type, stack and other information. When the business is abnormal, it may be matched based on the error code and the network error.

Action (Action)

When a certain rule on the terminal is triggered and all the conditions set are met, the specified behavior will be triggered, so that different data can be collected according to different scenarios. At present, the main behaviors of the end-side are:

Upload TLOG log
Upload snapshot information
Upload memory information
Cooperate with other troubleshooting tools to collect and report data according to the issued parameters

Scene release

A new scene management platform has been established on the platform side, which can easily configure various scenes and conditions. And there are also standard release, review, and grayscale procedures. Through PUSH + PULL, the client can get the scene rules in time.

Both the platform and the end side also support the ability to deliver configuration directionally. By specifying some basic conditions, targeted scenarios can be delivered for different devices and different users.

Data flow control

After matching the corresponding scene rules on the terminal, various abnormal data will be reported. However, if the amount of data is too large, it will put greater pressure on server storage. Therefore, we conducted traffic control for the data reported by the scene.

From the perspective of troubleshooting, we may only need a few related logs and data for the same problem. So when we create a scene, we will specify a threshold for data reporting. When the platform has collected enough data, the scenario will stop and the client will be notified that the rules are offline.

At the same time, in order to ensure that users do not affect normal use due to frequent uploading of diagnostic data, the end-side also has its own current limiting strategy. For example, it must be in a wifi environment to report, limit the number of rules executed every day, limit the amount of data uploaded every day, set the data upload interval, and so on.

Custom scene

At present, our triggers are all common scenarios, and data reports are obtained from high-availability tools. But the business may have its own abnormal monitoring system. Therefore, we also provide corresponding interfaces for the business to call, and use our ability to deliver scenarios, execute regular expressions, and obtain running data to help the business diagnose problems. The business can define its own trigger timing and trigger conditions. In the future, we can also add the ability to customize behavior so that the business can also report corresponding data according to the scenario.

Directed diagnosis

In addition to TLOG, there are other troubleshooting tools such as memory tools, performance tools, and lagging tools. These tools are useful when troubleshooting specific problems. However, online sampling is currently enabled or disabled by default. However, the existing configuration cannot be targeted to the device and the user for targeted delivery. This will lead to incomplete and effective information when troubleshooting. Therefore, we also made use of the scenario-based targeted distribution capabilities to make simple modifications to existing tools, and cooperate with these tools to collect and report abnormal data.

Future outlook

At present, the end-side diagnostic capabilities are still under continuous construction. We are also iterating on the pain points in the existing investigation process, such as real-time logs, remote debugging, perfecting full link data, abnormal data, and so on. In addition, we also need to see that end-to-side diagnosis is no longer a simple stacking tool to collect data. In the future, we need to collect and process data in a more targeted and refined manner.

With the improvement of diagnostic data, the next challenge will be how the platform uses the data to locate the root cause of the problem, analyze the impact of the problem, and precipitate the diagnosis result. For the end-side, it is not only positioned at data collection, but also can rely on the diagnostic knowledge precipitation of the platform, combined with the end-to-end intelligence capabilities, to achieve the ability of problem diagnosis on the end-side. At the same time, it can cooperate with the end-side degrading ability, dynamic repair and other capabilities to realize self-healing after an abnormality occurs.

, 3 mobile dry goods & practice for you to think about every week!

Taobao client diagnostic system upgrade actual combat

Diagnostic system upgrade ideas

Log system upgrade