Dewu App ANR monitoring platform design

Before Dewu, all online ANR problems were collected and displayed on the Bugly platform. In the experience, it was found that whether it is the effectiveness of the information provided in solving the problem or the aggregation of ANR logs Doesn't meet the needs of the actual scene of Dewu. Therefore, after referring to the ANR monitoring and governance solutions shared by major manufacturers in the industry, we started to build our own internal ANR monitoring platform. This article mainly introduces the realization of the ANR monitoring platform part, and the collection of relevant information on the end will not be explained.

Processing of ANR information

Before proceeding with the data processing of the ANR log, first sort out the information content currently collected and reported by the ANR, mainly including the following information

ProcessErrorStateInfo
tomstone information
Stack sampling information of the main thread
Main thread Looper Message historically scheduled Message information, currently processed Message, unprocessed Message queue
App status information
- App front and back status
- Application running time
- ...

The following describes the processing of the above-mentioned key information.

ProcessErrorStateInfo parsing

ProcessErrorStateInfo is obtained through the getProcessesInErrorState() function of ActivityManager, and it mainly contains the following information

processName
pid
uid
tag (If ANR occurs on system components such as Activity, Service, etc., the tag will be set to the corresponding component name)
shortMsg (short description of ANR message)
longMsg (Details of ANR)

The main concern here is the information of longMsg. Through regular matching, the following key information can be parsed from longMsg

ANR component
Reason for ANR trigger

tomstone information analysis

For the information contained in the Android tombstone file, please refer to tombstone in debuggerd. Currently, Dewu uses iQiyi's open source xcrash to collect ANR tomsbtone information. The collected tombstone file mainly contains the following information

Crash related meta information
- Crash type (anr, java, native)
- Application start time
- Crash time
- App version number
- system version
- ABI
- pid, tid, crash thread name
Virtual machine GC related information
On-site information of each thread during ANR
- Thread current call stack
- Thread priority prio
- Thread state
- Lock information held or waiting
- Thread nice value, utm, stm and other information
logcat information
open files
memory info
....

The analysis of the tombstone information of xcrash is also directly implemented using the code built in xcrash.

Analysis of Looper Message backtracking information

LooperMessage backtracking refers to the history of message records (also including unprocessed message records) of the main thread of the system, Looper, classified from the processing status of Messages, including three types of Messages, 1. Processed Messages 2. Currently processed messages Message 3. Unprocessed Message;

For the processed Message, the time spent executing the Message (wallTime) and the execution time allocated by the CPU (cpuTime) are collected. Based on these two times, you can roughly judge the scheduling of the main thread cpu, if there is a difference between the two The difference is large (cpuTime only accounts for less than 30% of wallTime), considering that there may be a serious thread scheduling preemption problem.

For unprocessed Messages, it is mainly used to observe the current Message scheduling delay, whether there is too much message backlog and other cases.

Analysis of stack sampling information (method trace)

When the App is running, Dewu will collect the current execution stack of the main thread at regular intervals (50ms). A total of 10S~20S thread call stacks will be saved. When an ANR occurs, the collected stack information will be along with the ANR information. Reported.

By processing the collected stack in the background, the more time-consuming functions can be analyzed in advance. Here is a brief description of the analysis and processing methods for function sampling data

First, the collected stack is traversed and analyzed, and converted into the form of NodeTree according to the calling relationship of the function. During the traversal process, the time consumption of each function is recorded
Starting from the root node, traverse the MethodNodeTree hierarchically to find all functions that take longer than a certain threshold
Filter out some whitelist functions, such as com.android.internal.os.ZygoteInit.main, android.app.ActivityThread.main

Part of the code

public class MethodNode {
     private List<MethodNode> children = new ArrayList<>();

    //method cost time
    private int cost;
    
    private String fullMethodName;

    private JavaStackFrameSnapshot javaStackFrameSnapshot;
    
}

Aggregation of ANR logs

In the APM platform, for issue-type events, a very important processing link is the process of data aggregation. The ANR logs reported on the end also need to be aggregated, so that ANR problems can be better classified, the severity of ANRs occurring online for different problems can be evaluated, and centralized management can be carried out according to the severity.

Aggregation strategy

In the production environment, there are many reasons for the occurrence of ANR in equipment. If it is simply classified, it can be divided into the following situations

Slow function execution on the main thread causes
- For example, the main thread executes heavy code, such as sorting large amounts of data
- Perform IO operations on the main thread
Looper message scheduling exception
Lock wait, deadlock
The system load is high and the CPU scheduling of the current process is affected
System binder service is abnormal
.... Other abnormal conditions

In view of the ANR problem that may be caused by slow functions, through the "stack information analysis" link in the previous section, all in-app functions that take more than a certain threshold have been analyzed. Under certain conditions (main thread CPU scheduling is normal), you can follow time-consuming function to aggregate .

For other cases, we first simply ANR component + ANR trigger reason 161460d1177a97, first aggregate according to this relatively coarse granularity, and then analyze. If a certain rule is found, The platform can support some specific rules , custom aggregation capabilities for data aggregation.

Platform experience

This chapter shares the experience design of DEWU on platform functions for ANR issues. The focus is mainly on how to display the existing data and information to help developers better discover and locate problems.

For each reported ANR log, in addition to some general information (such as userId, sessionId, deviceOs, etc.), the focus is on the display of contextual information related to the ANR issue.

List of Questions

First, after the aggregation process, the ANR logs are divided into ANR ISSUEs, and each ISSUE has its own processing status and trend for problem management and problem trend follow-up.

In the application as a whole, a trend chart of the corresponding dimension is also provided to observe the current online ANR situation.

Question details

On the question list page, you can roughly understand how many questions there are in the current ANR, the trend and processing status of each question. When you click on each question, you will jump to the details page of the "aggregation question", which is more complete in the aggregation algorithm. In the case of (in actual scenarios, it is difficult to achieve the diversity of current causes of ANR), the details of each problem are ANR caused by a specific cause, or an ANR that occurs on a specific application component.

On the problem details page, you can also display the occurrence trend of the specific problem and the status of the problem handling, but more importantly, what key element information is provided by the platform to help developers locate ANR problems.

Log details

In each ANR ISSUE detail page, each ISSUE is a collection of logs reported by ANR, that is, ISSUE and Log, which is a one-to-many relationship. On the problem details page, you can view all the abnormal logs reported under the problem. , When you click on each exception log, you can view the detailed information of the log. In the log details section, the ANR platform displays all the information analyzed in the previous section, including the main thread Message traceback information, method trace information, and CPU usage of each process, etc., in a more friendly visual way.

Display of ANR log basic information

ANR triggered components
Reason for ANR trigger
The front and back status of the application when the ANR occurs
Application running time
....

CPU usage information of each process for a period of time after ANR

This is mainly to display the CPU usage information of each process collected on the end, and the usage rate is divided into user mode and kernel mode.

Here are mainly two aspects of information, whether there is any abnormality in the cpu usage of the current process, and the distribution in user mode and kernel mode. Another aspect is to check the cpu usage rate of other processes, and consider whether it is possible that the cpu usage rate of other processes may be too high, which may affect the current process.

On-site information display of each thread during ANR

Thread scene information is mainly to check whether the main thread is likely to deadlock, and the running status of other threads, whether there may be too many other work scenes, and the current thread scheduling is affected.

Display of Looper message backtracking information

The looper message backtracking can see the distribution and processing of the message from the perspective of the main thread processing the Message. In addition to the time-consuming function, the cause of ANR may also be due to a bottleneck in the scheduling of the main thread CPU, or too many messages sent to Looper, and many other reasons.

logcat information

From the logcat information, you can assist in the verification and obtain some information before and after the occurrence of ANR, and sometimes the problem can also be located from the exception log in the logcat.

Function execution flame chart (Flame Chart) display

The flame graph can view the execution of the application function from a global perspective, and more intuitively analyze the function that may cause anr.

Ideas for using ANR monitoring platform to troubleshoot problems

Simply put, through the existing platform, ANR problems can be analyzed from the following aspects

From the front-end and back-end status of the ANR, the application running time, the process cpu usage rate, and the difference between the cpuTime and wallTime of the Message in the Message Trace, there is probably an assessment of the current application's cpu scheduling capability.
Analyze from the flame graph whether there is an obvious time-consuming business code
From MessageTrace, see if there is any obvious abnormal message processing (a single message is processed too long, the number of messages is too much, etc.)
From the CPU usage
- See if there is any abnormality in the proportion of user and kernel usage in the application process. If the kernel usage is high, analyze whether a large number of system calls have occurred. If the usage rate of the user layer is high, it means that there are more heavy tasks on the business side. implement.
- Check whether the kswapd0 cpu usage rate is high. If the cpu usage rate is relatively high, it means that the system memory resources are insufficient. At this time, the system is undergoing frequent memory swaps (exchange of memory space and disk space), occupying the system's cpu and io resources. Therefore, the performance of the whole machine will be affected.
- Check the cpu usage rate of systemServer, whether it is because the cpu usage rate of the system service is too high, causing the current process to be unable to allocate enough time slices
Troubleshoot problems from logcat
- Observe the gc-related information from logcat, whether it is possible that the ANR caused by the execution of blocking gc is due to a serious shortage of memory. Here, you can also combine the online memory collection information to report the memory usage information before and after the ANR occurs.
- Check kernerl-related logs to see if there are insufficient system-related resources such as lomemorykiller iowait

Part of the problem governance sharing

This section is to share some ANR problems that have been located using the ANR platform after the ANR platform is launched.

Calling the toString function call of the Kotlin Function object results in ANR

When investigating ANR logs, I found that there are multiple ANR logs, Kotlin's KfunctionImpl.toString is relatively time-consuming, the flame diagram case is as follows

So the Kotlin function was tested offline, and the scene was restored with reference to the business code in the flame diagram in the demo

It turned out that a simple toString call took 118ms in the demo. After a series of variable tests, it is found that after the introduction of kotlin and the introduction of the kotlin-reflection library, it takes a very serious time to toString function of the Kotlin Function object. Therefore, pay attention to the business code, do not directly or indirectly trigger the toString operation on the Kotlin function object.

The main thread executes View.getDrawingCache for bitmap conversion

From the queue, it is found that a large number of messages have been blocked, and in the previous Mesage processing record, it is found that there is a 900ms message processing.

Turning to the corresponding method trace, it is found that the stutter is mainly caused by the drawingCache obtained from the Webview in the main thread.

When testing offline, the time-consuming function may not be so time-consuming under normal circumstances, so it is not yet certain how the problem is caused, but from the perspective of code implementation, it should not be executed in the main thread. Time operation, so the operation can be moved into asynchronous thread execution.

ANR problems caused by SP waiting tasks

After the ANR platform went online, it was found that a large amount of data in the currently collected use cases was caused by the need to wait for the completion of all asynchronous SP write operations in the onPause, onStop and other related processes of the system components

Dewu’s current business has used MMKV to replace SP. The scenarios that mainly use SP are all connected third-party SDKs, and these SDKs generally do not need to write "asynchronous task data write completion" for this part of the data between component jumps. As a result, Dewu has performed a HOOK on the mechanism of this part of the system. After this part of the HOOK function is launched, the ANR rate has dropped significantly.

Follow-up optimization iteration direction

Judging from the current governance of the problems after the ANR platform is launched, most of the problems that are easy to solve are analyzed by analyzing the obviously time-consuming business code from the flame graph. In the actual online scene, there are still some In a specific case, the existing log information cannot clearly give the root cause of the occurrence of ANR. Later, according to each case, other aspects of monitoring capabilities will be strengthened to assist in locating the problem.

End side

Enhancement of other contextual information collection

For example, it is found online that some time-consuming functions of ANR problems are stuck in a simple File.exist function call. For such problems, it may be necessary to combine some monitoring and hooks of the native layer on the file io operation to assist in locating the problem.

Method Trace performance optimization on the end

At present, the method of DEWU APP on the terminal is to call getStackTrace at the timing of the child thread through the Thread object of the main thread to collect. Although the execution of this function is in an asynchronous thread, it is acquired in a non-current thread at the bottom of the virtual machine. When the thread stack, the corresponding thread will be suspended first, and then resume after the acquisition is complete. Therefore, if the frequency of grabbing the stack for the main thread is more frequent, it will inevitably affect the execution of the normal function of the main thread.

The requirement of our MethodTrace is to try our best to ensure that it only generates minimal runtime overhead to avoid affecting application performance for monitoring. Although the current sampling scheme has a minimum sampling interval of 50s, it has little impact on performance. In Facebook's profilo open source library, there is a set of asynchronous non-blocking access to the thread stack, but it is currently only open source compatible with Android 9 and below. 's watermelon video shared another 161460d1178595 Java Method Trace , but the solution ultimately requires a suspend thread, which also has a certain impact on performance, but it is also a reference direction for implementation.

Platform test

MessageTrace experience optimization

At present, Looper Message Trace function and Method Trace (function execution flame graph) function are two relatively independent functions. When we find that a certain message in the Message Trace function is time-consuming, under normal circumstances, we hope to be able to directly see the specific message. What has been done, so an optimization that can be done in the platform test is that when a Message is clicked, the function stack fragment of the corresponding time is automatically intercepted according to the start and end time of the processing of the Message, and the information is displayed.

Anr time distinction

The Message information and function sampling information collected on the end actually include anr, because there is a certain time difference from anr triggering to application capture to collecting information, so from the perspective of problem analysis, this time point needs to be done An obvious distinction.

Reference

Text/Knight-ZXW

Pay attention to the material technology, be the most fashionable technical person!