Analysis Practice of Android Caton and ANR

This article was first published on the WeChat public account " Shopee Technical Team "

Summary

Aiming at the "enemy of life" of client development - stuck and ANR problems, this article will deeply analyze the system message queue mechanism and common causes of stuck and ANR, and introduce how the monitoring tool LooperMonitor uses the intelligent aggregation and Visual Kanban provides business parties with more accurate and easy-to-use analysis capabilities.
This is the third article in the MDAP series, reviewing the previous article:
" MDAP: Observability Data Analysis Platform Design and Practice "
Optimization and practice of machine learning in URL-based client monitoring analysis

foreword

Caton and ANR, this is a topic that all client development students are very concerned about, and it is also a topic that cannot be bypassed.

The performance of the freeze is that the app has dropped frames, the sliding is not smooth, and the user's touch event response is slow; when a very serious freeze occurs, the app may even pop up a pop-up window Application not responding , prompting the user that the current app is unavailable. response.

In addition to affecting the user experience, Caton and ANR will directly affect the transaction volume during the peak order period for e-commerce platforms, resulting in actual revenue loss. It can be said that Caton and ANR are the lifelong enemies of client development students.

However, it is difficult to analyze and solve the problems of Caton and ANR, especially ANR. The main reason is that ANR is triggered when the main thread is busy and the key system messages cannot be executed in time. There are many reasons for the main thread to be busy. At the same time, the system's threshold for ANR is relatively long, and the minimum is 5s. During this time, There may be scenarios where the CPU resources of the device are tight or the main thread executes some time-consuming messages. These scenarios may be "the few snowflakes that cause the avalanche".

The current mainstream monitoring SDK in the industry, the basic idea is to monitor the ANR signal, and grab the thread stack and the system ANR log at the scene of the ANR occurrence. The stack grab at this time is an after-the-fact strategy, except for some very obvious ones, such as thread dead Locks or the current business logic that is extremely time-consuming, cannot do anything about more obscure and complex reasons. This "post-event strategy" often leads to a large number of "invalid stacks" in the reported ANR data. For example the classic MessageQueue.nativePollOnce :

A lot of stacks are aggregated to MessageQueue.nativePollOnce here, is it because the main thread calls nativePollOnce has been blocked at the jni layer and has not been awakened? If we only rely on such a stack of data, we cannot find analytical ideas, and these ANR problems are difficult to solve.

In order to solve these pain points, the ShopeeFood team and the Shopee Engineering Infrastructure team implemented a new monitoring tool LooperMonitor through in-depth research on the system message queue mechanism and common causes of stuttering and ANR. Combined with ANR reporting, with the help of the intelligent aggregation and visualization dashboard of the multi-dimensional analysis platform MDAP (Multi-dimension-analysis-platform) , it aims to provide business parties with more accurate and easy-to-use analysis capabilities.

1. The generation principle of Caton and ANR

Before formally introducing the LooperMonitor scheme, it is necessary for us to find out why the ANR field stack captured by the traditional scheme is inaccurate? To answer this question, it is necessary to first understand how stutter and ANR are generated.

We know that Android's application layer operates based on the event loop model of Looper+MessageQueue .

Looper is a message poller that keeps fetching messages in MessageQueue and executing them.

The messages here include UI drawing messages, scheduling messages of the four major components of the system, and messages constructed by the business itself through Handler .

The message that is responsible for UI drawing is doFrame . It is composed of Choreographer by applying and monitoring the hardware layer vsync after the vertical synchronization signal, wraps the UI drawing task into a doFrame message, in the appropriate frame The drawing time point throws the message to the main thread message queue for execution.

doFrame callback队列，比较重要的是input_queue 、 animation_queue traversal_queue 。 They handle touch events, animation events, and UI drawing events, respectively.

When a doFrame message is executed, the events of the above three queues will be executed. We think that the App responds to a user's touch, and the UI updates one frame, completing an interaction.

When a doFrame message is executed, it will notify Choreographer to apply for the next vsync signal. At this point, the UI drawing tasks are strung together, as shown below:

If a new doFrame message can be executed every vsync interval, the device is running at full frame.

doFrame vsync 9aa3afe5814dc8a077da16a2aeb889bf---间隔内被执行，比如当前的doFrame vsync --Interval, causing the next vsync vsync not to apply in time; or doFrame before the message, the main thread Looper was occupied by other time-consuming tasks.

Once doFrame cannot be executed in time, it is manifested in the experience that the device draws a frame lost. The larger the interval, the more obvious the dropped frame, and the more stuck the app will be.

For ANR, the principle is similar.我们拿系统创建Service 举例，在目标Service realStartServiceLocked() ， realStartServiceLocked() a44eeb0174aa30036dc8dd143c68047e---内部最终ActiveServices的scheduleServiceTimeoutLocked method, the system will plant a "bomb" at this time, and at the same time, depending on whether the service is the foreground or the background, the bomb will detonate at different times. Here, the foreground service is 20s.

 ActiveServices.java

void scheduleServiceTimeoutLocked(ProcessRecord proc) {
    ……
    Message msg = mAm.mHandler.obtainMessage(
            ActivityManagerService.SERVICE_TIMEOUT_MSG);
    msg.obj = proc;
    mAm.mHandler.sendMessageDelayed(msg,
            proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
}

It is represented by the diagram as follows:

Through a series of system calls, finally the inner class ActivityThread$H ActivityThread ---e169b3f8b1ce591cb5aa2798f3055334--- will receive a CREATE_SERVICE message.当CREATE_SERVICE时，service 实例会被创建， onCreate()方法， onCreate()前，会ActiveServices this timeout detonation message.

 ActiveServices.java

private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
        boolean finishing) {
        ……
        mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);

If this CREATE_SERVICE cannot be executed within 20s due to various reasons, the SERVICE_TIMEOUT_MSG message will be executed, and an ANR will be generated at this time.

The system's "tolerance" and ANR performance after detonation of information are also different for different events. The specific definitions are shown in the table:

in conclusion:

The reason for the doFrame cannot be executed within the time interval of vsync , and the ANR is because of a key system message, or the event of Input cannot be completed. Completion of execution within a system-defined timeout threshold.

Essentially, they are two manifestations of the same problem, just in different degrees of severity. When ANR occurs in the device, a very serious freeze often occurs.

2. Why is the stack inaccurate?

From the previous analysis, it can be seen that the causes of stall and ANR are that certain types of messages are not executed in time within the specified threshold. This threshold time can be long or short.

For Caton monitoring, our threshold interval will not be very long, which may vary according to business scenarios and product complexity, but generally it may be around a few hundred milliseconds. The monitoring threshold of ANR is longer, and user-perceivable ANR occurs, the shortest is 5s, and the longest is 10s.

When the freeze and ANR are monitored, it actually means that there are abnormal time-consuming tasks in the message queue of the main thread in the past few hundred milliseconds, 5s, 10s. If the stack is captured at this time, it only means that we have captured the freeze. and ANR field stack data.

For Caton, the interval of several hundred milliseconds is not very long. The stack captured this time has a high probability of hitting the stack of the abnormally time-consuming task, but it is only "probable".

For ANR, it is not so easy to hit. After all, 5s and 10s have passed, and the main thread has performed too many tasks. In these 5s and 10s, each time-consuming task may cause key system messages to fail to be processed in time. implement. If the stack is grabbed when the ANR signal occurs, there is a high probability that these scenes will be missed.

The ANR signal is generated asynchronously. When the ANR signal is received, the stack is collected. In addition to the possibility to collect the business stack executed in the current message, it is also possible to collect the main thread MesageQueue the "process of fetching the message" ”, this method of removing a message MessageQueue.next() is called very frequently, so the hit probability is very high, which is why many ANR reports eventually hit this stack.

After knowing the reason for the inaccuracy of the stack, there is a very clear idea to solve it.

3. MDAP-LooperMonitor

In fact, it can be seen from the above analysis that the cause of ANR and Caton is that there is an abnormal message in the main thread Looper.

At the same time, the longest time-consuming perceptible ANR with prompts is 10s. If a mechanism can be used to record the scheduling history of all messages in the main thread in the past 10s, save the key data required by the business side, and deal with performance and memory usage, When a problem occurs, report the scheduling data within the monitoring threshold:

For example, when stuck, report the scheduling data of the past 500ms (which can be flexibly adjusted on the configuration platform);
When ANR occurs, report the past 10s and unscheduled Pending messages.

In this way, for developers, it is equivalent to having the ability to "see through the past and the future", which should be able to solve most of the stuck and ANR problems.

MDAP's LooperMonitor is a monitoring solution based on this idea. We have also done a lot of technical optimization internally to ensure the implementation of the solution, such as finding better performance monitoring portals, using object reuse technology to reduce GC pressure, use a multi-instance single-threaded model to reduce synchronization overhead, use small message aggregation and rolling elimination strategies to reduce memory usage, etc., see below for details.

3.1 Monitoring entrance

Finding the scheduling entry of the main thread Looper is the core of this monitoring scheme. If we can insert the code we want at the starting and ending points of the scheduling, we can get the execution time of the current msg.

The source code of Looper only extracts the key information as follows:

 for (;;) {
            Message msg = queue.next(); // might block
            ...
            // This must be in a local variable, in case a UI event sets the logger
            final Printer logging = me.mLogging;
            if (logging != null) {
                logging.println(">>>>> Dispatching to " + msg.target + " " +
                        msg.callback + ": " + msg.what);
            }
            try {
                msg.target.dispatchMessage(msg);
                if (observer != null) {
                    observer.messageDispatched(token, msg);
                }
                dispatchEnd = needEndTime ? SystemClock.uptimeMillis() : 0;
            } catch (Exception exception) {
                throw exception;
            } finally {
            }
            
            if (logging != null) {
                logging.println("<<<<< Finished to " + msg.target + " " + msg.callback);
            } 
            ...
        }

At present, most APM SDKs use a Printer object to judge whether the string starts with >>>>> Dispatching to and <<<<< Finished to in the callback of the println method. <<<<< Finished to To judge the start and end points of the message, insert the statistical code at the start and end points to get the message execution time.

If there is an unusually time-consuming message, in order to locate the abnormal point, it is usually necessary to obtain some additional data:

当前消息的target ：来源自Handler ，如果业务侧Handler handleMessage ，我们Need to know the location of this Handler .
当前消息的callback ： post runnable ，此时需要通过---511d9e31d8154bb69681dfe23481709a callback业务代码的位置。
Type of system message: If the message is processed by ActivityThread$H , we need to know which type of system message is currently scheduled, so as to filter out the scheduling information of the four major components that will cause ANR.

The scheme of setting Printer can meet the requirements of 1 and 2, but the system message type cannot be obtained, that is, the requirement of 3 cannot be achieved.

For 3, the scheduled message type can be obtained through the ActivityThread$H what field of the hook system.

Is there a better solution than setting Printer ?

When looking at the source code, I found that when Android api ≥ 28, an interface Looper is added to Observer , and its corresponding method will be called back before and after the message is dispatched, and the message object will be called back. Passed in as a parameter, we can try to replace it with our own implementation, which seems to be a more efficient method without generating a lot of temporary String objects.

See the source code below:

 for (;;) {
            Message msg = queue.next(); // might block
            ...
            // Make sure the observer won't change while processing a transaction.
            final Observer observer = sObserver;
            ...
            if (observer != null) {
                token = observer.messageDispatchStarting();
            }
            try {
                msg.target.dispatchMessage(msg);
                if (observer != null) {
                    observer.messageDispatched(token, msg);
                }
            } catch (Exception exception) {
                if (observer != null) {
                    observer.dispatchingThrewException(token, msg, exception);
                }
                throw exception;
            } finally {
               
            }
            ...
        }

However, this class is a Hidden API, and its class definition cannot be directly obtained, so the interface cannot be implemented on the App side. At the same time, it is in the reflection blacklist, and the implementation cannot be set to Looper through reflection.

For the reflection limitation, meta-reflection can be used to break the limitation. In general, it is to use the system identity to reflect. When Api ≥ 31, the system restricts meta-reflection calls. We have also made compatible processing for this, so that the solution can be used in Api. Still available on devices ≥31.

For the class definition that cannot be obtained, you can implement a self-defined Stub interface and modify the type information corresponding to the Stub interface at compile time to ensure that it can pass the compile and run-time class verification.

Comparing the above two implementations, it is not difficult to find that for the scheme of the common setting in the industry Printer , with the continuous execution of the for loop of Looper , it will cause frequent splicing of strings and generate a large number of temporary objects , it may increase the GC frequency of the App and cause memory jitter.

MDAP achieves a better performance LooperObserver scheme by bypassing the reflection limitation of the high version and the instrumentation at compile time. At the same time, for compatibility and availability of the low version, the setting Printer is reserved. Capability, when the LooperObserver scheme fails, it will be downgraded to the Printer scheme.

3.2 Support multiple Looper

For the monitoring of Looper , we hope to make it a general capability, which can monitor Looper of any thread. Because the Looper model is not only used in the Android main thread, but also widely used in ReactNative , such as NativeModulesMessageQueueThread and JSMessageQueueThread For apps with RN as the main technology stack, these two threads— Looper execute abnormally, which will also cause the app to freeze.

If monitoring multiple Looper , the monitoring module is bound to be called by multiple threads. In order to reduce the context switching overhead caused by frequent locking, we adopt the model of multiple instances + single thread. Data collection is handled by creating a corresponding LooperRecorder object for each Looper e8834224a99ebf5564bf6ded05360feb---, and each LooperRecorder is mapped according to the thread ID of Looper , ab72583--- This Recorder the instance's internal lock-free.

3.3 Information Collection

With the monitoring portal, how to collect the corresponding data of different messages?

3.3.1 Message Classification

In fact, for messages with different time-consuming thresholds, the focus of the business side is also different.

For example, in most scenarios, messages are small messages that are not time-consuming and take less than 30ms to execute. Such messages have little impact on ANR. To save memory, record a total time-consuming and end timestamp, and The relevant data of the last message can be used to concatenate the message queues.

Ordinary scrolling message, record its target , callback , what fields, based on online symbol restoration, you can basically locate the business logic.

For time-consuming messages, such as more than 200ms, only know the specific location of its target and callback , and can locate the specific internal execution position of Handler , but The logic is still unknown, and the business side prefers to have a stack.

For system messages, in addition to the above basic data, it is also necessary to know which type of system messages are currently scheduled, so as to filter out the scheduling information of the four major components that will cause ANR.

In addition, when an ANR occurs, it is also necessary to obtain Pending messages that have not yet been scheduled. For Pending messages, in addition to recording the target , what , callback fields, they also need to record how long they can retrieve the timeout, so that when the ANR occurs Message queue, find the "bomb" that triggers this ANR occurrence.

3.3.2 Stack Acquisition

For stack collection of time-consuming messages, there are mainly two common schemes:

Option 1: When compiling, insert the statistical code at the start and end position of the function .

The advantage of this scheme is that the specific time-consuming of the method can be clearly known; in addition, if a unique ID is generated for each function at compile time, the stack data can be saved in a very small capacity with the help of the ID and function name mapping table.

However, this scheme has some impact on the package size and runtime performance; at the same time, it relies on the access party to access the instrumentation plug-in.

Option 2: Use sub-threads for timeout detection and capture .

The advantage of this solution is that it is simple to implement and does not depend on whether the access party accesses the plug-in.

The disadvantage is that the grabbed stack is larger than the instrumentation scheme, and if the frequency control is not good, it may have a greater impact on App performance.

For scheme 1, MDAP has implemented a set of high-performance online function Tracing scheme.

Scheme 2 is LooperMonitor the currently adopted scheme. We achieve efficient asynchronous detection and collection through the non-blocking suspension capability of the Looper , and filter the IDLE status of ---a89e0fc131af103c1a695ffad6f16a17---. If there is no message currently being executed, The target thread stack will not be collected, preventing a large number of NativePollOnce calls from being collected.

For time-consuming messages, the detection threshold will increase linearly, and the sampling frequency should be controlled as much as possible to reduce the performance impact on online apps.
For the collected stack data, serialization will be delayed until the freeze and ANR are reported.

In the future, we will consider combining the two solutions. If the access party accesses the plug-in and enables the trace function, it will give priority to obtaining the simplified stack data from the trace component to further reduce the memory and performance occupation of the app by the trace.

3.4 Memory usage

App memory is a very precious resource, LooperMonitor must try to ensure that the app's memory usage is controllable and minimized as much as possible, so that it has real value to run online.

We use a range of means to achieve this.

3.4.1 Object pool reuse

Object pooling technology is very suitable for scenes that require frequent creation and destruction. It can greatly reduce the memory pressure caused by object creation, and is very helpful in reducing memory jitter. There are a lot of scenarios using object pooling technology in the Android system, such as Message object, TouchEvent object creation, etc.
Inside Message there is a sPool , when using Message.obtain() , it will try to return an object from sPool . Because the entire Android upper layer is implemented based on the message loop mechanism of Looper and Message , the creation and destruction of Message objects will be very frequent, which is very suitable for using object pools technology.

The message execution record object of LooperMonitor MessageInfo is also applicable to this scenario.

Through the test, MessageInfo most of the queue will only use about 100 records, by setting a smaller size of 50 for MessageInfo Pool , in most scenarios, the new object allocating can be reduced by about 50%.

3.4.2 Aggregate small messages + rolling out old values

For small messages that are not time-consuming, their impact on ANR is limited. In order not to occupy valuable memory resources, we will aggregate consecutive small messages into one, and record the last small message callback traget , the total time and message occurTime , used to string up the entire message Record queue.

The maximum value of the message Record queue is currently set to 500, and the old value is rolled out through Size + occurTime . In a nutshell, "out of the queue upper limit" and "old data whose message generation time exceeds a set threshold such as 10s" will be rolled out.

Through calculation, even at the theoretical limit (each message does not hit the aggregation, causing the queue to be filled with small messages), a queue length of 500 can still record 8s of data.

500 is a theoretical limit scene, and the actual measurement of the online App is generally between 100 - 200 Records.

Different messages have slightly different recorded data, and the memory usage is not exactly the same.

The overall memory overview is as follows:

When the online Trace function is enabled later, the memory footprint of fat messages will be further reduced.

For the different busyness of the main thread, here is an approximate scenario + memory usage statistics:

It can be seen that the solution basically achieves the controllable memory usage.

3.5 Data Kanban

Looper monitoring data will be reported to MDAP's analysis platform along with ANR and Caton monitoring as supplementary data reported by Caton and ANR.

3.5.1 MDAP Looper Kanban

Looper's H5 analytics dashboard is as follows:

The color bar at the top is the historical scheduling message timeline. The length of the color block varies according to the time-consuming time of the message. From left to right is the oldest to the newest message data. The rightmost record represents when JANK occurs, that is, ANR/stutter. On-site data, different types of messages are distinguished by different colors, the time axis can be zoomed horizontally, and detailed data can be viewed by clicking on each message.

In order to facilitate the offline development and debugging of R&D students, we also provide the Looper monitoring board of Debug, which can monitor real-time scheduling information. Click to go to the details page to view the details of each scheduling history.

3.5.2 Real-time Looper monitoring floating window

Looper's real-time monitoring floating window uses bar graphs of different colors to represent the time-consuming status of each message. The floating window is updated in real time. Different message types have different colors.

The abscissa represents the elapsed time, in milliseconds.
The ordinate represents the message index, from top to bottom, from the newest to the oldest message.
Click the arrow in the upper right corner to capture the current snapshot and enter the details dashboard page to view it.

3.5.3 Debug Details Dashboard

The detailed kanban data is the same as the MDAP platform kanban data, but the style is a little different, and it is mainly divided into two parts.

1) The upper part is "Pending message data"

The abscissa represents the index of the Pending message in MessageQueue , from left to right, from the head to the tail of the queue.

The ordinate represents the time point difference between the "target execution time point" of the Pending message and the "time point when the freeze/ANR occurs".

If it is a positive number, it may be a message put into the message queue by sendMessageAtTime() , and it has not reached the target execution time point. The value represents the difference from the target execution time point.
If it is a negative number, it means that the message has timed out, and its value indicates the time-out duration. For example, if an ANR triggered by a CREATE_SERVICE message timeout, you will see a blue system message in the Pending message list, <-10000ms its Pending duration-- <-10000ms .

2) The lower part is "historical scheduling data"

The abscissa represents the message time, in milliseconds.
The ordinate represents the message index, and from top to bottom represents the latest to the oldest message, which is consistent with the Looper real-time monitoring floating window.
Use different color bars to distinguish different messages, click to view detailed data.

At this point, the basic introduction to the Looper monitoring tool is over. Below are some specific cases to share with you.

4. Case sharing

Case 1

Here is an online System.log log:

 2021-11-12 11:26:19.996 1657-11869/? E/ActivityManager: ANR in com.***.***.***
    PID: 11093
    Reason: Input dispatching timed out (*****.**CheckInActivity_ (server) is not responding. Waited 8014ms for MotionEvent(action=DOWN))
    Load: 3.89 / 2.63 / 2.44
    CPU usage from 0ms to 6012ms later (2021-11-12 11:26:13.938 to 2021-11-12 11:26:19.949):
      53% 1657/system_server: 32% user + 21% kernel / faults: 14664 minor 589 major
      52% 11093/com.***.***.***: 37% user + 14% kernel / faults: 17832 minor 34 major
      19% 25179/cnss_diag: 11% user + 7.6% kernel / faults: 7 minor
      12% 853/surfaceflinger: 4.1% user + 8.4% kernel / faults: 503 minor 7 major
      10% 499/logd: 4.3% user + 6.4% kernel / faults: 22 minor
      0% 1227/media.swcodec: 0% user + 0% kernel / faults: 19869 minor 68 major
      10% 10580/com.miui.securitycenter: 5.8% user + 4.4% kernel / faults: 4022 minor 3 major
      9.2% 9835/adbd: 3.4% user + 5.8% kernel / faults: 14 minor
      7.9% 3307/com.android.systemui: 5.4% user + 2.4% kernel / faults: 1546 minor 2 major
      7.6% 23896/com.google.android.gms.persistent: 4.4% user + 3.1% kernel / faults: 3190 minor 29 major
      0% 1135/media.codec: 0% user + 0% kernel / faults: 8815 minor 66 major
      3.6% 27550/com.miui.powerkeeper: 2.4% user + 1.1% kernel / faults: 3265 minor 104 major
      3.1% 149/kswapd0: 0% user + 3.1% kernel
      2.8% 773/android.hardware.wifi@1.0-service: 2.1% user + 0.6% kernel / faults: 262 minor
      2.8% 3685/com.android.phone: 1.4% user + 1.3% kernel / faults: 2965 minor 202 major
      2.6% 10542/kworker/u16:8: 0% user + 2.6% kernel
      ...
      0.4% 90/system: 0% user + 0.4% kernel

Here is a brief mention of how the System.log log looks:

1) Look at the ANR type and confirm the scope of analysis

Find the reason of ANR first, mainly look at the Reason line. For different ANR types, the content of Reason will be different. For example, Input dispatching timed out indicates that ANR is caused by Input dispatch timeout, executing service com.xx.xxx indicates that it is caused by Service creation timeout, the Reason line can locate the ANR type, and then determine the timeout threshold, such as this time ANR is caused by Input, and the Input timeout is 5s, which can further help us narrow the time range of problem analysis.

2) Look at the CPU LOAD Average data

Load: 3.89 / 2.63 / 2.44 Indicates the average number of processes that are using or waiting to use the CPU (in Run queue or Ready queue) in the past 1, 5, and 15 minutes. For the specific definition, please refer to Wikipedia ).

Under the general Linux standard, the single-core CPU usually takes the 70th quantile as the watershed, that is, <0.7 is the ideal load, and the multi-core CPU needs to be multiplied by the corresponding number of cores, but modern Android devices generally have multiple processes. There are intelligent scheduling strategies, such as automatically freezing background processes, or allocating only a small amount of CPU time to background processes. At this time, high Load does not necessarily mean that the device has encountered a performance bottleneck. The Load average of some devices is often maintained at 40+, but still The response is smooth and the CPU usage is not high. Based on this, we cannot simply define an absolute judgment formula for Load average.

When analyzing the ANR problem, it is more to see whether the value has an obvious deterioration trend. For example, the System.log device uses an 8-core CPU, 4 large cores and 4 small cores. Overall, there is no obvious abnormality in the CPU load. , there is a slight worsening trend, but not serious.

3) Look at the process CPU usage

CPU usage from 0ms to 6012ms later ，这里的later的CPU 占用率数据为ANR 发生后的数据，除了later还可能有before的The words indicate that the data was collected before the ANR occurred.

Note: The system does not guarantee the specific time for collecting the CPU usage. We need to look at such logs differently. For example, if the CPU data is collected after ANR, then it is normal for the CPU usage of system_server to be high. , because system_server need to do a dump operation at this time.

In the log, the process information will be printed according to the CPU usage from high to low, including CPU usage, process ID, process name, user and kernel CPU ratio, and number of page faults.

If the CPU ratio of the kernel is high, it means that a large number of system calls should occur on the application side. In general, most scenarios where system calls occur are file IO operations.

From the log, when ANR occurred this time, the CPU of the application process accounted for 52%, which was much higher than other processes. The ANR bottleneck should have nothing to do with other processes in the system, but the proportion of user and kernal had no obvious abnormality, and the main process should not have IO exception.

4) Look at the number of page faults in the App process

The minor and major of the process represent minor page faults and major page faults, which represent memory and cache hits, respectively. Too high page faults means that memory is frequently paged, which will lead to significantly slower memory allocation and GC.

5) Look at the kswapd0 process

kswapd0 is the process that manages virtual memory. When the device RAM space is exhausted, kswapd0 will be responsible for writing part of the virtual memory of the App process into the swap area. When the process occupancy rate is high, it often reflects that the device is in a low memory state.

On the whole, through the above SystemLog, we can roughly see the basic situation of the device before and after the occurrence of ANR: the CPU load does not increase, the kswapd0 process is only 3.1%, there is no low memory scenario, and the application process is 52%, compared with other processes of the device. In terms of CPU usage, there is no obvious IO exception.

At this point, you need to further use the Looper monitoring kanban tool to view the Looper data reported by this ANR.

When ANR occurs, the currently scheduled message takes more than 8s; continue to look at its Wall and CPU Duration, you can see that the WallDuration is 8369ms, but the CPU time is 0, indicating that the main thread did not get the time slice at this time, maybe is blocked. It may be that the main thread is IO, or waiting for a lock.

Further locate the corresponding business logic through the stack, and find that there is indeed a deadlock.

Through the Looper Kanban tool, it can be found that the root message of this ANR is that the information currently being scheduled is seriously time-consuming, and solving the bug will also solve the ANR.

Another scenario is a bit more cryptic:

Case 2

 ANR in **.***.*******
    PID: 5111
    Reason: executing service **.***.*******/com.***.*****.******Service
    Load: 32.79 / 32.69 / 33.14
    CPU usage from 0ms to 12126ms later (2021-12-12 20:22:05.251 to 2021-12-12 20:22:17.377):
      67% 5111/**.***.********: 52% user + 14% kernel / faults: 2186 minor 1 major
      55% 1309/system_server: 36% user + 19% kernel / faults: 2084 minor 3 major
      14% 24540/com.google.android.gms.persistent: 12% user + 2.4% kernel / faults: 29 minor 1 major
      13% 706/adbd: 2.5% user + 10% kernel
      12% 634/surfaceflinger: 8.2% user + 4.5% kernel / faults: 93 minor
      7.3% 2433/glgps4774: 4.4% user + 2.8% kernel / faults: 5 minor
      2.2% 636/powerlogd: 1.7% user + 0.4% kernel
      2.2% 1874/com.oppo.launcher: 1.4% user + 0.7% kernel / faults: 7267 minor
      0.3% 150/kswapd0: 0% user + 0.3% kernel
      0% 32322/com.google.android.gms.ui: 0% user + 0% kernel / faults: 1021 minor 22 major

First look at System.log:
executing service **.***.*******/com.***.*****.******Service This ANR is caused by Service timeout.

The CPU load is normal, the main process of the App has no obvious abnormal CPU usage, kswapd0 is 0.3%, and the device has no low memory scenario. The high probability of ANR this time is not because the device load and RAM are not enough. We will focus on the looper report of the App process.

Through the ANR Kanban tool, we can see that the execution of the message currently being scheduled by the main thread takes more than registerSensorManager to execute, which is a very serious time-consuming message. ANR monitors the ANR thread stack captured by the SDK when reporting. If you only see this data, the business side may think that the ANR problem can be solved by solving this time-consuming scenario, but is it the actual situation?

Looking at the Pending message board at the top, you can see that the Pending queue has been full, and a large number of messages have not been executed in time.

Click the blue system message on the left to view the details:

A message of CREATE_SERVICE has been pending for more than 10s, and it is the "bomb" that detonated this ANR.

But the currently executing message only took 1.7s, it's certainly not the whole reason. Continue to view historical scheduling data.

Further forward 10s, it is found that there are two serious time-consuming messages in the history, and the total execution time is more than 6s. These two messages are the biggest reason for this ANR. Only when the business side sees and solves these two problems at the same time can the ANR problem be completely solved this time.

Case 3

The data this time is relatively representative. It can be seen that there are a lot of red messages in the scheduling history of the main thread Looper.

The CPU of each time-consuming message is very different from the Wall. Some actual CPU time is only 1ms, but the WallDuration is 220ms, indicating that it is almost difficult for the main thread to seize the CPU time slice during this period, and a lot of time is wasted on the CPU. On task switching, the guess may be that the current device CPU is overloaded, or the main thread is preempting too many CPU resources by other threads.

Continue to analyze the system.log log:

 12-03 19:01:47.821 1309-1328/? E/ActivityManager: ANR in com.***.******
    PID: 5323
    Reason: executing service *******/com.***.*****.service.**AppTrackingService
    Load: 46.81 / 40.09 / 41.53
    CPU usage from 0ms to 12496ms later (2021-12-03 19:01:35.306 to 2021-12-03 19:01:47.801):
      116% 5323/com.***.****** 64% user + 52% kernel / faults: 11566 minor 11 major
      45% 1309/system_server: 26% user + 18% kernel / faults: 1830 minor 4 major
      18% 634/surfaceflinger: 12% user + 5.8% kernel / faults: 49 minor
      10% 590/android.hardware.graphics.composer@2.2-service: 5% user + 5.6% kernel
      6.7% 2433/glgps4774: 4.3% user + 2.4% kernel / faults: 55 minor
      6.4% 630/audioserver: 5.7% user + 0.6% kernel / faults: 54 minor 1 major
      3% 1756/com.huawei.lbs: 2.1% user + 0.8% kernel / faults: 1 minor
      2.7% 4811/kworker/0:0: 0% user + 2.7% kernel
      2.6% 598/vendor.huawei.hardware.audio@4.0-service: 0.7% user + 1.9% kernel / faults: 4 minor
      1.9% 150/kswapd0: 0% user + 1.9% kernel
      1.2% 4542/kworker/u17:0: 0% user + 1.2% kernel
      1.1% 1849/com.android.phone: 0.6% user + 0.4% kernel / faults: 1 minor
      1.1% 30529/kworker/u16:0: 0% user + 1.1% kernel

It can be seen that the overall load of the device has a tendency to increase, but it is not serious (8-core CPU). However, the current thread CPU usage is as high as 116%, and kernal accounts for 52%, which basically occupies half of the CPU usage, indicating that the current main process has a large number of syscalls.

Check out the log file further:

 46% 5323/com.***.******: 61% user + 23% kernel / faults: 247 minor
        44% 5698/realm_thread-io-4: 4% user + 40% kernel
        11% 5776/OkHttp TaskRunn: 5.8% user + 5.8% kernel
        8.7% 6055/Thread-764: 5.8% user + 2.9% kernel
        5.8% 5345/Jit thread pool: 0% user + 5.8% kernel
        3.8% 5699/realm_thread-io-5: 5.8% user + 0% kernel
        2.9% 5645/realm_thread-io-1: 2.9% user + 0% kernel
        2.9% 5704/DefaultExecutor: 0% user + 2.9% kernel
        2.9% 5724/pool-35-thread-: 2.9% user + 0% kernel

It is found that realm_thread-io-4 thread occupies 44% of the CPU, and at the same time OkHttp TaskRunn occupies 11%, these two sub-threads together occupy 55% of the CPU usage, and at the same time realm_thread-io-4 kernal ratio is as high as 40%.

The case can basically be solved here, and further analysis of the business logic also confirms the problem: an abnormal realm_thread-io-4 is doing a lot of IO, and at the same time due to various reasons, its nice value is high, and it seizes a large number of CPUs of the main thread time, which triggered this ANR.

5 Conclusion

The above cases are typical. I believe you have also noticed these ANR problems that should be more complicated and obscure. With the help of LooperMonitor's powerful data collection and visualization capabilities, they have become very intuitive. This is why we hope that LooperMonitor can give business The practical value brought by the side students.

Of course, the tool is only an aid, and the ultimate solution and positioning of the problem still depends on human experience and analysis. Online ANR problems are all kinds of strange, and the data of LooperMonitor alone cannot solve all scenarios. MDAP is still intensively developing more capabilities to enrich Caton ANR monitoring and reporting, and provide more detailed and complete data for analyzing and locating problems. The existing ones include Java, Native side stack dump, and thread deadlock monitoring. Thread monitoring will also be launched soon to provide clearer live thread data when ANR occurs.

author of this article
Xiaqiu, Android development engineer, from the ShopeeFood team.