"Online is complete!!!" A "good news" came from the back-end group. But half an hour later...
The robot in the mobile terminal group sent an alert:

The daily bug crash rate has stabilized at around 0.06%, and the crash rate has suddenly increased by 8 times! ! ! Get up and check for bugs [tears].
Open U-APM for the first time, check the crash statistics for the past one hour, and see the number one bug:

It turned out to be a null pointer exception. Let’s take a look at the user behavior statistics before the crash:

The crash page was quickly located through the log. Postman looked at the interface return data. It turned out that there was a String type that returned null. At the same time, the client just changed to Gson and failed to make compatibility and crashed (although the bean has a default value, but Gson parsing is done directly through reflection, which crosses the default value of kotlin). So notify the back-end emergency repair first to ensure that no errors are reported online. Then the client made compatibility by customizing the Gson TypeAdapter. With U-APM monitoring alarm + bug tracking, it took ten minutes from receiving the alarm to fixing the bug, preventing further expansion of the impact.

When it comes to the use of U-APM, it has gone through the following stages:

Primary Stage

Case1: The stack log cannot locate the personal code, and locate the bug by analyzing the characteristics of user behavior data

Before we encountered a bug in development, the stack information is as follows:

The entire error message does not have a single code of its own. Google the error log. Most of them say that the data volume of the Bundle transmission fashion is too large, so I have no choice but to review all the Bundle transmission related codes in the project, to no avail! Later, it was discovered that U-Meng U-APM could view the user behavior log before the crash. After looking at it, I found that the errors were reported by WebView-related pages. Due to many interactions with h5 in the project, it was still impossible to locate which page.
However, through further analysis of multiple user behavior logs, two rules are found:

1 There was a time difference of several hours between the page visited and the previous page before the user crashed.

2 这个时间差越长,崩溃日志的 “data parcel size xxx bytes”数值xxx就越大。
因此可以得出两个结论:
1 用户是在上一次使用后切后台,几个小时候切回前台使用时崩溃。(用户不可能停在一个页面好几个小时)
2 在后台期间应该有个累积操作,使parcel size持续变大,这个操作可能是轮询。
带着这两个结论,我们检查了h5代码,发现确实有轮询请求通过h5调原生的方式使用okhttp请求网络,App在后台的请求结果可能暂存parcel,App返回前台时parcel size超出了bundle承载上限导致崩溃。后来修改为切后台后停止轮询,新版本该bug消失[成功]。

Case2: Locate the bugs of non-mainstream models through the cloud real machine
When we routinely checked the bugs collected on umeng, we found a special bug that only appeared on the lower version of vivo phones:

Many of our test machines have been purchased in the past two years. Where can I find this kind of mobile phone for testing [breaking tears into tears]. Suddenly found that U-Meng U-APM has the function of real cloud machine. I searched it and found that there is a vivo phone with 6.0 system. The package was installed and tested, and it crashed as expected.

After google, I found that some friends on the Internet also encountered a similar problem: problem link. Later, I contacted vivo technical support and learned that it was a bug in the vivo system. After modifying the solution given by vivo, the cloud real machine was tested and passed and went online!

Advanced use

1 Add user-defined fields
Record the user id before the crash, the user's more specific behavior, the user's network request and other information through the custom field, and use this information to help locate the problem. Below is a screenshot of our custom field for a bug:

2 Access monitoring reminder
We use the monitoring function of U-APM to create monitoring and alarm rules, and use WebHook to notify the alarm information to our enterprise WeChat group in time. (Such as the notification at the beginning of the article)

3 Add custom exception
Report caught exceptions and non-crash exceptions that affect usage through custom exceptions. Although this type of anomaly can be directly treated as an event and can be reported to Youmeng, reporting in the form of anomaly can trigger alarm rules. (Alarms can monitor custom exceptions). The following picture is a non-crash we monitored, but it is a serious exception for us:

Future

U-APM + Arms + EMAS hot update

Since our back-end also uses Ali's arms monitoring, we can locate the bug in the later stage to monitor the linkage between the front-end and the back-end, such as through the front-end error log time and user id (you can view it manually or get it through U-APM OpenAPI) combined with the back-end arms Analyze the logs together, and bugs like the opening one can find the error points in the data returned by the backend in a shorter time, and then locate the bugs faster.
At the same time, it is connected to the mobile hot repair function of Alibaba Cloud EMAS, so that it can be updated directly without publishing.

Little Tip:

Because Umeng’s page automatically counts the lifecycle methods of hooking iOS, if developers use Aspects to perform some AOP operations and need to hook the lifecycle methods, the hook code needs to be written after Umeng is initialized, otherwise there will be multiple hook conflicts. Report the following abnormal:

As long as you make sure that Umeng is initialized first, it's ok.
Finally, thank Youmeng + U-APM for providing such a practical and easy-to-use platform to help us produce stable and high-performance applications.

Author: Guo Donghao


性能优化实践者
11 声望220 粉丝