Dewu App Android Crash Governance Evolution

Application crashes are called Crash. Crash rate is an important indicator to measure the quality of an APP. Effective governance can reduce user experience problems caused by application crashes and even user loss. This article describes the work done in the process of the Crash rate of the Dewu App Android client from 8 out of 1,000 to 3 out of 10,000, and introduces the evolution in the following directions by time and stage.

crash prevention
crash monitoring alarm
crash downgrade protection
Crash troubleshooting and positioning
crash repair

The first stage (Stone Age)

Crash information collection, indicator establishment, simple Crash distribution process

Collect crash information based on the third-party platform bugly and establish crash indicators.
Observe the bugly crash problem regularly every day and after the version is released, find the code author according to the stack
The crash form is manually sorted and sent to the group and email. The general processing flow is as follows.

Through the above method, we have crash information and crash indicator reference.

However, I was impressed that it was normal to work overtime to deal with crashes every time the grayscale version was released, and many crashes could not be checked due to lack of information. Every Sunday, I have to check the crash and organize the table data one by one. The quality of the online grayscale version and the accuracy of crash data statistics are being tested.

Phase 2 (Bronze Age)

Grayscale fuse mechanism (crash alarm)

In order to ensure the quality of the grayscale version, a grayscale fuse mechanism has been added.

app upgrade sdk
The hook based on the crash sdk monitoring reports the same data to the sls, and the crash reaches the threshold and automatically stops the grayscale.

Log file SDK & specification (crash troubleshooting)

In order to obtain more information during the crash, the local file information record is added.

Add log sdk
User feedback, crash and other scenarios actively report log information
Canonical log printing

VERBOSE, DEBUG logs are only printed to the console (convenient to adjust and manage)
INFO, WARN, ERROR log printing will also be recorded in the text log (key operations, online problem location)
BUG diary records will be submitted to bugly and Alibaba Cloud at the same time (important errors are reported in time, and the test environment throws exceptions to expose problems)

Automatic analysis statistics (crash statistics)

Established a crash handling mechanism based on bugly

Add buglyId mapping table when new students join
After the release of each grayscale version is completed, you need to track bugly’s crash issues, find the top problem record related data and distribute it to the relevant responsible person for processing. After the relevant students have dealt with the problem, they will give a simple description and mark
This part of the distribution cannot be automated. You need to check the reason here to confirm who submitted the code before you can change the crash status to processing and assign it to relevant students, or your friends can take the initiative to bugly claim the crash problem.
After the bug is distributed, a script will be run to generate a statistical table of crash information and fill in the document.

The third stage (Iron Age)

App comment crash notification (crash alert)

We found a Crash scenario. The Crash sdk has not been initialized when the Crash occurs, and it is not displayed on the Crash platform, but there are Crash comments in the application market. So I did real-time monitoring of crash comments on the four major application markets to detect crash problems in time

Configuration Center (crash pocket bottom)

In order to reduce online problems, when a new feature is launched, we need a configuration for the gradual increase or feature rollback of the new feature.

Simple process

Buried point report SDK (crash investigation)

Local log reporting has low real-time performance, low success rate of retrieval, and cannot support the function of multi-dimensional problem data statistics. The buried point reporting module complements it, and timely discovers online problems and statistical problem indicators.

Supplementary reporting of abnormal data (crash investigation)

BPM SDK

On the basis of the embedded point SDK, the business exception management SDK is encapsulated, and the business module and flow control process are added.

stack is missing main log

For example, in the first stack, we would like to know the specific url of the error.
The second need to know the specific jump to which activity carries what parameters to cause the crash.

These problems can be easily hooked to what we want by using bytecode modification frameworks such as aspectj and ASM.

ART OOM information supplementary report

Collected the current process memory status

Analysis of proc/self/smaps

Thread problem

Collected the maximum number of threads, and used Thread.getAllStackTraces() to obtain the current thread stack information. When the app was packaged according to the thread name and thread stack, the thread was renamed using bytecode check stubs.

FD problem

fd information aggregation output

Picture problem

The hook gets the source url size of the currently loaded image. Dump the memory information in BitmapPool. Dump the url address and size of the picture on the current screen

Analysis of the problem of uploading hprof files of some wifi devices whose grayscale memory exceeds the threshold.

Three-party library problem handling (Crash protection)

The du_aspect module is provided to downgrade the Crash of some third-party libraries to exception reporting through the bytecode instrumentation scheme try catch.

Upgrade 64-bit so (Crash management)

After the oom information was supplemented and reported, we found that the proportion of scenes where vmsize hits the top is particularly large. After upgrading to 64-bit, the proportion dropped sharply

The fourth stage (steam age)

Abnormal buried point platform (crash investigation)

The abnormal burying point is platform-based, the abnormal burying process is standardized, and the abnormal burying point management is standardized.

Platform establishment of burying point -> Generate code -> Manual burying point in code -> Report abnormal burying point for hit devices after release.

File recovery (crash troubleshooting)

File recovery SDK, supports any file path in APP to send and retrieve files, usage scenarios.

As a supplementary recovery of the abnormal file tombstone
Demand for recovery of independent logs of business lines such as live broadcast, audio and video.

Crash Security Center (crash data support, crash downgrade)

Crash downgrade pocket bottom

Analysis tool (crash troubleshooting)

stack de-obfuscation tool

Stack deobfuscation
Check the version number of the component corresponding to each row of the stack, the person in charge of the component, and the identification of code modification to improve the speed of problem location.

Multi-dimensional log analysis

Quickly view exception logs by device id and user id
Development students can click to view the detailed log to troubleshoot problems
Test students can view the problem context and quickly share the problem stack leader with the development leader

Associate OSS files

View of local log files after crash
art oom scene hprof file view

Self-built Crash platform (crash alert, crash distribution, crash statistics, crash processing process optimization)

Optimized abnormal Crash processing flow

Text/zjy

Pay attention to Dewu Technology, and work hand in hand to the cloud of technology

Dewu App Android Crash Governance Evolution

The first stage (Stone Age)

Crash information collection, indicator establishment, simple Crash distribution process

Phase 2 (Bronze Age)

Grayscale fuse mechanism (crash alarm)

Log file SDK & specification (crash troubleshooting)

Automatic analysis statistics (crash statistics)

The third stage (Iron Age)

App comment crash notification (crash alert)

Configuration Center (crash pocket bottom)

Buried point report SDK (crash investigation)

Supplementary reporting of abnormal data (crash investigation)

Three-party library problem handling (Crash protection)

Upgrade 64-bit so (Crash management)

The fourth stage (steam age)

Abnormal buried point platform (crash investigation)

File recovery (crash troubleshooting)

Crash Security Center (crash data support, crash downgrade)

Analysis tool (crash troubleshooting)

Self-built Crash platform (crash alert, crash distribution, crash statistics, crash processing process optimization)

得物技术

引用和评论

得物自研DScript2.0脚本能力从0到1演进

iOS 集成如何集成 FSPlayer

buildozer 不能使用 3.x 的 openssl 吗？必须要用落后的 1.1 吗？

laravel 小技巧：为日志组件的非默认通道注册全局上下文 context

执行 buildozer -v android debug 报错 ValueError: read of closed file

手机真能秒变顶级PC？无影云、ToDesk、顺网云等五大云电脑实测对比

扣子空间初体验，MCP 扩展集成，无限拓展 Agent 能力边界，让更专业 Agent 来为你提供服务