2

Application crashes are called Crash. Crash rate is an important indicator to measure the quality of an APP. Effective governance can reduce user experience problems caused by application crashes and even user loss. This article describes the work done in the process of the Crash rate of the Dewu App Android client from 8 out of 1,000 to 3 out of 10,000, and introduces the evolution in the following directions by time and stage.

  • crash prevention
  • crash monitoring alarm
  • crash downgrade protection
  • Crash troubleshooting and positioning
  • crash repair

The first stage (Stone Age)

Crash information collection, indicator establishment, simple Crash distribution process

  1. Collect crash information based on the third-party platform bugly and establish crash indicators.
  2. Observe the bugly crash problem regularly every day and after the version is released, find the code author according to the stack
  3. The crash form is manually sorted and sent to the group and email. The general processing flow is as follows.

Through the above method, we have crash information and crash indicator reference.

However, I was impressed that it was normal to work overtime to deal with crashes every time the grayscale version was released, and many crashes could not be checked due to lack of information. Every Sunday, I have to check the crash and organize the table data one by one. The quality of the online grayscale version and the accuracy of crash data statistics are being tested.

Phase 2 (Bronze Age)

Grayscale fuse mechanism (crash alarm)

In order to ensure the quality of the grayscale version, a grayscale fuse mechanism has been added.

  1. app upgrade sdk
  2. The hook based on the crash sdk monitoring reports the same data to the sls, and the crash reaches the threshold and automatically stops the grayscale.

Log file SDK & specification (crash troubleshooting)

In order to obtain more information during the crash, the local file information record is added.

  1. Add log sdk
  2. User feedback, crash and other scenarios actively report log information
  3. Canonical log printing
  • VERBOSE, DEBUG logs are only printed to the console (convenient to adjust and manage)
  • INFO, WARN, ERROR log printing will also be recorded in the text log (key operations, online problem location)
  • BUG diary records will be submitted to bugly and Alibaba Cloud at the same time (important errors are reported in time, and the test environment throws exceptions to expose problems)

Automatic analysis statistics (crash statistics)

Established a crash handling mechanism based on bugly

  1. Add buglyId mapping table when new students join
  2. After the release of each grayscale version is completed, you need to track bugly’s crash issues, find the top problem record related data and distribute it to the relevant responsible person for processing. After the relevant students have dealt with the problem, they will give a simple description and mark
  3. This part of the distribution cannot be automated. You need to check the reason here to confirm who submitted the code before you can change the crash status to processing and assign it to relevant students, or your friends can take the initiative to bugly claim the crash problem.
  4. After the bug is distributed, a script will be run to generate a statistical table of crash information and fill in the document.

The third stage (Iron Age)

App comment crash notification (crash alert)

We found a Crash scenario. The Crash sdk has not been initialized when the Crash occurs, and it is not displayed on the Crash platform, but there are Crash comments in the application market. So I did real-time monitoring of crash comments on the four major application markets to detect crash problems in time

Configuration Center (crash pocket bottom)

In order to reduce online problems, when a new feature is launched, we need a configuration for the gradual increase or feature rollback of the new feature.

Simple process

Buried point report SDK (crash investigation)

Local log reporting has low real-time performance, low success rate of retrieval, and cannot support the function of multi-dimensional problem data statistics. The buried point reporting module complements it, and timely discovers online problems and statistical problem indicators.

Supplementary reporting of abnormal data (crash investigation)

BPM SDK

On the basis of the embedded point SDK, the business exception management SDK is encapsulated, and the business module and flow control process are added.

stack is missing main log

  1. For example, in the first stack, we would like to know the specific url of the error.
  2. The second need to know the specific jump to which activity carries what parameters to cause the crash.

These problems can be easily hooked to what we want by using bytecode modification frameworks such as aspectj and ASM.

ART OOM information supplementary report

  1. Collected the current process memory status

Analysis of proc/self/smaps

  1. Thread problem

Collected the maximum number of threads, and used Thread.getAllStackTraces() to obtain the current thread stack information. When the app was packaged according to the thread name and thread stack, the thread was renamed using bytecode check stubs.

  1. FD problem

fd information aggregation output

  1. Picture problem

The hook gets the source url size of the currently loaded image. Dump the memory information in BitmapPool. Dump the url address and size of the picture on the current screen

  1. Analysis of the problem of uploading hprof files of some wifi devices whose grayscale memory exceeds the threshold.

Three-party library problem handling (Crash protection)

The du_aspect module is provided to downgrade the Crash of some third-party libraries to exception reporting through the bytecode instrumentation scheme try catch.

Upgrade 64-bit so (Crash management)

After the oom information was supplemented and reported, we found that the proportion of scenes where vmsize hits the top is particularly large. After upgrading to 64-bit, the proportion dropped sharply

The fourth stage (steam age)

Abnormal buried point platform (crash investigation)

The abnormal burying point is platform-based, the abnormal burying process is standardized, and the abnormal burying point management is standardized.

Platform establishment of burying point -> Generate code -> Manual burying point in code -> Report abnormal burying point for hit devices after release.

File recovery (crash troubleshooting)

File recovery SDK, supports any file path in APP to send and retrieve files, usage scenarios.

  1. As a supplementary recovery of the abnormal file tombstone
  2. Demand for recovery of independent logs of business lines such as live broadcast, audio and video.

Crash Security Center (crash data support, crash downgrade)

Crash downgrade pocket bottom

Analysis tool (crash troubleshooting)

stack de-obfuscation tool

  1. Stack deobfuscation
  2. Check the version number of the component corresponding to each row of the stack, the person in charge of the component, and the identification of code modification to improve the speed of problem location.

Multi-dimensional log analysis

  • Quickly view exception logs by device id and user id
  • Development students can click to view the detailed log to troubleshoot problems
  • Test students can view the problem context and quickly share the problem stack leader with the development leader

Associate OSS files

  1. View of local log files after crash
  2. art oom scene hprof file view

Self-built Crash platform (crash alert, crash distribution, crash statistics, crash processing process optimization)

Optimized abnormal Crash processing flow

Text/zjy

Pay attention to Dewu Technology, and work hand in hand to the cloud of technology


得物技术
851 声望1.5k 粉丝