Author: Qin Jingchao (Non-Taiwan)
Safe Production
First, we define "client safety production" as: A series of measures and activities taken to prevent experience-related accidents during the client's R&D life cycle.
To this end, the Taobao client has established a complete set of standardized processes and platforms for "R&D, construction, release, and emergency".
Figure 1 Safety production architecture diagram
Taobao client safety production is mainly divided four stages : R&D period, construction period, release period, and emergency state. At the same time, it accumulates development process data and reproduces abnormal data online and offline, in order to improve code quality and improve development capabilities. Further improve the platform for data support, thereby enhancing a good R&D environment for development and ensuring online user experience.
- R&D period : This stage mainly refers to the stage where the development students develop their needs, often the development of a single module. This stage focuses on the quality of the module itself. At this stage, the safety production platform mainly uses demand management and code branch management. , Single test management, Code Review, test request|approval, one-stop way to provide convenience for development students;
- Construction period : This stage mainly refers to the development students submit the code that has passed the test to the integration area for code integration testing. At this stage, the safety production platform mainly passes quality checkpoints, package size analysis, and product calibration. Check to ensure that the integrated modules meet the integration standards (duplicate resource files, codes, etc. or high-risk privacy API, debugging code, etc. or unreasonable component export, DEBUG code, etc.), and prevent risky code integration through pre-risk analysis ;
- release period : This stage is mainly to release (gray, formal) complete APP, configuration changes, and offline activities after passing the test. At this stage, attention should be paid to APP stability data, performance data, business data and public opinion data (End-to-cloud monitoring solution) to ensure that the released APP meets the user's experience requirements;
- Emergency state : The first three stages are mainly to avoid the online risk line. This stage is the state entered when the online APP cannot meet the user's normal use. The online core indicators fluctuate, which will trigger timely alarms, and then pass the nail Nail quickly set up an emergency team to deal with online problems, analyze the problems and the reasons behind the problems, quickly give solutions, and quickly resolve online risks through plan rollbacks, downgrading processing, etc., so as to avoid risk escalation and prevent malfunctions. .
In addition, in order to ensure the high-availability experience of the online APP, the Taobao end architecture has set up a "end-to-side daily guarantee" team, which is mainly engaged in version watch, big promotion guarantee, emergency handling, review optimization, etc., and constantly discovers from daily work Questions, concluding and thinking, optimizing the process, improving the R&D environment, and continuously automating, digitizing, and platformizing some processes that require manual intervention, so as to improve the R&D experience and R&D efficiency in the “R&D, construction, release, and emergency” phase, and finally save the manpower Released from repetitive and inefficient work, the safety production platform experience is continuously optimized through process data to ensure the healthy and sustainable development of upper-level businesses. Let development have more time and energy to engage in higher-dimensional research and development projects, and enhance the sense of accomplishment of development.
Development period
The R&D period is mainly for the development of students. The platform provides code coverage (single test). The single test code coverage of the core module middleware needs to meet 80% and above. The core change requires double CR (including TL). Core changes require technical experts and above CR.
Construction period
Quality bayonet
With the continuous expansion of Taobao's business, online problems have occurred from time to time. The Taobao scenes are very rich, and local testing, Review, Monkey, and even grayscale cannot cover all scenes. But once the problem comes online, the cost will increase dramatically.
After some analysis of historical online problems, a considerable part of them can static code analysis ", " binary product analysis " and other methods, so why can't we use technical means to find the problem in advance , To block them from slipping onto the line. E.g:
- The category method conflict with the same name causes many functional abnormalities in the manual Taobao;
- @{} Initialization did not determine the null problem;
- The oc block holds the c++ this pointer and causes the User After Free problem;
- objc_msgSend sends alloc and causes memory leaks;
- Some system APIs are no longer safe, such as vm_remap;
- Component export;
- Thread leak
- ......
These problems may cause problems that are difficult to locate after they go online, but in the code stage, static analysis and other means can prevent them from happening. Therefore, the client quality bayonet platform came into being. It integrates the existing problem scanning tools and rules of the mobile Tao client, and combines the open bayonet access platform designed by the DevSecOps bayonet to form a complete offline problem discovery and management of the client The ability to promote and integrate the bayonet will reduce online problems.
In terms of technical solutions, based on Android Lint+Spotbugs+Clang Static Analyzer (Android), OCLint (iOS)+Clang Static Analyzer, improvements are made to the specific platform and specific issues of the Tao system to meet the technical requirements of the Tao system (such as scanning thread native Use of the interface to assist the migration of the overall thread architecture of the Tao system).
Packet size
Packet size is a very important performance indicator for the client. From the perspective of users, users tend to choose apps with relatively small installation packages for the same functions and services, which can allow users to download and update with less traffic, and increase the download rate and update upgrade rate of users to a certain extent. From a technical point of view, every file in the installation package is within the scope of slimming. For different file types, targeted slimming programs are required. Therefore, slimming is a big project that contains many Aspects of technology.
Android uses image compression (TinyPng, Webp), repeated resource merging, shrinkResource strict mode, subcontracting, Proguard, ARSC slimming, downloading useless code (code instrumentation analysis), useless business offline, remote so, detection so debugging information .
iOS adopts image compression (TinyPng, Webp), compilation optimization (does not export symbols, oz, lto), selectorRef useless resource offline, eliminating duplicate code, business offline, shared dynamic library technology (<iOS9), Ld linker compression .
Product verification
Product verification occurs in the last link before the release of the APP, mainly to analyze the specific differences between the core changes of this release and the last release, to ensure the correctness of this release. This link mainly carried out core code change analysis (startup, CrashSDK, monitoring SDK), and need to pay attention to the possible risks caused by core code changes; component export analysis to prevent unnecessary component export from external attacks; signature verification , to prevent signature errors from causing APP to fail to be put on the shelves normally; etc.
Release period
Monitoring alarm
Tao Department attaches great importance to the stability and performance of mobile phone users. Through the establishment of highly available metrics, stability and performance management, automation and data platform construction, it has developed a set of systematic solutions and platform EMAS-MOTU, Comprehensively improve the stability and performance of mobile Taobao.
Change control
Amoy is a collection of high-frequency event operation apps. We have found that some failures are caused by changes (including event launches, configuration launches, etc.), which are highly relevant. Therefore, Tao Department has precipitated a change management and control platform. The main function of the change management and control platform is to monitor and analyze the correlation between abnormal data (Crash, ANR, jamming, leakage, etc.) and changes found by the analysis platform.
core idea of the change management platform is to generate a unique change ID for each change, and in the process of issuing this change, the change ID is added to the change ID set of the monitoring information. When the monitoring information is reported, it will Bring all the change IDs, and the service can perform cluster analysis on the change IDs, confirm which change IDs are responsible for the same clustering problem through correlation, and observe or roll back specific changes to prevent risk escalation.
By accurately changing the relevant gray-scale dyeing data, controlling the related gray-scale and full release, blocking abnormal releases in time, avoiding failures caused by releases, and also a core means to improve release efficiency.
Emergency
position
Track, measure, log
With the continuous refinement and improvement of client functions, modular and cross-team collaborative development has become the standard development method for client development. The birth of modular and cross-team collaborative development has greatly improved client delivery and Deployment efficiency, but at the same time, it can be seen that behind this modular, cross-team architecture, the original operation, maintenance and diagnosis requirements have become more and more complex. In order to meet the increasing functional requirements of the client, a standardized DevSecOps diagnosis and analysis system from the user-oriented perspective must be implemented, including tracing, metrics, and logging.
- Tracing : Used to record the information within the scope of the client's behavior. It processes information within the scope of a single request. Any data and metadata information are bound to a single transaction in the system. For example, the user enters the page to the process of data rendering. It is a powerful tool for us to troubleshoot client problems;
- Metrics : Used to record aggregated data. They are atomic, each is a logical measurement unit, or a histogram within a time period. For example: the current depth of the queue can be defined as a measurement unit, which is updated when writing or reading; the number of incoming HTTP requests can be defined as a counter for simple accumulation; the execution time of the request can be defined as A histogram, updated and statistically summarized on a specified time slice. For example, the number of network requests initiated from the client and the acceptance of correctly received network data. It is a powerful tool for us to measure the macro-quality of our business;
- Logging : used to record discrete events. For example, application debugging information or error information. It is our basis for diagnosing problems.
Based on the design principles of Tracing, Metrics, and Logging, Tao Department uses OpenTracing to implement the full logging platform TLog. Through the funnel model and the comparison model you can quickly find your own performance bottlenecks through the horizontal comparison of data, narrow the scope, and improve the efficiency of investigation.
Panoramic positioning
Tao Department is a collection of high-frequency activity operation APPs. We found that some failures are caused by changes (including activity online, configuration online, etc.), which are highly relevant. Therefore, the Tao Department precipitated a panoramic positioning platform, the main function of the panoramic positioning platform is to monitor and analyze changes on the line. core idea of the panoramic positioning platform is that when online risks occur, the panoramic positioning platform will actively collect online changes and present them in a time dimension. Development can be based on the panoramic positioning platform to quickly view online changes, locate and check lines The relevance of the above risk and the change, the change of potential risk is watched or rolled back until the risk is eliminated.
There is a certain similarity between panoramic positioning and change control. They both monitor and analyze online changes. The difference is that panoramic positioning mainly analyzes changes after risks occur (business is changing, and there is no guarantee that all changes will be connected to change control. ), change control means marking the main risk before it occurs, and analyzing the change after the risk occurs. Therefore, panoramic positioning is a supplement and guarantee for change control.
recover
"Recover from online problems where the code function does not meet project expectations or the code is not robust enough to cause App runtime crashes or exceptions." Recovery is an important method for Taobao to deal with online emergencies. Taobao currently adopts different recovery strategies for different online scenarios. The current main recovery strategies include downgrading, contingency plans, and security models.
Downgrade
In the complex ecological environment of Taobao, during the big promotion period, due to the superposition of various resource-intensive businesses, it will cause lag, experience decline significantly, memory water level skyrocketing, and the crash rate will also soar. Therefore, starting from 2018, we will try to degrade resource-heavy and high-risk businesses in multiple dimensions according to the performance of different devices. Purpose is to classify the user experience, to achieve "high-end equipment and most unusual experience, low-end devices smoothly priority, urgent problems quickly downgraded" (Over time, the condition of old software and hardware equipment, has been unable to meet all the new technologies , The landing of new services requires certain trade-offs to give each device the best user experience) .
According to the different hardware and software characteristics of different devices, based on the Listwise-SmartScorer model, Amoy sets up three dimensions of high, medium and bottom for the client, 0-100 (0 means that the device performance is better than 0% of the mainstream devices on the market. 100 means that the device performance is better than 100% of the mainstream devices on the market) dynamic device scoring algorithm.
What is the From the perspective of machine learning: it may be a classification problem, that is, equipment is divided into three categories: high/medium/low, and we need to distinguish these categories; it may be a regression problem, that is, there is an absolute classification of equipment. , We need to fit this score. Regardless of classification or regression problems, the device score is defined as an absolute value. In actual experience, we often say that "iPhone X" is faster than "iPhone 8" instead of saying that "iPhone X" is 90 points, "iPhone 8" "70 points, that is, the equipment score is relative, and due to the wear and tear of the equipment, its scoring is also dynamic. Based on this, we define the equipment scoring as a sorting problem.
On the basis of equipment scoring, a unified downgrade platform is realized, and the business can select the corresponding equipment to launch its own business through "high, medium, bottom" or "0-100".
Plan
What is a plan? An emergency response plan formulated in advance based on evaluation, analysis or experience, the category and degree of impact of potential or possible emergencies. The plan can reduce predictable or unpredictable risks, and reduce losses. Most of the risks currently faced come from various changes. In addition, in Ali’s most important big promotion scenario, the system, Business pressure. The plan is divided into advance plan and emergency plan.
advance plan : also known as a timing plan, which predicts the system and business conditions during the big promotion in advance. In order to avoid the impact of the business peak of the big promotion, cache warm-up, machine restart, limited degradation, disk cleaning or business offline And so on, generally has no impact on the business or the impact is controllable.
emergency plan: takes emergency measures for possible emergency situations, such as abnormal traffic exceeding expectations, system dependency timeout and unavailability, unexpected unavailability of the system, etc., which generally damage the business and may bring customers Lawsuits, capital losses, etc., require corresponding technical and business details, and implementation requires carefulness. In the newly added changes, in the Code Review link, the code has "grayscale, monitorable, and rollbackable" (three axes of stability) requirements, that is, to ensure that the code has an emergency plan. When there is a risk in the online code, Roll back quickly.
The difference between the plan and the degraded is that the plan adopts the same strategy for all devices, while the demotion is to adopt different strategies for different devices.
Safe mode
Recovery scenario, (startup phase) the crash problem of not using the network normally. The configuration cannot be downloaded to play a role due to the uninitialized network, so Amoy developed a safe mode (after triggering the same Crash continuously, it is forced to enter the "safe mode"-Android lightweight sub-process, iOS enters the safe mode code, used to The program restores the initial state (clearing the persistent information generated in the history), and triggers the download of the configuration if necessary), and makes the necessary guarantee for the normal startup of the main process. For example, because of persistent data errors during startup, the APP starts to crash continuously. At this time, the safe mode can play a huge role to make up for the blind area of the code before the configuration is issued and executed, so that users can only solve the problem by uninstalling and reinstalling the APP.
Summarize
Client safety production is a standardized, automated, and data-based platform established on the relatively complete underlying infrastructure of the Tao Department. The technical points involved in the article are based on the summary and realization of historical issues by many practitioners of development, product, operation and maintenance inside and outside Alibaba. Thank you for your efforts and dedication to participating in safety production. At the same time, thank you for the development and products outside of Alibaba. , Operation and maintenance efforts and dedication to enrich client technology and improve user experience. The article is more about thinking and solving problems at different stages of the APP. I hope it will be helpful to everyone.
reference
[1] Mobile R&D platform EMAS: https://www.aliyun.com/product/emas
[2] OpenTracing:https://github.com/opentracing
[3] Safe mode: App startup protection practice: 161a5e81e90583 https://juejin.cn/post/6844903437948157959
Follow [Alibaba Mobile Technology] Official public number , 3 mobile technology practice & dry goods for you to think about every week!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。