Mobile domain full-link observable architecture and key technologies

Author: Liu Changqing (holding water)

This article focuses on the original introduction of the full-link technology concept in the mobile field by the Hand Taoist team. The whole article is about 12,000 words and takes 15 minutes to read. The reader will gain the idea of optimizing the experience of the mobile technology field, as well as the precipitation and research and development of software-defined experience. practice.

App Existing Architecture Challenges

Since All in Wireless started in 2013 to now, Ali Group's mobile technology has developed for more than ten years and has gone through several key stages:

In the first stage, to solve the pain points of large-scale business concurrent research and development, the Atlas (containerized framework, providing support for component decoupling, dynamic, etc.) architecture is defined;
In the second stage, build the ACCS (Taobao wireless full-duplex, low-latency, high-security channel service) long-connection duplex encryption network capability to complement the end-to-end interoperable mobile service capability to catch up with the industry;
In the third stage, dynamic R&D frameworks such as Weex and small programs are built for business characteristics, and mobile technology has entered a dynamic cross-platform period.

In the mid-to-late stage, various BUs will be connected and capacity built through the Alibaba Mobile team mechanism. Since then, the mobile infrastructure has basically taken shape, and each field has accumulated several groups to achieve capability reuse. The App has basically formed a three-layer structure of upper-level business, intermediate R&D framework or container, and basic capabilities. As the contractor of the wireless terminal infrastructure, our team used to focus on building the basic capabilities of the group's mobile terminal. In recent years, the team has focused on Taobao business scenarios to carry out performance optimization, and horizontally analyze the App architecture and related call links through experience optimization projects. , I feel that the group App generally has the following common problems:

(Figure 1 Taobao App Architecture Challenge)

Inefficiency of operation and maintenance troubleshooting: First, in the monitoring phase, most problems are not monitored or the information reported after monitoring cannot support more effective analysis, so you need to rely on logs for troubleshooting; second, there are no logs, and they will not be actively uploaded when an exception occurs. Logs need to be retrieved manually, and if the user is not online, they cannot be retrieved. After the logs are retrieved, they will continue to encounter the problem of incomprehensible logs. For links related to the server, the server will also encounter problems. The Eagle Eye log is only saved for 5 minutes. After such a round, the basic time has passed for half a day...
Incomplete end-to-end tracking: For a complete business link, traffic will pass through multiple layers of end-to-end. Taking an order as an example, after the network request triggered by the client reaches the server, it will be processed by several client modules, Trigger N times of back-end application calls and experience the instability of the mobile network. Just imagine, which of these calls will affect the order transaction, which steps will slow down the entire processing process, and the request has not been returned. The server-side problem is still a network problem. If the definition of the full-link performance of each call is not clear, it means that the problems at each layer are not fully exposed. These factors need to be considered. In addition, the natural asynchronous call on the end-side leads to the measurement and overall performance of each stage. There are major challenges in link opening. The current situation is that there is no unified calling specification for each layer of the client, and lack of topology structure, the calling link cannot be restored, resulting in end-to-end traceability;
Lack of unified caliber for optimization: In the past, due to the self-closing loop of the performance caliber of each R&D framework, whether it is a client-side native technology or a cross-platform technology, a common technical caliber is collected from a technical perspective. This situation will naturally lead to differences in the implementation and performance of each business. Huge, generally speaking, it is not close to the user's body sense, which will make it difficult for online data to reflect the real situation and trends. For a long time, Taobao's experience has also been deteriorating. Every year, it basically relies on sports to optimize the experience. normalized maintenance;
The cost of mobile Paas process empowerment: After a large number of SDK components are exported to each BU of the group, and the basic capabilities are embedded in different App hosting environments, the above-mentioned problems will also be encountered. For each BU students, the infrastructure is more important. It is a black box. If the problem involves infrastructure, the troubleshooting process will be more difficult. In addition, there is no existing tool to self-diagnose the problem. If you encounter a problem, you can only come to consult. No less.

The above are some thoughts on the shortcomings of the current client in terms of operation and maintenance investigation, measurement monitoring, and full-link optimization from the perspective of APP structure, and it is also our follow-up direction.

observable system

Monitoring the evolution of observability

Observability is a system of ideas with no specific requirements for technical implementation. The focus is on introducing the idea and applying it to our business iterations and problem insights. Traditional operation and maintenance may only bring us the top-level alarms and anomalies. When deeper error information location is needed, people are often recruited by building groups, and then first looking for the characteristics of the problem through human flesh, or even The development of a module undertakes the work of analyzing the dependencies of each module, and problem handling basically involves more than three roles (business, testing, development, architecture, platform, etc.).

Compared with traditional monitoring, observability can combine data and organically link the data together to generate better connections, help us better observe the operation of the system, and quickly locate and solve problems. "Monitoring tells us which parts of the system are working, and observability tells us why it's not working there." Figure 2 below illustrates the relationship between the two. Monitoring focuses on the display of the macro market, while observability includes traditional scope of monitoring.

(Figure 2 The relationship between monitoring and observability)

From the above figure, the core is to observe the output of each module and key calls and dependencies, etc., and judge the overall working status based on these outputs. The industry usually summarizes these key points as Traces, Loggings, and Metrics.

Observability key data

(Figure 3 Observability key data)

Combined with the definitions of Traces, Loggings, and Metrics and the existing situation of Taobao, I made some interpretations:

Loggings: Based on the existing TLOG (wireless end-to-end logging system) log channel, it shows events generated when the app is running or some logs generated during the execution of the program. It can explain the running status of the system in detail, such as Page jumps, request logs, global CPU, memory usage and other information, most logs are not concatenated. Now, after the introduction of structured call chain logs, the logs can actually be converted into Trace after the call chain scenario is structured, supporting a single machine. investigation;
Metrics: It is an aggregated value. It is used in the macro market. It lacks detailed display of problem positioning. Generally, there are various dimensions and indicators.
Traces: It is the most standard call log. In addition to defining the parent-child relationship of the call (usually through TraceID and SpanID), it also defines the details of the operation's services, methods, properties, status, time-consuming, etc., through Trace It can replace some of the functions of Logs. In the long run, the Metrics metrics of each module and method can be obtained through the aggregation of Trace, but the log storage is large and the cost is high.

Full Link Observability Architecture

The above observable system concept has some practices in the backend, but returning to the characteristics and status quo of the mobile field, there are various problems as follows:

The problem of calling specification: The difference from the cloud is that the terminal side is completely asynchronous, the asynchronous API is extremely rich, and there is no unified calling specification;
Problems in multi-technical domains: There are a large number of R&D frameworks, and the capabilities are externally black-boxed. There are a lot of hard-to-perceive costs in how to connect them;
The problem of device-cloud differences: The massive distributed devices on the device side mean that the challenges of the observable mode are also fundamentally different from those on the server side. On the server side, logging and metrics can be fully reported and implemented based on a set of systems. The difference is huge, which is also the reason for the separation of the device-side embedded system and the log system. The device-side needs to realize how to take into account the single-machine problem troubleshooting of massive devices and the definition of indicator trends under big data;
The problem of device-cloud correlation: The end-to-end reality has always been fragmented. From the perspective of the device side, how to better perceive the back-end status, how to do the correlation, such as how to continuously promote serverRT (back-end request call time) from IDC (Internet data) Center) to CDN coverage, how does the full-link identifier on the end-side make the back-end sense.

Therefore, we need to define the whole link of the mobile technology field around the above problems, and establish relevant field-level analysis capabilities and good evaluation standards, in order to have a deeper insight into the problems of the mobile terminal, and to perform troubleshooting and performance measurement. The domain continues to serve the Group's apps and cross-domain issues well.

(Figure 4 Definition of the full-link observable architecture)

Data layer: define the indicator specification and collection scheme, and report data based on Opentracing (distributed tracing specification);
Domain layer: Evolution from problem discovery to problem location, continuous performance optimization system, and technology upgrade and precipitation;
Platform layer: Precipitate the comparison of group & competition perspectives, combine online and offline indicators, introduce manufacturers' perspectives, and drive App performance improvement;
Business layer: From a full-link perspective, it connects end-to-end. In addition to client students, it can also serve cross-domain R&D personnel with different technology stacks.

Looking back at the goal of the full-link observable project, we set it as "build a full-link observable system, improve performance, drive business experience improvement, and improve problem location efficiency". Subsequent chapters will focus on explaining the practice of each layer.

Mobile opentracing observable architecture

Full link composition

(Figure 5 End-to-end situation, detailed scene layered diagram)

The existing end-to-end link is long, and there are various R&D frameworks and capabilities on the end-side. Although the back-end call link is clear, from the perspective of the full link, it is not connected with the end-side. Taking the user browsing the details line as an example, once the first screen is opened, it will trigger different calling sequences of the three modules of Ultron, MTOP (wireless gateway) and DX. Different modules have their own processing procedures, and different stages have different time-consuming. and status (success, failure, etc.); then continue to look at the slide, you can see that the call timing combination of the modules is different, so in different scenarios, several elements can be randomly combined, and it is necessary to divide several dimensions according to the user's actual scenario to define the whole chain road:

Scene definition: A user operation is a scene, such as clicking and sliding, which are separate scenes, and a scene can also be a combination of multiple single scenes;
Capability layering: In different scenarios, there are business classes, framework classes, container classes, and request class calls, which can be layered for each field;
Stage Definition: Different layers have their own stages. For example, the framework class has 4 local stages, and the request class can contain the back-end server processing stage;
User moving line: A moving line consists of several scenes.

The full link is to decompose a complex large call into a limited number of structured small calls, and various cases can be derived:

The combined full link of "single scene + single stage";
The full link of the combination of "single scene + several layers + several stages";
The full link of "several scenarios + several layers + several stages" combination;
... ...

Falco - based on OpenTracing model

In order to support the Logs + Metrics + Tracing industry standard, the whole link introduces the distributed call specification opentracing protocol, and performs secondary modeling on the above-mentioned client architecture (hereinafter referred to as Falco).

The OpenTracing specification is the basis of Falco's model, which will not be listed below. For completeness, please refer to the OpenTracing Design Specification, https://opentracing.io/docs/overview/ . Falco defines the call chain tracking model in the end-side domain. The main table structure is as follows:

(Figure 6 Falco Data Sheet Model)

Span public header: the yellow part, corresponding to the Span basic properties of the OpenTracing specification;
scene: Corresponding to the baggage part of OpenTracing, it will be transparently transmitted from the root span down to store the business scene. The naming rule is "business ID_behavior". For example, the first screen of details is ProductDetail_FirstScreen, and the details refresh is ProductDetail_Refresh;
layer: Corresponding to the Tags part of OpenTracing, it defines the concept of layer, which is currently divided into business layer, container layer and capability layer. The module that processes business logic belongs to the business layer, named business; provides the view container belonging container & framework layer, such as DX and Weex, named frameworkContainer; only provides a module of atomic capability, the belonging capability layer, named ability, such as The mtop, picture, and layers can be used to compare the horizontal performance of different modules of the same layer and the same capability;
stages: Corresponds to the Tags part of OpenTracing, indicating the stages included in a module call. Each layer is divided into key stages based on the domain model. The purpose is to make different modules of the same layer have a consistent comparison caliber. For example, the comparison between DX and TNode can measure the pros and cons of each other in terms of preprocessing time, parsing time, and rendering time. For example, the preprocessing stage is named preProcessStart, which can also be customized;
module: Corresponds to the Tags part of OpenTracing, more of a logic module. Such as DX, mtop, picture library, network library;
Logs: Corresponding to the Logs part of OpenTracing, the log is only recorded to the TLog, not output to the UT buried point.

Falco - key takeaways

(Figure 7 Falco key implementation)

End-side traceID: generated according to the principles of uniqueness, fast generation, scalability, readability, and short length;
Call & restore abstraction: The traceID and the span multi-level serial number are transparently transmitted all the way to clarify the upstream and downstream relationships;
End-to-end concatenation: The core solves the problem of cloud concatenation, the device-side ID is transparently transmitted to the server, and the server stores the mapping relationship with the Eagle Eye ID; the access layer returns the Eagle-eye ID, and the end-side full-link model has the Eagle-eye ID , through such a two-way mapping relationship, we can know whether an unreturned request is because it did not succeed in the network phase, or did not reach the access layer, or the business service did not return, thus turning familiar and coarse-grained network problems into definable and interpretable;
Hierarchical measurement: The core purpose is to make different modules of the same layer have a consistent comparison caliber, and to support the horizontal comparison of performance after the framework upgrade. The idea is to abstract the client domain model. For example, take the framework class as an example. Although the frameworks are different, some key Calling and parsing are consistent, so they can be abstracted into standard stages, and others are similar;
Structured buried points: firstly, columnar storage is used, which is conducive to data aggregation operations and data compression of large data sets, and reduces the amount of data; secondly, business + scene + stages are deposited in a table, which is convenient for associated query;
Falco-based precipitation of domain problems: including key definitions of complex problems, clue-based logs for tracking problems, and buried points for some special demands. The information of all domain issues is structured and deposited into Falco, and domain technology developers can also continue to build analytical capabilities based on the accumulated domain information. Only by realizing the integration of effective data supply and domain interpretation can they define and solve deeper levels. The problem.

(Figure 8 Falco Domain Problem Model)

Falco-based operation and maintenance practice

The scope of operation and maintenance is extremely wide, focusing on the key processes of problem discovery, problem takeover, location analysis, and problem repair, from indicator observation and alarming of massive equipment, to single-machine inspection, log analysis, etc., everyone knows to do this, and each process in it. It involves a lot of capacity building, but it is difficult to implement in practice, and all parties do not agree. The Taobao client has always had problems with the accuracy of indicators and the inefficiency of log pulling. Take APM performance indicators as an example. In the past, many indicators of Taobao App were inaccurate, business students did not recognize them, and they could not guide actual optimization. This chapter will focus on sharing the relevant optimization practices of Taobao App in terms of index accuracy and log pulling efficiency.

(Figure 9 The problem reverses the user line and the operation and maintenance system)

Macro Indicator System

Taking the horizontal battle of terminal performance as an opportunity, based on the user experience, APM started related upgrade work. The core involves startup, external links, and visual and interactive indicators in various business scenarios. How to make the endpoint corresponding to the indicator closer to the user? Somatosensory, mainly includes the following work:

The upgrade of the 8060 algorithm: extracting and calculating visually useful elements (such as pictures and text), eliminating elements that users cannot perceive (blank controls, bottom maps), such as formulating view visual specifications to meet the customization of picture libraries, fishbone diagrams, etc. Control marking;
H5 field: support the visual interaction of UC page elements and the front-end JSTracker (event buried point framework) backtracking algorithm, which is connected with the H5 page visual algorithm;
In-depth and complex scenarios: formulate visual specifications for custom frameworks, open up various R&D frameworks such as Flutter, TNode (dynamic R&D framework), and calibrate, and the 8060 algorithm is implemented by each R&D framework;
External link field: open up the caliber of H5 page, redefine negative actions such as external link leaving.

Taking startup as an example, after APM is calibrated, including the stage of displaying pictures on the screen, although the data has increased, it is more in line with the demands of the business side.

(Figure 10 Start data trend after calibration)

Taking the outer chain as an example, after opening up the H5, the new caliber has also risen, but it is more in line with the body.

(Figure 11 Comparison of caliber data before and after the outer chain after calibration)

Based on this campaign, several R&D framework visual indicators and corrections have been achieved.

Single machine inspection system

For troubleshooting, the core is still based on TLOG. This time, we only focus on the problems encountered in the key links of log reporting, log analysis, and positioning and diagnosis in the user troubleshooting process (no logs, incomprehensible logs, difficult positioning, etc.). Efforts made by the dimensional inspection system to improve the efficiency of problem location.

(Figure 12 The core function of single-machine troubleshooting and problem location)

Improve the success rate of log uploading, and ensure the supply of logs when troubleshooting problems from several aspects. First, the built-in log upload capability can be triggered at multiple times in core scenarios or problem feedback, improving the log reach rate, such as public opinion feedback and new functions. When an abnormality occurs on the line; the second is to upgrade the TLOG capability, which involves optimization of sharding strategy, retry, log management, etc., to solve the time-limiting problem of log uploading that has been reported by users in the past; finally, various abnormal information is collected as a snapshot, through MTOP link bypass report, assist in restoring the scene;
To improve the efficiency of log location, first classify the logs, such as distinguishing page logs and full-link logs to support fast filtering; then open up the full-link calling topology structure of each scenario, so as to quickly see where the problem occurs. Nodes for quick distribution and processing; finally, for problems such as structural errors, slowness, and UI cards, the principle is to hand over the interpretation of domain problems to the domain. For example, there are several types of stuck logs, such as APM freeze frame, ANR, and main thread stuck. Etc.; the business category includes request failure, request RT greater than xx time, page white screen, etc., through the docking of capabilities in various fields to improve the ability to quickly diagnose and locate problems;
Full-link tracking capability building, Eagle Eye (the implementation of the distributed tracking system in Alibaba’s backend) has many access services and a large amount of logs, so it is inevitable to do log sampling. For calls without hit sampling, the cache is only 5 minutes, which requires Find a way to notify Hawkeye within 5 minutes to keep it longer. In the first stage, the back-end parsing service will parse out the Hawkeye ID of the call chain, and notify the Hawkeye service to store the corresponding trace log, which can be stored for 3 days after a successful notification. Hawkeye storage will be stored in the front; in the third stage, similar to scene tracking, obtain the Hawkeye trace log of the core scene, and try to store it on the Ferris wheel platform. The first stage has been launched, and it can be linked to the Eagle Eye platform. Generally, it takes 5 minutes from the occurrence of the problem to the investigation. Therefore, the success rate is not high. It is necessary to further improve the success rate in combination with the second and third stages, which is under planning and development. ;
The construction of platform capabilities is based on the analysis of the full-link log on the terminal side. In terms of visualization, the content of the full-link log is displayed in a structured manner, which is convenient for quick exceptions of some nodes; Quickly diagnose problems such as time-consuming exceptions, interface errors, and data size exceptions.

The above are some of the attempts made in operation and maintenance this year. The purpose is to use technology empowerment instead of process empowerment in the field of inspection through technology upgrades.

Next, I will continue to show you the practice of Taobao and the effect of other apps of the group.

Full link operation and maintenance practice

Taobao Caton Troubleshooting

Internal colleagues reported that using the Taobao App overseas, there were problems such as cards and some pages could not be opened. After the appeal and investigation process, the TLOG log was extracted.

Through the "full link visualization" function (Figure 10), it can be seen that the network status of the H5 page with spanID of 0.1 is "failed", causing the page to not be opened;
Through the "full link diagnosis" time-consuming abnormal function (Figure 11), it can be seen that a large number of network time-consuming are distributed in 2s, 3s+, and some even 8s+. Alibaba's CDN node is slow.

(Figure 13 Full link visualization function)

(Figure 14 Full-link stuck diagnosis function)

Ele.me main link access

Cold start full link

(Figure 14 Ele.me full link view - cold start full link)

Store full link

(Figure 15 Ele.me full link view - store full link)

Falco-based optimization practice

new index system

Now we will focus on how we can build an online performance baseline from the end-to-end full-link perspective around the Falco observable model, and use data to drive continuous improvement of the Taobao App experience. The first is the construction of the data indicator system, which mainly includes the following points:

Indicator definition and specification: close to the user's feelings, define relevant indicators around the operation line from the user's click to the content presentation to the sliding page, and focus on collecting technical scenarios such as page opening, content on the screen, click response, sliding, etc., such as the content display page Visual and interactive, picture on-screen indicators, sliding frame rate (finger), freeze frame and other indicators to measure;
Indicator measurement scheme: The principle is that the indicators in different fields are handed over to the corresponding field. Taking the stuttering indicator as an example, it can be the manufacturer's caliber (Apple MetricKit) or the self-built caliber (APM's main thread lag/ANR, etc. ), it can also be a custom indicator of different business domains (full link of the scene), such as MTOP request failure, detail header image on the screen, etc.;
Indicator composition: It is composed of online aggregated indicators and offline aggregated indicators. Based on online and offline data and relevant specifications, it is based on the user's perspective and competition to drive the optimization of APP experience.

(Figure 16 App performance indicator system)

Take APM as an example, the sliding related indicators are defined as follows:

(Figure 17 APM-related indicator definition scheme)

Take the full link of the scenario as an example. For a specific service, for a user interaction, from the initiation of the response to the end of the response, the complete call link from the front end to the server to the client, the details are based on the details under the full link of the scenario. Screen indicators:

(Fig. 18 Scene full link-detail first screen definition)

and others etc...

Optimization under the new index system

FY22 platform technology focuses on the whole-link perspective, takes experience as the export, conducts in-depth business optimization, defines and disassembles problem domains around indicators, and carries out major special optimizations for users' real physical sensation. We introduce from the bottom to the top, how to optimize the general network layer strategy, how to improve from the connectivity -> transport layer -> timeout strategy around the request cycle; technical strategy upgrades for user perception, such as gateway and image optimization; The technical transformation of business scenarios, the preprocessing and preloading of the venue framework, the lightweight practice of security bodyguards, and even the business experience grading. For example, if the homepage information flow is not enabled under the low-end machine, the relevant practices will be introduced below.

(Figure 19 Taobao App full-link optimization technical solution)

Request simplification and speed-up - minimalist calling practice

Taking the MTOP request as a scenario, the link mainly involves the interaction between "MTOP and the network library". Through the analysis of the current situation of the full-link thread model, from the initiation of the MTOP to the reception of the network layer, the request will be slow:

There are many data copies: the existing network layer mechanism, the network library has hook interception processing, which is forwarded to the network library for network transmission based on NSURLConnection + "URL Loading System", which involves multiple data copies, and the transit interception processing is very time-consuming;
Too many thread switching: the threading model is too complicated, and the threads are frequently switched after completing a request;
Asynchronous to synchronous: The original request uses a queue NSOperationQueue to process tasks. This queue maintained at the bottom ties the request and the response together, so that the response will be released after sending, and "HTTP Operation" occupies a complete HTTP All IO in the sending and receiving process violates the parallelism of network requests, and the operation queue is easily blocked.

The above problems are more obvious in the scenario of large batch requests and intense competition for system resources (cold start, dozens of requests swarming up).

(Figure 20 Before and after thread model optimization - minimalist call)

The transformation scheme, directly calling the network library interface through MTOP to obtain a greater performance experience improvement

Simplified thread model: skip the system URL Loading System hook mechanism, complete the thread switching of sending and receiving data, and reduce thread switching;
Avoid weak network blocking: Data packet Sending and Receiving are split and processed, and the air interface length RT does not affect the I/O concurrent capacity;
Replacing the deprecated API: Upgrade the old NSURLConnection to call the network library API directly.

Data effect: It can be seen that the optimization degree is more obvious in the environment where system resources are more tense, such as low-end machines.

(Figure 21 Minimal call AB optimization range)

Weak network strategy optimization - Android network multi-channel practice

In an environment with poor WIFI signal and weak network, sometimes multiple retries have little effect on improving the success rate. The system provides a capability that allows a device to request the ability to switch cellular network cards in a WIFI environment. The network application layer can use this technology to reduce errors such as request timeouts and improve the success rate of requests.

After Android 21, the system provides a new way to get network objects, even if the device currently has a data connection via Ethernet, the application can use this method to get the connected cellular network.

Therefore, when the user equipment has WIFI and cellular network at the same time, different requests can be scheduled to the two network card channels of the Ethernet and the cellular network at the same time under a specific strategy to achieve network acceleration.

Core Changes:

Prerequisites: Whether the current Wi-Fi network environment supports cellular network;
Trigger timing: When the request is sent and the data is not returned for a certain period of time, the request to switch the cellular network to retry is triggered, the request of the original process is not interrupted, and the request response of the channel that returns with priority will be used, and the late return will be canceled;
Time control: The Orange configuration is based on specific scenarios, and it needs to be flexibly adjusted dynamically according to the strength of the network in the future;
In terms of product form & compliance: when using it, the text is revealed to the user "I am using WIFI and mobile network at the same time to improve the browsing experience, which can be turned off in Settings - General", and the pop-up policy is triggered for the first time each time the function is activated.

(Figure 22 Android multi-channel network capability optimization + user compliance authorization)

Data effect: In the case of fierce competition for network resources, in the WiFi+cellular dual-channel network scenario, the optimization of long tail and timeout rate is more obvious. The quantile performance of P99/P999 is increased by 12%/58%, and the error rate is reduced by 0.41‰.

Technical Strategy Grading-Picture Grading Practice

The performance of different devices varies greatly, and the complexity of services is getting higher and higher. Many services cannot let users experience the expected effects on low-end devices, but will bring bad experiences such as freezes. In the past, "delay, concurrency, preloading" and other means were used to optimize performance, but only to avoid the problem, the core link still had to face the time-consuming of key calls. Therefore, we need to classify the business experience. Based on the classification of business processes, high-end devices can experience the most perfect and complex processes, and low-end devices can also use core functions smoothly. The ideal is to achieve user experience & business core indicators Taking a step back, we can make the performance experience better when some functions are detrimental (without affecting the core business indicators). The initial idea is to achieve this in two steps:

In the first stage, business classification requires a wealth of strategy libraries and judgment conditions to achieve classification. We will accumulate these general capabilities on core components to help businesses quickly achieve business classification capabilities;
In the second stage, as a large number of businesses have access to the grading capability, and a large number of business grading strategies and AB data have been accumulated, the recommendation and optimization of single-point business grading strategies can be done, so that a large number of similar businesses can be quickly reproduced. use to improve efficiency.

Traditional CDN adaptation rules will dynamically assemble to obtain the "best" image size according to factors such as network, view size, and system to reduce network bandwidth and bitmap memory usage, and improve device image loading experience. The specification given by UED realizes configurable compression parameters, expands the original CDN adaptation rules, and realizes the image classification strategy of different models. Through this capability, the size of the image can be further reduced and the image on the screen can be accelerated.

(Fig. 23 Grading rules for picture equipment)

Lightweight Link Architecture - Safe and Visa-Free Practice

The external link pull-up link is from startup to customs request to landing page loading (the main request is still MTOP), which involves multiple security sign-ups. Sign-up is a CPU-intensive task. Too long will lead to traffic skipping. FY22 S1 has done a lot of performance optimization on the pull-end link in the giant wave business. Optimizing performance can reduce the skipping rate. At present, the performance is mostly customs requests, and customs requests for security signing take time. The proportion is high, so it is hoped to skip the security signing. The service can be used according to the situation to improve the traffic value of the inbound end. The link involves MTOP, Aserver (unified access layer), and security multi-party transformation:

(Figure 24 Changes in the secure visa-free architecture)

Gateway protocol upgrade: The protocol upgrade supports visa-free, and provides a visa-free interface to the outside world. If the business API is set to be visa-free, carry the header to the network library;
AMDC scheduling service: Considering stability, at present, it will be scheduled to the online safe production environment through AMDC (Wireless Network Policy Scheduling Service) in the short term. Therefore, the AMDC scheduling module will determine whether to return to the client's visa-free vip according to the description mark. After the function is stable , it will be flexibly dispatched to the online master station environment;
Migration of signature verification module: The security extension capability is pre-installed at the AServer access layer. Based on the consideration of operation and maintenance costs, the capability will be uniformly migrated from Aserver to security. Subsequent Aservers will not have a signature extension module, and security will be enabled according to API/header characteristics. Signature verification and other functions;
MTOP signature-free error retry: In the case of visa-free status, if the MTOP layer encounters an illegal signature request failure, it will trigger the downgrade of the old link to ensure user experience.

Summary & Outlook

Summary: This paper mainly expounds how to complete the construction of observable capabilities by implementing call link tracing, standard logging and scenario-based tracing in the face of the existing challenges of the mobile terminal, and based on the full-link perspective and new observable capabilities, to create a full-chain The road operation and maintenance system and the continuous performance optimization system complement the long-lost call chain tracking ability on the mobile terminal, solve the problem of rapid positioning in complex call scenarios, and change the inefficient process of pulling groups of people in the past, starting the process empowerment to technology The transformation of empowerment, and build a full-link Metrics indicator around this capability, build a full-link performance indicator system, conduct governance in-depth business scenarios, upgrade platform technical capabilities, and use data to drive business experience improvement and long-term tracking of experience.

Disadvantages: Although Taobao App is gradually accessing various scenarios, there is still a long way to go to locate the problem within 15 minutes, and there are still many related card points, such as the success rate of log reporting, the effectiveness of server-side log acquisition, The improvement of problem location efficiency, the productization and technicalization of data quality inspection at the source, the understanding of the problem and the continuous accumulation of structured information by the technical side of the field, and finally the user experience of the entire product, need to be continuously optimized.

Outlook: Continuing the mobile native technology concept of Alibaba's mobile technology team, we need to do a good job of technology and experience, we need to go deep into the hinterland of the mobile domain, and face the challenges of the east-west multi-R&D framework and the north-south end-to-end full link. In the first phase of experience optimization in 2018, we have introduced similar concepts in the request field and tried it out. Until now, we have found a suitable structural theoretical basis, and through in-depth practice based on the characteristics of mobile terminals, we continue to define and solve problems in the thick field. . It is hoped to create an observable technology system in the mobile domain and form a precipitation of the architecture.

【References】

[1] Observability Technology Conference https://ppt.infoq.cn/list/qconsh2021
[2] OpenTracing Design Specification https://opentracing.io/docs/overview/
[3] 4D Cracking Cloud Native https://xie.infoq.cn/article/598fd893709f01ae751dbd7b8?utm_medium=article
[4] Apache APISIX https://www.apiseven.com/zh/blog/why-we-need-Apache-APISIX
[5] Mesh: What is https://cloud.tencent.com/developer/article/1706553
[6] gatner APM analysis report https://www.gartner.com/doc/reprints?id=1-25SQ95K7&ct=210414&st=sb
[7] New Relic APM https://blog.csdn.net/yiyihuazi/article/details/107974539
[8] dynatrace https://www.dynatrace.cn/platform/application-performance-management/
[9] Brief analysis https://zhuanlan.zhihu.com/p/361652744
[10] AppDynamics https://www.appdynamics.com/
[11] SkyWalking distributed tracing system https://www.jianshu.com/p/2fd56627a3cf

We are hiring!

We are the "terminal platform technology department" of Da Taobao platform technology. We have the world's largest e-commerce scene and a first-class mobile technology platform, create industry-leading technology products, serve more than 1 billion consumers around the world, and handle hundreds of billions of daily user requests.

As Ali's most important client team, we are responsible for Taobao mobile domain R&D, operation and maintenance support, native technology mining, and core technology construction, including but not limited to client experience, framework and innovative experience, manufacturers and system technologies, user growth, and mobile platforms. Whether it is infrastructure, business innovation or technological development, our team can provide you with huge opportunities and room for growth, and we look forward to your joining

Resume delivery

yangqing.yq@alibaba-inc.com

Follow [Alibaba Mobile Technology] WeChat public account, 3 mobile technology practice & dry goods every week for you to think about!