After three years of sharpening a sword, the summary of the optimization of the map experience of AutoNavi

Authors: Yang Xikai, Wu Wenyang

AutoNavi Maps has been optimizing the full-link performance experience for three consecutive years since 19, and finally achieved a half-discount optimization on the overall core link, and the user experience has been greatly improved. In the process, this article summarized some thoughts and practical experience of performance optimization, and I hope it will be helpful to everyone.

before and after optimization (take the time before optimization as the baseline 100%)

Ideas

The overall idea is divided into three parts: clear performance card points, reverse order special problem solving and positive order long-term management and control:

clear performance card points: finds optimization points to be targeted. Scientific evaluation standards and clear optimization points are essential for optimization. Scientific evaluation standards need to be able to reasonably evaluate the performance experience and be closer to the real feelings of users. The target needs to be quantifiable, so that it can keep the focus and efficient execution during the special process, and avoid detours;
reverse order special problem solving: performance problem is not a single business problem, it often involves the collaboration of multiple production, research and testing teams. We start from the problem quickly and reverse order and gather multi-team resources in a special form to determine goals, quickly tackle results, and enhance team confidence;
positive-sequence long-term control: optimization is the process of reverting from the "effect" to the "cause". If a problem has occurred, it is a way of solving problems in reverse order. So how to stop the problem from the source of the "cause", or to prevent the already optimized effect from going backwards, then our third idea is: long-term continuous positive-sequence control, to avoid the continuous deterioration of the original business, and at the same time to consolidate the special project Optimization results.

Next, this position will analyze these three parts one by one.

Identify performance stuck points

The development of standards

The loading speed of the first screen greatly affects the user experience, so the first screen time is used as the statistical standard for the time consumption of our page. With the continuous upgrading of mobile phone hardware, many high-end devices have good hardware performance to cover up program performance problems. Therefore, we will optimize devices of different models and levels to maximize online users.

Statistical standard

It is determined that the first screen display time is a statistical standard. The following is how to determine the first screen display time consumption dimensions:

Business perspective: different business forms, the first screen definition of different pages must be different;
Product perspective: The definition of the first screen revolves around the amount of function usage, and high-frequency functions are prioritized;
R&D perspective: anchor the start and end points of the first screen through the log buried points in the business process;
Production, research, testing, and communication standards: Establish a unified communication language for production, research, and testing, that is, quantitative data.

Model standard

Model level: According to the equipment score, the equipment is divided into three levels: high, medium and low;
Model selection: The criteria for selecting representative models of different levels are based on the proportion of online user equipment, covering as many users as possible, and selecting representative manufacturers’ equipment as much as possible. Of course, it is also necessary Considering the availability of equipment in existing testing laboratories, after all, procurement is not necessarily timely and unnecessary waste is avoided.

Determine optimization items

The accumulated history of AutoNavi Map over a long period of time has led to the optimization of each scene, which is faced with complex business codes and even business blind spots. This brings great challenges to quickly analyzing a large number of historical businesses and accurately locating time-consuming points. If you rely on manual analysis, manpower investment and duration are unrealistic. This requires finding an acceleration plan from the tools and methodology.

Clear optimization points from top to bottom

Dimensional analysis of mobile phone equipment:

Unlimited business scenarios run on limited performance resources through mobile devices, and resource allocation is bound to be stretched. So we need to analyze the time-consuming problem according to the different equipment. For example, the time-consuming problem that occurs on poorer mobile phones may not be a problem on high-end mobile phones. The optimization points are also different, and personalized strategies are needed for targeted optimization. For example, complex interactive animations on low-end computers are a time-consuming point. For example, when the search page closes some animations, it will bring a lot of benefits in performance without harming the user experience. On high-end computers, this time-consuming point can be ignored.

Analysis of parallel business dimensions:

There are so many business points, so why do we choose travel scenarios, search scenarios, etc. to analyze time-consuming? This requires connecting products and BI and using data to speak. By analyzing online user behavior, most users choose to click the search box to enter the search homepage, and online user feedback is also time-consuming to enter the search. Compared with other functions, the timeliness and importance of the search homepage are self-evident , So business classification is based on online data. This scenario is ranked first, performance resources should be tilted here, and the time-consuming points here need to be analyzed first.

Analysis of the internal dimensions of the business:

First, the business scenario itself is sorted out through the entire link to find out the key time points, and then the scenario log tool is used to bury the points and upload them to the service to collect the real first-screen time data of online users to provide an effective basis for the quantitative target. The time stamp difference between the two key points is the phase time, which can assist in analyzing the business time-consuming problem. Around the process of business analysis, we have deposited many analysis tools to assist analysis.

Minimal set, addition, subtraction

The minimum set is a process of bottoming out, making the smallest runnable under the condition that the product form of the first screen is not deformed, and removing all other items that are not related to the first screen. This is the subtraction. We can understand that this minimum set is our optimal optimization effect without changing the existing architecture. If the minimum set of extreme data meets the standard, then on this basis, the necessary related dependencies must be added back one by one to ensure that the product is fully functional. Other unnecessary dependencies can be optimized, removed, or postponed. This is addition. Of course, if the extreme data of this minimum set cannot meet the standard, then we need to find optimization points from other dimensions. Generally, we can find breakthrough points in terms of network time consumption and structural rationality.

Reversed special problem solving

Performance issues are not a single business issue, but often involve the collaboration of multiple production, research, and testing teams. So our thinking is:

Start from the problem, quickly reverse the order, gather multi-team resources in a special form, determine the goal, quickly tackle the problem and get the result, and enhance the team’s confidence

Special Tackling

Specially tackling tough problems is also a process of building and building while building and precipitation. The means of troubleshooting performance problems were initially discrete. It is often encountered one solution to the other, and the next time a different scenario needs to repeat the work again, then our thinking is:

Precipitate reusable solutions, problem-solving ideas, general frameworks and tool platforms, and optimize the efficiency of the "optimization means" itself, so that the cost is gradually reduced

The start-up project is the first performance thematic project. A scene optimization was completed, which cost 30 people and 3 version iterations. The reason why there are more manpower is that the "first time" faces a lot of problems. Indicators need to be analyzed and defined standards, burying tools need to be newly built, optimization processes are inexperienced, and management and control methods need to be newly built.

The search project completed a scene optimization, which cost 8 people, and the manpower situation was much better. The version became version 2. At that time, there were already some burying tools that had been built during the start-up period, and with some optimization experience, many detours were avoided.

The core link special project completed six scenes, costing 24 people, and one version was completed. The optimization process is methodical, with less manpower, many scenarios, and short time. This benefited from the improvement of optimization efficiency and the gradual reduction of costs. In the process of continuous optimization, relatively mature analysis tools, optimization tools, and control tools have been accumulated.

Optimization

Performance optimization is a systematic problem. We divide the optimization plan into three layers: business, engine, and basic capabilities. The optimization points are clearly defined from top to bottom. The upper-layer business performs adaptive resource scheduling, the intermediate engine provides acceleration capabilities, and the lower-layer capabilities provide high-performance components.

Business adaptive resource scheduling

The optimization of the business layer mainly achieves the optimal state of performance through business orchestration and scheduling, but the scheduling and optimization of each business is repetitive and cumbersome. In order to reduce this part of the cost, we have developed a set of resource scheduling framework. After the service is connected, the scheduling work is completed by the framework. During the application operation process, the scheduling framework senses and collects the operating environment, and then makes different scheduling decisions for different environmental conditions, generates corresponding performance optimization strategies, and finally executes corresponding optimization functions according to the optimization strategy. At the same time, it monitors the scheduling context and the execution effect of the scheduling strategy, and feeds it back to the scheduling decision-making system, so as to provide information input for further decision-making and optimization. In this way, it is possible to achieve the expected extreme performance experience in different operating environments.

1. Environmental perception

The perception environment is divided into four dimensions: hardware equipment, business scenarios, user behavior, and system status:

On the hardware equipment, on the one hand, the group laboratory evaluates the known equipment to determine the high, middle and low-end models, on the other hand, performs real-time computing power evaluation on the hardware locally on the user equipment;
In terms of business scenarios, businesses are divided into foreground display, background operation, interactive operations, etc. Generally, the business scenario that is performing interactive operations in the foreground has the highest priority, and the background data preprocessing business scenario has the lowest priority; for the same category of business scenarios , PK based on business UV, transaction volume, resource consumption and other dimensions to determine the subdivision priority;
In terms of user behavior, combine service user portraits and local real-time calculations to determine user functional preferences and operating habits, and prepare for the next step of precise optimization decision-making for users;
In terms of system status, on the one hand, the system provides interfaces to obtain extreme system states such as memory warnings, temperature warnings, and power-saving modes. On the other hand, system performance resources can be determined in real time by monitoring memory, threads, CPU, and power. condition.

2. Scheduling decision

After sensing the state of the environment, the scheduling system will combine various states and scheduling rules to make business and resource allocation decisions.

Downgrading rules: turn off high-energy-consumption functions or low-priority functions on low-end devices or when system resource shortage alarms (such as memory and temperature alarms)
Avoidance rules: When high-priority functions are running, low-priority functions will avoid them. For example, when the user clicks the search box and the search results are fully displayed, the background low-quality tasks will be suspended and avoided to ensure the user's interactive experience;
Pre-processing rules: Pre-processing based on user operations and habits. If a user usually clicks on search 3s after starting, the user’s search results will be preloaded before 3s, so as to present the ultimate interactive experience when the user clicks
Congestion control rules: Actively reduce the amount of resource requests when equipment resources are tight, such as when the CPU is busy, actively reduce the amount of thread concurrency. In this way, when high-quality tasks arrive, the problem of resource shortages and resource performance experience problems can be avoided

Three, strategy execution

Strategy execution is divided into task execution and hardware tuning: the task execution is mainly the operation control of the corresponding tasks through the memory cache, database, thread pool and network library to indirectly realize the scheduling control of various resources; and the hardware tuning Excellent, it is through cooperation with the system manufacturer to directly control the hardware resources; for example, when the CPU-intensive high-quality business starts to run, the CPU frequency will be increased, and the running thread will be bound to the large core to avoid the loss of thread switching back and forth Performance, maximize the scheduling of system resources to improve performance

Fourth, effect monitoring

In the process of resource scheduling, each module is monitored, and the environment status, scheduling strategy, execution record, business effect, resource consumption, etc. are fed back to the scheduling system. The scheduling system uses this to judge the pros and cons of this scheduling strategy. Further tuning

Engine acceleration

1. Map engine

The map engine is a unique part of the map application. This part mainly starts with drawing optimization strategy, which mainly includes batch and block rendering, frame rate scheduling, message scheduling, etc.

Two, cross-end engine

The cross-end engine needs to provide support for the business. It is also a universal solution for all scenarios. It has more room for display than client optimization, and it is close enough to the business to directly contact the business. Therefore, the optimization strategy of cross-end engine is to reduce the performance cost of business code. The main programs are:

Thread priority
Context preload
Business framework reuse
Require reference reuse

Here is a brief introduction to context preloading. In order not to affect the running status of the existing business, we have designed a pre-loading scheme in idle time, which can calculate the time-consuming and import file before the page is entered. Time-consuming, etc. to be executed in advance.

Idle time: preload when the business thread is idle to avoid affecting other pages
Segmentation: The granularity of the content preloaded each time is less than 16ms, avoiding the preloading task to block the current thread
Preloading: advance the calculation of the target page to speed up the entry of the target page

Three, H5 container

1. Offline package acceleration

Offline package acceleration mainly solves the problem of loading speed of complex H5 pages: the number of resource files is large, and the download takes a long time, resulting in slow page loading. Usually, loading is increased to reduce the interface loading waiting problem, which causes the user to wait for a long time, and finally The result is a loss of conversion rate. In this context, combined with some of AutoNavi's existing platform capabilities, an offline package acceleration capability has been built. The entire link contains:

Offline package construction: speed up business development efficiency through front-end scaffolding, and dynamically specify offline package resource configuration;
Offline package publishing: docking with existing service publishing capabilities, building a front-end visualization publishing platform, providing grayscale control, package update, data statistics and other capabilities;
Terminal management: package download, management, take effect, control download and update timing-make pre-download requests for high-frequency pages, and realize "second opening" when opening the page;
Resources take effect: the loading of resources in the container is intercepted, and the offline resource management module is docked. If the cache is hit, it will take effect directly, and the download will be requested by the normal network if it is missed.

2. Container pre-creation

The pre-creation and pre-heating of containers will greatly improve the loading speed of H5 pages. The cost of creating WebView instances is relatively high. Pre-creation and cache reuse can be performed at the appropriate time after the APP starts. Solve the speed of first opening and second loading. AMap itself has startup task scheduling and idle task scheduling. On this basis, pre-creation operations can be performed in the corresponding WebView module. For the pre-created WebView Context switching problem, because the page stack of AMap is actually a custom implementation, there is only one independent Activity, so it has natural good compatibility. For pre-creation, it is a way of changing space for time. For the differentiated configuration of different performance devices, it needs to be polished. In addition, it can also be combined with the characteristic behavior of end intelligence, similar to the behavior and frequency of user page jumps, etc. , To dynamically decide whether to pre-create.

Architecture high-performance components

One, thread pool

The thread pool supports business scheduling strategies such as task priority scheduling, total number of threads control, and thread avoidance, so that equipment resources can be fully and reasonably utilized.

Thread queue management module, which provides 5 priority queues:

High-quality queue: used to process UI-related tasks, and can quickly return execution results, such as high-quality tasks in the startup phase;
Sub-optimal queue: used to perform tasks that need to be returned immediately, such as loading business page files, etc.;
Normal (low-quality) queue: mainly used for tasks that do not need to return immediately, such as network requests;
Background (lowest optimal) queue: used to process some tasks that users will not perceive, such as buried points, time-consuming IO operations, etc.;
Main thread idle time queue: used to process tasks that do not need to be executed immediately, but the business does not support on-demand execution, and will only be executed when the main thread detects that it is in an idle state

Second, the network library

Network request is the most time-consuming scene in the scene, and its performance almost determines the time-consuming performance of the first screen of the scene. We have done link monitoring and extreme optimization for each link of the network request. The key points include: request refined scheduling, concurrency prediction Processing, DNS pre-loading, connection reuse, etc.

Schematic diagram of request link:

Queuing: Fine scheduling of requests; grading of thread resources, high-quality requests have their own independent resources, and resources can be high and low, but low cannot be high, achieving high-quality requests 0 queuing. At the same time, limit the concurrency of low-quality requests to avoid excessive concurrency leading to the preemption of underlying bandwidth;
Preprocessing: It mainly includes a series of time-consuming operations such as public parameters, signatures, encryption, etc., which converts the preprocessing operation from the original serial to parallel, and reduces the preprocessing time;
DNS resolution: make a whitelist of commonly used domain names, and perform DNS pre-resolution of commonly used domain names at startup. When it is actually used, the resolution is zero time-consuming;
Connection establishment: By using strategies such as h2 long connections and pre-established connections, it takes almost zero time to establish connections;
Request uplink/downlink: According to the size of the body, intelligently judge whether compression is needed, reduce the size of the body, and reduce the transmission time;
Parsing callback: For scenarios with more complex responses (such as planning), use more efficient data protocol formats (such as pb) to reduce data size & parsing time.

Positive sequence long-term control

Optimization is the process of reverting from the "effect" to the "cause", and solving problems after they have occurred is a way of solving problems in reverse order. So how to stop the problem from the source of the "cause", or to prevent the already optimized effect from going backwards, then our third idea is:

Long-term continuous positive order management and control avoids the continuous deterioration of the original business and at the same time consolidates the results of special optimization.

The horizontal business of AutoNavi Maps client includes travel, search, taxi and other business lines, and the vertical architecture includes business layer, platform adaptation layer, cross-end engine layer and map engine layer, across multiple language stacks, performance issues can be followed up The characteristics of the long process and the length of the investigation link. Therefore, the management and control ideas focus on the construction of standards, processes, automation platforms and tools.

[]()

standard

After a comprehensive special management, the goal and requirement of performance control is to avoid a continuous deterioration. Due to the existence of test fluctuations and hot changes, rapid iteration, and frequent release of dynamic plug-ins, there are many outlets that need to be controlled. If there is control, Missing areas, performance will inevitably continue to deteriorate. Therefore, the management and control standard is determined to be based on a fixed baseline, and the chain ratio value is the quantitative standard, and all change factors are included in the management and control, so as to effectively prevent superimposed deterioration.

Process

The main version process of AutoNavi Client is mainly divided into three major stages: requirements analysis and design, independent business iterative development, integration testing. The integration testing stage itself has a large number of business bugs, so the time left for the discovery and resolution of performance problems is very tight. In order to solve these problems, the management and control process needs to make good use of each stage of the master process to eliminate performance problems in stages.

Demand analysis and program design calculation, identify problems in advance;
In the iterative development stage, problems are discovered and resolved in advance to reduce the risk of late exposure of performance problems and late repairs, which will affect the online user experience;
During the integration test phase, we summarize the data market every day, find problems in time, and rely on the platform and tools to quickly troubleshoot and accelerate the circulation of problems;
At the grayscale and release stage, pay attention to the online data market, establish an alarm mechanism, find problems in time, and troubleshoot online problems through user logs.

platform

Relying on the Titan continuous integration platform and the ATap automated test platform, it creates a tool chain for connected development, construction, performance testing, problem follow-up, troubleshooting, circulation, and complete link resolution to improve the efficiency of problem discovery and resolution.

Titan Continuous Integration Platform
- Timed construction, support locating package tasks, construction type support performance package
- Automated test trigger, supporting two trigger modes: package trigger and timing trigger
- Integrated bayonet and decision-making, integrated application display performance test results, integrated decision-making approval process
ATap automated test platform
- Performance market, aggregate performance data, quickly find problems
- Bury details, integrate quick troubleshooting tools to speed up the investigation
- Problem follow-up, combined with Aone, monitor the problem-solving process, and accelerate the flow

Summarize

So far, the performance experience optimization effect of AutoNavi's core links has been greatly improved. From the initial optimization results, to later optimization efficiency increases, optimization costs are reduced. The overall optimization process is summarized:

Tactically, the “special project” + “technical precipitation” + “long-term management and control” approach can be adopted to ensure that the performance experience problem is solved benignly.
Strategically, we used to rely on "people" to solve problems, but now we rely on "people", "architecture" and "tools" to solve problems. Will it be possible to "tool" to solve or avoid problems on its own in the future? As the tools accumulated by "technical precipitation" and the platforms for "long-term management and control" construction continue to increase, it is believed that it is only a matter of time before quantitative changes cause qualitative changes.

, 3 mobile technology practices & dry goods for you to think about every week!