📄

Text|Wu Chenghui (Name: Qingxiao Ant Group Senior Technical Expert)

Proofreading|Chen Zhen, Zhuang Xiaodan, Feng Jiachun

This article is 6250 words read in 11 minutes

Let's review the problems faced by monitoring the upper-level business together:

  • Chimney-style monitoring and analysis of repeated construction

In the field of technical risk, there were once more than 5 systems were doing capacity building for monitoring and analysis. The general logic is basically to pull the raw data of monitoring, do basic aggregation, analysis and processing (similar to the same ring comparison, threshold type method), according to Some current scene input, combined with some algorithm capabilities, draws a conclusion and triggers a series of actions. Similar to this kind of chimney-style monitoring and analysis scenarios, they are all relatively rudimentary repetitive constructions, lacking intelligence and systematization.

  • Problems precipitated by SRE experience standardization

In the analysis scenario, there is also the problem that SRE experience cannot be standardized and reused. What is SRE experience? Take the decline in transaction payment as an example. If the current transaction payment drops by 10%, SRE usually has several analysis paths. First, look at whether Taobao transactions have fallen (source), and then look at transaction creation, whether the cashier rendering has fallen, etc. We call this class business dependency. Looking at the overall situation, whether there are large-scale pressure testing, changes and other actions, this category is called change association, and how to make decisions based on the current fault situation to cut current, limit current and other recovery measures, we call analysis post-operation.

  • Long tail demand

For example, it is more common to calculate the success rate between multiple data, traffic loss and other requirements. The cost of solving through productization is very high, and users cannot quickly modify or customize related capabilities.

在这里插入图片描述

At the same time, we see the problems faced by the current monitoring system itself:

  • Complicated use of monitored data

Contains tens of thousands of data tables, hundreds of thousands of custom data tables, data types more than 20+, and cross-site, data source heterogeneity and other issues.

  • The monitoring service is not open enough

How to clean and calculate the dynamic log, how to store the time sequence after analysis, how to reuse the time sequence detection capability of monitoring, and so on. We need to fully service the monitored data and capabilities.

  • Efficient programmable coding platform is missing

There is no efficient platform to help SRE accumulate experience first, then promote the standardization and reuse of experience, and finally productization.

Based on the above problems, we put forward the idea of MaaS (Monitoring as a Service), monitoring capabilities as a service, opening and integrating monitoring capabilities into various fields of technical risk, quickly completing the construction of SRE scenarios, and accumulating reusable capabilities, which mainly include the following 4 Milestones:

在这里插入图片描述

1. Open services open up monitoring capabilities such as computing, storage, algorithms, and views.

2. The core jointly builds several key technical risk field scenarios, such as change defense, pressure measurement and drainage, non-destructive injection, positioning emergency and so on.

3. Promote the standardized precipitation of services, so that more scenarios can be reused and jointly build this part of the ability.

4. Solve the problem of the link between "monitoring" and "control".

在这里插入图片描述

(Our AIOPS here refers more to

A series of expert experience collection, precipitation, continuous maintenance and optimization)

PART. 1 Platform Technology Overview

Capability Servicing

The serviceability of monitoring includes several parts as a whole, including data service, configuration service, computing, storage service, algorithm service, notification, cloud native service, etc. It also includes some external capability integrations such as message caching.

Let me briefly give two examples of servicing to introduce what is servicing:

For example, in the view service, when a change is initiated, there is a problem with 10 of the 100 application indicators and 20 business indicators associated with the change. At this time, we need to dynamically create a data view for these 10 indicators. Analyze the scope of business impact, this time we will use our view service.

Dynamic computing services. For example, if I need to calculate in real time the interface method call status of 5 machines of a certain application in computer room A (or a ZONE such as GZXXX) between 11:00 and 12:00, the dynamic computing service will be dispatched. It's useful.

在这里插入图片描述

Data Servicing

A very important part of the service-oriented capability is the service-oriented data, because data is the basis of all monitoring and analysis, and the service-oriented data is mainly divided into two parts.

  • The first step is to model the monitoring data

From the perspective of data usage, we abstract data into data virtual tables, virtual table column definitions, and column relationships. And the realization of the virtual table binding, including index data, but also relational metadata, and finally these physical realizations are mapped to monitoring specific data indicators or underlying metadata services.

  • The second step is service template

We abstract a data service into three categories:

Entity query, such as querying an application or a Tbase cluster;

Data query, such as querying the data of an application’s Sofa Service in the APP aggregation dimension;

Relation query, query the related entities of an entity, such as the Tbase cluster associated with cif.

  • Finally, we have realized the ability to automatically generate 5w+ data services and update the minute-level SDK. We can access all the monitored data through a very semantic access method. We can achieve that as long as it is the data in the monitoring system, it can be passed through us. This set of capabilities can access the corresponding data services, thereby achieving the service-oriented opening of the entire monitoring data.

在这里插入图片描述

R&D efficiency

  • For large database management, we have designed a code structure that can support service precipitation, permission isolation (CR can be managed independently by directory), and logical reuse. For example, the development of cache Tbase + SRE can jointly define the discovery of cache problems in the cache directory. , Analysis, and recovery. At the same time, standard services such as whether the container is healthy and whether the network is abnormal can be reused in core-service. The platform can provide consistent multi-environment debugging capabilities, and facilitate monitoring and analysis logic to run directly in the local IDE environment. We proxy the locally accessed data and function services to the server through the dynamic proxy mode, thereby achieving complete local tracking. Consistent R&D experience. This ability has greatly improved the efficiency of research and development. Online data can be debugged directly during the research and development process, which has a very large efficiency improvement over the traditional log mode debugging method. A basic analysis function can basically be completed within an hour.
  • One-stop release operation and maintenance, the traditional serverless release mode roughly goes through these stages, from project packaging to Jar file, it takes 1min, about 100-300MB, from the Jar file to package and build the mirror, about 10min, and finally do The overall time for mirroring release is about 20 minutes, and the overall process takes about half an hour. We have developed a set of publishing operation and maintenance model based on function granularity, using MaaSFunction as the entry point to make a cross-file linker, how to make a linker, look at this schematic diagram, we start from the MaasFunction entry to analyze the dependent files of this category, such as other dependencies here Function, to the service below, and then to the interdependence of the util layer, so as to parse out the business code fragments in this way. The Jar size is currently about 500KB after dynamic compilation. According to the code size, it is generally within 5s, and then through hot loading On the target machine, it can achieve 5s level release, 1s rollback, and other supporting log, monitoring and other capabilities.

在这里插入图片描述

Multilingual Runtime

During our initial construction, the functions are all running in a cluster, and we often encounter performance problems such as infinite loops, OOM, frequent thread creation, etc. Because the business code is uncontrollable, CR cannot completely eliminate such problems, so we A completely isolated function running environment is required. As can be seen from the figure, each function has at least 2 independent containers (for disaster tolerance), and function 100M (Out Of Memory) will not affect function 2. In this way, we can basically achieve function-level isolation of the execution layer. We have greatly improved the stability of the platform through the measurement of SLO.

When we launched a large-scale launch in accordance with the function-level isolation mode, we encountered cost problems. The minimum scale of pod scheduling supported by the underlying sigma is 0.5c (the underlying physical network card, etc.). If there are two disaster recovery, basically one function is at least Occupying 1c of physical resources, with the large-scale use of functional services, this cost is difficult to sustain. Through our observations, most functions occupy less than 10% of the actual CPU, or even lower. We worked with the Serverless Paas team to build a high-density deployment mode of functions. We isolate 5 0.1c containers in 0.5c pods, and then bind them to the execution of the function through the container IP + port. This is the overall resource The cost can be reduced from 1c to 0.2c, and the cost is reduced to 1/5.

在这里插入图片描述

Recently we have also supported the Python language. The following figure is a simple code Demo.

在这里插入图片描述

PART. 2 Application Scenario

Change monitoring

The proportion of ants change failures is one-third of the overall failure proportion, and monitoring is a very important part of the change process. Currently, we are facing many problems. When a change is initiated, we get an automated system based on the scope of the change. Generate core system and business indicators such as application gc, CPU, service success rate, error data, etc., to solve the first problem of not knowing what indicators to change. Our MaaS computing service will dynamically generate real-time calculations of a set of indicators based on the scope of change, (before change, after change, change group, non-change group), etc., because it is dynamically generated calculation based on the change scope, so it solves the monitoring granularity Rough question, after the data is produced, according to our anomaly detection algorithm, we can find out whether there is an abnormal indicator in the change process, and solve the problem that we don’t know how to look at. We use the method of function orchestration to get through the core process of the change, and the whole process No user involvement is required. If an exception is found in the change monitoring, it will automatically trigger the interception and blocking exception.

  • Effective: A total of 163 real faults were intercepted throughout the year, and 75 were intercepted by change monitoring, accounting for 46.01% of the total number of interceptions.
  • General: The core defense services for change monitoring are unified, covering 87.74% of the main operation and maintenance scenarios of the entire site.
  • Scale: 1.2w change defense per day, 30w dynamic monitoring creation.

在这里插入图片描述

Business link alarm analysis

Under normal circumstances, the proportion of business alarm noise has reached 66%, which is very disturbing to daily R&D and SRE students, which is not conducive to the emergency process and stability construction of the business. In fact, SRE students have a lot of experience in noise determination, but the noise reduction logic cannot be customized, which makes noise reduction very difficult.

Through in-depth co-construction with the international SRE team, we have built a set of alarm analysis services. The process can be briefly summarized as:

  1. Build a business chain and pull data
  2. Do anomaly detection
  3. Positioning analysis
  4. Push exceptions, build analysis views

Through this scenario, we can see what the SRE experience is, what is the source of decline, how to deal with small traffic, how to deal with the normal service loss in the business process, and how to do business analysis and positioning, etc.

  • The effective rate of warning increased from 34.68% to 91.15%.
  • The alarm coverage rate of international GOC P1 and P2 services was 85%, and S2 was extended to the entire site on a large scale.
  • SRE only took 2 weeks from the overall idea to the first Poc implementation.

Configuration coding

This scenario is mainly to solve the efficiency problem of batch coverage, such as simulation site construction, automatic coverage of application monitoring rules and other scenarios. We expose these capabilities in a service-oriented way to achieve an effect of improving efficiency. Simply take the application monitoring rule coverage as an example. The code logic is to configure the template first, then after some configuration replacements, and finally large-scale coverage in batches. During the construction of the simulation environment, the threshold was adjusted in real time according to the flow, 0 manual configuration, Chongqing Xiaojin construction station, 5 person-day configuration work was reduced to 0 manual monitoring configuration. The automatic coverage of application rules has also been reduced from dozens of person-days in 527 in previous years to one hour.

(Coding case of middleware SRE students:)

  • Automated coverage of application rules:

The site-wide application rule coverage is reduced to 1 hour, and the 527 international application alarm 0 manual configuration.

  • xx business site:

The middleware 10+ products are reduced to 0 manual monitoring configuration.

  • Simulation environment construction:

Simulation rules are automatically covered, thresholds are dynamically adjusted, and 0 manual configuration.

在这里插入图片描述

In the follow-up, we will have more series of articles to introduce in detail how to use MaaS to do more technical risk capacity building (please look forward to it), and I will not describe it here.

PART. 3 written at the end

MaaS future planning is mainly divided into three parts:

  • Platform capacity building, copying the service-oriented path of monitoring to more scenarios of technical risks, dynamic flexibility, and standardization of R&D processes.
  • Code ecological construction, community construction, construction of service catalogs in the field of monitoring and analysis (for example, how to analyze container CPUs, what are the hotspot methods, etc.), the construction of function services, application markets, to allow more SREs to cooperate in one field Carrying out continuous construction, and finally the ability to operate in a community.
  • Monitoring + X (cross-technology risk domain) exploration. From our two years of experience, monitoring + change = change monitoring, monitoring + drill = non-destructive injection, monitoring + stress testing = performance analysis, etc. We have seen these areas For some qualitative changes, we hope to dig out more such scenarios, such as monitoring + current limiting + capacity, building refined current limiting capabilities, and so on.

在这里插入图片描述

As shown in the picture above, the MaaS logo looks like a house made of wood, which symbolizes that we hope that our business can be quickly built like building blocks. M is again two small triangles. The triangles symbolize stability. We hope that the MaaS platform can support very stably The business above. The graphic of a paperclip is also integrated in M, which symbolizes the connection.

The goal and dream of the platform: one day we will no longer find students in the system department when analyzing operating systems, analyzing middleware no longer rely on middleware students, analyzing caching no longer waiting for students (an SRE expert in Alipay who is responsible for caching) to investigate conclusions. If the process cannot start and the configuration center cannot be connected, it can be analyzed by one-click code online diagnosis, instead of lengthy and inefficient, word-of-mouth documents, so that technical risk knowledge can be truly shared through code.

We dream that the entire technical risk capability can be automated through the MaaS platform. One day, the second-level pressure test fusing, change defense, fine flow allocation, plan decision-making, self-healing, etc. can be truly Auto, truly unattended.

  • END -

After Double Eleven in 2019, we started MaaS, which has been nearly two years in a flash. So far, MaaS has effectively developed nearly 200 users, more than 1,000 function services, 1,000 function calls per day, and nearly 20 billion data points analyzed. Our MaaSHub (community product), multi-language support, etc. are also in rapid iteration.

  • question

When MaaS was first promoted (including still existing), why use this platform? In fact, most students are skeptical. The most common question is: Why should my code be written on your platform? Your platform is not yet mature, and my business is very important; I just want to pull data, but To learn something as complicated as you, etc. This process is relatively difficult, but through our guidance and support, users can already do various function research and development on the platform on their own, and users have begun to promote and spread spontaneously. It is not our platform that is awesome, but we have solid monitoring data. , Computing, storage, and algorithm capabilities, with the entire 40-person Antmonitor team as our strongest backing.

  • Customer acquisition cost

Coded products are naturally different from productized products. In the early days, the lack of servicing capabilities and the retention of a single user’s conversion would cost at least 2 people/week in R&D costs. It is conceivable how much such a high effort makes us cherish. For the existing 200+ users, the demand is basically from R&D to pre-release on the same day, and it will be online the next day. Only when users have a sense of security and trust can they be retained.

  • Platform ecology

The word ecology is relatively large and relatively empty. To describe in realistic terms, how to build a MaaS ecosystem requires answers to the following questions:

  1. How to make the user's code slowly transform into the precipitation of the platform, and when will the cached diagnostic code be standardized and quickly reused (caching is a relatively mature field that we have cooperated for a long time), this is difficult, but It needs to be done, or we need to establish a mechanism to make this channel possible and simple.
  2. How does the precipitation of the platform promote the development of platform users in the opposite direction? The platform has a lot of analytical capabilities. If it is quickly introduced to users, users will perceive and use it, which will make it easier and faster for users to build scenarios.
  3. How to reuse user scenarios? Is the problem solved by the user a special case or a general problem? If it is a general problem, can it be possible to reuse the user scenario directly to achieve the logical reuse of the general problem solution?

Of course, the whole process of servicing is very tedious and boring. In the future, we want to leap from monitoring to technical risk, so the service-oriented integration of more technical risk platforms is the only way to go. But how to quickly integrate the service capabilities of external systems, we are also doing some explorations and have achieved some results.

  • Production practice of big library mode & FaaS mode

Why use large libraries? Large libraries mainly solve the problem of code transparency. The goal of transparency is to promote code precipitation and higher-level reuse. Therefore, we abandon the single-function single-database model of the traditional FaaS platform. The goal is precipitation reuse of code, followed by an efficient function development platform.

The FaaS model is still too advanced. The vast majority of students are still accustomed to the traditional Project model. Various tripartite packages, TRs, and reflections are flying all over the sky. They are not very suitable for the function research and development model. In this process, we have also made a lot of compatibility. Let functions use various functions like traditional applications, but today when Serverless has not yet become popular, the future of FaaS requires more exploration and practical innovation.

  • Analytical DSL

We have spent great efforts trying to design our own analysis DSL, as shown in the figure below, an analysis case of an abnormal Alipay transaction:

在这里插入图片描述

Although we have always wanted to increase the level of analysis as much as possible, we can see that many details are unavoidably exposed (such as detection threshold, time, comparison operator, return value type, etc.). Once the details are exposed, it will leave us. There will be a big deviation from the original intention.

在这里插入图片描述

The reason why DSL cannot be implemented is actually the problem of understanding the hierarchical structure in the figure above. DSL is not suitable for describing details. DSL's area of expertise is limited expressiveness, and what we really lack now is the foundation of the building (that is, the first, second and third floors), which is far from the level of DSL.

  • Coding VS Productization

The end of coding is productization, and the two are not contradictory.

Especially for the various analytical capabilities in the monitoring field, it is very dangerous to rush to implement relevant analysis and positioning products before the analytical capabilities. We have also suffered from this. Our path now is to first implement the code analysis capabilities, then abstract the coded function service into a Controller, abstract the necessary parameters as product inputs, and do the reverse. Finally, it becomes a stable and powerful analysis product.

在这里插入图片描述


SOFAStack
426 声望1.6k 粉丝

SOFAStack™(Scalable Open Financial Architecture Stack)是一套用于快速构建金融级分布式架构的中间件,也是在金融场景里锤炼出来的最佳实践。