The downtime of the Federal Reserve's payment system, the downtime of Amazon's cloud services, and the downtime of domestic Internet service platforms such as Station B... In recent years, there have been frequent outages around the world, and system stability has gradually become the focus of the industry.

With the deep integration of Internet services into production and life, the software needs to meet diverse needs, and it is bound to expand the system and introduce emerging technology architectures. The complexity of the information system is rapidly increasing, all of which make the challenge of system stability more and more difficult. Ma Pengwei, Institute of Cloud Computing and Big Data, China Academy of Information and Communications Technology believes that it is an inevitable trend that the stability of information systems has become the focus of the industry.

In order to help all walks of life achieve more efficient operation and maintenance and provide a full range of stability guarantees for business, Ant Digital recently released the business intelligence observable platform BOS. This product can empower heterogeneous applications on and off the cloud to obtain out-of-the-box intelligent observability capabilities, helping enterprises improve operation and maintenance efficiency by more than 3 times. At the product launch conference, Ma Hengyang, a product expert from Ant Digital, gave a comprehensive interpretation of the challenges existing in traditional IT operation and maintenance, as well as the functions of business intelligence observable products.

Four challenges faced by traditional IT operation and maintenance

At present, various industries are undergoing digital transformation, such as the construction of Devops, distributed architecture, and containerization transformation. After enjoying the benefits of digitalization and cloudification, complex business scenarios and large-scale users have brought new challenges and greater risk pressure to today's operation and maintenance. The main challenges are as follows:

1. Lack of business digital operation and maintenance: At this stage, most enterprises prefer to operate and maintain from the perspective of application or resources, and lack the ability to conduct operation, maintenance and operation from a business perspective. In addition, the business scenarios of enterprises are complex, such as mobile banking and WeChat banking for users, financial and HR systems for internal employees, and open platforms for partners. There is no way to quantify and visualize these complex business scenarios through traditional operation and maintenance methods, and it is even impossible to associate and map the business with the application system;

2. The coverage of the link is very low: about 40% of user experience failures are caused by the client itself, and about 60% are caused by the client calling the server or middleware, so the original single-point request call has become a long chain Road call, any request may pass through multiple heterogeneous nodes such as client-server-middleware, and each time a fault occurs, the operation and maintenance personnel cannot quickly perceive which link in the call link has an abnormality or performance bottleneck;

3. There are many and fragmented O&M products: Many companies have purchased and built various monitoring products, such as applications, middleware, monitoring of basic resources, etc. These products are used by different departments, and they also build logs and links. Other operation and maintenance tools, but manual collection of information is still required when a fault occurs, resulting in a long troubleshooting cycle. For example, when an application fails, it may be caused by the abnormality of the virtual machine where it is running, but the two monitoring platforms each issue a message. The alarm information cannot be automatically associated;

4. There is no unified standard for operation and maintenance data: Massive operation and maintenance data cannot realize multi-dimensional correlation analysis of data, cannot support upper-layer observable and intelligent operation and maintenance capabilities, and can not realize analysis and mining of operation and maintenance data.

The above four types of challenges eventually lead to the operation and maintenance personnel being caught in a massive alarm storm every day, but unable to accurately detect the fault; production accidents occur frequently, but there is no good observation and emergency measures; and each fault must be related to the business. Multi-party coordination, application R&D, operation and maintenance, etc., is not only inefficient but also high in coordination costs.

How to meet these challenges? The traditional method is to focus on monitoring to find the fault points of basic resources, mainly operation and maintenance personnel; in recent years, with the rise of cloud native, the concept and technology of observability have also been well developed and spread, providing various observation methods for application systems, For example, indicators, links, and logs can better discover the root cause of system failures, white-box the system and perceive what is happening inside the system. Users can also expand from operation and maintenance to application development.

But in the face of complex business scenarios, this is far from enough. Ant Group has complex business scenarios, and each business will go through many application systems, so what is happening inside the business becomes very important. Ants precipitated business scenarios visualization and data business semantics, so as to realize business and application association mapping. When services are abnormal, intelligent observation techniques can be used to achieve rapid fault location and recovery.

Five Capabilities of Business Intelligence Observable Services

Business-Intelligent Observability Service (BOS) is an operation and maintenance platform based on Ant's large-scale technical risk prevention and control practice. It integrates product features such as data analysis and large-scale practice, visualizes business scenarios and semantics data business, enables out-of-the-box intelligent observability of heterogeneous applications on and off the cloud, and provides all-round business stability. Guarantee, build a new paradigm for business observation, and make stability more powerful.

The Business Intelligence Observable Service contains the following core values:

Core value 1: digital business operation and maintenance <br> Ant has hundreds of business domains, with a variety of business types, a large number of business scenarios, and a high level of business volume. Therefore, it is necessary to detect and discover business anomalies at all times, such as traffic drop/sudden increase, traffic failure, etc. And when the business is abnormal, it can provide the ability to quickly diagnose, so the observation data such as links, logs and indicators are aggregated according to the business scenario model, so as to provide the ability of business digital operation and maintenance:

 通过对业务链路和日志数据融合,并增加业务依赖轨迹,可构建业务多阶段模型,比如交易业务(交易创建-> 交易付款-> 交易支付成功),让业务方、研发和运维人员都能过可视化熟悉业务流程走向,并可自动感知到业务上下游依赖,以及通过业务影响面定义故障和拉齐应急;
通过对链路和日志数据融合,并增加业务语义行为,可自动聚合成业务单依赖链路,比如支付这个动作,支付业务在服务端的请求调用依赖是什么样的,当支付业务受损后,可查看对应的业务链路,识别链路中的应用服务和中间件等异常节点,将业务异常与应用异常自动关联映射;
通过将指标和日志数据融合,并增加业务语义维度,可灵活自定义配置丰富的业务指标,比如交易量和转账率等,并借助全息可观测能力去快速的发现和定位故障,提供业务连续性保证。

Business digital operation and maintenance locate emergency and visualization systems from the perspective of business, but this requires observable capabilities and data to be built. We provide a complete set of positioning adequacy measurement mechanisms to measure the integrity of observable basic data; and based on business The priority and importance of each business are fully sorted out to achieve a wider coverage of the business, so that the business in the platform can be seen at a glance.

Core value 2: Holographic observable positioning <br> End-to-end full-link observation: Provide distributed full-link tracking capability from client->server->middleware, through link diagram, topology diagram and sequence diagram It can identify and lock abnormal points and performance bottlenecks in link calls; for client applications, it provides crash analysis functions to monitor APP crash events such as crash, stuck, and stuck, and report them in time The memory stack information corresponding to the APP is convenient for locating problems. In addition, it also provides client monitoring capabilities such as startup analysis, network analysis, power analysis, memory analysis, H5 performance analysis, and applet analysis;

Provides rich server-side performance monitoring, visualizes the operation of various aspects of the application itself, such as application service interface, resource usage, JVM Runtime, port survival, etc., and organizes fine-grained observation data according to single service, computer room, unit and application, etc. Dimension aggregation realizes the ability to drill down indicators layer by layer; and in a real sense, it realizes hyper-integration of observable data such as indicators, links and logs. For example, the number of errors can be viewed in the corresponding error log information statistics, slow interfaces and slow SQL, etc. The indicators can query the corresponding link details, and the application operation indicators and associated logs can be viewed in a single link;

Performance diagnosis and analysis: Provides performance monitoring capabilities for CPU snapshot analysis, memory snapshot analysis, thread analysis, and exception analysis, which can truly restore the code execution process and help quickly locate program failures caused by threads and stacks. At the same time, Arthas is a powerful tool for diagnosing online problems in the Java field. Using bytecode enhancement technology, you can view the running status of the program without restarting the JVM process;

Fault locating and self-healing: Aggregate related alarms and abnormal events according to the risk dimension, provide single-application diagnosis, link diagnosis, dependency diagnosis, and fault decision analysis capabilities, which can quickly locate the fault point, such as known risk events, which can be automatically triggered A risk plan has been configured to realize the self-healing capability of faults;

Application security governance: Based on the instrumentation technology, the security policy is injected into the application runtime environment to resist the application security protection capability of black and gray network attacks. The RASP security technology can detect attacks and protect themselves when the application is running, and its attack interception protection The rate is as high as 98.7%, RT<1ms; when the service is abnormal, the service governance capability can be realized based on ServiceMesh, and the sidecar nodes can be observed in the link and monitoring, so as to ensure the stability of the sidecar and avoid the impact on the business. The interface provides a wealth of observation data fusion display; finally, it is connected with the application change process to realize real-time observation of change traffic. The business intelligence observable service truly realizes the prevention-governance-change capability of the entire application life cycle.

Core value 3: Integrated data analysis <br> In addition to providing rich observable data collection capabilities, the business intelligence observable service can also integrate with third-party system data, and report to the data model according to the open source Open-teletry standard protocol. The data is preprocessed and calculated twice, and stored in a highly reliable database.

And connect to the metadata center or CMDB of the third-party operation and maintenance change platform, convert heterogeneous metadata into unified technical risk metadata, and aggregate into different impact models according to different business positioning scenarios, such as system dependency impact, business link impact , customer asset impact, etc., integrate time series data on the impact surface model, and build a real-time technical risk data middle platform, so that the top-level observability capabilities and the underlying heterogeneous data sources are truly decoupled.

The purpose of integrated data analysis is not only to manage the data in a unified manner, but also to perform correlation analysis on the data to support the operation and maintenance of various technical risk scenarios, such as fault diagnosis, root cause analysis, roll-up and drill-down Wait. Use this to solve problems such as business source decline and service loss ratio. For example, when the business is abnormal, we detect business-related changes, diagnose business-related applications, and analyze application dependencies. The abnormal points are aggregated and pushed to the emergency personnel, so as to perceive the impact of the fault and make emergency decisions in the shortest time, and finally achieve the 1-5-10 goal of ant technology risk emergency response (that is, find abnormalities in one minute and locate problems in five minutes). , 10 minutes to restore the fault).

Core Value 4: Intelligent Scenario-based Prevention and Control <br> Ant has done a lot of exploration of AIOPS algorithms and tools, and finally precipitated a complete algorithm capability platform including deployment, training, regression, and decision-making of intelligent algorithms. Combined with the alarm module, based on the time series data, it can determine the spikes that have not occurred as business anomalies, such as sudden rise/sudden drop, slow rise/slow drop, zero drop, long trend abnormality, frequency abnormality, etc. And give the detailed reasons why the current point is not alarmed, such as year-on-year filtering, month-on-month filtering, and same-rise and same-decline filtering, etc.; and the accuracy rate is stable > 90%, it can identify abnormal fluctuations of > 5% rise and fall, and intelligent scenario-based prevention and control Help more enterprises to realize automatic operation and maintenance, release the labor cost of operation and maintenance.

Core value 5 11.11 Large-scale practice <br> Business intelligence observable service is the eye of safety production and stability assurance, so its own stability is extremely important. The business intelligence observable service framework can achieve rapid and elastic expansion for different levels of observation objects. All components are self-developed by Ant, with strong technical guarantee. And the entire platform has super high performance in acquisition, computing, storage, etc., and supports multi-site and multi-center disaster recovery deployment architecture, which can achieve 4 9 financial emergency disaster recovery capabilities, so as to cope with various large-scale scenarios and ensure Uninterrupted business.

Open and compatible with various heterogeneous applications

Today, more and more enterprises are building observable systems and product capabilities, because observation can enable different departments and personnel of the enterprise to gain greater competitive advantages.

For operation and maintenance engineers and R&D engineers, through holographic observability, it is possible to realize the observable capability, integrate the whole process of business design, research and development, operation and operation and maintenance, end-to-end full-link visualization can locate the bottleneck of invocation, and one-stop application observation It can quickly diagnose the root cause of faults; for project managers and architects, they can define faults through business impact, realize multi-department collaborative emergency response, provide multi-view business scenarios/topologies/links/big disks, realize business and system mapping, and break data silos. In this way, business operations can be realized; for enterprises, business production failures can be reduced, and better security and stability guarantees can be provided, so as to achieve the goal of safe production.

The business intelligence observable service will provide external services in a more open and compatible form. Provides a full set of business observation services in Alibaba Cloud public cloud, which can be used together with SOFAStack financial-grade cloud-native distributed solution and other Alibaba Cloud products to better enjoy the convenience brought by cloud-native. It also supports the output of hybrid cloud privatization. Currently, it can be deployed in various heterogeneous environments such as Alibaba Cloud Apsara, vmware virtual machine, Kubernetes container, and openstack, and supports localized architecture, and has obtained Xinchuang certification.

Today, distributed and containerized applications only account for a part of enterprise systems. Most of the application systems are under the cloud and run on classic virtual machines. These core systems also face the aforementioned O&M challenges. Business intelligence observable services can Application systems of various heterogeneous languages and heterogeneous technology stacks provide out-of-the-box business observation capabilities, allowing applications under the cloud to enjoy the benefits of observable technology.

If some enterprises have made observable attempts based on open source products, such as Skywalking, Prometheus, EFK, etc. The business intelligence observable service is also compatible. It can collect link data reported by open source link products, collect monitoring indicators generated based on the prometehus protocol, and query the original logs of ES, allowing application systems to seamlessly and cost-free migrate to Business Intelligence on Observable Services.

In addition, Ant Digital also provides SRE consulting and configuration services. With the consulting service of Ant SRE, it is possible to conduct in-depth research and investigation to understand the current situation of enterprise operation and maintenance. Combined with Ant's technical risk practice, a consulting report on the development of the enterprise's own operation and maintenance and the construction of the SRE system can be sorted out. In addition to the products of business intelligence observable services, it also provides related business configuration services, and builds business sample rooms according to the enterprise's pilot applications, such as business scenarios, business indicators, business dashboards, inspection scripts, fault diagnosis trees, and plans, etc. Empowering enterprise personnel during the configuration process can achieve better product implementation and truly autonomous and controllable enterprises.

At present, many state-owned banks, joint-stock banks, city commercial banks, rural credit cooperatives, and insurance companies in the financial industry have used business intelligence observable services. For example, the observability and fault diagnosis self-healing of Ningbo Bank, the hybrid cloud unified observation platform of China Property & Casualty Insurance, etc.

Yu Bin, general manager of Ant Group's Digital Industry Division, said: "In the future, Ant Digital will provide a richer product system, and cooperate with relevant ecological partners to serve the digital transformation of more enterprises, so that products, technologies and Services bring more value to the business.”

Ant Digital is the technology business segment of Ant Group. It is committed to continuously opening up Ant Group's core technical capabilities in the fields of blockchain, artificial intelligence, cloud computing, security technology, etc., to digitally upgrade small and medium financial institutions and digitize small and micro businesses. Contribute to the digital transformation of operations, industrial chain digital collaboration and cross-border services.


蚂蚁技术
1.2k 声望2.5k 粉丝

蚂蚁集团技术官方账号,分享蚂蚁前沿技术创新探索。