Alibaba Cloud Solution Architect Zhang Ping: System Construction of Cloud-Native Digital Safety Production

title=

Regarding today's sharing theme - " safety production ", the content is mainly divided into three parts:

The first part is the background of safe production and our understanding of the field of safe production;
The second part mainly introduces how the safety production work of Alibaba Group is carried out, so as to give you a reference and reference;
The third part is the overall plan for safe production that we have refined, helping everyone here to go to our own enterprise or environment to implement safe production.

Digital safety production background

When it comes to safe production, first of all, we need to look at the general background of the industry. In fact, my colleagues have just mentioned that all walks of life are now doing digital transformation and upgrading of their own businesses. Our business began to be cloud-based and online, and the application architecture began to be transformed into cloud-native. When each of our businesses goes online, we will find that the original traditional safety production concept and management model also need to be transformed into online and digital.

As online systems become more and more complex, business failures cannot be avoided. The occurrence of failures has a huge impact on our enterprise. How to improve the location, handling and recovery capabilities of failures is the most important goal of safety production at this stage. In the process of business digital transformation and upgrading, each of us should think about how to simultaneously complete the construction of a digital safety production system.

Starting from the safety production challenges from a business perspective

title=

Regarding safety production, it can be seen from the above recent failures that not only our ordinary enterprises, but even large domestic and foreign Internet companies that have invested heavily in the field of safety production, will experience business failures. After a fault occurs, not only business interruption and economic loss, but also the impact of public opinion will bring great challenges. How can we help you build a safe production work system? That's the core theme of our discussion today.

Within Alibaba Group, after more than ten years of exploration, we have accumulated a series of product and service systems, as well as the methodology of safe production construction. We have concluded that " high availability and stability prevail " as the guiding ideology for us to face the challenges of safety production on the business side.

What is digital safety production

title=

Today we are talking about digital safety production. The safety production that everyone may think of at first impression is still relatively traditional. For example, some factories, workshops, coal mines or construction sites, we often see some slogans, posters and some related concepts. Traditional safety production refers to taking corresponding accident prevention and control measures in order to avoid accidents that cause personal injury and property loss in production and operation activities.

The digital safety production we are discussing today is actually combined with the digital transformation and upgrading of our business, which mainly solves the problem of enterprise business continuity management. In the event of an expected or unanticipated accident or disaster, the enterprise protects important business activities with reasonable costs and resources, ensures that continuous operations are resumed within a specified time, minimizes the impact of disasters and minimizes the impact of disruptions lowest.

Digital safety production has special requirements in the following aspects:

Digitally enabled safe production. After the business is transferred from offline to online, the digital transformation of the contact points in the whole life cycle of the business is completed. At this time, the focus of safety production will also shift from offline to online, and safety production itself also requires digital empowerment.
Safe production with cloud native blessing. Digital transformation has brought about an architecture upgrade. All systems are on the cloud and are designed using advanced cloud-native and microservice architectures. Our secure production platform also needs to be upgraded synchronously to seamlessly adapt to cloud-native product capabilities and future-oriented architecture expansion capabilities.
Best Practices for Safe Production. The construction of safety production system needs to be tested by practice. Inside the Alibaba Group, we have a team of more than 100 people who continue to explore the construction of safe production, and have accumulated a set of best practices that are very suitable for all walks of life, and are still evolving.

Digital safety production construction content

title=

Based on the above discussion, in order to do a good job in safety production, we disassemble the core content into three parts, namely construction before, during and after the event.

Beforehand: We must have the relevant organizational structure guarantee, the construction of the system process system and system structure in advance, and the water level monitoring and fault monitoring capabilities of the relevant systems, as well as the protection, flow cut, and change management capabilities that match the SLA. .
In the process: We need to achieve agile and rapid collaboration, so that faults can be quickly discovered, located, and recovered quickly. For example, within Alibaba, we usually need to collaborate with a team of hundreds or even thousands of people during Double Eleven or a major failure scenario. In such a context, first of all, we need a unified mechanism to ensure consistency, as well as the ability of full-link monitoring (observable) mentioned by a colleague to ensure rapid discovery. In addition, a systematic ability to automate and coordinate the event processing process is required, rapid positioning is achieved by relying on the system's trace and topology capabilities, and it is also necessary to rely on the system's protection capabilities and unitized disaster recovery and multi-activity to truly achieve rapid fault recovery.
After the fact: we need to reflect, summarize the root cause, and define the action. After each fault emergency is completed, we need to do a review, rank and determine responsibilities, and produce system improvement items to ensure that our entire architecture continues to iteratively improve. For managers, we need to analyze what is the cause of the failure, how efficient the team cooperates in the processing process, the stability data statistics of the team and the product, and then ensure that our entire safety production management system is measurable, assessable, and verifiable. managed. Finally, through the ability of visualization, indexed and globalized control of business safety production.

Alibaba Group Best Practices

Alibaba Group Global Operation Command Center

title=

First of all, in the organizational support layer, we have an organization called the Global Operation Command Center, or GOC. Within the group, there are more than 60 business BUs that connect all safety production-related businesses to the GOC for unified and collaborative processing.

Then there is the monitoring (observable) we just mentioned, which is a very important link. We will aggregate all observable and manual feedback (such as feedback collected by Taobao customer service and Alibaba Cloud customer service) into a unified event center and use a systematic platform for management.

Finally, all fault emergencies are gathered at the command centers on both sides of the Taiwan Strait and the three places. The corresponding emergency students on duty use emergency coordination, fault location, and quick recovery tools to carry out fault emergency and quick recovery disposal, and conduct post-event recovery and improvement, and operate through the mechanism. and other strategies to control the safety production risk events of the entire group.

Big picture of safety production system

title=

Safety production is a complete system. With the help of this structure diagram, I will give you a general introduction. The group's safety production system is relatively large, and we divide the overall work into small modules.

First, there is the technical support of the platform. Through the previous introduction, we have learned that safety production involves many people with different roles, observable data from different business systems, and safety production management requires capabilities such as stress testing, fault emergency coordination, drills, positioning, stream switching, and replay. There is a corresponding platform within our group to provide effective support.

On this platform, the construction of systems in various fields, including fault management, multi-active, full-link stress testing, change management and other capabilities are all supported by this large platform. The construction of a safety production platform itself is also Digital transformation of work safety.

At the upper level of the platform, there are related management systems, data operations, and technical and cultural construction. When we were working on safety production in the early days, the biggest physical sensation was that we could not measure it. After a fault, we could not locate where the problem was and whose problem was. After the definition of fault level, fault classification, stability classification and other mechanism systems and operational activities The construction of safety production can realize the measurable and assessable work of safety production.

Then the platform and system construction need to cooperate with relevant drills to do standardized acceptance, to ensure that these systems and product capabilities can be effectively implemented and play a role.

Core elements of safe production

title=

People & Organization

The core of safety production mainly consists of three parts. The first part is the structure construction of personnel organization. We believe that safety production is the top project of the enterprise. It is necessary to establish a top-down unified organization that can synergize all safety production capabilities.

Within the group, we have such a vertical organizational structure. The command center is a department at the same level as each business BU, and then there are corresponding professional roles to support each business BU. Some organizational roles such as the arbitration committee ensure that our system can be effectively implemented.

System & Process

The second part is mainly about the mechanism process. After more than ten years of construction, the group has accumulated a lot of institutional processes.

The unified definition of failure level for the whole group: it provides quantitative standards for resource scheduling and decision-making in the emergency process;
Standardized emergency procedures: make incident handling quick and orderly; assessment standards for fault points and stability points, and uniformly measure the results of safe production;
Fault grading, responsibility determination, and dispute negotiation mechanism: a long-term mechanism to ensure safe production work.

Tool platform

The last part is the tools. The Group's systems and processes are not just on paper or hanging on the wall. All of our mechanism processes are supported by the corresponding system platform, and then based on our system capabilities, robots, NLP technology, etc., to achieve effective implementation, and implement all these mechanisms into the actual work every day, every day in an execution phase.

Definition of failure class

title=

The definition of failure level is the basis for the operation of the safety production system. We define service interruption or service quality degradation and experience degradation in the production environment as a failure regardless of the cause. Note that this is a fault defined from a business perspective. Its advantage is that it can be discovered before users, and it is more accurate than traditional monitoring.

If we go to the lower layer, we will have a lot of supporting platforms, such as middleware, database, cloud platform, network, server, etc., and we will define the indicators and faults of the lower layer according to the characteristics of each business. However, the overall principle is still mainly based on business impact. From top to bottom, only the business of the lower-level system usually becomes the business dependence of the upper-level system.

Based on the definition of fault level, when it is actually implemented, there are many subdivisions within the group. Here are a few commonly used categories: P sequence represents general level definition , D sequence represents data quality level , S sequence represents the degree of influence on important customers , E sequence represents public opinion level, and I sequence represents infrastructure related level .

We usually have 4 levels for each sequence, 4 represents common failure, 1 represents serious failure, the smaller the number, the higher the urgency and the higher the importance.

In the actual implementation process, we must first bring all services into the management scope, and define the fault level for all services. The fault level definition needs to coordinate with various roles, including development, testing, product, operation and maintenance, business relying parties, etc. to do the level definition review to ensure that an agreement is reached in advance. After the definition of failure level is officially released, everyone will invest and support back-end resources according to this level. Once a failure occurs, different emergency procedures can be initiated according to the level, and peer resources can be coordinated to participate in the emergency.

The determination of the fault level of each business scenario mainly refers to the business importance, impact, and duration to make a comprehensive judgment. The well-defined fault level definition should be structured and measurable, and should be coordinated with the whole-link observability to realize automatic fault discovery.

Once a fault occurs, we will define rules according to the observable indicators, automatically try to calculate the fault level, and automatically send a fault notification through the robot after reaching the fault standard. Auxiliary.

1-5-10 Mechanism

title=

With the fault level definition, we can accurately identify the fault risk of the business, find and deal with it in time. So, how to measure the efficiency of fault handling? This involves a core mechanism in digital safety production, the failure 1-5-10 mechanism.

Within the group, after all faults occur, we set an assessment target, requiring that business faults be discovered and notified within 1 minute, relevant personnel respond and initialize within 5 minutes, and complete fault recovery within 10 minutes. Then based on such a core guidance mechanism, we will go down and do a secondary split to build the entire safety production system.

1-5-10 Strategy Breakdown

title=

1-5-10 mainly focuses on the three major links of "discovery, positioning, and quick recovery", and then subdivided it will involve multiple links of architecture, development, and operation and maintenance. Each link has its own business rules, related mechanisms, and corresponding systems that we need to build.

For example, the "1" part mainly involves the observability of the whole link, and also includes the intelligent baseline and the whole link monitoring that we usually pay more attention to. These are all we need to do in this link.

Then in the second part, regarding the 5-minute response and positioning, we usually make announcements based on mobile methods, including text messages, phone calls, and DingTalk. Then there are collaborative tools. We will do collaboration based on DingTalk robot, use NLP robot technology to do check-in, emergency process interaction, and realize ChatOps.

Regarding positioning, we need some capabilities such as an observable system, a plan system, and change management. Usually, if a failure occurs in the platform, we will first receive a failure notification, and then we will receive some relevant change information before the failure, the system will push the scenario-related plans, and emergency personnel will be based on the observable ability. Assisted positioning.

Regarding the 10-minute fast recovery part, one of our biggest tricks is to cut the flow in units. Only the system determines that the impact of the failure and the estimated recovery time are unacceptable. We can do sub-unit cutting based on the unitized multi-active capability. flow, first recover and then judge. In addition, small-scale failures can also be partially recovered based on the pre-planned system.

The last word is our related operation mechanism construction and exercise acceptance. The operation mechanism is also a very important part of safety, which can ensure the continuous iteration of relevant safety production capabilities. The drill can use the online environment to simulate fault injection in real time, and test the system and process.

Testable Metrics

title=

A big pain point in safety production is that it cannot be measured. Usually, we don't know which product is stable, which team is doing it well, and we don't know the direction of future improvement. Based on the above product technology system construction, we have designed many operating standards.

Fault score: After each fault occurs, the system will automatically judge a score. The basic calculation logic is the impact surface, duration, and weight setting. It is an outcome indicator used to measure product and emergency efficiency. Through continuous operation, we can formulate the team's failure score quota value, and then set future goals related to safe production.
Stability score: It consists of 14 indicators in the fields of engineering design, architecture, and operation and maintenance. We will go to each business development team, cover review in the design process, observable coverage in the operation process, grayscale capability in the release process, and action completion rate afterward, and generate evaluation indicators in a systematic way. The stability score is a process indicator, which evaluates the investment related to safe production.

The failure score and the stability score are basically the two core indicators, which are important criteria for judging whether a team is qualified in the field of safe production. In addition, there are a series of mechanisms such as business availability, circuit breaker, change control, etc. These mechanisms will all run into their respective system platforms to realize automatic management.

Emergency process

title=

For emergency procedures, we will aggregate all incidents to the GOC. There are two main types of event access, one is user-side feedback, which is the manual part, based on intelligent customer service docking; the other is observable alarms, which we have docked with dozens of monitoring systems in the group's business BU. After a large amount of alarm data comes in, it involves convergence, suppression, and intelligent algorithm processing. Combined with the robot processing and filtering in the background, it will finally be integrated into a unified platform to determine the fault level. Events or faults will be coordinated through the nail group. When emergencies go to work orders, there will be corresponding coordination between systems. The processing process is effectively precipitated through the knowledge base, and the whole process data is displayed in a unified and visual manner through the large screen.

The processing process is all completed in the nail group. After the fault is passed, the relevant personnel need to sign in in the group, and the emergency process will be presented in a unified way through the group. In the event of a major failure, we will escalate to our executive team to coordinate with more people.

Mechanism operation

title=

In addition to the product capabilities and related organizational structure just mentioned, mechanism operation is also a very important part of safety production. We will have very rich operational activities and various awards. The experience of the teams with excellent performance can be shared, and the teams that do not perform well can summarize and improve, so as to ensure a long-term mechanism for safe production.

Digital safety production system construction plan

The Big Picture of Digital Security

title=

If an enterprise wants to build safety production, the core is divided into two parts : one is the construction of the technical system, and the other is the construction of the service system.

For the technical system part, we need to form a unified platform. In fact, a classmate has already mentioned it just now, saying that there are many monitoring systems in enterprises now, and each business has them. Then there are their own systems from the application layer, middle layer, database, cloud platform, and network. If we build in such a decentralized way, it is actually difficult to form a unified emergency command center. Our suggestion is to build a unified platform, and then this platform has various operational capabilities for safe production, and integrates the business capabilities of each system to form a unified command center.

In terms of services, we must ensure that the mechanism culture and organizational structure can effectively support the implementation of safety production.

Digital safety production platform

title=

Regarding the digital safety production platform, we have designed a framework that integrates existing capabilities, such as observable capabilities, plans, work orders, and event management through assembly in various fields, and abstracts it into an overall platform, where personnel and events are fully integrated. Unified management of life cycle. Then through the platform, we form corresponding business fields, support our various business scenarios, and serve various upper-level businesses. This is the overall architectural idea of the unified safety production platform construction.

Construction of digital safety production system-full life cycle service design

title=

beforehand

When making long-term planning, enterprises need to clearly design the product structure and business structure related to safe production, and need to have corresponding business management thinking.

in the middle

During operation, we need to consider system capacity building, such as drills, stress testing, current limiting, and multi-activity, etc., to ensure that we can effectively assess and prevent risks.

Here is a case, which is a business application that everyone is familiar with during this period of time, an epidemic prevention and control system, such as health code, site code, and nucleic acid detection. In the early stage, we will do a pressure test on the entire system to evaluate the capacity and water level on the line.

The result of the evaluation is that the capacity of the online production system is 10,000 QPS, and we prepare system resources according to this flow. At this time, if the peak flow exceeds 10,000 QPS, we will configure the traffic protection capability to ensure that the system is in extreme conditions. Not to say the whole thing collapsed.

Then go up one level, if the system has higher requirements for SLA, then we also need to build the active-active capability of the system, which is a big move for safe production. We must ensure that in extreme cases, when the entire business system collapses, we have a corresponding active-active site that can take over the business. All these capabilities require corresponding unified scheduling management in a unified platform.

afterwards

The last part is the improvement after the fact. This part of the content is actually very broad, such as the improvement of our emergency coordination capabilities, the improvement of product architecture, and the improvement of the entire management mechanism. Whether the improvement is completed on time and whether the landing effect is ideal is also a very important closed loop. We need a platform to do the corresponding support.

Digital Safety Production-Holographic Observation Platform

title=

Regarding platform capacity building, we take each important link apart and look at it. The first is observability. This part is actually the full-link monitoring of acos that we just talked about. Another point is that our observability does not necessarily depend on a certain platform, but must effectively integrate all monitoring capabilities of the business site. . acos is compatible with proprietary cloud arms application monitoring, and enhances the access capabilities of business, log, and heterogeneous monitoring systems, and achieves rapid positioning through the improvement of visualization capabilities.

Digital Safety Production-Full-Link Stress Test

title=

The second part is the full-link stress test. In the process of safety production, one of the core contents is to first understand what kind of processing capacity our platform is, and we need to find out the water level of the system and the ultimate carrying capacity. In this way, when the peak traffic of real business arrives, we can know what to do and deal with it easily.

The full-link stress test is a crucial link in various promotion activities within the group. Every full-link stress test is carried out in the production system, so as to ensure that all the data obtained from the stress test are true. , the corresponding shortcomings and system problems are exactly the same as the online problems. Accurately find the shortcomings and improve the overall pressure level of the business system.

Digital safety production - "1-5-10" emergency coordination

title=

Here we mainly introduce the 1-5-10 emergency coordination. In the actual construction process of the safety production system, we first need to integrate the incident and alarm feedback from the manual, and then define the fault level from the business point of view to ensure that the fault is 1 Quick discovery in minutes, accurate and timely notification.

In the emergency process, we will have corresponding horizontal support capabilities, including the connection of resources, cross-team and cross-vendor personnel collaboration, the implantation of devops capabilities, and Chatops capabilities to ensure that the system can automatically find the interface person, and assist in the rapid completion of position.

In the early stage of construction, we mainly rely on our own existing capabilities for effective integration. Of course, we all have mature solutions, but it does not necessarily mean that we need to completely revise and start over. Usually, enterprises need to do it based on our existing status. Then the quick recovery part includes related rules and capabilities such as related plans, disaster recovery and active-active. 1-5-10 is the part of safety production construction, the easiest to land and the fastest to see the effect.

Digital Safety Production - Traffic Protection

title=

Regarding the ability of traffic protection, I have just mentioned a part of it.

In a real-world environment, in order to cope with sudden traffic spikes, we need to prepare additional resources. This is actually a compromise between cost and efficiency. For some businesses, we may evaluate a business peak when we are building. Based on this peak, we may not prepare an unlimited amount of computing resources and storage resources, but after the business peak comes, it is impossible for our system to Out of service.

Therefore, we need the current limiting capability of the system to ensure that the business is available in extreme cases, and reserve expansion time for operation and maintenance operations. Usually, we cooperate with the full-link stress test just introduced, and we can use the traffic protection capability and the elastic capability related to our cloud native containerization to ensure the smooth transition of related traffic peaks and maximize the support of the overall business. stability.

Digital safety production-disaster recovery and multi-active solutions

title=

Disaster recovery and more activity is the ultimate strategy for safe production. When large-scale failures occur, if we rely on independent positioning and recovery capabilities, we may not be able to meet the SLA of important systems. At this time, we need to build business-level disaster recovery. Live more.

In our high-level disaster recovery solution, the overall architecture is unitized, that is, a business-level remote multi-active solution. However, many enterprises usually lead the solution to disaster recovery. By doing database synchronization and storage replication, each application manages its own part to obtain relevant disaster recovery capabilities.

There is a general management and control platform in the disaster recovery and multi-active system, which manages the traffic access layer, middleware and database collaboratively. It can be understood that we are a large traffic scheduling system, and then we can ensure that after a business failure, it can automatically do the traffic scheduling of a single application, and the related business flow of a single application can be executed automatically, and the switching process platform can be completed automatically. No manual adjustment of the application is required.

Multi-active construction has obvious advantages in resource utilization, switching success rate, and degree of automation, and it is also the ultimate goal of enterprise safety production construction.

Alibaba Cloud Solution Architect Zhang Ping: System Construction of Cloud-Native Digital Safety Production

Digital safety production background

Starting from the safety production challenges from a business perspective

What is digital safety production

Digital safety production construction content

Alibaba Group Best Practices

Alibaba Group Global Operation Command Center

Big picture of safety production system

Core elements of safe production

People & Organization

System & Process

Tool platform

Definition of failure class

1-5-10 Mechanism

1-5-10 Strategy Breakdown

Testable Metrics

Emergency process

Mechanism operation

Digital safety production system construction plan

The Big Picture of Digital Security

Digital safety production platform

Construction of digital safety production system-full life cycle service design

beforehand

in the middle

afterwards

Digital Safety Production-Holographic Observation Platform

Digital Safety Production-Full-Link Stress Test

Digital safety production - "1-5-10" emergency coordination

Digital Safety Production - Traffic Protection

Digital safety production-disaster recovery and multi-active solutions

阿里云云原生

引用和评论

Higress 入选全球 Top 100 MCP Servers 榜单｜MCPMarket.com

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

支付宝H5下载被拦截的原因排查与解决指南

云上玩转DeepSeek系列之四：DeepSeek R1 蒸馏和微调训练最佳实践

终于，AWS Aurora 也走向了融合架构，这一次阿里云 PolarDB-X 确实遥遥领先

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

草莓不是莓，西瓜才是莓——解读 Kubernetes 中被驱逐的 Pod