Recently, the Distributed System Stability Laboratory of China Academy of Information and Communications Technology officially released the "Guidelines for Information System Stability Guarantee Capability Building" (hereinafter referred to as the "Guidelines"). Ant Group was invited to deeply participate in the discussion and compilation of the "Guide", which contains excellent cases of system stability assurance services from many well-known institutions, including Ant Group, and aims to provide reference for various industries to improve system stability capabilities .

With the advancement of digital transformation in various fields, the application scope of information systems continues to expand, the carrying business becomes more and more critical, and high-frequency user access has become the norm. Facing the ever-increasing demand for use, most information systems continue to break through the upper limit of their processing capabilities through distributed architecture transformation, DevOps system construction, and the introduction of a large number of open source technologies. The introduction of these measures has led to an exponential increase in the complexity of the information system architecture, a significant increase stability risk. At the same time, the stability of the information system is also highly valued by the state. The "Regulations on the Security Protection of Critical Information Infrastructure" issued in 2021 put forward clear requirements for the stability assurance of my country's critical information infrastructure.

In this context, the "Guidelines for Building Information System Stability Guarantee Capability" came into being. As the first domestic research result that comprehensively summarizes and summarizes the practical experience and methodology of stability assurance, the guide combs the relevant background, basic principles, key elements, core capabilities and evaluation system of the information system stability assurance capacity building in the new stage. The future development trend of stability assurance work is discussed.

The "Guide" believes that information systems are the infrastructure of various industries, and the rapid development of Internet technology has brought many new challenges to system stability, among which distributed systems face higher stability risks. To this end, the guidelines pioneered the information system stability guarantee system in the digital age, which includes "two general principles, three key elements, four types of core capabilities, and five important tasks".

Not only that, in order to help various industries improve the system stability guarantee system, the "Guide" has collected a number of information system stability best practice cases, of which Ant Group's stability guarantee system has been included in the Internet industry cases.

Ant Group mainly provides payment, wealth management, insurance and other services through the Alipay client, serving billions of users. The business scenarios are complex and involve financial-related businesses, so it requires extremely high stability. With the development of the business for many years, Ant Group has gradually established TRaaS (Technological Risk-defense as a Service), a solution to problems in stability assurance and a risk prevention and control system. TRaaS pays attention to the stability risks that may arise during the entire R&D, operation and maintenance process, and provides stability risk prevention and control solutions from the aspects of process system, cultural promotion, technical solutions, and platform systems, so as to realize the ability of active risk discovery and self-recovery, and help business high-quality growth.

Simply put, TRaaS is an immune system that combines Alipay's entire distributed architecture and technical risk capabilities. It combines high availability and capital security capabilities with AIOps to enable the system to achieve self-healing from failures. In addition, TRAaaS also has the following six characteristics:

  • Unified change management and control, intelligent change risk defense;
  • Standard SOP fault management based on chatops, refined emergency positioning assistance;
  • Intelligent resource capacity scheduling to achieve the optimal balance between stability and cost;
  • Real-time intelligent real-time verification of trillion-level capital certificates;
  • Large-scale chaos engineering drives stable technology evolution and promotes technology risk culture;
  • AIOps improves operation and maintenance efficiency under controllable risks;

In fact, TRaaS was born from the actual combat experience of Ant Group's super-large-scale system. It is a technical risk prevention and control platform that has gradually grown up step by step after experiencing the harsh "Double Eleven" and other trials, ensuring the internal super-large-scale system. system stability.

Li Zheng, the general risk structure of Ant Technology, said that in the past ten years, due to the emphasis on system stability and security, Ant Group has accumulated countless experiences and technologies. TRaaS is the technical risk platform capability that Ant has accumulated and polished over the years in its large-scale and complex internal business. In the future, we will gradually open up more technologies and products to help all parties build a stable digital system.

At present, Ant Group's TRaaS technology risk prevention and control platform is being exported through commercialization and open source projects. Ant hopes to share its platform accumulation and practical experience in technology risk prevention and control with partners in various industries, so that partners can Work together and share risk protection technology to escort the stability of the enterprise system.


蚂蚁技术
1.2k 声望2.5k 粉丝

蚂蚁集团技术官方账号,分享蚂蚁前沿技术创新探索。