云计算 - Practice and thinking behind SOFAStack｜A new generation of distributed cloud PaaS platform to create a new cloud experience for enterprises - 金融级分布式架构SOFAStack

In recent years, the development of cloud computing is as rapid as a rocket, and heterogeneous changes are changing with each passing day. This is a clear development trend at the infrastructure layer. It is worth noting that with the increasing complexity of the infrastructure, it also brings great challenges to the unified resource scheduling of the entire infrastructure.

In the increasingly complex heterogeneous infrastructure, how should existing applications and incremental applications go to the cloud? Faced with the challenges posed by a large number of heterogeneous infrastructures, how can companies maximize the value of going to the cloud?

On December 15th, at the Global Distributed Cloud Conference with the theme of "Leading Distributed Cloud Reform and Helping the Digital Economy in the Bay Area", Ma Zhenxiong, Product Director of the Digital Technology Division of Ant Group, shared on top of distributed cloud heterogeneous infrastructure , Ant Group’s practice and thinking behind the construction of the distributed cloud PaaS platform SOFAStack.

PART. 1 Service grid defines a new application path to the cloud

With the development of cloud native, enterprises are accompanied by a large amount of historical baggage in the process of technological upgrading. These historical baggages are all the existing heterogeneous functions. These heterogeneous functions have the following characteristics: heterogeneous technology architecture and different communication protocols. Structure and development framework are heterogeneous.

How these existing applications are managed in a unified manner on heterogeneous infrastructure involves the entire life cycle of the application, from the cost of application transformation during research and development, to how to implement unified service governance for heterogeneous applications at runtime, and then to operation. How to maintain unified metadata management, unified changes, unified disaster tolerance, unified emergency response, and fund security for infrastructure are all challenges that exist at the PaaS layer.

If the unified resource scheduling of the IaaS layer is based on the perspective and starting point of resources, then the upper PaaS needs to think about the challenges that the complexity of the entire distributed infrastructure will bring from the perspective of the application, and how the enterprise should deal with it.

Enterprises have a large amount of historical baggage, and the historical baggage is varied. If all these historical baggages are to be transformed into distributed applications or cloud-native applications, the cost required behind it is very expensive. It is difficult for a company to be willing to afford such a time in a short period of time. And cost, completely nativeize all historical burdens.

Compared with other cloud access methods, Service Mesh can achieve cross-platform, cross-protocol, and non-intrusive transformation of business code, so that applications can be quickly implanted in Sidecar to complete Mesh, obtain distributed dividends, safe and observable, and the entire structure is smooth Evolution. In the process of architecture upgrade, enterprises can proceed step by step, step by step, and achieve end-to-end security, credibility, and full-link observability.

Generally speaking, grid service first reduces the cost of transforming traditional applications into distributed, cloud-native applications; secondly, it solves the problems of interconnection and unified management of all new and old systems of the enterprise; thirdly, it allows the enterprise application architecture to be integrated The upgrade process has become smoother; the fourth is to allow all companies to retain their own stock system technology stack, and retain the company's own controllability requirements.

Forrester has been paying attention to the innovative technology of Ant Group for a long time. Dai Kun, the chief analyst of Forrester and the service technology decision maker of Serving Technology Executives, released the "Total Economic Impact of Ant Group Service Grid" and shared his research on Mesh.

In order to realize the intelligent development in the future, it is necessary to carry out the intelligent process through microservices, which is no longer as piecemeal as before. Customization of traditional applications must be dynamically assembled through grid services to achieve cloud development.

Through interviews with Ant Group customers, Forrester found that both traditional financial institutions and Internet financial institutions face common challenges under hybrid architecture, including infrastructure upgrades, application development upgrades, and cloud-on-cloud interactions. Forrester found that grid services have significant benefits from the cost savings of single application transformation to the improvement of operation and maintenance safety management efficiency. After three years of data research, the customer's return on investment reached 99% after using the Ant service grid products. .

PART. 2 SOFAStack realizes heterogeneous unified operation and maintenance and elastic disaster recovery

Based on its own technology accumulation and scene polishing, Ant Digital Technology defines the six major capabilities of distributed cloud PaaS platforms in the operation and maintenance state, including unified metadata management, unified cluster resource management, unified change capabilities, unified emergency response capabilities, and unified disaster tolerance Capabilities, and unify end-to-end observable capabilities from business, application to infrastructure. On this basis, Ant Digital Technology redefines SRE and realizes unified application operation and maintenance capabilities.

The industry generally believes that the “R” (Reliability) in SRE is reliability. Ant Digital Technology has combined its own pursuit of business availability and continuity for more than ten years, and has undergone more than a dozen large-scale double eleven verifications, and renewed SRE. Definition, changing the R in SRE from Reliability to Risk means that the ant's own guarantee system is based on risk. Finally, through more than ten years of technology precipitation, it created its own technology risk protection platform TRaaS. It is precisely because of the essence of the accumulation of more than ten years that ants can achieve unattended operation and maintenance of business, applications, and infrastructure, and "automatic driving" in operation and maintenance.

Ant's technical risk prevention and control system represents three goals from top to bottom: high availability, fund security, and low cost. Three organizational guarantees: team, culture, and system. Then to the four lines of defense of demand, R&D, release, and monitoring, a complete set of platform capabilities of the technical risk assurance system is finally precipitated. The entire platform is composed of four capability sections, including emergency response, change, capacity, and financial security.

The emergency platform has established a pre-, mid-, and post-event failure risk protection system with risk as the core, corresponding to failure risk detection capabilities, failure location capabilities, failure emergency and self-healing capabilities, and failure retrospective capabilities. The change platform has established the ability to automatically analyze, defend, and block changes before, during, and after change risks with change as the core. The capacity platform has established automatic detection, capacity planning, and capacity preservation capabilities for the global data center and overall system bottlenecks. The final capital platform has established a second line of defense for capital verification through non-intrusive business applications, helping companies to completely avoid capital security risks and reduce capital losses.

If the first core challenge is to solve the problems of R&D state and operation state, the second core challenge is to solve the problem of operation and maintenance state, and the third core challenge is to solve the problem of disaster tolerance state from the overall architecture. .

With the vigorous development of distributed cloud infrastructure, enterprise data centers are moving from centralized to discrete, which means that any enterprise application can run at any node in any data center computer room in the country anytime, anywhere. Behind this change, from an application perspective, there is an urgent need for an overall system application architecture to support business breakthroughs with unlimited scalability at the geographic and city levels. Based on Ant’s ultimate pursuit of business continuity, in the process of supporting business development, Ant has established an ultra-large-scale three-location five-center in the financial industry, and has deposited a set of remote multi-active unit architecture to solve the enterprise’s disaster tolerance and flexibility. , Three major pain points in gray scale.

In terms of disaster tolerance, the data center architecture that can support the enterprise completely shifts from single-active to dual-active in the same city, to three centers in two places, and to multiple actives in multiple places. The failure of one business unit will not affect another business unit, and the reliability and continuity of the business is guaranteed natively from the architecture itself.

In terms of flexibility, due to the flexible deployment and rapid expansion mechanism, it can be combined with the flexible traffic allocation mechanism to support the enterprise's data center to break through the expansion of the city and region level, and achieve unlimited scalability in the true sense.

Grayscale, combined with cross-unit routing and distribution, can easily achieve innovative business grayscale methods such as blue-green cells.

The multi-site multi-active architecture is very complicated. It contains four layers from top to bottom, from the access layer for routing rules and routing distribution, to the application layer's middleware routing, to the data layer's data fragmentation and data routing, and finally Unified disaster tolerance, unified monitoring, and unit topology to the operation and maintenance layer.

Taking the financial industry as an example, the important issue that a large bank needs to face in the process of host downward move is how to sink the core system into a distributed cluster, and how to match the performance and stability of the host system during the downward move of the distributed cluster. A very important ability is the multi-site multi-active architecture.

In the end, in the course of practicing the above three core challenges, ants precipitated a new generation of distributed cloud PaaS platform SOFAStack. The platform has a lot of top customer cases in the financial industry. From its original capabilities, the platform meets the high standards of the financial industry in terms of capacity, performance, scale, high availability, compliance, and cost reduction and efficiency improvements that are far higher than those of other industries. More importantly, SOFAStack comes from the financial industry, but not only in the financial industry. Ant hopes to use SOFAStack to empower more industries and complete the digital transformation of more companies.

PART. 3 The future evolution direction of SOFAStack

The future of Mesh will go through three important development stages:

In the first stage, not only Service Mesh, but more Mesh product forms appear, including Message Mesh, Cache Mesh, DB Mesh, etc. At this stage, it will help enterprises to more easily control heterogeneous runtime infrastructure;

In the second stage, on top of compatible heterogeneous runtime infrastructure, try to define a community or factual API standard, which allows enterprises to have a unified programming interface. When an enterprise has developed an application, any changes to the underlying infrastructure will be insensitive to the application. The vision at this stage is to let the application build once, run anywhere. Once the application is developed, there is no need to make any changes. It can run at any data center node in any computer room across the country at any time, and this node carries the runtime infrastructure upward. Is variable

In the third stage, if the first two stages are the end-to-end sinking of basic services to the infrastructure, in the third stage, the sinking of horizontal capabilities, including resource calls and system calls, is more to be seen. At this stage, we will try to sink more logic in business applications that is not related to the business itself to Sidecar as much as possible, completely liberating business development, allowing business developers to focus on capability programming instead of focusing on the bottom layer and returning to the business standard. Focus on the business itself.

Finally, Ant Group has been committed to the forward-looking layout and continuous innovation of the technical architecture, and will continue to polish end-to-end trusted native capabilities on heterogeneous infrastructure.

In the future, Ant hopes to build SOFAStack into a cross-cloud operating system for the digital transformation of all walks of life.

Practice and thinking behind SOFAStack｜A new generation of distributed cloud PaaS platform to create a new cloud experience for enterprises

PART. 1 Service grid defines a new application path to the cloud

PART. 2 SOFAStack realizes heterogeneous unified operation and maintenance and elastic disaster recovery

PART. 3 The future evolution direction of SOFAStack

Recommended reading this week

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

云电竞巅峰对决：ToDesk/网易云/START实战测评，谁是真王者？

分析型数据库入门指南：如何选择适合你的实时分析工具？

如何基于 Go 语言设计一个简洁优雅的分布式任务系统

Dify+DeepSeek实战教程！企业级 AI 文档库本地化部署，数据安全与智能检索我都要