1
头图

*Service Mesh* is the foundation of the next generation of microservice architecture. Ant Group has started technical exploration and pilot projects since the beginning of 2018. At present, Service Mesh covers thousands of Ant applications, achieving full coverage of core links.

Through the large-scale implementation of Service Mesh, Ant Group has taken a solid step towards cloud native, verified the feasibility, and truly saw that the sinking of the infrastructure has brought R&D and operation and maintenance efficiency to both the business and the infrastructure team. The improvement of the cost, the reduction of the cost.

At the same time, Ant is also actively opening up mature technology to the society. At present, the self-developed data plane MOSN has been open sourced. Open source enthusiasts are welcome to build together.

https://github.com/mosn/mosn

|Foreword|

Microservice architecture is a system architecture model that is becoming mainstream in today's Internet and financial institutions. Its core is a service framework integrating service communication and service governance functions. While the microservice framework continues to evolve, service mesh (Service Mesh) is a new type Because of its flexible and universal microservice architecture, it is considered to have good development prospects.

The Industrial and Commercial Bank of China (hereinafter referred to as ICBC) took the initiative to explore the service grid field, and started pre-research on service grid technology in 2019. After in-depth research and practice on service grid technology, it built a service grid platform in 2021. The service grid integrates and develops with the existing micro-service architecture, helping ICBC's application architecture to transform into a distributed and service-oriented way, and to carry the core banking system of the open platform in the future.

PART. 1 Development status of service grid in the industry

Since the birth of service grid technology in 2016, many open source products have emerged in the industry, such as Istio (Google + IBM + Lyft), Linkerd (Twitter), Consul (Hashicorp), etc. Among them, the Istio community has the highest degree of activity and recognition, and is regarded as the benchmark open source product of the service grid.

Service grid is an infrastructure layer that specializes in service communication. It takes over the communication traffic of the business container by injecting the Sidecar container into the business Pod. At the same time, the Sidecar container is docked with the control plane of the grid platform. Based on the strategy issued by the control plane, the agent traffic is governed and controlled, and the original service framework is changed. The lower layer of the governance capability of the company is in the Sidecar container, thereby achieving the sinking of the basic framework capability and decoupling from the business system.

img

Figure 1: Schematic diagram of service grid

After the Sidecar container takes over the incoming and outgoing traffic of back-end service communication, it communicates between services through standard protocols, which can realize cross-language and cross-protocol service exchanges. In addition, the Sidecar container can control the traffic of the agent, such as unified service routing, security encryption, monitoring collection, etc.

img

Figure 2: Schematic diagram of service grid request flow process

PART. 2 Service grid technology in ICBC

Explore and practice

ICBC started the IT architecture transformation project in 2015. Up to now, the distributed system has covered more than 240 key applications. There have been more than 480,000 provider distributed service nodes in production, and the average daily service call volume has exceeded 12.7 billion, which is gradually realized The cluster processing capability surpasses the performance capacity of the mainframe. While ICBC's distributed service platform stably supports the smooth operation of existing business systems, there are also some challenges that are common to the industry, such as:

(1) The interconnection of cross-language technology stacks requires the development of multiple sets of basic frameworks, and the cost of technology development and maintenance is high.

(2) Under the multi-product line, each application uses different versions of the basic framework, which promotes a longer cycle for upgrading the framework of each application. The production of the basic framework that runs multiple versions in parallel has greater compatibility pressure.

To solve current pain points, ICBC actively introduced service grid technology, explored decoupling business systems and infrastructure, and improved service governance capabilities.

Integrate and develop with the microservice framework to build an enterprise-level service grid platform

During the construction of the Service Mesh platform, it integrates the original distributed system registry, service monitoring and other infrastructures, and transforms the most basic communication protocol encoding and decoding capabilities in the original service framework client into a lightweight client The form of the client is retained in the business system, and the capabilities of the rest of the service framework client are all sunk into the Sidecar, which can be compatible with the service framework for development and smooth transition.

At present, ICBC has completed the construction of the Service Mesh platform. In the process of integrating with the distributed service platform, it has opened up the service governance and monitoring system of heterogeneous language systems, decoupled business and middleware systems, and enriched Traffic management capabilities, and have completed business pilots in applications such as robo-advisors and text recognition.

img

Figure 3: Comparison of Sidecar and Microservice SDK

The service grid control plane includes modules such as configuration center, registration center, security center, control center, monitoring center, and log center. The data plane Sidecar uses the same communication protocol (Dubbo/Spring Cloud) as the original service framework, which supports the interconnection and intercommunication between the service grid system and the original service framework system, and smooth migration.

img

Figure 4: ICBC service grid architecture diagram

Explore enterprise-level solutions to support large-scale deployment and smooth migration

The ICBC Service Grid has implemented implementation practices for traffic proxy deployment mode, smooth migration, and performance optimization in service scenarios such as big data and high-frequency online.

(1) Non-intrusive traffic proxy deployment mode in big data scenarios

The ICBC application development language mainly uses Java, but the Python language is also widely used in the field of big data. For heterogeneous language scenarios, the service grid platform provides a non-intrusive and transparent hijacking traffic proxy solution, which simplifies the difficulty of accessing heterogeneous language applications. The core of the non-intrusive traffic proxy is to forcibly intercept the traffic in and out of the business container by modifying the network Iptables rules, and redirect this part of the traffic to the Sidecar container.

The specific implementation is: when starting the business Pod, modify the network Iptables rules of the business Pod through the Init Container (initialization container), which makes the traffic in and out of the business container forcibly redirected to the Sidecar container to realize the traffic from the Sidecar container to the business container take over.

img

Figure 5: Schematic diagram of transparent hijacking traffic proxy

However, Iptables poses greater challenges to performance and maintainability. Therefore, in the online high-frequency service scenario, we provide a traffic proxy solution with a lightweight client and Sidecar collaboration.

(2) Low-intrusion traffic proxy deployment mode in high-frequency online scenarios

In the online high-frequency service scenario, we introduced a lightweight client to the business application. The client changed the service registration discovery behavior of the business application under the premise of being transparent to the business, and registered the service originally initiated to the registration center with The subscription behavior changes to initiate service registration and subscription to the local Sidecar address of 127.0.0.1, and the Sidecar agent initiates service registration and subscription to the registration center. After the service container is subscribed through the Sidecar proxy, the service destination address obtained locally is the Sidecar address of 127.0.0.1, and all subsequent requests will be sent directly to Sidecar, and then forwarded by Sidecar to the real service destination address to achieve traffic proxy capabilities.

img

Figure 6: Schematic diagram of port traffic proxy

(3) Smooth migration from traditional deployment to grid deployment

At present, ICBC microservices are mainly composed of two service instances based on Dubbo and Spring Cloud, and have been running on a large scale in the production environment. When the service grid system is introduced, it needs to have a smooth transition capability with the original microservice system. ICBC supports both the Dubbo and Spring cloud protocols through the service grid system. The service grid instance and the original service framework instance can access each other through the same protocol. Under the same registration center, the service grid system and the original distributed service system can be integrated and developed, and the transition is smooth.

img

Figure 7: Schematic diagram of smooth migration

(4) Performance challenges and optimization after large-scale deployment

At present, ICBC’s largest registry cluster has ultra-large-scale business scenarios with more than 480,000 providers. In the open source Isito architecture, the destination address and configuration information of service discovery will be fully distributed through Pilot’s Xds API. In the case of a large number of service instances, full delivery will affect the performance and stability of Pilot and Sidecar. The service grid platform introduces a third-party registration center and configuration center. Sidecar directly connects the registration center and configuration center, supports on-demand subscription, accurate configuration delivery, and greatly reduces the pressure on Pilot and Sidecar. Through pressure testing, the control plane has the performance capacity to support millions of instances.

img

Figure 8: The evolution of ICBC control plane components

Build enterprise-level service governance capabilities to support precise traffic control

At present, open source Istio's traffic management capabilities are extremely limited, only basic routing and observability can not meet the needs of enterprise-level. SOFAMesh is based on the Istio architecture design, self-developed data plane, and tuned some control plane components to meet the needs of enterprises. ICBC and SOFAMesh team cooperated to build a financial-grade service grid platform, and carried out enterprise-level traffic control capabilities. Enhanced. The ICBC service grid has complete monitoring, operation and maintenance capabilities, can monitor the operating status of each node, supports real-time traffic allocation for each node, and has the ability to remove faulty nodes in real time, and can perform unified security management and control on each node.

(1) Monitoring operation and maintenance capabilities

The service grid platform has built-in complete monitoring and alarm capabilities, and supports reporting monitoring indicators such as service monitoring and link monitoring to third-party monitoring systems; When service management functions such as flow, fuse, degradation, and fault self-healing, the corresponding alarm event is triggered synchronously.

(2) Traffic management capability

The service grid platform has the ability to accurately match fine-grained traffic. It can identify specific traffic collections from the perspective of traffic identification, and accurately control this part of the traffic. The platform now supports (label-level/method-level/service-level/application-level) current limiting, fusing, downgrading, routing, traffic mirroring, link encryption, authentication, fault rehearsal, fault isolation and other enterprise-level flow control capabilities.

(3) Fault self-healing ability

Traditional fault feedback relies on the ability to temporarily handle faulty nodes through emergency plans after monitoring and alarming, and the ability to customize emergency plans for business and operation and maintenance, relying on experienced operation and maintenance engineers, is costly for newcomers; and the plan operations are scattered in the document, and maintainability is poor , As business iterations may gradually degrade, increasing operational complexity. The service grid platform provides a unified basic fault self-healing system. The service request failure rate in the time window is the golden indicator, the minimum number of calls during the auxiliary window, the failure rate multiple, etc., to realize the automatic perception of common faults, and automatically learn from the customers. The network on the end or server side isolates the faulty node, and after the faulty node is restored, the network can self-recover to achieve business self-healing ability, which improves the high availability of operation and maintenance of the distributed system.

img

Figure 9: Working diagram of fault isolation

(4) Safety management capability

The service grid platform already supports security authentication capabilities, supports national secrets and a variety of mainstream algorithms to build encrypted channels, realizes more secure data transmission, and uses a zero-trust network security attitude to achieve full link credibility and encryption; and can identify calls Party ID, and access control strategy (black/white list) is set according to the ID. In business scenarios with multiple access parties, it can prevent individual customer system failures or malicious attacks, implement blacklist control for abnormal customers, deny illegal access, and protect the availability of the system.

img

Figure 10: Schematic diagram of safety management and control work

PART. 3 Future Outlook

As the next-generation microservice technology in the cloud-native field, service grid has evolved over more than five years and has only been practiced in large-scale production by individual leading companies, and there have been no successful cases in the financial industry represented by banks. ICBC Service Grid has completed business trials in multi-language, heterogeneous technologies, and edge scenarios, basically demonstrating the advantages of service grid in traffic control and system scalability, and has the ability to sink service governance to the infrastructure layer, and highly decouple middleware And the feasibility of the business system.

In the follow-up, ICBC will expand the scope of pilot applications based on a comprehensive summary of the previous pilot experience, fully demonstrate the adaptability of service grid technology in differentiated technical architectures and bank diversified business scenarios, simultaneously polish and improve platform capabilities, and comprehensively improve performance Capacity and stability provide best practices and demonstrations for the implementation of service grid technology in the financial industry.

Recommended reading this week

opens a new chapter in cloud-native MOSN-integrating Envoy and GoLang ecology

Service Mesh Exploration and thinking after Double Eleven (

Service Mesh Exploration and thinking after Double Eleven (Part 2)

Ant Service Mesh Large-scale Landing Practice and Prospects

img


SOFAStack
426 声望1.6k 粉丝

SOFAStack™(Scalable Open Financial Architecture Stack)是一套用于快速构建金融级分布式架构的中间件,也是在金融场景里锤炼出来的最佳实践。