Kubernetes multi-cluster practice in SOFAStack CAFE unit hybrid cloud product

background

SOFAStack is Ant Group's commercial financial-grade cloud-native architecture product. Based on SOFAStack, a cloud-native microservice system can be quickly built, and cloud-native applications that are more reliable, scalable, and easier to maintain can be rapidly developed. At the macro-architecture level, it provides an evolution route from a single computer room to an active-active in the same city, three centers in two locations, and a multi-active architecture in different locations, so that the system capacity can be arbitrarily expanded and scheduled in multiple data centers, making full use of server resources, and providing room-level disaster recovery. ability to ensure business continuity.

At the application life cycle management level, SOFAStack provides a multi-mode application PaaS platform - SOFAStack CAFE (Cloud Application Fabric Engine) cloud application engine. It provides PaaS platform capabilities for full lifecycle management such as application management, process orchestration, application deployment, and cluster operation and maintenance. It meets the operation and maintenance requirements of classic and cloud-native architectures in financial scenarios, helps traditional architectures transition smoothly, and protects financial technology risks.

In terms of cloud-native architecture operation and maintenance, SOFAStack CAFE provides cloud-native multi-cluster release operation and maintenance capabilities of unitized applications through the unitized hybrid cloud product LHC (LDC Hybrid Cloud), and realizes multi-region, multi-computer room, and multi-cloud hybrid deployment of applications. This article will demystify the LHC and talk in detail about some of our practices in its underlying Kubernetes multi-cluster release system.

challenge

At the beginning of the birth of the LHC product, the first problem we faced was to choose a suitable underlying Kubernetes multi-cluster framework for it. At that time, the Kubernetes community had just completed its official multi-cluster project KubeFed, which provided a series of multi-cluster basic capabilities such as multi-cluster management, multi-cluster distribution of Kubernetes resources, and state reflow, which naturally became our best choice at that time.

But as mentioned earlier, the community's multi-cluster framework provides only "basic capabilities", and these capabilities have many unsatisfactory and even conflicting points for our unitized hybrid cloud products. Among them, the most prominent problem is that the community does not have the concept of "unitization", and its multi-cluster is a pure multi-Kubernetes cluster. For any multi-cluster Kubernetes resource (we call it a federated resource in KubeFed), its The distribution topology can only be per-cluster. But in the unit model, the resources of an application service are distributed among multiple deployment units, and the relationship between the deployment unit and the cluster is flexible - in our current model, the relationship between the cluster and the deployment unit is flexible. The relationship is 1:n, i.e. a Kubernetes cluster can contain multiple deployment units. At this time, we encountered a point of disagreement with the community framework, which is also the biggest challenge: the upper-level business needs to manage Kubernetes resources according to the dimension of deployment units, while the bottom-level community framework only recognizes clusters.

In addition, the basic capabilities covered by KubeFed itself are not enough to meet all our needs, such as the lack of tenant isolation capabilities of the cluster, the lack of support for the distribution of resource annotations, and the high requirements for network connectivity between the main cluster and sub-clusters. Wait. Therefore, resolving conflicts and complementing capabilities has become a key issue for us in building the underlying multi-cluster capabilities of LHC products.

practice

Let's talk about some specific practices in building the underlying Kubernetes multi-cluster capabilities of LHC products in modules.

Multi-Topology Federated CRD

In the community KubeFed framework, we conduct multi-cluster distribution of Kubernetes resources through federated CR. A typical federated CR spec looks like this:

It can be seen that it mainly contains three fields, where placement is used to specify the cluster to be distributed, template contains the single-cluster resource body of the federated resource, and overrides is used to specify the custom part of the resource body in the template in each sub-cluster .

As we mentioned earlier, for the Kubernetes resources of unitized applications, it needs to be distributed according to the dimension of the deployment unit rather than the dimension of the cluster. Therefore, the above community CRD obviously cannot meet the requirements and needs to be modified. After modification, the spec of the new federated CR is as follows:

It can be seen that we did not completely abandon the CRD of the community, but "upgraded" it. By transforming the concrete "cluster" into an abstract "topology", we completely customized the distribution topology of federated resources, breaking the Restrictions on a single cluster dimension. For example, in the above spec, we set the topologyType to cell, which means that the resource is distributed in the dimension of the deployment unit. Otherwise, if it is designated as a cluster, it is fully compatible with the community's native cluster dimension distribution mode.

Of course, just defining a new CRD will not solve the problem, we also need to modify its corresponding implementation to make it work. However, if the KubeFed controller in the community is to be aware of the multi-topology model, it is bound to make a lot of modifications to its underlying implementation, which is likely to eventually become a half-rewrite, with high R&D costs, and it is impossible to continue to return to the upstream modification of the community. Maintenance costs are also higher. We need to find better ways to decouple our multi-topology model from KubeFed's native multi-cluster capabilities.

Standalone and extend the federation layer ApiServer

Since we don't want to make too many intrusive modifications to the community KubeFed controller, we must need a translation layer to convert the above multi-topology federated CRD into the corresponding community CRD. For a specific topology, its conversion logic is also determined, so the simplest and most efficient conversion is to process it directly through Kubernetes ApiServer, and ApiServer's Conversion Webhook capability for CRD can just meet the implementation requirements of this conversion layer. .

Therefore, we pair the KubeFed controller with a dedicated Kubernetes ApiServer to form an independent Kubernetes control plane, which we call the "federation layer". This independent control plane only contains data related to federated multi-clusters, ensuring that it does not interfere with other Kubernetes resources, and also avoids strong dependencies on external Kubernetes clusters during deployment.

So, what's so special about the ApiServer at the federation layer? In fact, its main body is still the native ApiServer of Kubernetes, which provides all the capabilities that ApiServer can provide. What we have done is to "package" it and put in the capabilities that we need to extend to the federation layer. Several key extension capabilities are described in detail below.

Built-in multi-topology federated CRD conversion capability

As mentioned above, this is the most important capability provided by the federation layer ApiServer. With the multi-version capability of Kubernetes CRD, we define our multi-topology federated CRD and community CRD as two versions of the same CRD, and then we can customize the two by integrating the Conversion Webhook for the CRD in the federation layer ApiServer conversion is achieved. In this way, on the control plane of the federation layer, any federated CR can read and write in two forms at the same time, so that the upper-layer business only cares about the deployment unit (or other business topology), while the underlying KubeFed controller still only cares about Cluster, which implements its inductive support for the multi-topology federated CRD model.

The following takes the deployment unit topology as an example to briefly introduce the conversion implementation between it and the cluster topology. In KubeFed, we host the subcluster by creating a KubeFedCluster object that contains the subcluster access configuration, and then we can specify the subcluster to distribute by the name of the KubeFedCluster object in the placement of the federated CRD. So, all our conversion logic has to do is to convert the deployment unit name in the multi-topology federated CRD to the KubeFedCluster object name of the corresponding cluster. Since the relationship between the cluster and the deployment unit is 1:n, we only need to create an additional KubeFedCluster object containing the access configuration of the cluster where it is located for each deployment unit, and generate it through a unified naming rule that can be accessed through the namespace where the deployment unit is located ( Namely the tenant and workspace group name) and the name to which the name is addressed.

By analogy, we can easily support more topology types in a similar way, greatly improving our flexibility in the use of federated models.

Support direct use of MySQL/OB as etcd database

The etcd database is an essential dependency for any Kubernetes ApiServer. At Ant Master, we have abundant physical machine resources and a strong DBA team to provide etcd with continuous high availability, but this is not the case for the complex external output scenarios of SOFAStack products. Outside the domain, the cost of operating and maintaining etcd is much higher than operating and maintaining MySQL. In addition, SOFAStack is often output together with OceanBase. We also hope to make full use of the mature multi-room disaster recovery capabilities provided by OB to solve the problem of high database availability. .

Therefore, after some research and attempts, we integrated the etcd on MySQL adapter Kine, which is open sourced by the k3s community, into the federation layer ApiServer, so that it can directly use the general MySQL database as the etcd backend, eliminating the need to maintain an etcd separately. Annoyance. In addition, we have also adapted some differentiated behaviors of OB and MySQL (such as switching to master and self-increasing order), making it perfectly compatible with OB to enjoy the high availability and strong consistency of data brought by OB. .

In addition, we have also integrated some Admission Plugins for verifying and initializing federation-related resources in the federation layer ApiServer. Most of them are related to product business semantics, so I won't go into details here.

It is worth mentioning that these extensions we have made have the ability to be disassembled into independent components and webhooks, so they can also be applied to the form of community native plug-in installation, and do not strongly depend on the independent ApiServer. At present, we separate ApiServer mainly to isolate the data of the federation layer and facilitate independent deployment and maintenance.

To sum up, from the architectural perspective, the federation layer ApiServer mainly plays the role of a north-south bridge for federated resources. As shown in the figure below, it provides the KubeFed controller with the ability to carry community federated resources from the south through the multi-version capability of CRD. ApiServer capability, and northbound provides upper-layer service products with the capability of mapping and converting from service topology (deployment unit) to cluster topology.

KubeFed Controller Capability Enhancement

As mentioned earlier, in addition to the federation model, the community KubeFed controller cannot meet all our needs in terms of its own underlying capabilities. Therefore, we have enhanced its capabilities in the process of productization. Some of these general enhancements have also been contributed to the community, such as support for setting the number of controller worker concurrency and multi-cluster informer cache synchronization timeout, and support for the reservation of special fields of Service. And some high-level capabilities have been pluggable and enhanced in the form of Feature Gate, which has achieved real-time synchronization with the code base and the upstream community. Below we will introduce some of the representative enhancement capabilities.

Support sub-cluster multi-tenant isolation

In the SOFAStack product, whether in the public cloud or the private cloud, all resources are isolated at the granularity of tenants and workspaces (groups) to ensure that each user and its subordinate environments do not affect each other. For KubeFed, the main resource it cares about is the Kubernetes cluster, and the implementation of the community does not do any isolation on it. This can be seen from the deletion logic of the federated resource: when the federated resource is deleted, KubeFed All clusters managed in its control plane are checked to ensure that the resource is deleted in all subclusters as a single-cluster resource. Under the product semantics of SOFAStack, it is obviously unreasonable to do so, and there is a risk of mutual influence between different environments.

Therefore, we made some non-intrusive extensions to the federated resources and the KubeFedCluster object in KubeFed that represents the managed sub-cluster, by injecting some well known labels to make it hold some metadata of the business layer, such as tenants and jobs Space group information, etc. With the help of these data, we added a pre-selection to the sub-cluster when the KubeFed controller processes federated resources, so that any processing of federated resources will only limit the read and write scope to the tenants and workspace groups to which it belongs. Complete isolation of multi-tenant multi-environment.

Support grayscale distribution capability

For a financial production-level release and deployment platform like SOFAStack CAFE, grayscale release is an essential capability. For any application service change, we hope that it can be gray-scaled to the specified deployment unit in a user-controllable manner. This also puts forward corresponding requirements for the underlying multi-cluster resource management framework.

As you can see from the introduction of federated CRDs above, we use placement to specify the deployment units (or other topologies) that need to be distributed for federated resources. When issuing a federated resource for the first time, we can achieve grayscale release by gradually adding the deployment units to be released to the placement, but when we want to update the resource, there is no way to do grayscale — at this time, in the placement All deployment units have been included, and any modifications to federated resources will be synchronized to all deployment units immediately. At the same time, we cannot do grayscale by resetting placement to the deployment unit to be released, because this will cause other deployment units The resource is deleted immediately. At this point, in order to support grayscale publishing, we need an ability to allow us to specify which deployment units in the placement are to be updated, and the rest need to remain unchanged.

To this end, we introduce the concept of placement mask. As its name suggests, it is like a mask for placement, and when the KubeFed controller processes federated resources, the topological scope it updates becomes the intersection of placement and placement mask. At this point, we only need to specify its placement mask when updating the federated resource, and then we can finely control the range of deployment units affected by this change, and achieve fully autonomous and controllable grayscale publishing capabilities.

As shown in the figure below, we added a placement mask containing only the deployment unit rz00a to the federated resource. At this time, we can see that the sub-resource located in rz00a has been successfully updated (the generation of the updated sub-resource is 2), while the resource of rz01a is not. Do processing (thus not generating an updated generation) to achieve the effect of grayscale publishing.

It is worth mentioning that the introduction of placement mask not only solves the problem of grayscale publishing, but also solves the problem of disaster recovery publishing. In the case that some clusters (deployment units) are unavailable due to a disaster in the computer room, we can continue to publish other available deployment units normally through the placement mask, and the release of the entire multi-cluster will not be blocked due to local exceptions. After the cluster is restored, the presence of the placement mask can prevent unanticipated automatic changes to the newly restored deployment unit, ensuring strong control over the release of changes.

Support custom Annotation delivery strategy

KubeFed has a principle for the distribution of resources, that is, only the attributes of the spec class are distributed, and the attributes of the status class are not distributed. The starting point of this principle is simple: we need to ensure that the specs of sub-cluster resources are strongly controlled by the federation layer, but keep their status independent. For any Kubernetes object, most of its attributes are non-spec or status - needless to say, the spec and status attributes themselves, such as name, labels, etc. in metadata belong to spec, while creationTimestamp, resourceVersion and the like belong to status. However, there are exceptions to everything, and there is one attribute that can act as both a spec and a status, and it is annotations.

In many cases, we cannot condense all the spec and status of a Kubernetes object into the real spec and status attributes. A typical example is Service. Students who are familiar with Service applications should know that we can use LoadBalancer-type Service and CCM (Cloud Controller Manager) provided by different cloud vendors to achieve load balancing management under different cloud platforms. Since Service is a built-in object of Kubernetes, its spec and status are fixed and non-extensible, and the parameters supported by CCM of different cloud vendors are different. Therefore, the annotations of Service naturally carry these configurations and play the role of spec. At the same time, some CCMs also support reflowing some specific states of load balancing to annotations, such as some intermediate states in the process of creating load balancing, including error messages, etc. At this time, the annotations of Service also play the role of status. At this time, KubeFed faces a problem - do you want to issue the annotations field? The community chose not to distribute it at all. Of course, this will not affect the ability of annotations as status, but it also loses the ability to control annotations when they are used as spec.

So, is it possible to do both? The answer is naturally yes. In KubeFed, for each Kubernetes resource type that requires multi-clustering, a FederatedTypeConfig object needs to be created for it at the federation layer, which is used to specify information such as federated type and single-cluster type GVK. Since the spec/status feature of the annotations field is also related to the specific resource type, we can make a fuss in this object and add a configuration of propagating annotations to it. In this configuration, we can explicitly specify (support wildcards) which keys are used as specs in the annotations of this resource type. For these keys, the KubeFed controller will issue and control them, while the rest of the keys are used as status Treated, will not overwrite the value on the subcluster resource. With this extension capability, we can flexibly customize the annotation delivery strategy for any Kubernetes resource type, and achieve complete multi-cluster resource spec management and control capabilities.

Taking Service as an example, we configure the FederatedTypeConfig object of this type as follows:

The first picture below is the delivery template specified by the template of FederatedService, and the second picture is the situation of the managed service in the actual sub-cluster. It can be seen that the spec class annotation (such as service.beta.kubernetes.io/antcloud-loadbalancer-listener-protocol) we specified in the federated resource was successfully delivered to the sub-resource, and the status class annotation that belongs to the sub-resource itself (such as status.cafe.sofastack.io/loadbalancer) is normally reserved and not deleted or overwritten due to the strong control of the annotations field.

In addition, we have also enhanced the KubeFed controller's state reflow capability, enabling it to reflow status fields of all federated resource types in real time; it supports federated layer sub-clusters to access the configured KMS encrypted storage to meet financial-level security compliance requirements, etc. Due to space limitations, we will not introduce them one by one.

At this point, the federation layer has met most of the needs of the upper-layer unitized application release operation and maintenance products, but as mentioned above, what we are doing is a "hybrid cloud" product. The limitation of cluster construction and network connectivity is the most typical problem we will encounter when operating and maintaining Kubernetes clusters. For the federation layer, since it mainly focuses on Kubernetes application resource management, the heterogeneity of the cluster will not have much impact, as long as the cluster that conforms to the Kubernetes specification within a certain version range can theoretically be directly managed; while the network The limitation of connectivity is fatal: since KubeFed adopts the push mode for sub-cluster management and control, it requires the KubeFed controller to be able to directly access the ApiServer of each sub-cluster, which requires quite high network connectivity. The network environment cannot meet such requirements, and even if there is a way to meet it, it will require a high cost (such as opening up the network between the central cluster and all user clusters). Therefore, we are bound to seek a solution to reduce the requirement of network connectivity between clusters at the federation layer, so that our products can be compatible with more network topologies.

Integrated ApiServer Network Proxy

Since the direct forward connection between the KubeFed controller and the sub-cluster ApiServer may be limited, we need to set up a proxy that can establish a reverse connection between the two, and perform forward access through the long connection channel established by the proxy. ApiServer Network Proxy (ANP) is an ApiServer proxy developed by the community to solve the problem of internal network isolation in Kubernetes clusters. It just provides the capability of reverse long-connection proxy that we need, so that we can access normally without the need for a forward network connection. to the ApiServer of the subcluster.

However, ANP mainly solves the problem of accessing ApiServer within a single cluster. Its connection model is that multiple clients access one ApiServer, but for multi-cluster management and control such as the federation layer, its connection model is that one client accesses multiple ApiServers. To this end, we have extended the back-end connection model of ANP to support dynamic location selection based on cluster names. Subsequent requests will be routed to the persistent connection channel established with the agent that reported the cluster name. The final architecture is shown in the figure below. By integrating this "multi-cluster extension" ANP, we can easily manage multi-clusters in a more severe network environment.

Summarize

Finally, let us summarize through specific product capabilities to briefly reflect some of the highlights of the SOFAStack CAFE multi-cluster product compared to the community version of KubeFed:

● With the help of an extensible multi-topology federation model, it natively supports the release of multi-cluster applications of the LDC deployment unit dimension, shielding the underlying Kubernetes infrastructure
● Supports multi-tenancy, and can isolate resources such as Kubernetes clusters at the tenant and workspace level at the bottom layer
● Break the constraints of declarative, support refined multi-cluster grayscale release, and support disaster recovery release at the same time
● Supports advanced capabilities such as custom annotation issuance, complete state reflow, and KMS encryption of cluster access credentials
● Continue to manage all user clusters in push mode with ANP support in the case of limited network connectivity such as heterogeneous hybrid clouds
● The multi-cluster control plane can be deployed independently of the Kubernetes cluster, and supports the direct use of MySQL/OB as the backend database

At present, SOFAStack has been applied to more than 50 financial institutions at home and abroad. Among them, Zhejiang Rural Credit, Sichuan Rural Credit and other enterprises are using CAFE's unitized hybrid cloud architecture to manage the full life cycle of container applications and build multi-region, high availability. multi-cluster management platform.

future plan

As can be seen from the practice part above, our current application of the underlying multi-cluster framework mainly focuses on Kubernetes cluster management and multi-cluster resource management, but there are still broader possibilities for multi-cluster applications. In the future, we will gradually evolve capabilities including but not limited to the following:

● Multi-cluster resource dynamic scheduling capability
● Multi-cluster HPA capability
● Multi-cluster Kubernetes API proxy capability
● Light CRD capability of directly using single-cluster native resources as templates

In the future, we will continue to share our thinking and practice on these capabilities. You are welcome to continue to pay attention to our multi-cluster products, and look forward to any comments and exchanges at any time.

Kubernetes multi-cluster practice in SOFAStack CAFE unit hybrid cloud product

background

challenge

practice

Multi-Topology Federated CRD

Standalone and extend the federation layer ApiServer

Built-in multi-topology federated CRD conversion capability

Support direct use of MySQL/OB as etcd database

KubeFed Controller Capability Enhancement

Support sub-cluster multi-tenant isolation

Support grayscale distribution capability

Support custom Annotation delivery strategy

Integrated ApiServer Network Proxy

Summarize

future plan

蚂蚁技术

引用和评论

润开鸿与蚂蚁数科达成战略合作，发布基于鸿蒙的mPaaS移动应用开发产品

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

Ubuntu 常用运维脚本大全（30个干货）

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

【Hadoop】HBase系统解析及适用场景

架构设计不合理，如何优化系统结构