Still worrying about multi-cluster management? RedHat, Ant, and Alibaba Cloud bring OCM to the open source community

Description: to let developers, users could be like on a single Kubernetes cluster platform as in the multi-cluster and a mixed environment, using familiar open-source projects and products easily develop functions, RedHat and ants, Ali cloud, co-sponsored open source OCM (Open Cluster Management, the official website of the project (\_ https://open-cluster-management.io/ \_), which aims to solve the life of resources, applications, configurations, policies and other objects in a multi-cluster and mixed environment Cycle management issues. At present, OCM has submitted an application for the incubation of Sandbox-level projects to CNCF TOC.

Author: Feng Yong (Lu Jing)

In the field of cloud computing, if anyone hasn’t heard of Kubernetes, it’s as if someone doesn’t know that Chongqing hot pot must have chili. Kubernetes has become the de facto standard platform for managing data centers like Android on mobile phones and Windows on laptops. Around Kubernetes, the open source community has built a rich technology ecosystem. Whether it is CI/CD, monitoring operation and maintenance, or application framework, security anti-intrusion, users can find projects and products that suit them. However, once the scenario is extended to a multi-cluster and hybrid cloud environment, there are only a handful of open source technologies that users can rely on, and they are often not mature and comprehensive.

In order to allow developers and users to use their familiar open source projects and products to easily develop functions in a multi-cluster and hybrid environment, just like on a single Kubernetes cluster platform, RedHat, Ant and Alibaba Cloud jointly initiated and open sourced OCM (Open Cluster Management, the official website of the project (\_ https://open-cluster-management.io/\_ ), aims to solve the life cycle management problems of resources, applications, configurations, policies and other objects in a multi-cluster and mixed environment. At present, OCM has submitted an application for the incubation of Sandbox-level projects to CNCF TOC.

Open Cluster Management

Multi-cluster management development history

Let's take the time back to a few years ago. When the focus of the industry's attention/controversy was on whether Kubernetes was available at production level, the first batch of players who landed in the "multi-cluster federation" technology appeared. Most of them are pioneers of Kubernetes practice that are far above average in size. From the earliest Redhat and Google entering the market, they tried KubeFed v1, and then they worked with IBM to learn from experience and launch KubeFed v2. In addition to these large companies exploring multi-cluster federation technology in the production and practice of Kubernetes, in the commercial market, most of the service products packaged by various vendors based on Kubernetes have also experienced the evolution from single-cluster product services to multi-cluster forms and hybrid cloud scenarios. . In fact, both companies and business users have common needs, focusing on the following aspects:

Multi-regional issues: When the cluster needs to be deployed on a heterogeneous infrastructure or across a wider region.

The Kubernetes cluster relies on etcd as the data persistence layer, and etcd as a distributed system has requirements on the network delay between the members of the system, and there are some restrictions on the number of members, although the delay can be tolerated by adjusting the heartbeat, etc. Parameter adaptation, but cannot meet the global deployment requirements of transnational and transcontinental, nor can it guarantee the number of available zones in large-scale scenarios, so in order to make etcd at least stable operation, etcd is generally planned into multiple clusters by region. In addition, based on business availability and security, hybrid cloud architecture is increasingly accepted by users. It is difficult to deploy a single etcd cluster across cloud service providers, and correspondingly, the Kubernetes cluster is also split into multiple. When the number of clusters gradually increases and administrators are tired of coping, it is natural to need an aggregated management and control system to manage and coordinate multiple clusters at the same time.

Scale problem: When a single cluster encounters a bottleneck in scale.

It is true that the open source version of Kubernetes has obvious scale bottlenecks, but even worse, it is difficult for us to truly quantify the scale of Kubernetes. At the beginning, the community provided the kubemark suite to verify the performance of the cluster, but the reality is very skinny. What kubemark does is based on the repeated scaling and scheduling of workloads under different numbers of nodes. However, in practice, the reasons for the performance bottleneck of Kubernetes are complex and there are many scenarios. It is difficult for kubemark to comprehensively and objectively describe the scale of multi-clusters, and can only be used as a reference solution under very coarse-grained conditions. Later, the community supported the use of scale envelopes to measure cluster capacity in multiple dimensions, and later there was a more advanced cluster stress testing suite perf-tests. After users have a clearer understanding of the problem of scale, they can plan the distribution of multiple Kubernetes clusters in advance according to actual scenarios (such as IDC scale, network topology, etc.), and the demand for multi-cluster federation will also emerge.
**

Disaster tolerance/isolation issues: When there are more granular isolation and disaster tolerance requirements.

Disaster recovery of business applications is achieved through scheduling strategies in the cluster and deploying applications to infrastructure availability zones of different granularities. Combining technologies such as network routing, storage, and access control can solve the problem of business continuity after the availability zone fails. But how to solve the cluster level, or even the cluster management control platform itself?

As a distributed system, etcd can naturally solve the problem of most node failures, but unfortunately in practice, etcd services may still be down, which may be due to management errors or network partitions. In order to prevent etcd from "destroying the world" when there is a problem, the "explosion radius" is often reduced to provide a more granular disaster recovery strategy. For example, in practice, it is more inclined to build multiple clusters within a single data center to avoid split-brain problems, and at the same time make each cluster an independent autonomous system, which can run completely even in the presence of network partitions or higher-level control offline, at least stable on site. This naturally forms the need to control multiple Kubernetes clusters at the same time.

On the other hand, the isolation requirement also comes from the cluster's lack of multi-tenant capabilities, so the cluster-level isolation strategy is directly adopted. By the way, the good news is that the control plane fairness/multi-tenant isolation of Kubernetes is being built brick by brick. By entering the API Priority And Fairness feature of Beta in version 1.20, you can actively customize the traffic soft isolation strategy according to the scenario. It is not passive to restrict traffic through a penalty-like ACL. If the cluster is divided into multiple clusters at the beginning of the cluster planning, then the isolation problem will naturally be solved. For example, we can assign exclusive clusters to big data according to the business, or assign exclusive clusters to specific business applications, and so on.

Main functions and architecture of OCM

OCM aims to simplify the management of multiple Kubernetes clusters deployed in a mixed environment. It can be used to expand multi-cluster management capabilities for different management tools in the Kubernetes ecosystem. OCM summarizes the basic concepts required for multi-cluster management and believes that in multi-cluster management, any management tool needs to have the following capabilities:

Understand the definition of a cluster
Select one or more clusters through a certain scheduling method
Distribute configuration or workload to one or more clusters
Govern user's access control to the cluster
Deploy management probes to multiple clusters

OCM adopts the hub-agent architecture and includes several primitives and basic components for multi-cluster management to meet the above requirements:

The managed clusters are defined through the Managed Cluster API. At the same time, OCM will install an agent named Klusterlet in each cluster to complete cluster registration, life cycle management and other functions.

Define how to schedule the configuration or workload to which clusters through the Placement API. The scheduling result will be stored in the Placement Decision API. Other configuration management and application deployment tools can use Placement Decisiono to determine which clusters need configuration and application deployment.

Define the configuration and resource information distributed to a cluster through the Manifest Work API.

Use Managed Cluster Set AP to group clusters and provide the boundaries for users to access the cluster.

Through the Managed Cluster Addon API, define how the management probe is deployed to multiple clusters and how it communicates securely and reliably with the control plane of the hub.

The architecture is shown in the figure below, where registration is responsible for cluster registration, cluster life cycle management, registration of management plug-ins, and life cycle management; work is responsible for resource distribution; placement is responsible for cluster load scheduling. On top of this, developers or SRE teams can conveniently develop and deploy management tools in different scenarios based on the API primitives provided by OCM.

By using OCM's API primitives, the deployment and operation and maintenance of many other open source multi-cluster management projects can be simplified, and the multi-cluster management capabilities of many Kubernetes single-cluster management tools can also be expanded. E.g:

Simplify the management of submariner and other multi-cluster network solutions. Use OCM's plug-in management function to centralize the deployment and configuration of submariner on a unified management platform.
Provide a wealth of multi-cluster scheduling strategies and reliable resource distribution engines for application deployment tools (KubeVela, ArgoCD, etc.).
Extend existing kuberenetes single-cluster security policy management tools (Open Policy Agent, Falco, etc.) to enable multi-cluster security policy management capabilities.

OCM also has two built-in management plug-ins for application deployment and security policy management. The application deployment plug-in adopts the subscriber model, which can obtain application deployment resource information from different sources by defining a subscription channel (Channel). Its architecture is shown in the following figure:

At the same time, in order to closely integrate with the kubernetes ecosystem, OCM has implemented multiple design solutions for kubernetes sig-multicluster, including KEP-2149 Cluster ID and the concept clusterset in KEP-1645 Multi-Cluster Services API. We are also working with other developers in the community to promote the development of Work API .

The main advantages of OCM

Highly modular --- freely selectable/tailorable modules

The entire OCM architecture is very similar to a "microkernel" operating system. The OCM chassis provides services such as core capabilities cluster metadata abstraction, while other expansion capabilities are detachable and deployed as independent components. As shown in the figure above, except for the core capabilities of the entire OCM solution, other upper-level capabilities can be tailored according to actual needs. For example, if we don’t need complex cluster topology, then we can tailor the cluster grouping. If we don’t need to distribute any resources through OCM only as metadata, then we can even cut out the Agent component distributed by the entire resource. This is also conducive to guiding users to log in to OCM gradually. In the initial stage, users may only need to use a small part of the functions, and then gradually introduce more feature components as the scene expands, and even support hotspots on the running control surface at the same time. Plug and unplug.

More inclusive---Swiss Army Knife for complex usage scenarios

At the beginning of the design of the entire OCM solution, it was considered to build advanced capabilities in some complex scenarios by integrating some mainstream third-party technical solutions. For example, in order to support the rendering and delivery of more complex application resources, OCM supports the installation of applications in the form of Helm Chart and supports loading into remote Chart warehouses. At the same time, the Addon framework is also provided to support users to customize and develop their own needs through the provided extensibility interface. For example, Submarine is a multi-cluster network trust solution developed based on the Addon framework.

Ease of use --- reduce the complexity of use

In order to reduce the complexity of users and the ease of migrating to the OCM solution, OCM provides a traditional command-style multi-cluster federated control process. It is worth noting that the following mentioned functions are still in the research and development process, and will officially meet with you in subsequent versions:

Through Managed Cluster Action, we can issue atomic commands to the managed clusters one by one. This is also the most intuitive way to automatically orchestrate each cluster as a central control system. A Managed Cluster Action can have its own instruction type, instruction content, and specific status of instruction execution.
Through the Managed Cluster View, we can actively "project" the resources in the managed cluster to the multi-cluster federated central system. By reading the “projection” of these resources in the central, we can perform more dynamic and accurate operations in the federated system. decision making.

Practice of OCM in Ant Group

OCM technology has been applied to the Ant Group’s infrastructure. As the first step, OCM Klusterlets are deployed to the managed clusters one by one through the use of some operation and maintenance methods similar to the community Cluster API, thereby integrating dozens of lines in the ant domain. The meta-information of the online and offline clusters is integrated into the OCM. These OCM Klusterlets provide the upper product platform with the basic capabilities of multi-cluster management, operation and maintenance to facilitate future function expansion. Specifically, the first step of OCM includes the following aspects:

Certificateless: In the traditional multi-cluster federation system, we often need to configure the corresponding cluster access certificate for the metadata of each cluster, which is also a required field in the cluster metadata model of KubeFed v2. Since OCM adopts the Pull architecture as a whole, the Agents deployed in each cluster pull tasks from the hub and there is no process for the hub to actively access the actual cluster. Therefore, the metadata of each cluster is only a completely "desensitized" placeholder. symbol. At the same time, because the certificate information does not need to be stored, there is no risk of the certificate being copied and misappropriated in the OCM scheme.

automated cluster registration: previous cluster registration process, there were many manual interventions, which lengthened the collaboration and communication time and lost the flexibility of change, such as site-level or computer room-level flexibility. In many scenarios, manual verification is indispensable. You can make full use of the review and verification capabilities provided by OCM cluster registration and integrate them into the approval process tools in the domain to achieve the entire cluster registration process and achieve the following goals:

(1) Simplify the cluster initialization/takeover process. (2) Clearly control the authority of the control center.

Automatic cluster resource installation/uninstallation: The so-called takeover mainly includes two things (a) Install the application resources required by the management platform in the cluster (b) Enter cluster metadata into the management platform. For (a) it can be further divided into Cluster-level and Namespace-level resources, and (b) is generally a critical operation for the upper management and control system. From the moment the metadata is entered, the product is considered to have taken over the cluster. Before the introduction of OCM, all preparations required manual preparation step by step. Through OCM, the entire process can be automated, simplifying the cost of manual collaboration and communication. The essence of this matter is to sort out the cluster management into a process operation, and define the concept of state on the cluster metadata so that the product hub can automatically take over the "trivial tasks" that need to be done in a process. After registering the cluster in OCM, the installation and uninstallation procedures of resources are clearly defined.

Through the above work, dozens of clusters in the ant domain are all within the management scope of OCM. In major events such as Double Eleven, clusters that were automatically created and deleted have also been automatically accessed and deleted. Later, it plans to integrate with application management technologies such as KubeVela, and collaborate to complete the cloud-native management capabilities of applications and security policies in Ant Domain.

Practice of OCM in Alibaba Cloud

Cloud, the OCM project is one of the core dependencies of 1610a2ce92d355 KubeVela for non-differentiated application delivery in a mixed environment. KubeVela is a "one-stop" application management and delivery platform based on the Open Application Model (OAM), and it is currently the only cloud native application platform project hosted by the CNCF Foundation. In terms of function, KubeVela can provide developers with an end-to-end application delivery model, as well as multiple cluster-oriented operation and maintenance capabilities such as gray-scale release, elastic scaling, and observability, and can be applied to a mixed environment with a unified workflow Delivery and management. In the whole process, OCM is KubeVela's main technology to implement Kubernetes cluster registration, management, and application distribution strategies.

On the public cloud, the above features of KubeVela combined with the Alibaba Cloud ACK multi-cluster management capabilities can provide users with a powerful application delivery control plane, which can be easily implemented:

One-click site construction in a mixed environment. For example, a typical hybrid environment can be a public cloud ACK cluster (production cluster) plus a local Kubernetes cluster (test cluster) managed by ACK multi-cluster. In these two environments, the providers of application components are often different. For example, the database component may be MySQL in the test cluster, and the Alibaba Cloud RDS product on the public cloud. In such a mixed environment, traditional application deployment and operation and maintenance are extremely complicated. KubeVela allows users to easily define the products to be deployed, deliver the workflow, and declare the differentiated configuration of different environments in a deployment plan. This not only eliminates the tedious manual configuration process, but also greatly reduces the release and operation and maintenance risks with the help of Kubernetes' powerful automation and certainty.

Multi-cluster microservice application delivery: Microservice applications under the cloud native architecture are often composed of diversified components, such as container components, Helm components, middleware components, cloud service components, etc. KubeVela provides users with a micro-service-oriented multi-component application delivery model. With the help of the distribution strategy provided by OCM, unified application delivery is carried out in a multi-cluster and mixed environment, which greatly reduces the difficulty of operation, maintenance and management of micro-service applications.

In the future, the Alibaba Cloud team will work with partners such as the RedHat/OCM community, Oracle, and Microsoft to further improve KubeVela's application orchestration, delivery, and operation and maintenance capabilities for hybrid environments, so that the delivery and management of microservice applications in the cloud-native era can be truly achieved. "Fast and good."

Join the community

At present, the OCM community is still in the early stages of rapid development, and interested companies, organizations, schools and individuals are very welcome to participate. Here, you can become partners with technical experts from Ant Group, RedHat, and Alibaba Cloud, as well as the core Contributor of Kubernetes, to learn, build, and promote the popularization of OCM together.

GitHub address ( https://github.com/open-cluster-management-io )
Learn about OCM through video ( https://www.youtube.com/channel/UC7xxOh2jBM5Jfwt3fsBzOZw )
Come to the community weekly meeting ( https://docs.google.com/document/d/1CPXPOEybBwFbJx9F03QytSzsFQImQxeEtm8UjhqYPNg )
Free communication in the Kubernetes Slack channel # open-cluster-mgmt ( https://slack.k8s.io/ )
Join the mailing group to browse key discussions ( https://groups.google.com/g/open-cluster-management )
Visit the official website of the community for more information ( https://open-cluster-management.io/ )

On September 10 this year, INCLUSION·The Bund Conference will be held as scheduled. As a global financial technology event, it will continue to maintain the original intention of making technology more inclusive. In the multi-cluster and hybrid cloud architecture open source special session on the afternoon of the 11th, the main developers of the OCM community will bring you the best practices of multi-cluster and hybrid cloud built around OCM. You are welcome to participate offline and have face-to-face communication.

Thank you for your attention and participation in OCM, and welcome to share with more friends who have the same needs. Let us work together to further the multi-cluster and hybrid cloud experience!

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Still worrying about multi-cluster management? RedHat, Ant, and Alibaba Cloud bring OCM to the open source community

Multi-cluster management development history

Multi-regional issues: When the cluster needs to be deployed on a heterogeneous infrastructure or across a wider region.

Scale problem: When a single cluster encounters a bottleneck in scale.

Disaster tolerance/isolation issues: When there are more granular isolation and disaster tolerance requirements.

Main functions and architecture of OCM

The main advantages of OCM

Highly modular --- freely selectable/tailorable modules

More inclusive---Swiss Army Knife for complex usage scenarios

Ease of use --- reduce the complexity of use

Practice of OCM in Ant Group

Practice of OCM in Alibaba Cloud

Join the community

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

再见 XShell！一款万能通用的终端工具，用完爱不释手！

OpenInfra 基金会董事会宣布加入 Linux 基金会意向，增强开源全球影响力

从开发者视角解读 Google Cloud Next 25

记录下安装open-eBackup过程

Dev.Together 2025 开发者生态峰会演讲议题、社区百宝箱开放征集！

rocky linux 使用记录