Take you into cloud-native technology: exploration and practice of cloud-native open operation and maintenance system

This article is a summary and overview of the first stage of cloud-native open collaboration technology exploration and practice.

Ant basic technology has been continuously and deeply advancing the evolution of comprehensive cloud native technology for more than 3 years. We have installed online and offline computing resources into a computer, and the service system has been sunk through the idea and technical means of mesh. Coupling, it can be said that the cloud native technology has been fully embraced and the technology dividends brought by it have been obtained.

When the cloud nativeization of resources and services is completed, we find that the operation and maintenance system based on cloud native basic capabilities is far from the idea of openness and sharing of cloud native technology, and the technical system is also declarative with cloud native technology. The idea of white-boxing is inconsistent. At the same time, due to the lack of matching technical support, historical baggage and other issues have also become obstacles to the real generation of cloud-native operation and maintenance. What I want to introduce today is the technical exploration and practice of ant in the direction of cloud native operation and maintenance under this background.

Exploration of large-scale cloud native operation and maintenance

Let's first review the real practice methods and problems faced by Ant. First, let's take a look at the classic operation and maintenance platform that Ant has practiced for many years. This type of operation and maintenance platform generally includes controllers, business models, orchestration engines, atomic tasks, and pipelines. Platforms such as Ant are a collection of a series of services. , They better meet the needs of centralized, standardized, and low-frequency application release and operation and maintenance. But this model also has obvious shortcomings in practice.

First of all, non-standard applications, application-specific requirements, high-cost requirements, non-emergency requirements, and technical transformation requirements are often not well met. In Ant’s practice, non-standard operation and maintenance requirements, high-cost transformation requirements that have a greater impact on core application models and operation and maintenance models, a large number of basic capabilities or the need for the disclosure of operation and maintenance functions, etc. have not been well met for a long time. It is often reasonable, and it is difficult to get enough priority to implement the landing. In the R&D stage, the operation and maintenance platform has accumulated high-complexity business logic for a long time. Modification and testing involve long cross-system transformation links. At the same time, the disclosure of basic capabilities and the productization of operation and maintenance capabilities rely on front-end and server-side R&D resources. These problems make the research and development of the operation and maintenance platform increasingly difficult, especially in the product GUI, business model, orchestration engine and other hotspots. Due to the lack of expansion mechanism capabilities, there have even been continuous online code revisions and service releases in internal practice. Demand situation. After the platform went online, unified quality assurance and online full-link function verification also faced greater pressure. For the end user, the black box calculation behind the command buttons is low in transparency, difficult to audit, and difficult to predict. At the same time, issues such as passionate operation and unfamiliar operation interface have always affected online stability. These problems have existed for a long time, and we hope that the technological evolution of generations will solve these problems.

When the cloud-native basic services gradually stabilized, for applications whose scenarios are not within the scope of the operation and maintenance platform management, R&D students voluntarily use the cloud-native community tool chain to solve problems. Based on the highly open and highly configurable characteristics of the Kubernetes ecosystem, developers can self-service, flexible, and transparent declarative application operation, operation and maintenance requirements, and complete release, operation and maintenance operations with application granularity.

Users have shortened the path of docking infrastructure through kustomize and other community technologies, and partially solved the problem of static YAML file dimension explosion when there are many variables through text template technologies such as velocity, and solved the problem of default value setting. At the same time, through code Multi-factor changes and reviews are carried out in the way of review. Because Kubernetes and its ecology provide horizontal capabilities for resources, services, operation and maintenance, and security, this simple method has good universality and applicability. You can "play" these data to different Kubernetes clusters. Completing the changes to the infrastructure is essentially a flow of declared data. The R&D methods and gitops process support for git warehouses have low demands on R&D resources for operation and maintenance products, and can often be built relatively simply, and do not rely heavily on product R&D resources. Compared with the classic operation and maintenance center, these benefits are clear and clear, but the disadvantages are also very obvious from the engineering perspective.

First of all, the design of the Kubernetes API is more complicated. Only the low level API provided by Kubernetes exposes more than 500 models and more than 2000 fields. The scene covers almost all aspects of the infrastructure application layer. It is difficult for even professional students to understand all of them. detail. Secondly, the degree of engineering of this method is very low, it violates the DRY principle, and violates the principle of high cohesion and low coupling of the responsibilities of each team. Even with the support of certain tools, in a typical internal case, a multi-application infra The project still maintains more than 50,000 lines of YAML. At the same time, due to the multiple split platforms caused by the team boundary, users need to switch between multiple platforms. The operation methods of each platform are different, and the black screen command of the springboard is added. A complete release takes 2 days.

Due to the low degree of engineering, the collaboration between the teams depends on the synchronization of the human flesh and the group. The final YAML is composed of parts defined by multiple teams, most of which belong to the definition of Kubernetes and the operation and maintenance platform team. These contents need to be tracked continuously. Synchronously avoid corruption, and long-term maintenance costs are high.

KUSION: Cloud native open collaborative technology stack

The above two models have their own advantages and disadvantages, and the advantages and problems are relatively clear. So can we have both? Can we make full use of the dividends brought by cloud native technology while inheriting the advantages of the classic operation and maintenance platform to create an open, transparent, and collaborative operation and maintenance system?

With such questions, we explored and practiced, and created the cloud-native programmable technology stack Kusion based on the idea of infrastructure coding.

Everyone knows that Kubernetes provides a declarative low level API, and advocates that its ecological capabilities are defined and provide services through CRD extensions. The entire ecosystem follows unified API specification constraints and reuses API technologies and tools. The Kubernetes API specification advocates that low level API objects are loosely coupled and reusable to support the "combination" of high level APIs from low level APIs. Kubernetes itself provides a minimalist solution that is conducive to the spread of open source, and does not include technologies and solutions on top of the API.

Back to the origin of cloud native technology, we look back at the application technology ecology of Borg, the predecessor of Kubernetes. As shown in the figure below, on top of BorgMaster, the Borg team has developed a three-piece Borg access set, namely BCL (Borg Configuration Language), Command-line tools, and corresponding web services. Users can write requirements declaratively through BCL, execute BCL files to the Borg cluster through Command-line tools, and view task details through the web GUI view. After a lot of research, we have learned that Google's internal operation and maintenance capabilities, product ecology, and quality technology ecology are all built on these three sets, and it has been iteratively evolved internally for many years.

This inspired us. Today we have container technology, service system, a large number of users and differentiated needs, and a certain number of automated operation and maintenance platforms. We hope to link Kubernetes through cloud-native dedicated languages and tools. Ecology, various operation and maintenance platforms, and a large number of users, through the definition of unique facts, eliminate the islands of operation and maintenance platforms, complete the intergenerational evolution of cloud-native infrastructure at the application, operation and maintenance level, and achieve the goal of "Fusion on Kubernetes".

With this goal in mind, we continue to conduct technical exploration and practice. At present, we have formed the Kuusion technology stack and applied it in Ant's production practice.

The Kusion technology stack works based on such basic capabilities and includes the following components:

KCL (Kusion Configuration Language)

KCL interpreter and its Plugin extension mechanism

KCL R&D toolset: Lint, Format, Doc-Gen, IDE Plugin(IDEA, VsCode)

Kusion Kubernetes ecological tools: OpenAPI-tool, KusionCtl(Srv)

Konfig configuration code base, including platform side and user side code

OCMP (Open CloudNative Management Practice) Practice Manual

Kusion works on the infrastructure, as an abstraction and management layer technical support service upper application. Users in different roles use the horizontal capabilities provided by the Kubernetes ecosystem together, use infrastructure through declarative and intent-oriented definitions, support typical cloud-native scenarios in the scene, and also serve some classic O&M scenarios, completing the first stage Construction work. The products currently connected to Kusion include IaC release, operation and maintenance product InfraForm, site building product SiteBuilder, and quick recovery platform. By integrating Kusion into the automation system, we try to reconcile the black box imperative automation system and the open declarative configuration system as much as possible, so that they can play their respective advantages.

Integration & landing

The new technology system first faces the problem of landing. Let's take a look at Kusion's thinking and practices in integration and landing.

From the overall idea, we start from the change hot business layer and the orchestration layer in the classic operation and maintenance system, write the corresponding logic externally in the form of KCL declarative configuration blocks, and are automatically integrated by the controller.

This kind of thinking is traceable. Let’s take a look at the experience of our peers. Taking the construction site of Leishenshan Hospital as an example, we can see that a large number of on-site components are pre-products, which are tested, verified, and delivered by the site. The tower crane is responsible for assembly. These components need good quality control and need to have built-in water pipes, wires and other "capabilities", otherwise they will not work effectively even if they are assembled. At the same time, a certain amount of custom configuration space is required for the business side, and easy assembly and automation to improve on-site assembly efficiency. In fact, the large-scale operation and maintenance activities we are facing are similar to such sites, and the efficient methods of modern infrastructure are very worthy of our study.

Corresponding to our actual scenario, our KCL-based working method needs to meet the following requirements:

can be divided into labor, can be coordinated : component production, acceptance, delivery, and use can be based on the role of reasonable division of labor and coordination to meet the needs of the software supply chain

External, prefabricated components : The components exist independently of the automation system, and the prefabricated components need to be fully tested and verified before delivery

built-in resources, services, identity and other elements : components only expose valid business information to users, and built-in cloud-native and trusted logic

easy to define the business : the component needs to provide a certain degree of custom configuration capabilities

easy to automate. : supports automated assembly and automatically performs "addition, deletion, modification, and inspection" on components

Next, let's take a look at the typical process based on Kusion. There is a certain abstraction and simplification here.

As mentioned earlier, the Kubernetes API provides a low level API based on OpenAPI and an extension mechanism. It is designed based on the principles of high cohesion, low coupling, easy reuse, and easy assembly, and exists in the form of Resource and Custom Resource. In addition, the Kubernetes API provides a large number of commands to manipulate resources such as containers and Pods. For SDN, Mesh, or other capability extensions are based on such overall constraints and methods, most of which provide resource definitions or command operations.

Based on this foundation, in Ant's practice, we divide the overall workflow into 4 steps:

codes . For resource definition, generate KCL structure based on OpenAPI Model/CRD definition; for command operation, write corresponding declarative KCL structure. These structures correspond to the definition of atomic capabilities on the platform side.

abstracts . Platform-side PaaS platform students write abstraction, assembly, and define user-oriented front-end structures based on these atomic declaratives, covering AppConfiguration, Action / Ops, Locality / Topology, SA / RBAC, Node / Quota and other scenarios from the functional scenarios , And provides a collection of Templates that simplify writing. Taking AppConfiguration as an example, we provide SigmaAppConfiguration and SigmaJobConfiguration respectively corresponding to service-type and task-type application definitions. In addition, we provide SofaAppConfiguration for the characteristics of SOFA applications. These front-end structures exist as the "interface layer" of Kuusion Models. Due to business progress and other reasons, the accumulated water levels of each scenario are different, and long-term accumulation and polishing are still needed.

configuration . Application-side R&D or SRE students describe application requirements based on these front-end structures. Users can define configuration baselines and configurations of different environments for applications by means of structure declarations. In most cases, users only need to make structure declarations, that is, some key-value pairs. For scenarios with complex requirements, users can write logic or organize code logic by inheriting structure

automation . When the application-side configuration is completed, the available "components" have actually been defined, and the conditions for automation are available. The platform-side controller can complete automatic integration work such as compilation, execution, output, code modification, and element query through KCL CLI or GPL binding API, and users can execute KCL code mapping and execution to the Kubernetes cluster through the KusionCtl tool.

Through such a unified workflow, we have lightly revealed a large number of basic capabilities of the Kubernetes ecosystem, declaratively encapsulated and abstracted application-oriented configuration and operation and maintenance capabilities based on atomic capabilities, and completed the landing of certain scenarios application. Kusion provides R&D tools to assist users to complete their work. We can further discuss the practice under the hierarchical collaboration mode on the platform side and the user side. The students on the platform side abstract and define the front-end structure, such as SofaAppConfiguration, which defines business mirroring, required resources, config, secret, sidecar, LB, DNS, number of copies, logical resource pool, release strategy, whether it is oversold, whether to access Public network and so on.

The front-end structure cannot work independently. In fact, there is a back-end structure corresponding to the front-end structure. The back-end is transparent to the front-end, and the front-rear-end structure is separated and decoupled. The back-end structure “translates” the data generated by the front-end structure into the corresponding low level API at runtime. This way of reverse dependence depends on the KCL language capabilities.

From an engineering perspective, the platform-side students actually completed a lightweight and declarative application-level API definition. There are many advantages to this design of separation of front and rear ends. First, the front-end structure used on the application side can be kept simple and clean, business-oriented, and independent of implementation details; secondly, it can be realized by dynamically switching to different back-end structures by pointing to different back-end files during compilation to complete the platform version switch and realize the switch And so on; the final separation approach can ensure full flexibility under the premise of a unified model. For example, the platform can be compiled through multiple files of kcl base.k prod.k backend.k to include baseline, environment configuration, and backend structure at one time. The combination compiles. In fact, we can reduce all scenarios to the paradigm of kcl user_0.k ... user_n.k platform_0.k ... platform_n.k, where user.k represents the user-side code and platform.k represents the platform-side code. We look at the way of multi-team collaboration from another perspective. Each team defines platform capabilities and constraints from bottom to top, completes application-level configuration baselines and configuration environment characteristics, and completes the definition of the last mile.

After clarifying the workflow, let's look at the practice of KCL through the Konfig library. We defined the code space on the platform side and the user side in the Konfig code warehouse, and completed the sharing and reuse of the code through the unified configuration of the code library, ensuring the visibility of the overall infrastructure code definition. On the user side, the code is organized through the three-level directory of project, stack, and component (corresponding to Ant's internal application). Take cloudmesh as an example. The project directory of tnt/middleware/cloudmesh contains multiple stacks, such as dev and prod, and each stack contains multiple components. Code is isolated in these three dimensions and shares context.

In terms of code quality assurance, we use unit testing, integration testing and other means to ensure the quality of the platform-side and user-side code. We are introducing verification methods such as code scanning, configuration playback, configuration verification, and dry-run to ensure the reliability of code changes. . In terms of research and development, we use the main development and branch release methods to ensure that different applications are developed in parallel to avoid code corruption as much as possible, and we use tags to protect stable branches.

In the IaC product landing scenario, the infrastructure description code is managed through standardized structure, code versioning, multi-environment code isolation, CI pipeline, etc., through static and dynamic diff, simulation, exception notification, and risk control access of code changes Ensure that infrastructure changes are controllable, and perform change audits and track change personnel through code Pull Request. The following figure shows the key steps in the business release scenario as an example. After the business code passes the quality assurance process and completes the image construction, the CI process controller automatically updates the image field in the corresponding KCL file in the Konfig warehouse through the KCL API, and initiates Pull Request, which triggers the release process.

IaC provides verification methods such as compilation and testing, live-diff, dry-run, and risk control access, and supports the visualization of the execution process. The product is based on KCL language capabilities and tool construction to minimize business customization. The entire process starts with the automatic modification of the Konfig code. The platform, application, and SRE are based on code collaboration, and are released online through the product interface, supporting operation and maintenance capabilities such as batch, step by step, and rollback. The code "components" in Konfig can be integrated and used in multiple scenarios. For example, the components integrated by the released controller can also be integrated by the station building controller. The controller only needs to pay attention to the automation logic, not the internal details of the integrated components. Take the typical site construction scenario at the beginning of the article as an example. After accessing Kuusion, the user-side configuration code is reduced to 5.5%, and the four platforms that users face are reduced by accessing the unified code library, and the delivery time is delivered without other exceptions. Decrease from 2 days to 2 hours.

Let's look at a more dynamic large-scale fast recovery scenario. The quick recovery platform decides to generate an abnormal container hostname list after receiving the monitoring alarm input, and needs to restart the container and other recovery operations.

We write declarative application recovery operation and maintenance code through KCL, in which the online CMDB query is completed through the KCL Plugin extension, the hostname list is converted into a multi-cluster Pod list, and the Pod recovery operation is defined declaratively. The quick recovery platform executes KusionCtl run AppRecovery.k to complete the Pod recovery operation across multiple clusters. In this way, the quick recovery controller does not need to understand the details of container recovery, the execution details of Kubernetes multi-cluster mapping, etc., and can focus more on its own abnormal judgment and decision logic.

In the process of project landing, we also found many platform-side design problems caused by schedule and other reasons. For example, the platform-side operation definition is not standardized enough, and the common definitions such as application dependence are too scattered. This requires us to continue to eliminate precipitation and improve in the subsequent landing process. The open configuration gives users more flexibility and space, but it requires more security guarantees than the black box approach. At the same time as open and collaborative advancement, the Trusted Native Technology Department is advancing the construction of a cloud-native trusted platform in parallel. The trusted platform provides technical support that is stronger than community solutions by closely combining identity with Kubernetes technology.

For example, through open configuration, can we use mount certificates to enable untrusted and insecure services to gain access to the target service to obtain key data? In fact, this is completely possible without identity transmission and high-water Pod security. Through the reinforcement of PSP (Pod Security Policy), service verification, service authentication and other scenarios through the trusted platform, we can enhance the security policy of critical links as needed. Compared with the community solution, the trusted platform defines a more complete spiffe identity, and enables the identity to act on all links of resources, networks, and services. It can be said that credibility is a necessary prerequisite for openness. At the same time, the authentication capabilities and isolation capabilities provided by credibility also need to be used by users. The atomic capabilities are encapsulated and revealed at the application configuration level to rely on the advancement of Kusion, making it easier for applications that access Kusion to use the credible capabilities. It can be said that the open collaborative technology stack and the trusted platform are cloud-native application layer technologies with orthogonal capabilities and complement each other.

Finally, we make a summary of the integration landing:

80% of the content is written on the platform side, standard configuration blocks are provided through the application-oriented front-end structure, and low level API resources and operations are shielded through the back-end structure definition, and finally the application's workload, orchestration, operation and maintenance, etc. are described in this way In terms of requirements, the focus is on what can be defined, what is default, and the set of constraints, and shared and reused through the Konfig warehouse. The platform side tends to be engine-based, focusing on automated control logic, and KCL code is used as an extension technology to write business logic externally. We hope that in the face of complex operation and maintenance business requirements, the platform-side controller will gradually evolve to low-frequency changes or even zero changes.

Enter 20% of the content on the application side, and use the front-end structure of the platform side as the interface to declare the appeal of the application side. The focus is on what you want and what you want to do. What you write is what you get. The application side organizes code through a code engineering structure for multi-project, multi-tenant, multi-environment, and multi-application, initiates changes through Pull Request, and completes white-box online changes through CICD pipeline. At the same time, the application side has the freedom to compile, test, verify, and simulate a single application, which will be delivered to use after full verification; for multiple applications, it can be flexibly combined through KCL language capabilities on demand. Splitting and reducing large-scale complex issues to application granularity, and merging on demand after being fully verified, is essentially a practice of dividing and conquering. In response to the actual situation of Ant, we support the execution and visualization of the R&D test environment through KusionCtl tools, and promote the online deployment process through InfraForm products and SiteBuilder products.

Collaborative configuration problem model

After understanding the landing ideas and scenario practice methods, we will further drill down to disassemble specific collaboration scenarios, and analyze the design and application of KCL language in configuration scenarios.

Let's first look at some key points of writing lightweight application-level APIs on the platform side. On the platform side, students can extend the structure through single inheritance, define the dependency and value content of the properties in the structure through the mixin mechanism, complete the declarative structure definition through the order-independent writing method in the structure, and also support such as logical judgment, Common functions such as default values.

For a simple analysis of the difference between declarative and imperative, we take the Fibonacci sequence as an example. A set of declarative codes can be regarded as a system of equations. The order of writing equations does not essentially affect the solution, but "solve" The process of is completed by the KCL interpreter, which can avoid a large number of imperative assembly processes and sequence judgment codes. The optimization is especially obvious for structures with complex dependencies.

For complex structures, imperative assembly is more than twice the amount of code. Patch code makes the results unpredictable. At the same time, it is necessary to consider the order of execution. Especially in the modularization process, adjusting the order of dependent modules is very cumbersome and error-prone. . For various supporting capabilities, we write them through the mixin mechanism, and "mix them" into different structures through mixin declarations.

For the platform side, stability assurance is particularly important.

When the amount of configuration data gradually increases, well-formed types are an effective means to ensure the discovery of problems at compile time. KCL spec includes a complete type system design. We are practicing static type checking and derivation to gradually enhance the completeness of types.

At the same time, KCL introduces a variety of immutable means to support users to define the immutability of the properties in the structure on demand. Through these two basic and important technical means, a large number of violations of writing constraints can be checked and discovered at compile time.

For the content of the business direction, KCL supports the built-in verification rules and unit tests in the structure. Take the code shown in the following figure as an example. We define the verification rules for containerPort, services, and volumes in AppBase, and define the verification rules related to the superimposed environment in MyProdApp. Currently, the verification rules are judged at runtime, and we are trying to judge the rules through static analysis at compile time to discover problems.

In addition, for the platform side, upgrading is a problem that must be faced. We first need to consider the worst-case scenario, that is, the front-end structure provided to users needs to be adjusted incompatible. According to the idea of adding new configuration items and offline old configuration items, we need to disable the offline field and use reasonable Ways to inform users.

When the platform itself has incompatible updates, the problem is similar, but the back-end structure on the platform side needs to be adjusted, and the user on the application side does not directly perceive it. KCL provides the function of disabling fields for this kind of problems. The use of disabled fields will prompt a warning or error during the compilation phase. Compilation errors will compile the block, thereby forcing the user to modify during the compilation phase to avoid bringing the problem into the runtime. Make an impact.

For compatible platform-side adjustments, it is usually sufficient to modify the imported atomic definition file in the back-end structure. For the changes of the KCL interpreter itself, we verify it through unit testing, integration testing, fuzz testing, etc., and the changes to the plugin are verified through the plugin's own testing. The changes of the KCL interpreter and plugin are tested and verified by UT and IT that require the Konfig code base to ensure the normal operation of the existing code. After being tested and verified, initiate a Pull Request and pass the code review.

Let's briefly sort out the application-side collaboration scenarios. Assuming there are baseline configuration and production environment configuration, there are three typical scenarios in our practice.

In the first scenario, the baseline and the production configuration each define a part of the configuration with the same name, and KCL automatically merges to generate the final configuration block. This is very effective for symmetric configuration scenarios. If there is a conflict, a conflict error will be reported.

In the second scenario, we hope to cover some configuration items in the baseline configuration in the production configuration, similar to Kustomize's overlay coverage function. In fact, this is the demand of most users who are familiar with Kubernetes.

For the third scenario, the writer hopes that the configuration block is globally unique and cannot be modified in any form. If a configuration with the same name appears, an error will be reported during the compilation phase. In a real scenario, the baseline and each environment configuration can be completed by the cooperation of R&D and SRE, or it can be completed independently by Dev. Kuusion itself does not limit user functions.

Through scenario analysis, we have a preliminary understanding of KCL. We design KCL with the theory and technology of programming language and cloud native application scenarios as inputs. We hope to use simple and effective technical means to support the platform and application sides to complete the infrastructure description. , To expose the problem as much as possible in the KCL compilation and testing stage to reduce the frequency of problems during online operation. In addition, we provide convenient language capabilities and tools to help different user groups complete their work more efficiently, and organize and share code in an engineered way to connect to the Kubernetes API ecosystem.

Abstract model

Through the analysis of the integration and collaborative programming scenarios of Kusion, we understand the composition scenarios and usage methods of Kusion technology. Let's take a look at the key abstract model of Kusion.

Let's first look at the abstract model of the KCL code. Take the following figure as an example. First, the KCL code forms two directed acyclic graphs during the compilation process, which correspond to the internal declaration code of the structure and the declaration of the use of the structure. The compilation process can be simply divided into three steps: unfolding, merging, and replacing. Through this calculation process, most of the substitution operations are completed at compile time, and the final solution can be obtained by performing a few calculations at the final runtime. During the compilation process, we perform type checking and value checking simultaneously. The difference between them is that type checking is generalization, taking the upper bound of partial order, and value checking is specialization, taking the lower bound of partial order.

For the KCLVM interpreter, we adopt a standard hierarchical decoupling design method, which consists of three parts: parser, compiler, and VM. We hope to complete the work at compile time as much as possible, such as graph expansion, substitution, type checking, derivation, etc., so that the VM part can be kept as simple as possible. In the future, we will support the compilation of WASM intermediate representation in the KCLVM compiler. In addition, we support the extension of VM runtime capabilities through the plugin mechanism, and consider the support of LSP Server to reduce IDE and editor support costs.

In terms of engineering, we organize KCL code in three levels: project, stack, and component. When the code is mapped to a Kubernetes cluster, Kusion supports two mapping methods.

The first method supports mapping the stack to a namespace, and the component exists in the namespace, that is, the resource quota is shared within the stack, and the components are isolated through SDN and Mesh capabilities. This is a common practice method in the community.

The second way is to map component to namespace, stack is identified by label, management authority is managed by SA, resource quota is defined in component dimension, and components are isolated by the isolation capability of namespace. This is the current practice of Ant in the online environment. Regardless of the mapping, users do not need to perceive the details of physical cluster docking and switching. In addition, the resource definitions in the KCL code can be located by a unique resource ID, which is also the basis for "addition, deletion, and modification" of the code.

In order to support the aforementioned isolation and mapping logic, we provide the KusionCtl tool to help users complete common functions such as project structure initialization, Kubernetes cluster mapping, execution status tracking and display, and Identity permission integration. Users can use KusionCtl to complete the implementation and verification of the R&D, test environment.

For online environments, we recommend using Kusion-based O&M products for change operations. We hope that through the KCL code open, transparent, declarative, intention-oriented, hierarchical decoupling definition infrastructure, it is essentially a kind of collaborative work oriented to data and its constraints, and change is a flow of data. Through pre-compilation, calculation, and verification, we finally deliver the data to the runtime of each environment. Compared with the way of calculation logic flow in the classic imperative system, we can avoid the runtime data caused by complex imperative calculation to the greatest extent. Errors, especially when the calculation logic changes, the result of such runtime calculation errors is usually an online fault.

Finally, let’s look at a kind of Kuusion’s technical architecture. We still look at the hierarchical logic of controllers, business layers, orchestration layers, tasks, and pipelines. Bottom-up, the Kubernetes API Server converges the pipeline and provides native resource definitions, and expands through CRD & Operator to provide stable atomic task definitions. From my personal point of view, Operator, as its name calls "operator", repeats the simple cycle of receiving orders and executing operations, and continues to operate if the order is not completed.

Operators should be kept as simple as possible to avoid complex business logic disassembly, control logic, and state machines. At the same time, avoid creating new Operators due to minor differences or performing simple data and YAML conversions through Operators. Operator, as the atomic capability of convergence infrastructure, should be as cohesive and stable as possible. In the business layer and the orchestration layer, we use KCL code to write in the Konfig warehouse, and combine with GitOps to support the change, compilation, testing, and verification of application granularity. The controller layer is highly engineized, focusing on automation logic, and customizing the controller and GUI product interface according to the needs of the business scenario. The configuration code "component" of the application is shared and reused by multiple controllers. For example, site construction, release, and part of operation and maintenance will all rely on the application AppConfiguration configuration code block.

Summary & Outlook

Finally, we make a summary of the open collaborative technology work.

We often say that Kubernetes is the Linux/Unix of cloud computing. Compared with Unix's rich peripheral supporting ecosystem, Kubernetes has a long way to go in supporting technical capabilities. Compared with the easy-to-use Shell and Tools, we still lack a language and tool that conforms to the Kubernetes declarative, open, and shared design philosophy. Kusion hopes to help in this field and improve the openness and efficiency of infrastructure. It is easy to share and collaborate, improve stability, and simplify the access method of cloud-native technology facilities.

Our exploration and practice are still in a preliminary stage, and we hope to play an active role in the evolution of operation and maintenance, trustworthiness, and cloud-native architecture through Kusion's technology and service capabilities.

We hope to promote the real coding of infrastructure, facilitate cross-team DevOps, and become the technical support for continuous deployment and operation and maintenance. In terms of credibility, the support of strategy and code, credible integration, and standardization is one of our follow-up work focuses. In particular, the combination with the strategy engine is a key step in opening up the credibility of technical capabilities.

In terms of cloud-native architecture, we will continue to promote the evolution of architecture modernization, support more rapid innovation of upper-level automation products and businesses through technical means, and at the same time support and serve infrastructure application scenarios through unified processes and enterprise-level technical capabilities.

Throughout history, technology has always evolved towards improving the overall efficiency of social collaboration.
The cloud-native open collaboration brought by Kusion is undoubtedly a footnote for this simple rule to be effective again.

Take you into cloud-native technology: exploration and practice of cloud-native open operation and maintenance system

Exploration of large-scale cloud native operation and maintenance

KUSION: Cloud native open collaborative technology stack

Integration & landing

Collaborative configuration problem model

Abstract model

Summary & Outlook

Recommended reading this week

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

云电竞巅峰对决：ToDesk/网易云/START实战测评，谁是真王者？