Text|Shi Guiming (flower name: Mocheng)
Ant Group Technical Expert
Ant Group's multi-cloud configuration management system technical leader
Cloud native infrastructure, container services, configuration management, IaC, PaC, GitOps, etc.
This article 2369 words read 7 minutes
background
KusionStack
To talk about Kusion's practice in Ant Group, let's first understand the configuration management status of Ant Group before this.
As shown in the figure above, the figure shows the application baseline configuration management system before incorporating Kusion. The "application baseline configuration" mentioned here is not an application dynamic switch, but is injected into the application-dependent software version, middleware configuration, network database and other infrastructure configurations.
As can be seen from the above figure, the application baseline management system is a standard BS architecture, providing users with Console and API. After 6 or 7 years of development history, it has undertaken most of the application baseline configuration requirements in the history of Ant Group. It has rich functions and extremely Complex configuration computing power. At present, it has supported 15,000+ applications, nearly 400 application configurations, and 50+ global configurations.
Under this architecture, the top-level user interacts with the system through forms or integrated system APIs, uses RDBMS as the storage medium, and stores the configuration of the application in the form of class Key-Value. The capability layer mainly includes common capabilities such as general role management, authentication and authentication, version and configuration auditing, and also provides a templated way to calculate application configuration, such as templated Deployment, and finally render the user's baseline configuration as Deployment, and at the same time template Both the baseline configuration and the baseline configuration have very complex and flexible inheritance capabilities. For example, the application can be configured with a Zone_(logical computer room)_level baseline, an environment-level baseline, or an application-level baseline. The former can be inherited later. It is like the integration relationship between the subclass and the superclass.
In addition to the baseline configuration of the application itself, it also manages global configurations, such as global DNS configuration, Load Balance, network configuration, and more. This architecture is very classic, and it effectively supports various configuration requirements in history and various scenarios such as 618 and Double 11, which are beyond doubt. However, with the advancement of the cloud-native process of Ant Group, some bottlenecks have gradually appeared in the above classic architecture. I don't know if you have any problems with the configuration management of this architecture, or if the architecture has encountered such problems? Let me give a few examples:
● Flexibility: There are more and more businesses, the infrastructure configuration of applications is more flexible, and there are more and more customization requirements. The original architecture mainly solves standard application scenarios and general scenarios;
● Openness: The main code of the core capabilities of the baseline system is in charge of the PaaS students. Internal scheduling support is required for a variety of needs. The openness is insufficient, and the experience of a strong SRE team cannot be reused and accumulated;
● Transparency: The configuration calculation black box, the calculation logic of many configurations is hardcoded in the code, what and how much a configuration change will ultimately affect cannot be determined. For example, the global sidecar version has been modified, resulting in abnormal batches of online applications.
Industry benchmarking
With the above questions, we have done some benchmarking and learning in the industry:
1. In Google's book "The Site Reliability Workbook", Google students summed up some common problems from their own practice, a very important point is: in the process of configuration management, they did not realize that large-scale configuration The essence of management problems is a programming language problem. The declaration and verification of configuration requirements can be solved by language.
2. From Google's own practice, K8s is an open source product based on Google's years of experience in large-scale cluster management, but Borg is mainly used internally. The kits are:
● BCL: The user realizes the configuration required for the infrastructure by writing the BCL code;
● Borgcfg: Execute configuration to Borg cluster through Borgcfg;
● Webconsole: View the release status through the Webconsole.
After investigation, we learned that a large number of Google's operation and maintenance capabilities, products, and quality ecology have evolved over the years based on the above three-piece sets.
Based on some of the above summaries, we deduce a Borg-like idea to solve the infrastructure configuration management of Ant Group. We try to use language, tools and services to implement the next-generation configuration management architecture of Ant Group.
Next Generation Configuration Management Architecture
Under this new architecture, we can see that the overall architecture is not just a simple BS architecture, but the configuration user interface has also evolved from a browser form to a central open configuration library. The Kusion is used to configure the big library. The users of Kusion have already mentioned it before. I will not expand too much on the technical details of configuring the big library itself. The emphasis here is that the big library is designed to support multiple sites. Delivered Architecture.
The new configuration management architecture is mainly divided into the following features:
● Abstractly build a unified application configuration model based on the concept of configuration coding, precipitate reusable model components, and realize that configuration code is written once and can be migrated to multiple sites. Abstract domain model: Stack is the smallest set of configurations, and Project is an abstraction of a set of Stacks, which not only includes the application baseline configuration of the App, but also supports other non-application configurations such as DataBase configuration, load balancing configuration, and even Network Policy.
● Create protection rules that correspond to an organization's unique security, compliance, and governance requirements through a policy controller mechanism.
● Declarative automation that continuously monitors operational status and ensures that the desired state defined in Git is met.
Application release case
Next, it will be explained in combination with a specific product case. In this case, the application iterative release is taken as an example:
1. During the business iteration, the user modifies the business code, submits the code, performs the CI test, builds the mirror, and pushes the built mirror to the remote mirror center. Then through the configuration management service layer - here is the auto-image-updater component - the configuration is automatically updated to the configuration file corresponding to the configuration library.
2. Trigger a series of quality assurance methods such as change scanning, testing, and auditing of the database, and trigger an application release process at the same time. The application release process is a risk-based and visual release process, including the promotion process from pre-release, simulation, Grayscale progresses gradually, and finally enters the production environment.
In each advancement stage, it is necessary to obtain the configuration code from the configuration library and use the configuration management service layer to obtain the KCL compilation result, that is, the Spec model, and then use the productized method to "Live" the Spec with the real Runtime in the production environment. Diff” for participants to better identify the content and scope of changes, and then make changes to the corresponding environment by means of risk prevention and control such as group release, such as apply to the K8s cluster.
3. Let's look at the specific visualization products in the process, and we can see the release progress, the Diff of the application configuration, and the configuration of the historical version.
Issues and Outlook
Let’s review a few questions we started with:
1. Flexibility: that is, support for various flexible customization requirements;
2. Openness: Through the KCL language and the open configuration library, the user's infrastructure configuration can be completed independently through the new user interface, and there is no need to wait for the configuration management platform developers to develop;
3. Transparency: The change process can identify change risks through a productized "Live Diff";
Through some of the above explorations, we have solved the problems of Ant Group in promoting the cloud-native process to a certain extent, but also encountered various difficulties in the process, such as how to switch from the old architecture to the new architecture? During the intergenerational evolution of the architecture, the coexistence of new and old systems must be solved, and can be solved by means such as double writing. Under the new architecture, the following issues are worth exploring, how to manage the global configuration and how to better expand the dimension of the configuration.
The global configuration of the site, the global configuration of the old baseline configuration is not only a simple Key-Value, but is very complicated in use. For example, the DNSConfig configuration is arranged according to different priorities according to tenants, environments, zones, etc. These are complex to describe, and it is difficult to describe such complex configurations using the KCL.
For the inheritance and expansion of configuration, taking the APP baseline configuration as an example, currently it supports more configuration at the application and application environment level. For fine-grained configuration such as Zone, it needs to be implemented by writing if else in the KCL code. Scaling at other granularities and automation through APIs brings new challenges.
There are some internal solutions for these issues, and we look forward to continuing exchanges with you in the follow-up open discussions.
Related Links
Kusion toolchain and engine:
http://github.com/KusionStack/kusion
Kusion model library:
http://github.com/KusionStack/konfig
Roadmap:
http://KusionStack.io/docs/governance/intro/roadmap
_
understand more...
KusionStack Star ✨:
https://github.com/KusionStack/Kusion
Recommended reading of the week
KCL: Declarative Cloud-Native Configuration Policy Language
KusionStack Open Source | Kusion Model Library and Tool Chain Exploration Practice
Wonderful review | KusionStack is open source~
Welcome to scan the code and follow our official account
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。