How to break the container runtime operation limitations in native Kubernetes based on OpenKruise?

Author: Wang Siyu, Alibaba Cloud technical expert, OpenKruise community leader

Typically, one can only use plain old data as the smallest unit of operation in Kubernetes. Some companies hacked the Kubelet's code in their clusters so they could do more with the container. However, scaling operations for the runtime is really the wrong approach, as it is not conducive to open source and community collaboration. Now, OpenKruise, one of the Cloud Native Computing Foundation sandbox projects, provides advanced capabilities to operate container runtimes in every raw Kubernetes cluster. In this talk, we'll cover the usage of some of the features in OpenKruise and how it works with Kubelet and CRI.

This sharing is mainly divided into the following parts. First, we introduce the operation restrictions on the container runtime in Kubernetes. That is to say, in Kubernetes, its mechanism restricts our operations. In fact, it is for the controller. The runtime can’t do it; the second point is how OpenKruise expands these operations on the controller runtime; the third point is that we simply do a demo, how do we implement these operations through OpenKruise; the fourth point is to briefly introduce us Some follow-up plans.

What are the restrictions on actions on container runtimes in Kubernetes?

Container runtime in Kubernetes

在这里插入图片描述

As shown in the figure above, this is the basic structure of Kubernetes. Its structure is on each node (Node), and it is actually received by Kubelet in the API server. For example, for Pod changes, when Kubelet receives a Pod creation, it calls the underlying real interface implementer to complete the operation through CRI (Container Runtime Interface), CNI and similar public interfaces (such as CSI). For the container runtime, the operation of creating the container and pulling the startup image is completed by calling the underlying real runtime through the CRI interface.

Among them, CRI is a new function added after Kubernetes 1.5. It consists of protocol buffers and gRPC API, providing a well-defined abstraction layer. Its purpose is that Kubelet can shield the details of the underlying Runtime implementation and only display the required ones. interface.
（https://github.com/kubernetes/cri-api）

Before Kubernetes 1.5, Kubelet was coupled with Docker, Kubelet actually introduced Docker's client, and they directly operated on Docker. With CRI, Kubelet does not need to care what the underlying real runtime implementation is, but only needs to call this layer of interface. The implementation behind the interface may be Containerd-d, CRI-O, or Docker.

The responsibility of CRI is to manage the container runtime and images, including start and stop operations of containers, operations of Sandbox containers, data collection of container States, and operations such as pulling and querying images. Therefore, CRI provides a relatively complete container interface, as shown in the following figure.

在这里插入图片描述

Operational limitations of container runtimes in Kubernetes

The Kubernetes API does not provide interface operations for the container runtime. The only thing it provides is the v1 version of Pod operations (Pod API CRUD, Pod Subresources API). In addition to Pod creation and update, the only operations that can be compared with Runtime are the Exec subresource and the Log subresource.

The API level of Kubernetes restricts users to only create or delete Pods. In addition, the containers inside can only perform operations such as Exec and Log. At the Kubernetes interface level, users cannot perform operations such as pulling images and restarting containers.
在这里插入图片描述

Is it possible to extend this API?

We found that Kubelet currently does not provide any hook to solve this operation of the plugin, so that the outer layer can dynamically expand what Kubelet does. (The interface of Kubelet does not provide such a plug-in mechanism.) Is it possible to add a new component similar to Kubelet, which can be connected to the CRI API to expand the operation of the Kubernetes container?

We will also call the CRI layer. For example, it can pull images and restart containers. Its upper layer can also receive a CRD resource defined on the Kubernetes API. This CRD resource defines a CRI interface that allows users to declare and do something operate. For example, it can define designated users to pull images, restart containers, and more.

This way is what we can think of, an extension of the operations of this Kubernetes container runtime.

在这里插入图片描述

What is OpenKruise?

OpenKruise Concept

Openkruise is an extension suite of Kubernetes, which makes up for many deficiencies of Kubernetes, such as deficiencies in application workloads (application deployment and release related functions), deficiencies in container runtime operations, and so on. It can be used with native Kubernetes and provides more powerful and efficient capabilities for managing application containers, sidecars, and image distribution.

In November 2020, OpenKruise joined CNCF as a Sandbox project.
在这里插入图片描述

Openkruise itself is not a Pass platform. However, the PassS platform can better manage and operate cloud-native applications by using the expansion capabilities provided by Openkruise. Interested friends can learn more about OpenKruise through the following website.

Github:
https://github.com/openkruise/kruise

WebSite:
https://openkruise.io

Features of OpenKruise

OpenKruise is an extension based on CRD, and its functions can be roughly divided into five parts:
(1) Application workload: including grayscale publishing, flow control, and its in-situ upgrade for stateless applications, stateful applications, and other related functions;
(2) Sidecar container management: Provides more enhanced independent definitions and independent deployments;
(3) Multi-domain management of application: If an application is to be deployed in multiple partitions, it will be broken up and managed by fragmentation.
(4) Application availability protection: protect the high availability of cloud-native applications running on Kubernetes;
(5) Expanding and enhancing operations: In this way, the enhanced operation capabilities of the container runtime (runtime) are realized. The expansion and enhancement operation is the main function introduced in this article, and we will expand it in detail later.

在这里插入图片描述
OpenKruise Functional Diagram

Architecture of OpenKruise

在这里插入图片描述

As shown in the figure, OpenKruise is mainly divided into two components: the central end (Kruise-manager) and the node end (Node). The central Kruise-manager includes controllers and webhooks. Through the combination of the central role of Kruise-manager and the kruise-daemon function on each node, many capabilities that Kubernetes itself does not provide can be accomplished. Kruise-daemon is used to avoid modification of Kubelet and operate CRI runtime in an extended way.

Extended functions of Runtime

Runtime has three core extensions.

In-place upgrade function

In-place upgrade is an upgrade image capability that avoids deleting and creating new Pods.
在这里插入图片描述

As shown in the figure above: the first part is not directly extended through kruise-daemon, but uses the native mechanism of Kubelet, which is called in-place upgrade.

How to understand in-place upgrade? Let's take a simple example: For example, there was a pod-a, and the pod-a was expanded through deployment or cloneset of Openkruise. If we want to upgrade the image version of the app container, such as upgrading from Fv1 to Fv2, under normal circumstances, everyone using development deployment is to use Recreatte Update, which is to rebuild the Pod upgrade, after the reconstruction is completed, we will see the Pod name, Pod UID, ( Mirrors will also be upgraded to V2) largely changed.

Looking at the latter two, the name and UID of the former Pod will definitely change, because it is no longer an object of the same Pod. Compared with the in-place upgrade we introduced this time, the object of the Pod is actually the original object, and the name of the Pod and the UID of the Pod remain unchanged. Secondly, the IP of the node where the Pod is located also remains unchanged. The only thing that changes is that the image is from v1 level to v2. Since there is no change in the Pod node, the Pod object does not need to be rescheduled by the scheduler, IP allocation, volume allocation, and mounting. These time-consuming are also eliminated, so an obvious benefit is that the scheduling time is saved.

As we all know, when the application image is upgraded from v1 to v2, only the top layers may have changed, but most of the base images and public layers at the bottom have not changed.

When we do an in-place upgrade on the same node, we can reuse most of the layers of the original v1 image, and only download a small part of the layers image.

During the process of upgrading the app container, other containers in the Pod. For example, the sidecar container has been running normally and has not been affected. Conversely, when we upgraded the sidecar container, the container was running normally. This can largely avoid the impact on business capabilities in the process of upgrading a bypass (such as operation and maintenance) container.

1.1 Advantage

• Save operation time, including: Pod scheduling, IP allocation, volume allocation, mounting, etc.;
• Reuse most of the image layers;
• When a container is upgraded, it will not affect other containers in the Pod;

1.2 How

The principle of in-place upgrade can be simply understood that when Kubelet creates each container, it will calculate a hash value for the container. When the upper layer modifies the image of the app container, for example, Kubelet thinks that the hash value of the container has changed. When the Kubelet finds that the hash value of the app container in the Pod spec is inconsistent with the actual hash value, such as the hash value corresponding to container d, it will stop the old app container and rebuild the new app container with the new image, so as to realize the container's The ability to upgrade in place.
在这里插入图片描述

### Container restart function
在这里插入图片描述

The function of container restart is a function that many businesses, including operation and maintenance platforms, rely on. You may ask, in Kubernetes, since a Pod is stateless, can you just delete the Pod and create a new Pod when you want to restart?

This is of course possible, but there may be many debug scenarios for the business. It is not just to rebuild a new Pod, but to restart the container from the original place, which is equivalent to restarting the business process inside. For example, if you want to keep some data in the volume, some network, stack information, etc., these scenarios lead to the business need to have the ability to restart the container of the Kubernetes Pod in place.

Kubernetes does not have the ability to restart containers natively. For Kubernetes, if you want to restart the container, you can only manually enter the container and kill the application process in the container. At this time, when the container exits, Kubernetes will pull it up again. Of course, this method is actually a hack. The container restart capability provided by Openkruise only needs to create a CR for the API.

What is written in the CR is very clear. Name spacesl only needs to be defined in the same name spacesl as the pod, where name is a custom name. By specifying which Pod needs to be restarted, which containers need to be restarted, when these are defined After the information, submit the CR. When kruise receives the CR, kruise-manager will first pass through the webhooks, inject some information into it, and then kruise-daemon will get the CR according to the information defined in the CR (for example, it will find the container corresponding to the Pod) Execute the preStop hook, and then execute the preStop through the EXTC through the CRI interface, and call the stop of the CRI after the execution of the preStop is completed.

In fact, this stopping method is consistent with the way that the Kubelet itself stops the container when the Pod is deleted. When kruise-daemon stops the old container, such as the zero app (app_0) container, the Kubelet perceives that the app container is stopped, and then creates a new F1 container and pulls it up. Graceful container restart capability in place.

Code example:

apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
metadata:
namespace: pod-namespace
name: xxx
spec:
podName: pod-name
containers:
- name: app
strategy:
# ...
activeDeadlineSeconds: 300
ttlSecondsAfterFinished: 1800
status:
containerRecreateStates:
- name: app
phase: Succeeded
phase: Completed
# ..

Mirror warm-up function

Preheating the nodes, including the newly created Nodes, in advance can greatly reduce the time-consuming of subsequent Pod expansion.
在这里插入图片描述

As we can see from the above figure, for upper-level users, Openkruise provides a CRD called ImagePullJob. The user can define which image needs to be warmed up, and can also selectively configure the selector (the selector can be the label selector of the node or can be based on the Pod select), all of the above can be preheated on the node where the Pod is located.

When the user creates an ImagePullJob, for kruise's internal logic, kruise will split the ImagePullJob into the CR corresponding to the node image of each node. After synchronization, the kruise-daemon on the node will get the node image corresponding to this node. CR, to warm up multiple mirrors defined in CR on the node.

In other words, the mirror list in the node image corresponding to each node actually means that all the ImagePullJob in the upper layer specify that the complete set of images should be pulled on this node. After the bottom layer of kruise-daemon gets the node image, it is equivalent to calling the Pod image interface of CRI to complete the image warm-up.

Code example:

apiVersion: apps.kruise.io/v1alpha1
kind: ImagePullJob
metadata:
name: test-job
spec:
image: nginx:latest
parallelism: 10
selector:
# ...
podSelector:
# ...
completionPolicy:
# ...

future project planning

In December 2021, OpenKruise completed the release of its first official version, v1.0, bringing cloud-native application automation to new heights. It has been more than 2 years since OpenKruise released version 0.1 in 2019. More than 70 contributors have contributed and the number of stars has exceeded 3,000. In 2022, we will promote OpenKruise to become a CNCF Incubation project and promote the field of cloud-native application automation further mature.

User:
• Alibaba Group, Ant Group, Douyu TV, Shentong, Boss Zhipin
• Hangyin Consumer, Wanyi Technology, Multipoint, Bringg, Zuojiang Technology
• Lyft, Ctrip, Enjoy Wisdom, VIPKID, 1-to-1 in charge
• Xiaohongshu, Bixin, Yonghui Technology Center, who to learn from, Hello Travel
• Spectro Cloud, Aijia Life, Arkane Systems, Dipu Technology, Spark Thinking
• OPPO, Suning, Happy Time, Mobvista, Shenzhen Phoenixwood Network Co., Ltd.
• Xiaomi, NetEase, Meituan Finance, Shopee, LinkedIn

How to break the container runtime operation limitations in native Kubernetes based on OpenKruise?

What are the restrictions on actions on container runtimes in Kubernetes?

Container runtime in Kubernetes

Operational limitations of container runtimes in Kubernetes

What is OpenKruise?

OpenKruise Concept

Features of OpenKruise

Architecture of OpenKruise

Extended functions of Runtime

In-place upgrade function

Mirror warm-up function

future project planning

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

Jenkins 企业级 CI/CD 实践：安装、配置与 Kubernetes & Docker 集成

k8s集群部署（一主两从）

k8s实战基础

分析型数据库入门指南：如何选择适合你的实时分析工具？