The grievances between Docker and k8s (two)-the simplest technology to achieve "containers"

Please indicate the source for reprinting: Grape City official website, Grape City provides developers with professional development tools, solutions and services, and empowers developers.

Last time we talked about the development history of PaaS, from Cloud Foundry's sad exit to Docker's crowning, it was Docker's "a little bit" improvement that set off a butterfly effect and instigated the entire PaaS open source project market.

In order for everyone to better understand the "container", the core technology of PaaS, this article will start with a process and tell you what a container is, and how PaaS "front waves" such as Cloud Foundry implement containers.

Process vs container

Taking the Linux operating system as an example, the process running in the computer is a collection of related device states used by the binary files on the disk to the memory, registers, stack instructions, etc. after the program is executed. It is a dynamic expression of data and state integration. . The goal of container technology is to isolate and restrict the state and data of a process. It can be said that the essence of a container is actually a special process in Linux. This special process is mainly realized by the two mechanisms provided by the Linux system. Let me review it first.

Namespace

Linux Namespace is a function of the Linux kernel that partitions kernel resources so that a group of processes can see a group of resources, and another group of processes can see another group of resources. This function works by having the same namespace for a group of resources and processes, but these namespaces refer to different resources. Resources may exist in multiple spaces. Examples of such resources are process ID, host name, user ID, file name, and certain names related to network access and inter-process communication. The types are listed as follows:

Mount namespaces
UTS namespaces
IPC namespaces
PID namespaces
Network namespaces
User namespaces

Super process

In the Linux operating system, the process with PID==1 is called the super process. It is the root of the entire process tree and is responsible for generating all other user processes. All processes will be suspended under this process, if this process exits, then all processes will be killed.

Isolation & Restriction

We mentioned isolation and restriction just now. What do we mean specifically?

isolation

Take Docker as an example (the same goes for Cloud Foundry, the latter is not installed on my machine), we can execute the following command to create a simple image:
$ docker run -it busybox /bin/sh

The content of this statement is: run a container with docker, the image name of the container is busybox, and the command that needs to be executed after running is /bin/sh, and the -it parameter indicates that you need to use standard input stdin and assign a text input and output The environment tty interacts with the outside. Through this command, we can enter a container, execute the top command in the container and the host, and see the following results:

(The return result of executing the top statement inside and outside the container)

It can be found that there are only two running processes in the container. One is the /bin/sh super process with the main process PID==1, and the other is the top we run. All other processes in the host are invisible to the container-this is isolation.

(Isolated top process, the picture comes from the network)

Originally, whenever we run a /bin/sh program on the host, the operating system will assign it a process number, such as PID==100. And now, we need to run this /bin/sh program in a container through Docker. At this time, Docker will apply a "obscure method" when the PID==100 is created, so that it will never see the previous 99 Process, so that the program running in the container will treat itself as a super process with PID==1.

This mechanism actually manipulates the process space of the isolated program. Although the PID==1 displayed in the container, it is actually the process with the PID==100 in the original host. The technology used is the Namespace mechanism in Linux. This mechanism is actually an optional parameter when Linux creates a process. In Linux, the function to create a thread is (the thread is not written wrong here, the thread in Linux is realized by the process, so it can be used to describe the process):

int pid = clone(main_function, stack_size, SIGCHLD, NULL);

If we add a parameter such as CLONE_NEWPID to this method:

int pid = clone(main_function, stack_size, CLONE_NEWPID | SIGCHLD, NULL);

Then this new process will see a brand new process space. In this space, because there is only one process in the space, its PID is equal to 1.

Such a process is the most basic isolation realization of Linux containers.

limit

Containers with namespace isolation are just as incomplete as programmers without computers.

If we only isolate and not restrict, the programs in the cage still occupy system resources, and access is still free. In order to add resource restrictions to programs with isolation, a second technique is used: cgroups

Cgroups was originally a program developed by Google engineers in 2006. The full name is Linux Control Group. It is used in the Linux operating system to limit the upper limit of the resources that a process group can use, including functions such as CPU, memory, disk, and network bandwidth.

Through the API file system exposed by Cgroups to users, users can manipulate Cgroups functions by modifying the value of the file.

(Process restricted by cgroup, the picture comes from the network)

In the Linux system (Ubuntu), you can execute the following command to view the CgroupsAPI file:

mount -t Cgroups

(Cgroup file system)

As you can see from the above figure, there are multiple Cgroups configuration files including cpu, memory, IO, etc. in the system.

Let's take the CPU as an example to illustrate the function of the following Cgroups. To limit the CPU, we need to introduce two parameters cfs_period and cfs_quota. We often operate these two parameters in order to limit the CPU for programs in the movable type public cloud Docker. These two parameters are used in combination, which means that within the length of cfs_period, the program group can only be allocated the total amount of CPU time of cfs_quota. In other words, cfs_quota / cfs_period == the upper limit of cpu usage.

To limit the CPU usage of a process, you can execute the following command to create a folder container in the /sys/fs/Cgroups/cpu directory:

/sys/fs/Cgroups/cpu/ > mkdir container

At this point, we can find a series of CPU limit parameter files automatically generated by the system in the container directory. This is automatically generated by the Linux system, indicating that we have successfully created a control group container for the CPU:

(The default CPU resource file list)

To show the actual effect of the CPU limit, let us execute an endless loop created with the following script:

while : ; do : ; done &

We will see in the top command result that the returned process is 398, because of the infinite loop, the cpu occupancy rate is 100%:

(The process of infinite loop occupies 100% of the CPU)

At this time, let's look at cpu.cfs_quota_us and cpu.cfs_period_us in the container directory:

(The CPU limit parameter by default)
Here is what it looks like when no restrictions have been made. If cfs_quota_us is -1, it means that there is no limit to the upper limit of the CPU. Now we change this value:

echo 20000 > /sys/fs/Cgroups/cpu/container/cpu.cfs_quota_us

Then write the previous process 398 into the tasks file of this control group:

echo 398 > /sys/fs/Cgroups/cpu/container/tasks

At this time, I went to the top again and found that the CPU usage rate of the infinite loop just now became 20%, and the CPU usage resource limit began to take effect.

(Use cgroup to limit the infinite loop process of CPU usage)

The above is the principle of restricting containers through the Cgroups function. In the same way, we can use this method to limit the memory and bandwidth of a process. If the process is a container process, a resource-controlled container can basically be shown to you. In fact, in the early days of the cloud era , Cloud Foundry and other "front waves" all use this method to create and manage containers. Compared with latecomers, Cloud Foundry and others are relatively simple and easy to understand in terms of container isolation and restrictions, but in some scenarios it will inevitably be restricted.

Here is a special note. Only the containers running in Linux are the result of simulation by restricting the process. The containers under Windows and Mac are all simulated "real" by operating the virtual machine through container software such as Docker Desktop. Virtual container.

summary

This section starts with the principle of containers and the technology of implementing container isolation and restriction under Linux, and introduces the container principles of Paas platforms such as Cloud Foundry in the early cloud era. The next section will continue to introduce to you what changes Docker has made on the basis of Cloud Foundry containers, and how to solve the fatal shortcomings of Cloud Foundry.

If you want to understand how Docker disturbs the situation, what is the difference between this Docker container and the traditional virtual machine?

Stay tuned for the next article, we continue to chat.

The grievances between Docker and k8s (two)-the simplest technology to achieve "containers"

Process vs container

Namespace

Super process

Isolation & Restriction

isolation

limit

summary

葡萄城技术团队

引用和评论

GC-QA-RAG 智能问答系统的问答生成

🔥吐血整理 Bolt.diy 部署与应用攻略

大数据从业者必知必会的Hive SQL调优技巧

【Docker】基本概念及语法与环境搭建

狂揽17k star！Docker可视化神器，一键部署项目真香！

【成功解决】JetBrains PyCharm 激活提示 “Key is invalid” (秘钥无效) 的终极解决方案

解剖DeepSeek四把刀，一场深到源码，大到行业，细到人心的手术盛宴