A cornerstone to understand container technology: cgroup

Hi everyone, this is Zhang Jintao.

At present, the container technology and virtualization technology (regardless of the virtualization technology under the abstraction level) can achieve isolation and restriction on the resource level.

For container technology, it realizes the restriction and isolation at the resource level, which relies on the cgroup and namespace technology provided by the Linux kernel.

Let's first summarize the role of these two technologies:

The main function of cgroup: to manage the allocation and restriction of resources;
The main function of namespace: encapsulation abstraction, restriction, isolation, so that the processes in the namespace appear to have their own global resources;

In this article, we will focus on cgroups.

Why pay attention to cgroup & namespace

Cloud native/container technology blowout growth

Since 1979, Unix version 7 Chroot Jail and Chroot system calls in the development process, until the open source Docker in 2013, the open source native Kubernetes ecosystem in 2014, and the hot cloud now. Container technology has gradually become one of the mainstream basic technologies.

After more and more companies and individuals choose cloud services/container technology, resource allocation and isolation, and security have become hot topics of concern and discussion.

In fact, container technology is not difficult to use, but to really use it well and use it in a large-scale production environment, we still need to master its core.

The following is the general development history of container technology & cloud native ecology:

Figure 1, the development history of container technology

From the figure, we can see the development trajectory of container technology and cloud native ecology. Container technology actually appeared very early, but why did it begin to develop significantly after the advent of Docker? What are the problems with the early chroot and Linux VServer?

Security issues brought by Chroot

Figure 2, chroot example

Chroot can isolate the process and its children from the rest of the operating system. However, for root process , but can exit chroot .

package main

import (
    "log"
    "os"
    "syscall"
)

func getWd() (path string) {
    path, err := os.Getwd()
    if err != nil {
        log.Println(err)
    }
    log.Println(path)
    return
}

func main() {
    RealRoot, err := os.Open("/")
    defer RealRoot.Close()
    if err != nil {
        log.Fatalf("[ Error ] - /: %v\n", err)
    }
    path := getWd()

    err = syscall.Chroot(path)
    if err != nil {
        log.Fatalf("[ Error ] - chroot: %v\n", err)
    }
    getWd()

    err = RealRoot.Chdir()
    if err != nil {
        log.Fatalf("[ Error ] - chdir(): %v", err)
    }
    getWd()

    err = syscall.Chroot(".")
    if err != nil {
        log.Fatalf("[ Error ] - chroot back: %v", err)
    }
    getWd()
}

Run as normal user and sudo respectively:

➜  chroot go run main.go 
2021/11/18 00:46:21 /tmp/chroot
2021/11/18 00:46:21 [ Error ] - chroot: operation not permitted
exit status 1
➜  chroot sudo go run main.go
2021/11/18 00:46:25 /tmp/chroot
2021/11/18 00:46:25 /
2021/11/18 00:46:25 (unreachable)/
2021/11/18 00:46:25 /

You can see that if you use sudo to run, the program has switched between the current directory and the original root directory of the system. However, ordinary users have no authority to operate.

Security vulnerabilities of Linux VServer

Linux-VServer is a Security Contexts , which can isolate virtual servers and share the same hardware resources. The main problem is that the VServer application does not have good security protection against "chroot-again" attacks. Attackers can use this vulnerability to escape the restricted environment and access any files outside the restricted directory. (Since 2004, the National Information Security Vulnerability Database has posted related vulnerabilities)

Advantages of modern container technology

Lightweight, based on the cgroup and namespace capabilities provided by the Linux kernel, the cost of creating containers is very low;
A certain degree of isolation;
Standardization, through the use of container images for packaging and distribution of applications, many problems caused by inconsistent environments can be shielded;
DevOps support (applications can be easily migrated between different environments, such as development, testing, and production environments, while retaining all the functions of the application);
Add protection to the infrastructure to improve reliability, scalability and reliability;
DevOps/GitOps support (can achieve rapid and effective continuous release, management version and configuration);
Team members can effectively simplify, accelerate and orchestrate the development and deployment of applications;

After understanding why we should pay attention to technologies such as cgroups and namespaces, then let's get to the focus of this article, let's learn about cgroups together.

What is a cgroup

Cgroup is a function of the Linux kernel used to limit, control and separate the resources of a process group (such as CPU, memory, disk input and output, etc.). It was developed by two engineers from Google and has provided this capability since the Linux kernel v2.6.24, which was officially released in January 2018.

cgroup So far, there are two major versions, cgroup v1 and v2. The following content is based on the cgroup v2 version, and the differences between the two versions will be described in detail below.

The main restricted resources of cgroups are:

CPU
RAM
The internet
Disk I/O

When we allocate available system resources to cgroups by a certain percentage, the remaining resources can be used by other cgroups or other processes on the system.

Figure 4, cgroup resource allocation and remaining available resources example

Composition of cgroup

cgroup stands for "control group" and does not use uppercase. A cgroup is a mechanism for organizing processes hierarchically, which allocates system resources in a controlled manner along the hierarchical structure. We usually use the singular form to specify the entire feature, and also as a qualifier such as "cgroup controller".

There are two main components of a cgroup:

core-responsible for the hierarchical organization process;
controller-usually responsible for allocating specific types of system resources along the hierarchy. Each cgroup has a cgroup.controllers file, which lists all the controllers that can be enabled by the cgroup. When cgroup.subtree_control , either all succeed or all fail. If multiple operations are specified on the same controller, only the last one will take effect. The destruction of each cgroup's controller is asynchronous, and there is also the problem of delayed reference when it is referenced;

All cgroup core interface files are prefixed cgroup The interface file of each controller is prefixed with the controller name and a dot. The name of the controller consists of lowercase letters and "_", but it will never start with "_".

The core file of cgroup

cgroup.type-(single value) a readable and writable file that exists on a non-root cgroup. By writing "threaded" into the file, the cgroup can be converted to a threaded cgroup, and 4 values can be selected, as follows:

1) domain-a normal valid domain cgroup
2) domain threaded-the thread domain cgroup of the root of the thread subtree
3) domain invalid-invalid cgroup
4) threaded-thread cgroup, thread subtree

cgroup.procs-(Newline separated) Read and write files that all cgroups have. Each line lists the PID of the process belonging to the cgroup. PID is not ordered. If the process moves to another cgroup, the same PID may appear more than once;

cgroup.controllers-(space separated) Read-only file that all cgroups have. Show all controllers available for cgroup;

cgroup.subtree_control-(space separated) Read and write files that all cgroups have, initially empty. If a controller appears more than once in the list, the last one is valid. When multiple enable and disable operations are specified, either all succeed or all fail.

1) The controller name prefixed with "+" means the controller is enabled
2) The controller name prefixed with "-" means the controller is disabled

cgroup.events-a read-only file that exists on a non-root cgroup.

1) populated-cgroup and its child nodes contain active processes, the value is 1; there are no active processes, the value is 0.
2) frozen-whether the cgroup is frozen, the frozen value is 1; the unfrozen value is 0.

cgroup.threads-(Newline separated) Read and write files that all cgroups have. Each line lists the TIDs of threads belonging to the cgroup. TIDs are not ordered. If the thread moves to another cgroup, the same TID may appear more than once.

cgroup.max.descendants-(single value) read and write files. The maximum number of cgroups allowed is the number of child nodes.
cgroup.max.depth-(single value) read and write files. It is lower than the maximum allowable tree depth for the current node.

cgroup.stat-read-only file.
- 1) nr_descendants-the number of cgroups of visible descendants.
- 2) nr_dying_descendants-The number of cgroups deleted by the user and will be destroyed by the system.

cgroup.freeze-(single value) a readable and writable file that exists on a non-root cgroup. The default value is 0. When the value is 1, the cgroup and all its child node cgroups will be frozen, and related processes will be shut down and no longer run. It takes a certain amount of time to freeze a cgroup. When the action is completed, the "frozen" value in the cgroup.events control file will be updated to "1" and a corresponding notification will be issued. The frozen state of the cgroup will not affect any cgroup tree operations (delete, create, etc.);

cgroup.kill-(single value) a readable and writable file that exists on a non-root cgroup. The only allowed value is 1. When the value is 1, the cgroup in the cgroup and all its child nodes will be killed (the process will be killed by SIGKILL). Generally used to kill a cgroup tree to prevent the migration of leaf nodes;

Ownership and migration of cgroups

Each process in the system belongs to a cgroup, and all threads of a process belong to the same cgroup. A process can migrate from one cgroup to another cgroup. The migration of the process will not affect the cgroup to which the existing descendant process belongs.

Figure 5, the cgroup allocation of the process and its children; an example of cross-cgroup migration

Migrating processes across cgroups is an expensive operation and stateful resource constraints (for example, memory) will not be dynamically applied to the migration. Therefore, migrating processes across cgroups is often only used as a means. The direct application of different resource constraints is discouraged.

How to implement cross-cgroup migration

Each cgroup has a readable and writable interface file "cgroup.procs". One PID per line records all processes managed by the cgroup restriction. A process can be migrated by writing its PID to the "cgroup.procs" file of another cgroup.

However, in this way, only the calls made by one process on a single write(2) can be migrated (if a process has multiple threads, all threads will be migrated at the same time, but also refer to the thread subtree to see if there are threads in the process Enter the records of different cgroups).

When a process forks a child process, the process is born in the cgroup to which its parent process belongs.

A cgroup without any child processes or active processes can be destroyed by deleting the directory (even if there is an associated zombie process, it is considered to be able to be deleted).

What are cgroups

When multiple separate control groups are explicitly mentioned, the plural form "cgroups" is used.

cgroups form a tree structure. (A given cgroup may have multiple sub-cgroups to form a tree structure) Each non-root cgroup has a cgroup.events file, which contains the populated field to indicate whether the cgroup's sub-hierarchy structure has real-time processes. All non-rooted cgroup.subtree_control files can only contain controllers enabled in the parent.

Figure 6, cgroups example

As shown in the figure, the use of cpu and memory resources is restricted in cgroup1, and it will control the CPU cycles and memory allocation of child nodes (ie, limit the cpu and memory resource allocation in cgroup2, cgroup3, and cgroup4). The memory limit is enabled in cgroup2, but the cpu resource limit is not enabled, which causes the memory resources of cgroup3 and cgroup4 to be restricted by the mem setting content in cgroup2; cgroup3 and cgroup4 will freely compete within the cpu resource limit of cgroup1 cpu resources.

From this, it can also be clearly seen that cgroup resources are distributed from top to bottom. Only when the resource has been distributed from the upstream cgroup node to the downstream, the downstream cgroup can further distribute the constraint resource. All non-rooted cgroup.subtree_control files can only contain the controller content enabled in cgroup.subtree_control

So, will there be internal process competition ?

of course not. In cgroup v2, it is set that non-root cgroups can only distribute domain resources to cgroups of child nodes when there is no process. In short, only a cgroup that does not contain any process can cgroup.subtree_control file, which ensures that the process is always on the leaf node.

Mount and delegate

How to mount cgroup

memory_recursiveprot-Recursively apply memory.min and memory.low protection to the entire subtree, without explicitly propagating down to the cgroup of the leaf nodes, and the leaf nodes in the subtree can compete freely;
memory_localevents-can only be set at mount time or modified by remounting from the init namespace. This is a system-wide option. Only fill memory.events with the data of the current cgroup. If this option is not available, all subtrees will be counted by default;
nsdelegate-can only be set at mount time or modified by remounting from the init namespace. This is also a system-wide option. It treats the cgroup namespace as a delegation boundary, which is one of two ways to delegate cgroups;

cgroup delegation method

Set the mount option nsdelegate;
Authorized users have write access to the directory and its cgroup.procs , cgroup.threads and cgroup.subtree_control

The results of the two methods are the same. Once delegated, the user can establish a sub-hierarchical structure under the directory, and all resource allocation is restricted by the parent node. Currently, cgroups do not have any restrictions on the number of cgroups in the delegated sub-hierarchy or the depth of nesting (which may be explicitly restricted later).

The cross-cgroup migration was mentioned earlier. From the delegation, we can clearly know that cross-cgroup migration has restrictions for ordinary users. That is, whether to have write access to the "cgroup.procs" file of the current cgroup and whether to have write access to the "cgroup.procs" file of the common ancestor of the source cgroup and the target cgroup.

Delegation and migration

Figure 7, an example of delegated authority

As shown in the figure, the ordinary user User0 has the delegated authority of cgroup[1-5].

Why does User0 fail when trying to migrate the process from cgroup3 to cgroup5?

This is because User0 only has the permissions of cgroup1 and cgroup2, and does not have the permissions of cgroup0. The authorized user in the delegation clearly pointed out that needs the common ancestor of "cgroup.procs" file has write access permissions! (that is, the permission of cgroup0 in the figure is required to achieve)

Resource allocation model and function

The following is the resource allocation model of cgroups:

Weight-(for example, cpu.weight) All weights are in the range [1, 10000], the default value is 100. Allocate resources according to the weight ratio.
Limit-within the range of [0, max], the default is "max", which is noop (for example, io.max). Limits can be overused (the sum of the limits of the child nodes may exceed the amount of resources available to the parent node).
Protection-In the range of [0, max], the default is 0, which is noop (for example, io.low). Protection can be a hard guarantee or a best-effort soft boundary, and protection can also be overused.
Allocation-within the range of [0, max], the default is 0, that is, no resources. Allocation cannot be overused (the sum of the allocation of child nodes cannot exceed the amount of resources available to the parent node).

cgroups provides the following functions:

Resource restriction-The cgroup part has been exemplified above, cgroups can be nested to restrict resources in a tree structure.
Priority-When resource contention occurs, which process's resources should be guaranteed first.
Audit-monitor and report on resource limits and usage.
Control-Control the status of the process (start, stop, suspended).

cgroup v1 and cgroup v2

Deprecated core functions

There is a big difference between cgroup v2 and cgroup v1. Let's take a look at which features of cgroup v1 are deprecated in cgroup v2:

Multiple hierarchical structures including naming levels are not supported;
Not all v1 installation options are supported;
"Tasks" file is deleted, "cgroup.procs" is not sorted
- List of thread group IDs in cgroup v1. There is no guarantee that this list is sorted or that there are no duplicate TGIDs. If this attribute is required, the user space should sort/unify the list. Writing the thread group ID to this file will move all threads in the group to this cgroup;
cgroup.clone_children was deleted. clone_children only affects the cpuset controller. If clone_children is enabled in the cgroup (setting: 1), the new cpuset cgroup will copy the configuration from the cgroup of the parent node during initialization;
/proc/cgroups has no meaning for v2. Use the "cgroup.controllers" file in the root directory instead;

Problems with cgroup v1

The most significant difference between cgroup v2 and v1 is that cgroup v1 allows any number of hierarchies, but this can cause some problems. Let's talk in detail.

When mounting the cgroup hierarchy, you can specify a comma-separated list of subsystems to be mounted as file system mount options. By default, mounting a cgroup file system will attempt to mount a hierarchy that includes all registered subsystems.

If an active hierarchy with the exact same set of subsystems already exists, it will be reused for the new installation.

If the existing hierarchy does not match, and any requested subsystem is being used in the existing hierarchy, the mount will fail with -EBUSY. Otherwise, the new hierarchy associated with the requested subsystem will be activated.

It is currently not possible to bind the new subsystem to the active cgroup hierarchy or unbind the subsystem from the active cgroup hierarchy. When the cgroup file system is unmounted, if any sub-cgroups are created under the top-level cgroup, the hierarchy will remain active even if it is unmounted; if there are no sub-cgroups, the hierarchy will be deactivated.

This is the problem in cgroup v1, and it is solved very well in cgroup v2.

The connection between cgroup and container

Here we take Docker as an example. Create a container and limit its available CPU and memory:

➜  ~ docker run --rm -d  --cpus=2 --memory=2g --name=2c2g redis:alpine 
e420a97835d9692df5b90b47e7951bc3fad48269eb2c8b1fa782527e0ae91c8e
➜  ~ cat /sys/fs/cgroup/system.slice/docker-`docker ps -lq --no-trunc`.scope/cpu.max
200000 100000
➜  ~ cat /sys/fs/cgroup/system.slice/docker-`docker ps -lq --no-trunc`.scope/memory.max
2147483648
➜  ~ 
➜  ~ docker run --rm -d  --cpus=0.5 --memory=0.5g --name=0.5c0.5g redis:alpine
8b82790fe0da9d00ab07aac7d6e4ef2f5871d5f3d7d06a5cdb56daaf9f5bc48e
➜  ~ cat /sys/fs/cgroup/system.slice/docker-`docker ps -lq --no-trunc`.scope/cpu.max       
50000 100000
➜  ~ cat /sys/fs/cgroup/system.slice/docker-`docker ps -lq --no-trunc`.scope/memory.max
536870912

As can be seen from the above example, when we use Docker to create a new container and specify the CPU and memory limits for it, the cpu.max and memory.max the corresponding cgroup configuration file are set to the corresponding values.

If you want to check the resource quota of some containers that are already running, you can also directly view the content in the corresponding configuration file.

Summarize

The above is a detailed introduction of cgroup, one of the cornerstones of container technology. Next, I will write about namespace and other container technologies, so stay tuned!

Welcome to subscribe to my article public account【MoeLove】

TheMoeLove