Thoroughly understand the cornerstone of container technology: namespace (Part 2)

Hi everyone, this is Zhang Jintao.

At present, the container technology and virtualization technology (regardless of the virtualization technology under the abstraction level) we mentioned can achieve isolation and restriction at the resource level.

For container technology, it realizes the restriction and isolation at the resource level, which relies on the cgroup and namespace technology provided by the Linux kernel.

Let's first summarize the role of these two technologies:

The main function of cgroup: to manage the allocation and restriction of resources;
The main function of namespace: encapsulation abstraction, restriction, isolation, so that the processes in the namespace appear to have their own global resources;

This is a series of articles, friends who are interested in this series can view:

In this article, we will continue to talk about namespace.

Namespace type

Let's take a look at the overview of the type of namespace, the last chapter has been introduced for everyone Cgroup , IPC , Network and Mount 4 types such as namespace. We continue to talk about the remaining part.

namespace name	Flag used-Flag	Control content
Cgroup	CLONE_NEWCGROUP	Cgroup root directory cgroup root directory
IPC	CLONE_NEWIPC	System V IPC, POSIX message queues, semaphore, message queue
Network	CLONE_NEWNET	Network devices, stacks, ports, etc. Network devices, protocol stacks, ports, etc.
Mount	CLONE_NEWNS	Mount points
PID	CLONE_NEWPID	Process IDs
Time	CLONE_NEWTIME	Boot and monotonic clocks
User	CLONE_NEWUSER	User and group IDs
UTS	CLONE_NEWUTS	Hostname and NIS domain name

PID namespaces

We know that in the Linux system, each process has its own independent PID, and PID namespace is mainly used to isolate the process ID. That is, different PID namespaces can contain the same process ID.

The process number in each PID namespace starts from 1. In this PID namespace, fork(2) , vfork(2) and clone(2) can be used to create other processes with independent PIDs.

To use PID namespace, the kernel needs to support the CONFIG_PID_NS option. as follows:

(MoeLove) ➜ grep CONFIG_PID_NS /boot/config-$(uname -r)
CONFIG_PID_NS=y

init process

We all know that there is a special process in the Linux system, the so-called init process, which is the process with PID 1.

We have already said that the process number in each PID namespace starts from 1. So what are its characteristics?

First, the process No. 1 in PID namespace is the parent process of all orphaned processes.

Secondly, if this process is terminated, the kernel will call SIGKILL send a signal to terminate all processes in this namespace. this part of the application Kubernetes elegant closed / smooth upgrade and others have some connection. (Partners who are interested in this part can leave a message to exchange, if you are interested in these content, I can write a special article to start the chat)

Finally, starting from the Linux v3.4 kernel version, if a reboot() occurs in a PID namespace, the init process in the PID namespace will immediately exit. This is a relatively special technique that can be used to deal with the problem of container withdrawal on high-load machines.

Hierarchical structure of PID namespace

PID namespace supports nesting. Except for the initial PID namespace, all other PID namespaces have the PID namespace of their parent node.

In other words, the PID namespace is also a tree structure, and all PID namespaces in this structure can be traced to the ancestor PID namespace. Of course, this depth is not unlimited. Starting from the Linux v3.7 kernel version, the maximum depth of the tree is limited to 32.

If this maximum depth is reached, an error of No space left on device (I encountered it when I tried to nest containers before)

in the same (and the same level) PID namespace, and the processes are visible to each other.

But if a process is located in the child PID namespace, then the process cannot see the process in the upper layer (that is, the parent PID namespace).

Whether the process is visible or not determines whether there is a certain association and calling relationship between the processes. The friends should be familiar with this, so I won't go into details here.

So, can the process be scheduled to PID namespaces of different levels?

Let's start with the conclusion process in the PID namespace can only be one-way scheduling (from high -> low) . which is:

The process can only be scheduled from the parent PID namespace to the child PID namespace;
The process cannot be scheduled from the child PID namespace to the parent PID namespace;

Figure 1, through the setns(2) scheduling process description

The hierarchical relationship of PID namespace is actually ioctl_ns(2) system call (NS_GET_PARENT), which will not be expanded here. So, how is the scheduling in the above content realized?

To answer this question, you must first realize that at the beginning of PID namespace creation, which processes have the permissions of the namespace have already been determined. As for scheduling, we can simply understand it as relational mapping or symbolic link.

The threads must be in the same PID namespace to ensure that the threads in the process can transmit signals to each other. As a result, CLONE_NEWPID cannot be CLONE_THREAD simultaneously with 061b832ee44529. But what if multiple processes distributed in different PID namespaces need to transmit signals to each other? It can be solved with a shared signal queue.

In addition, we often come into contact with /proc There are many directory /proc/${PID} directory, where you can see the process in the case of PID namespace. At the same time, this directory can also be directly operated by mounting. for example:

(MoeLove) ➜ mount |grep proc 
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)

Is there a way to know the current maximum PID number?

This is also possible. Since the Linux v3.3 version of the kernel, a new /proc/sys/kernel/ns_last_pid has been added to record the ID of the last process.

When the next process ID needs to be allocated, the kernel will search for the largest unused ID for allocation, and then update the PID information in this file.

Time namespaces

Before talking about time namespace, we need to talk about monotonous time . First of all, the system time we usually refer to refers to clock realtime, that is, the display of the current time by the machine. It can be adjusted forward or backward (combined with NTP service to understand). The clock monotonic represents the time record after a certain moment, it is a one-way backward absolute time, and is not affected by changes in the system time.

The use of time namespace requires the kernel to support the CONFIG_TIME_NS option. like:

(MoeLove) ➜ grep CONFIG_TIME_NS /boot/config-$(uname -r)
CONFIG_TIME_NS=y

The time namespace does not virtualize the CLOCK_REALTIME clock. You may be curious, why does the kernel support time namespace? Mainly for some special scenes.

All processes in the time namespace share the following two parameters provided by the time namespace:

CLOCK_MONOTONIC-Monotonic time, a clock that cannot be set;
CLOCK_BOOTTIME (refer to the CLOCK_BOOTTIME_ALARM kernel parameter)-a clock that cannot be set, including the time when the system is suspended.

The time namespace can only use the CLONE_NEWTIME logo at present, and it can unshare(2) system call. The process of creating the time namespace is independent of the newly created time namespace, and the subsequent child processes of the process will be placed in the newly created time namespace. Processes in the same time namespace will share CLOCK_MONOTONIC and CLOCK_BOOTTIME.

When the parent process creates a child process, the ownership of the time namespace of the child process will be displayed in the file /proc/[pid]/ns/time_for_children.

(MoeLove) ➜ ls -al /proc/self/ns/time_for_children 
lrwxrwxrwx. 1 tao tao 0 12月 14 02:06 /proc/self/ns/time_for_children -> 'time:[4026531834]'

The file /proc/PID/timens_offsets defines the monotonic clock and startup clock of the initial time namespace, and records the offset. (If a new time namespace has not yet entered the process, it can be modified. It will not be expanded here, and interested friends can leave a message in the discussion area for discussion.)

It should be noted that in the initial time namespace, the offsets displayed by /proc/self/timens_offsets are all 0.

(MoeLove) ➜ cat /proc/self/timens_offsets 
monotonic           0         0
boottime            0         0

The meanings of the second and third columns are as follows:

<offset-secs> can be a negative value, unit: second (s)
<offset-nanosecs> is an unsigned value, unit: nanoseconds (ns)

The following clock interfaces are all associated with this namespace:

clock_gettime(2)
clock_nanosleep(2)
nanosleep(2)
timer_settime(2)
timerfd_settime(2)

On the whole, time namespace is still very useful in some special scenarios.

User namespaces

User namespaces, as the name implies, isolates user id, group id, etc.

The use of user namespaces requires the kernel to support the CONFIG_USER_NS option. like:

➜  local_time grep CONFIG_USER_NS /boot/config-$(uname -r)
CONFIG_USER_NS=y

The user id and group id of the process may be different inside and outside a user namespace.

For example, the user and group of a process in the user namespace may be a privileged user (root), but outside the user namespace, it may be just an ordinary unprivileged user. This involves user, group mapping (uid_map, gid_map) and other related content.

Starting from the Linux v3.5 version of the kernel, we can view the mapping content /proc/[pid]/uid_map and /proc/[pid]/gid_map

(MoeLove) ➜ cat /proc/self/uid_map 
         0          0 4294967295
(MoeLove) ➜ cat /proc/self/gid_map 
         0          0 4294967295

The user namespace also supports nesting, using the CLONE_NEWUSER logo, and using system calls such as unshare(2) or clone(2) to create it. The maximum nesting level is also 32.

If the child process created by fork(2) or clone(2) does not have the CLONE_NEWUSER identification, the same is true. The child process and the parent process are in the same user namespace. The tree-like relationship is also maintained through the ioctl(2) system call interface.

A single-threaded process can adjust its belonging user namespace through the setns(2) system call.

In addition, user namespace has a very important rule, that is, the inheritance relationship of Linux capability. Regarding Linux capability, I won’t expand on it. Here is a brief record:

When the user namespace of the process has the capability in the effective capability set, the process has the capability.
When a process has a capability in the user namespace, the process has the capability in all child user namespaces of the user namespace.
The user who created the user namespace will be recorded as the owner by the kernel, that is, has all the capabilities in the user namespace.

For Docker, it can natively support this capability to achieve a kind of protection for the container environment.

UTS namespaces

UTS namespaces isolates the host name and the NIS domain name.

The use of UTS namespaces requires the kernel to support the CONFIG_UTS_NS option. like:

(MoeLove) ➜ grep CONFIG_UTS_NS /boot/config-$(uname -r)
CONFIG_UTS_NS=y

In the same UTS namespace, the settings and modifications made through the sethostname(2) and setdomainname(2) system calls are shared and viewed by all processes, but for different UTS namespaces, they are isolated and invisible to each other.

Namespaces main API

There are a lot of system calls mentioned in the previous content, here we pick a few important introductions.

clone(2)

The system calls clone(2) to create a new process, which will implement the corresponding configuration functions one by one according to the CLONE_NEW* settings in the parameters. Of course, this system call also implements some functions that have nothing to do with namespace. For systems with a kernel lower than Linux 3.8, in most cases, the capability of CAP_SYS_ADMIN is required.

unshare(2)

The system calls unshare(2) to allocate the process to the new namespace. Similarly, it will also adjust the corresponding configuration function according to the CLONE_NEW* setting in the parameter. For systems lower than Linux 3.8, in most cases, the capability of CAP_SYS_ADMIN is required.

setns(2)

The system calls setns(2) to move the process to an existing namespace, which will cause the contents of the directory corresponding to /proc/[pid]/ns to change. The child process created by the process can adjust its namespace by calling unshare(2) and setns(2). This usually requires the capability of CAP_SYS_ADMIN.

Some key catalog descriptions

/proc/[pid]/ns/ directory

Each process has a /proc/[pid]/ns/ subdirectory, and the contents of the directory will be affected by the setns(2) system call. As long as the file in the directory is opened, the corresponding namespace cannot be destroyed. The system can change the contents of these files by calling setns(2).

Linux 3.7 and earlier versions-files exist as hard links;
Starting from Linux 3.8-the file exists as a soft link;

(MoeLove) ➜ ls -l --time-style='+' /proc/$$/ns  
总用量 0
lrwxrwxrwx. 1 tao tao 0  cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx. 1 tao tao 0  ipc -> 'ipc:[4026531839]'
lrwxrwxrwx. 1 tao tao 0  mnt -> 'mnt:[4026531840]'
lrwxrwxrwx. 1 tao tao 0  net -> 'net:[4026532008]'
lrwxrwxrwx. 1 tao tao 0  pid -> 'pid:[4026531836]'
lrwxrwxrwx. 1 tao tao 0  pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx. 1 tao tao 0  time -> 'time:[4026531834]'
lrwxrwxrwx. 1 tao tao 0  time_for_children -> 'time:[4026531834]'
lrwxrwxrwx. 1 tao tao 0  user -> 'user:[4026531837]'
lrwxrwxrwx. 1 tao tao 0  uts -> 'uts:[4026531838]'

If the namespace of the two processes is the same, then the contents of this directory should be the same.

The following is a detailed description of the files in this directory:

file name	Starting version	describe
/proc/[pid]/ns/cgroup	Linux 4.6	The cgroup namespace of the process
/proc/[pid]/ns/ipc	Linux 3.0	IPC namespace of the process
/proc/[pid]/ns/mnt	Linux 3.8	Mount namespace of the process
/proc/[pid]/ns/net	Linux 3.0	The network namespace of the process
/proc/[pid]/ns/pid	Linux 3.8	The PID namespace of the process is constant throughout the life cycle of the process
/proc/[pid]/ns/pid_for_children	Linux 4.12	The PID namespace of the child process created by the process is not necessarily consistent with /proc/[pid]/ns/pid.
/proc/[pid]/ns/time	Linux 5.6	The time namespace of the process
/proc/[pid]/ns/time_for_children	Linux 5.6	The time namespace of the process that creates the child process
/proc/[pid]/ns/user	Linux 3.8	The user namespace of the process
/proc/[pid]/ns/uts	Linux 3.0	UTS namespace of the process

/proc/sys/user directory

The files in the /proc/sys/user directory record the relevant restrictions of each namespace. When the limit is reached, related calls will report error ENOSPC.

file name	Restricted content description
max_cgroup_namespaces	The maximum number of cgroup namespaces that each user in the user namespace can create
max_ipc_namespaces	The maximum number of ipc namespaces that each user in the user namespace can create
max_mnt_namespaces	The maximum number of mount namespaces that each user in the user namespace can create
max_net_namespaces	The maximum number of network namespaces that each user in the user namespace can create
max_pid_namespaces	The maximum number of PID namespaces that each user can create in the user namespace
max_time_namespaces	The maximum number of time namespaces that each user in the user namespace of Linux 5.7 can create
max_user_namespaces	The maximum number of user namespaces that each user in the user namespace can create
max_uts_namespaces	The maximum number of uts namespaces that each user in the user namespace can create

Namespace life cycle

The life cycle of a normal namespace is related to the termination and departure of the last process.

But in some cases, even if the last process has exited, the namespace cannot be destroyed. Here is a little talk about these special situations:

/proc/[pid]/ns/* is opened or mounted, even if the last process exits, it cannot be destroyed;
There are hierarchical namespaces, and sub-namespaces still exist. Even if the last process exits, it cannot be destroyed;
A user namespace has some non-user namespaces (such as PID namespace and other namespaces), even if the last process exits, it cannot be destroyed;
For PID namespace, if /proc/[pid]/ns/pid_for_children , even if the last process exits, it cannot be destroyed;

Of course, there are some other situations, and I can add more when I have time.

Summarize

Through the previous article and this article, I mainly introduce the development history of Linux namespace, basic types, main APIs, and some usage scenarios and purposes.

Namespace is a very core part of container technology. Follow-up in this series will continue to share with you about containers and Kubernetes and other technologies, so stay tuned.

Welcome to subscribe to my article public account【MoeLove】

Thoroughly understand the cornerstone of container technology: namespace (Part 2)

Namespace type

PID namespaces

init process

Hierarchical structure of PID namespace

Time namespaces

User namespaces

UTS namespaces

Namespaces main API

clone(2)

unshare(2)

setns(2)

Some key catalog descriptions

/proc/[pid]/ns/ directory

/proc/sys/user directory

Namespace life cycle

Summarize

张晋涛

引用和评论

张晋涛：KubeCon China 2024 回顾

再见 XShell！一款万能通用的终端工具，用完爱不释手！

OpenInfra 基金会董事会宣布加入 Linux 基金会意向，增强开源全球影响力

记录下安装open-eBackup过程

rocky linux 使用记录

🔥吐血整理 Bolt.diy 部署与应用攻略

【Docker】基本概念及语法与环境搭建