Understanding the cornerstone of container technology: namespace (on)

Hi everyone, this is Zhang Jintao.

At present, the container technology and virtualization technology (regardless of the virtualization technology under the abstraction level) we mentioned can achieve isolation and restriction at the resource level.

For container technology, it realizes the restriction and isolation at the resource level, which relies on the cgroup and namespace technology provided by the Linux kernel.

Let's first summarize the role of these two technologies:

The main function of cgroup: to manage the allocation and restriction of resources;
The main function of namespace: encapsulation abstraction, restriction, isolation, so that the processes in the namespace appear to have their own global resources;

In the previous article, we focused on the cgroup . In this article, we will focus on namespace.

What is Namespace?

We quote the definition of namespace on wiki

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for a set of resources and processes, but those namespaces refer to distinct resources.

Namespace is a feature of the Linux kernel. It can partition kernel resources so that a group of processes can see a group of resources, and another group of processes can see another group of different resources. The principle of this function is to use the same namespace for a group of resources and processes, but these namespaces actually refer to different resources.

This statement is too circumstantial. Simply put, namespace is a technology for resource isolation between processes provided by the Linux kernel. Wrap the global system resources in an abstraction, so that the process (seems) has an independent global resource instance. At the same time, Linux also provides multiple namespaces by default to isolate multiple different resources.

Previously, the scenarios where we used namespace alone were relatively limited, but namespace is the cornerstone of containerization technology.

Let's take a look at its development first.

The history of Namespace

Figure 1, the historical process of namespace

Early stage-Plan 9

The early proposal and use of namespace can be traced back to Plan 9 from Bell Labs, and Plan 9 of Bell Labs. This is a distributed operating system, developed by the Computing Science Research Center of Bell Labs in the past eight years to 2002 (the stable fourth version was released in 2002, and it has been 10 years since the first public version was released in 1992. Polished), it is still being developed and used by researchers and enthusiasts of operating systems. In the design and implementation of Plan 9, we focused on the following 3 points:

File system: All system resources are listed in the file system, identified by Node. All interfaces are also presented as part of the file system.
Namespace: It can better apply and display the hierarchical structure of the file system. It realizes the so-called "separation" and "independence".
Standard communication protocol: 9P protocol (Styx/9P2000).

Figure 2, Plan 9 from Bell Labs icon

Start to join Linux Kernel

Namespace started to enter the Linux Kernel version in 2.4.X, the first version was 2.4.19. However, since version 2.4.2, the namespace of each process has been implemented.

Figure 3, Linux Kernel Note

Figure 4, Linux Kernel corresponding operating system version

Linux 3.8 basic implementation

In Linux 3.8, User Namespace have finally been fully integrated into the kernel. In this way, the namespace-related capabilities used by Docker and other container technologies are basically realized.

Figure 5, Linux Kernel gradually evolved from 2001 to 2013, completing the realization of namespace

Namespace type

namespace name	Flag used-Flag	Control content
Cgroup	CLONE_NEWCGROUP	Cgroup root directory cgroup root directory
IPC	CLONE_NEWIPC	System V IPC, POSIX message queues, semaphore, message queue
Network	CLONE_NEWNET	Network devices, stacks, ports, etc. Network devices, protocol stacks, ports, etc.
Mount	CLONE_NEWNS	Mount points
PID	CLONE_NEWPID	Process IDs
Time	CLONE_NEWTIME	clock
User	CLONE_NEWUSER	User and group ID
UTS	CLONE_NEWUTS	System host name and NIS (Network Information Service) host name (sometimes called domain name)

Cgroup namespaces

Cgroup namespace is a virtualized view of the cgroups of the process, displayed through /proc/[pid]/cgroup and /proc/[pid]/mountinfo .

The use of cgroup namespace requires the kernel to enable the CONFIG_CGROUPS option. It can be verified in the following ways:

(MoeLove) ➜ grep CONFIG_CGROUPS /boot/config-$(uname -r)
CONFIG_CGROUPS=y

The cgroup namespace provides a series of isolation support:

Prevent information leakage (the container should not see any information outside the container).
Simplified container migration.
Limit the resources of the container process, because it will mount the cgroup file system, making the container process unable to obtain upper-level access rights.

Each cgroup namespace has its own set of cgroup root directories. The root directory of these cgroups is the base point of the relative position of the corresponding records /proc/[pid]/cgroup When a process uses CLONE_NEWCGROUP (clone(2) or unshare(2)) to create a new cgroup namespace, its current cgroups directory becomes the cgroup root directory of the new namespace.

(MoeLove) ➜ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/session-2.scope

When a target process /proc/[pid]/cgroup , the path name of each record will be displayed in the third field, which will be associated with the root directory of the relevant cgroup hierarchy of the process being read. If the cgroup directory of the target process is outside the cgroup namespace root directory of the process being read, then the path name will show ../ for the upper node in each cgroup hierarchy.

Let's take a look at the following example (here we take cgroup v1 as an example, if you want to see an example of the v2 version, please let me know in the message):

In the initial cgroup namespace, we use root (or a user with root privileges) to create a sub-cgroup named moelove-sub under the freezer layer, and at the same time, put the process into this cgroup for restriction.

(MoeLove) ➜ mkdir -p /sys/fs/cgroup/freezer/moelove-sub
(MoeLove) ➜ sleep 6666666 & 
[1] 1489125                  
(MoeLove) ➜ echo 1489125 > /sys/fs/cgroup/freezer/moelove-sub/cgroup.procs

We create another sub-cgroup under the freezer layer, named moelove-sub2 , and put the execution process number. You can see that the current process has been managed under the cgroup of moelove-sub2

(MoeLove) ➜ mkdir -p /sys/fs/cgroup/freezer/moelove-sub2
(MoeLove) ➜ echo $$
1488899
(MoeLove) ➜ echo 1488899 > /sys/fs/cgroup/freezer/moelove-sub2/cgroup.procs 
(MoeLove) ➜ cat /proc/self/cgroup |grep freezer
7:freezer:/moelove-sub2

We use unshare(1) create a process. Here we use the -C parameter to indicate the new cgroup namespace, and the -m parameter to indicate the new mount namespace.

(MoeLove) ➜ unshare -Cm bash
root@moelove:~#

From the new shell started with unshare(1), we can /proc/[pid]/cgroup file, the new shell and the process in the above example:

root@moelove:~# cat /proc/self/cgroup | grep freezer
7:freezer:/
root@moelove:~# cat /proc/1/cgroup | grep freezer
7:freezer:/..

# 第一个示例进程
root@moelove:~# cat /proc/1489125/cgroup | grep freezer
7:freezer:/../moelove-sub

From the above example, we can see that in the freezer cgroup relationship of the new shell, when a new cgroup namespace is created, the relationship between the root directory of the freezer cgroup and it is also established.

root@moelove:~# cat /proc/self/mountinfo | grep freezer
1238 1230 0:37 /.. /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer

The fourth field ( /.. ) shows the mounted directory in the cgroup file system. From the definition of cgroup namespaces, we can know that the current freezer cgroup directory of the process has become its root directory, so this field displays /.. . We can remount to handle it.

root@moelove:~# mount --make-rslave /
root@moelove:~# umount /sys/fs/cgroup/freezer
root@moelove:~# mount -t cgroup -o freezer freezer /sys/fs/cgroup/freezer
root@moelove:~# cat /proc/self/mountinfo | grep freezer
1238 1230 0:37 / /sys/fs/cgroup/freezer rw,relatime - cgroup freezer rw,freezer
root@moelove:~# mount |grep freezer
freezer on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer)

IPC namespaces

IPC namespaces isolate IPC resources, such as System V IPC objects and POSIX message queues. Each IPC namespace has its own set of System V IPC identifiers and POSIX message queue system. Objects created in an IPC namespace are visible to all members under that namespace (not visible to members under other namespaces).

The use of IPC namespace requires the kernel to support the CONFIG_IPC_NS option. as follows:

(MoeLove) ➜ grep CONFIG_IPC_NS /boot/config-$(uname -r)
CONFIG_IPC_NS=y

/proc interface can be set in the IPC namespace:

/proc/sys/fs/mqueue -POSIX message queue interface
/proc/sys/kernel -System V IPC interface (msgmax, msgmnb, msgmni, sem, shmall, shmmax, shmmni, shm_rmid_forced)
/proc/sysvipc -System V IPC interface

When the IPC namespace is destroyed (when the last process in the space is stopped and deleted), the objects created in the IPC namespace will also be destroyed.

Network namepaces

Network namespaces isolates system resources related to the network (here are some):

network devices-network devices
IPv4 and IPv6 protocol stacks-IPv4 and IPv6 protocol stacks
IP routing tables-IP routing tables
firewall rules-firewall rules
/proc/net (ie /proc/PID/net)
/sys/class/net
Files in the /proc/sys/net directory
Port, socket
UNIX domain abstract socket namespace

The use of Network namespaces requires the kernel to support the CONFIG_NET_NS option. as follows:

(MoeLove) ➜ grep CONFIG_NET_NS /boot/config-$(uname -r)
CONFIG_NET_NS=y

A physical network device can only exist in one Network namespace. When a Network namespace is released (when the last process in the space is stopped and deleted), the physical network device will be moved to the initial Network namespace instead of the upper Network namespace.

A virtual network device (veth(4)) is connected through a pipe-like method between Network namespaces. This allows it to exist in multiple Network namespaces, but when the Network namespace is destroyed, the veth(4) devices contained in the space may be destroyed.

Mount namespaces

Mount namespaces first appeared in Linux version 2.4.19. Mount namespaces isolates the process instances mounted in each space. Processes under each instance of mount namespace will see different directory hierarchies.

The description of each process in mount namespace can be seen in the file view below:

/proc/[pid]/mounts
/proc/[pid]/mountinfo
/proc/[pid]/mountstats

The creation identifier of a new Mount namespace is CLONE_NEWNS, using clone(2) or unshare(2).

If Mount namespace is created with clone(2), the mount list of the child namespace is copied from the mount namespace of the parent process.
If Mount namespace is created with unshare(2), the mount list of the new namespace is copied from the moun namespace before the caller.

If the mount namespace is modified, what kind of chain reaction will it cause? Next, we will talk about the shared subtree

Each mount can be marked as follows:

MS_SHARED-Share events with every member of the group. In other words, the same mount or unmount will automatically occur in other mounts in the group. Conversely, the mount or unmount event will also affect this event action.
MS_PRIVATE-This mount is private. Neither mount nor unmount events will affect this event action.
MS_SLAVE-mount or unmount events will be imported from the master node to affect this node. However, mount or unmount events under this node will not affect other nodes in the group.
MS_UNBINDABLE-This is also a private mount. Any attempt to bind mount will fail under this setting.

In the file /proc/[pid]/mountinfo you can see propagation types of fields. Each peer group will be generated by the kernel with a unique ID, the mount of the same peer group is this ID (that is, X in the following).

(MoeLove) ➜ cat /proc/self/mountinfo  |grep root  
65 1 0:33 /root / rw,relatime shared:1 - btrfs /dev/nvme0n1p6 rw,seclabel,compress=zstd:1,ssd,space_cache,subvolid=256,subvol=/root
1210 65 0:33 /root/var/lib/docker/btrfs /var/lib/docker/btrfs rw,relatime shared:1 - btrfs /dev/nvme0n1p6 rw,seclabel,compress=zstd:1,ssd,space_cache,subvolid=256,subvol=/root

shared:X-shared in group X.
master:X-For group X, it is a slave, that is, it belongs to the master with ID X.
propagate_from:X-Receive shared mount from group X. This label always appears with master:X.
unbindable-indicates that it cannot be bound, that is, not subordinate to other associations.

The propagation type of the new mount namespace depends on its parent node. If the propagation type of the parent node is MS_SHARED, then the propagation type of the new mount namespace is MS_SHARED, otherwise it will default to MS_PRIVATE.

Regarding mount namespaces, we also need to pay attention to the following points:

(1) Each mount namespace has an owner user namespace. If the new mount namespace and the copied mount namespace belong to different user namespaces, then the new mount namespace has a lower priority.

(2) When the created mount namespace has a low priority, then the slave mount events will take precedence over the shared mount events.

(3) When high-priority and low-priority mount namespaces are associated and locked together, they cannot be uninstalled separately.

(4) The mount(2) flag and the atime flag will be locked, that is, they cannot be modified by the influence of propagation.

summary

The above is some introduction about the namespace in the Linux kernel. For reasons of space, the remaining part and the application of namespace in the container will be introduced in the next article, so stay tuned!

Welcome to subscribe to my article public account【MoeLove】

TheMoeLove

Understanding the cornerstone of container technology: namespace (on)

What is Namespace?

The history of Namespace

Early stage-Plan 9

Start to join Linux Kernel

Linux 3.8 basic implementation

Namespace type

Cgroup namespaces

IPC namespaces

Network namepaces

Mount namespaces

summary

张晋涛

引用和评论

张晋涛：KubeCon China 2024 回顾

再见 XShell！一款万能通用的终端工具，用完爱不释手！

OpenInfra 基金会董事会宣布加入 Linux 基金会意向，增强开源全球影响力

记录下安装open-eBackup过程

rocky linux 使用记录

🔥吐血整理 Bolt.diy 部署与应用攻略

【Docker】基本概念及语法与环境搭建