Learn how to use Podman to run containers in separate user space.

Podman is part of the libPod library that enables users to manage pods, containers, and container images. In my last article, I wrote about Podman as a safer way to run containers. Here, I’ll explain how to use Podman to run containers in a separate user namespace.

As a great feature for separating containers, I’ve been thinking about the user namespace User Namespace, which was developed primarily by Eric Biederman at Red Hat. User namespaces allow you to specify user identifier (UID) and group identifier (GID) mappings for running containers. This means you can run with UID 0 inside the container and UID 100000 outside the container. If container processes escape from the container, the kernel treats them as if they were running at UID 100000. Furthermore, any file object owned by a UID that is not mapped to the user namespace is treated as owned by nobody (UID 65534, as specified by kernel. overflowUID) and is not allowed to be accessed by container processes. Unless the object is accessible by “others” (that is, the world is readable/writable).

If you have a file with permission 660 and the owner is “real” root, container processes in the user namespace will block access to it when they try to read it and will treat the file as owned by nobody.

The sample

Here’s how it works. First, I create a file on a system owned by root.

$ sudo bash -c "echo Test > /tmp/test"
$ sudo chmod 600 /tmp/test
$ sudo ls -l /tmp/test
-rw-------. 1 root root 5 Dec 17 16:40 /tmp/test
Copy the code

Next, I mount the file volume into a container running with the user namespace mapping 0:10000:5000.

$ sudo podman run -ti -v /tmp/test:/tmp/test:Z --uidmap 0:100000:5000 fedora sh
# id
uid=0(root) gid=0(root) groups=0(root)
# ls -l /tmp/test
-rw-rw----. 1 nobody nobody 8 Nov 30 12:40 /tmp/test
# cat /tmp/test
cat: /tmp/test: Permission denied
Copy the code

The above — UIDMap setting tells Podman to map a series of 5000 UIds inside the container, starting with UID 100,000 outside the container (100,000-104999) to UID 0 inside the container (0-4999). Inside the container, if my process is running with UID 1, it is 100001 on the host.

Since the actual UID=0 is not mapped to the container, any files owned by root will be treated as owned by nobody. This protection cannot be overridden even if the process in the container has CAP_DAC_OVERRIDE capability. The DAC_OVERRIDE capability enables root’s process to read/write any file on the system, even if the process is not owned by root and is not globally readable or writable.

The functions of the user namespace are different from those of the host. They are a function of namespaces. This means that the root of my container only has functionality within the container — really only uids in that scope are mapped to the inner user namespace. If the container process escapes from the container, it will have no functionality other than UID that is not mapped to the user namespace, including UID=0. Even though processes might somehow enter another container, containers do not have these capabilities if they use a different range of Uids.

Note that SELinux and other techniques also limit what happens when a container process breaks a container.

Use podman Top to display user namespaces

We’ve added features to Podman Top that allow you to check the user names of the processes running in the container and identify their real UID on the host.

Let’s start by running a sleep container using our UID mapping.

$ sudo podman run --uidmap 0:100000:5000 -d fedora sleep 1000
Copy the code

Now run podman Top:

$ sudo podman top --latest user huser
USER   HUSER
root   100000

$ ps -ef | grep sleep
100000   21821 21809  0 08:04 ?         00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
Copy the code

Note that Podman Top reports that the user process is running as root inside the container, but as UID 100000 on the host (HUSER). In addition, the ps command confirms that the sleep process is running with UID 100000.

Now let’s run the second container, but this time we’ll choose a single UID mapping, starting with 200000.

$ sudo podman run --uidmap 0:200000:5000 -dfedora sleep 1000 $ sudo podman top --latest user huser USER HUSER root 200000 $ ps -ef | grep sleep 100000 21821 21809 0 08:04? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000 200000 23644 23632 1 08:08 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000Copy the code

Note that Podman Top reports that the second container is running as root inside the container, but is UID=200000 on the host.

See also the ps command, which shows that both sleep processes are running: one 100000 and the other 200000.

This means that running the container in a separate user namespace allows for traditional UID separation between processes, which has been a standard Linux/Unix security tool from the beginning.

User namespace issues

I’ve been arguing for years that user namespaces should be a security tool that everyone should have, but almost no one uses them. The reason is that there is no file system support, nor is there a shifting file system.

In containers, you want to share the base image between many containers. Fedora base images are used in each of the above examples. Most files in Fedora images are owned by the real UID=0. If I run the container on this image with the user namespace 0:10000:5000, by default it treats all these files as owned by nobody, so we need to move all these Uids to match the user namespace. For years, I wanted a mount option to tell the kernel to remap these file Uids to match the user namespace. Upstream kernel storage developers are still working on this, and some progress has been made on this feature, but it is a challenge.

Podman can use different user namespaces on the same image, thanks to automated Chown built into containers/stores by a team led by Nalin Dahyabhai. When Podman uses containers/storage, and Podman first uses a container image in a new user namespace, the container/storage “chown” (i.e., change ownership) all files in the image to the UID mapped in the user namespace and creates a new image. Think of it as a Fedora: 0:10000:5000 mirror.

When Podman runs another container on a mirror with the same UID mapping, it uses a “pre-chown” mirror. When I run the second container at 0:20000:5000, the container/store creates a second image, which we call Fedora: 0:20000:5000.

Note that if you are executing podman Build or Podman Commit and pushing the newly created image to the container registry, Podman will use the container/store to reverse the move and change the pushed file owner back to the real UID=0 image.

This can cause real slowdowns when creating containers in a new UID map, because chown can be slow, depending on the number of files in the image. In addition, on normal OverlayFS, every file in the image is copied. A normal Fedora image can take up to 30 seconds to complete chown and start the container.

Fortunately, the Red Hat kernel storage team (mainly Vivek Goyal and Miklos Szeredi) added a new feature to OverlayFS in kernel 4.19. This feature is called “Copy metadata only”. If you use the metacopy=on option to mount a cascading file system, it does not copy the contents of the lower layers when you change the file properties; The kernel creates new inodes that contain attributes that refer to lower-level data. If the content changes, it will still copy the content. If you want to try it out, you can use this feature in Red Hat Enterprise Linux 8 Beta.

This means that container chown can happen in less than two seconds, and you won’t double the storage space per container.

This makes it feasible for tools like Podman to run containers in different user namespaces, greatly improving system security.

foresight

I would like to add a new option to Podman, such as –userns= Auto, which automatically selects a unique user namespace for each container you run. This is similar to how SELinux is used with separate Multi-category Security (MCS) tags. If you set the environment variable PODMAN_USERNS= AUTO, you don’t even need to set this option.

Podman finally allows users to run containers in different user namespaces. Tools like Buildah and Cri-o can also take advantage of user namespaces. However, for Cri-O, Kubernetes needs to know which user namespace will run the container engine, which is being developed upstream.

In my next article, I’ll explain how to run Podman as a non-root user in a user namespace.


Via: opensource.com/article/18/…

By Daniel J Walsh, Lujun9972

This article is originally compiled by LCTT and released in Linux China