This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!

A, beginning

Students who have come into contact with Docker have more or less heard such a sentence as “Docker containers realize resource isolation and restriction through Linux namespace and Cgroup features”. Today we’re going to take a look at two of the underlying Linux technologies that support container development.

Namesapce

Namespaces wrap global system resources in an abstraction, and processes in mission namespaces appear to have their own separate instances of global resources. Changes to global resources in the namespace are visible to other processes; members of the namespace are invisible to other processes.

At present, the Linux kernel has implemented the following seven namespaces:
Namespace Flag (API operation type alias) Grow (isolate content) Cgroup CLONE_NEWCGROUP Cgroup root directory (since Linux 4.6) IPC CLONE_NEWIPC System V IPC, POSIX Message Queues (since Linux 2.6.19) Network CLONE_NEWNET Network Devices, stacks, ports, Etc. (since Linux 2.6.24) Mount CLONE_NEWNS Mount points (since Linux 2.4.19) PID CLONE_NEWPID Process IDs (since Linux 2.6.24) User CLONE_NEWUSER User and group IDs (started in Linux 2.6.23 and completed in Linux 3.8) UTS CLONE_NEWUTS Hostname and NIS domain name (since Linux 2.6.19)Copy the code
View the namespace of the process
[root@i-k9pwet2d ~]# pidof bash
14208 11123 2053
​
[root@i-k9pwet2d ~]# ls -l  /proc/14208/ns
total 0
lrwxrwxrwx 1 root root 0 Jul 20 09:36 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Jul 20 09:36 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Jul 20 09:36 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Jul 20 09:36 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Jul 20 09:36 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Jul 20 09:36 uts -> uts:[4026531838]
​
Copy the code

In /proc/[pid]/ns, each process can view its owning namespace information. These link files point to the owning namespace and inode ID. We can use the readlink to check whether the two processes belong to the same namespace. If the inodes are the same, they belong to the same namespace

[root@i-k9pwet2d ~]# readlink /proc/11123/ns/uts
uts:[4026531838]
[root@i-k9pwet2d ~]# readlink /proc/14208/ns/uts
uts:[4026531838]
Copy the code

How to register your process in a namespace (API operations)?

Clone () : Create a new namespace, the child processes belong to the new namespace,flags is the namespace type we create, such as CLONE_NEW*

int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
                 /* pid_t *parent_tid, void *tls, pid_t *child_tid */ );
Copy the code

Setns () : add to a namespace where fd is a linked file in /proc/[pid]/ns and nstype is our Flag

int setns(int fd, int nstype);
Copy the code

Unshare () : exits a namespace and joins it in a new space.

int unshare(int flags);
Copy the code

Ioctl () : The IOCtl system call can be used to query namespace information

int ioctl(int fd , unsigned long request , ...) ;Copy the code

Let’s take a look at the seven namespace isolation implementations using the shell command unshare

1.PID Namespace

PID Namespace is used to isolate processes. With PID Namespace, the main process of each container is process 1, while processes in the container have different Pids on the host.

[root@i-k9pwet2d ~]# unshare --fork --pid --mount-proc /bin/bash [root@i-k9pwet2d ~]# ps -aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.1 115680 2036 PTS /0 S 10:46 0:00 /bin/bash root 12 0.0 0.1 115684 2048 PTS /0 S 10:47 :00 -bash root 30 0.0 0.0 155468 1804 PTS /0 R+ 10:57 :00 ps -aux ls -l /proc/1/ns total 0 LRWXRWXRWX 1 root root 0 Jul 20 11:05 ipc -> ipc:[4026531839] lrwxrwxrwx 1 root root 0 Jul 20 11:05 mnt -> mnt:[4026532545] lrwxrwxrwx 1 root root 0 Jul 20 11:05 net -> net:[4026531956] lrwxrwxrwx 1 root root 0 Jul 20 11:05 pid -> pid:[4026532546] lrwxrwxrwx 1 root root 0 Jul 20 11:05 user -> user:[4026531837] lrwxrwxrwx 1 root root 0 Jul 20 11:05 uts -> uts:[4026531838]Copy the code

In the new PID Namespace we can only see processes in our own Namespace. And the current bash is in the MNT and PID namespaces.

2.Mount Namespace

It can be used to isolate the mount points seen by different processes or process groups. Mounting operations in the container do not affect the mount directory on the host.

Let’s create a namespace

unshare --mount --fork /bin/bash
Copy the code

Mount a directory

[root@i-k9pwet2d ~]# mkdir /tmp/mnt
[root@i-k9pwet2d ~]# mount -t tmpfs -o size=1m tmpfs /tmp/mnt
​
[root@i-k9pwet2d ~]# df -h |grep mnt
tmpfs            1M     0   1M   0% /tmp/mnt
Copy the code

The mount in the namespace does not affect our host directory, and we cannot see mount information on the host

df -h |grep mnt 
Copy the code

3.User Namespace

User Namespace Isolates users and User groups. Let’s create a user namespace and modify the prompt

[root@i-k9pwet2d ~]# PS1='\u@container#' unshare --user -r /bin/bash
​
root@container#
​
Copy the code

If you look at NS again, the user link is different, already in a different space.

[root@i-k9pwet2d ~]# readlink /proc/1835/ns/user
user:[4026532192]
[root@i-k9pwet2d ~]# readlink /proc/$$/ns/user
user:[4026531837]
Copy the code

The biggest advantage of user namespaces is that you do not need root permission to run containers, avoiding the impact of applications using root on hosts.

4.UTS Namespace

The UTS Namespace is used to isolate host names, allowing each UTS Namespace to have a separate host name.

[root@i-k9pwet2d ~]# unshare --fork --uts /bin/bash
Copy the code

After the host name is changed in the namespace, the host name is not affected

[root@i-k9pwet2d ~]# hostname -b container
[root@i-k9pwet2d ~]# hostname
container
Copy the code

In the host

[root@i-k9pwet2d ~]# hostname
i-k9pwet2d
Copy the code

5.IPC Namespace

The IPC namespace isolates certain IPC resources, namely System V IPC objects (see sysviPC (7)) and (since Linux 2.6.30) POSIX message queues (see MQ_Overview (7)). The processes in the same IPC Namespace can communicate with each other through IPC Namespace and PID Namespace. However, processes in different IPC namespaces cannot communicate with each other.

We use the IPC-related commands in Linux to test

The ipcs -q command displays the list of intersystem communication queues.

The ipcmk -q command is used to create a communication queue between systems.

Let’s create an IPC Namespace first

[root@i-k9pwet2d ~]# unshare --fork --ipc /bin/bash
Copy the code

Query after creating a communication queue

[root@i-k9pwet2d ~]# ipcmk -Q
Message queue id: 0
​
[root@i-k9pwet2d ~]# ipcs -q
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    
0x1de4aef6 0          root       644        0            0 
Copy the code

Query on the host and you can see that the communication has been quarantined

[root@i-k9pwet2d ~]# ipcs -q
​
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
Copy the code

6.Net Namespace

Net Namespace Is used to isolate network devices, IP addresses, and ports. Net Namespace allows each process to have its own IP address, port, and network adapter information.

Let’s go ahead and create a Net Namespace

[root@i-k9pwet2d ~]# unshare --net --fork /bin/bash
Copy the code

View network and port information

[root@i-k9pwet2d ~]# ip addr 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00  [root@i-k9pwet2d ~]# netstat -ntlp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign  Address State PID/Program nameCopy the code

We see a loopback interface, LO, which is DOWN, and we start it up so that our Namespace has its own network address.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft foreverCopy the code

In the host

[root@i-k9pwet2d ~]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:96:E1 :36:04 BRD FF: FF: FF: FF: FF inet 10.150.25.9/24 BRD 10.150.25.255 scope global NoprefixRoute Dynamic eth0 valid_lft 80720sec preferred_lft 80720sec inet6 fe80::5054:96ff:fee1:3604/64 scope link valid_lft forever preferred_lft forever [root@i-k9pwet2d ~]# netstat -ntlp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program Name TCP 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 757/ SSHD TCP 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 1112/master ...Copy the code

7.Cgroup Namespace

Cgroup virtualizes the Cgroup view of a process. Each cGroup namespace has its own set of CGroup root directories. Linux 4.6 is supported.

The virtualization provided by the CGroup namespace has several uses:

  • Prevent information leakage. Otherwise, the cGroup directory path outside the container is visible to processes inside the container.
  • Simplifies tasks such as container migration.
  • Allows better restrictions on containerized processes. You can mount the cgroup file system of the container so that the container does not need to access the host cgroup directory.

8.Time Namespace

Virtualize two system clocks for time isolation. “TIME_NAMESPACES” (7)


3. About Cgroup

From the above, we know that when we want to run a container, Docker and other applications will create a group of namespaces for the container, which can be understood as a group of processes for the operating system. So we’ve concentrated power, but with great power comes great responsibility, and we can’t let this set of power go, so we have Cgroup (Linux Control Group).

The main purpose of Cgroup is to limit the resources that can be used by a process group, including CPU, memory, disk, network bandwidth, and so on.

The CGroups framework provides the following:

  • Resource limits: We can set memory limits or CPU limits for our process group or limit it to a specific peripheral.
  • Priority: One or more groups can be configured to preferentially occupy CPU or disk I/O throughput.
  • Resource records: Monitor and measure group resource usage.
  • Control: You can freeze or stop and restart process groups.

A Cgroup can consist of one or more processes that are all bound to the same set of restrictions. These groups can also be hierarchical, meaning that subgroups can inherit the restrictions managed by the parent group.

The Linux kernel provides cGroup technology with access to a set of controllers or subsystems. A controller is responsible for allocating a particular type of system resource to a group of one or more processes. For example, the memory controller limits memory usage, while the CPUAcct controller monitors CPU usage.

We use Mount to view the subsystem of cGroup in the system

[root@i-k9pwet2d ~]# mount -t cgroup 
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
Copy the code

You can see that cgroup has been mounted to /sys/fs/cgroup/ as a file system

[root@i-k9pwet2d ~]# ls -l /sys/fs/cgroup/ total 0 drwxr-xr-x 2 root root 0 Jul 20 12:23 blkio lrwxrwxrwx 1 root root 11  Jul 20 12:23 cpu -> cpu,cpuacct lrwxrwxrwx 1 root root 11 Jul 20 12:23 cpuacct -> cpu,cpuacct drwxr-xr-x 2 root root 0 Jul 20 12:23 cpu,cpuacct drwxr-xr-x 2 root root 0 Jul 20 12:23 cpuset drwxr-xr-x 4 root root 0 Jul 20 12:23 devices drwxr-xr-x 2 root root 0 Jul 20 12:23 freezer drwxr-xr-x 2 root root 0 Jul 20 12:23 hugetlb drwxr-xr-x 2 root root 0 Jul  20 12:23 memory lrwxrwxrwx 1 root root 16 Jul 20 12:23 net_cls -> net_cls,net_prio drwxr-xr-x 2 root root 0 Jul 20 12:23 net_cls,net_prio lrwxrwxrwx 1 root root 16 Jul 20 12:23 net_prio -> net_cls,net_prio drwxr-xr-x 2 root root 0 Jul 20 12:23 perf_event drwxr-xr-x 2 root root 0 Jul 20 12:23 pids drwxr-xr-x 4 root root 0 Jul 20 12:23 systemdCopy the code
Let’s look at an example of how cGroup limits CPU usage

We start a loop script that will consume nearly 100% of the CPU, which we limit to 50% through cgroup

$ cat loop.sh #! /bash/sh while [ 1 ]; do : doneCopy the code

Put our script in the background and get its PID of 21497

nohup bash loop.sh &   
Copy the code

We need to create a cgroup control group loop

[root@i-k9pwet2d ~]# mkdir /sys/fs/cgroup/cpu/loop
Copy the code

The LOOP group is a subgroup of cpus. As mentioned above, subgroups can inherit the restrictions managed by the parent group, so loop will inherit access to the entire CPU of the system

[root@i-k9pwet2d shell]# ls -l /sys/fs/cgroup/cpu/loop total 0 -rw-r--r-- 1 root root 0 Jul 20 17:15 cgroup.clone_children --w--w--w- 1 root root 0 Jul 20 17:15 cgroup.event_control -rw-r--r-- 1 root root 0 Jul 20 17:15 cgroup.procs -r--r--r-- 1 root root 0 Jul 20 17:15 cpuacct.stat -rw-r--r-- 1 root root 0 Jul 20 17:15 cpuacct.usage -r--r--r-- 1 root root 0 Jul 20 17:15 cpuacct.usage_percpu -rw-r--r-- 1 root root 0 Jul 20 17:15 cpu.cfs_period_us -rw-r--r-- 1 root root 0 Jul 20 17:15 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 Jul 20 17:15 cpu.rt_period_us -rw-r--r--  1 root root 0 Jul 20 17:15 cpu.rt_runtime_us -rw-r--r-- 1 root root 0 Jul 20 17:15 cpu.shares -r--r--r-- 1 root root 0 Jul 20 17:15 cpu.stat -rw-r--r-- 1 root root 0 Jul 20 17:15 notify_on_release -rw-r--r-- 1 root root 0 Jul 20 17:15 tasksCopy the code

Check the CPU limit of the inherited LOOP group. The calculation period is 100000US and the sampling time is unrestricted (-1).

[root@i-k9pwet2d shell]# cat /sys/fs/cgroup/cpu/loop/cpu.cfs_period_us
100000
[root@i-k9pwet2d shell]# cat /sys/fs/cgroup/cpu/loop/cpu.cfs_quota_us
-1
Copy the code

To limit the process’s CPU usage to 50%, we need to update the value of cpu.cfs_quota_us to 50000

echo 50000 >/sys/fs/cgroup/cpu/loop/cpu.cfs_quota_us
Copy the code

Update script PID to tasks under the loop control group

[root@i-k9pwet2d shell]# echo 21497 >/sys/fs/cgroup/cpu/loop/tasks
Copy the code

At this point our script CPU usage has been limited to 50%

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                        
21497 root      20   0  113284   1176    996 R 50.0  0.1  12:17.48 bash  
Copy the code

When docker starts a container, the CPU limit parameters –cpu-period and –cpu-quota actually adjust the CPU quota of the corresponding container control group.


Reference:

  • An In-depth analysis of Kubernetes by Lei Zhang
  • Everything You Need to Know about Linux Containers, Part I: Linux Control Groups and Process Isolation
  • Namespaces (7) — Linux manual Page

The small composition has the inadequacy place welcome to point out.

Welcome to collect, like and ask questions. Keep an eye on top water cooler managers and sometimes do more than just boil hot water.