preface

Docker container technology is something you’ve all heard of, and it’s all the rage right now. Its advantages are mainly as follows:

  • There is no need to restart the kernel, so applying scaling can be started in seconds.
  • High resource utilization, using the host kernel directly to schedule resources, small performance loss.
  • One click to start all dependent services, image compilation, use anywhere. The test and production environments are highly consistent.
  • The running environment of the application is independent of the host environment and is completely controlled by the image. The image test of multiple environments is deployed on one physical machine.
  • Continuous delivery and deployment is achieved.

However, when we learn something, we should not only stay at the superficial commands, but more importantly, we should go deep into understanding. The core technology of Docker is mainly based on namespace, Cgroup and Union Fs file system of Linux, as well as Docker’s own network.

This article today mainly introduces the core technologies of Docker, namespace, Cgroup and Union FS (Docker network will be released later), and summarizes the learning during this period.

Namespace

The Linux Namespace is a resource isolation scheme provided by the Linux Kernel:

  • The system can assign different namespaces to processes. Each process has its own namespace
  • In addition, resources of different namespaces are allocated independently and processes are isolated from each other. That is, processes of different namespaces do not interfere with each other

To isolate different resources, the Linux Kernel provides six different types of namespace:

** Note: **UTS full name: UNIX Time Sharing UNIX Time Sharing Operating system

Linux implementation of namespace

Since a namespace is associated with a process, we can see that the task_struct structure contains variables associated with the namespace.

## \kernel\msm4.4\include\linux\sched.h
struct task_struct {  Task_struct also contains other information about the process, such as the state of the process./* namespaces */
	struct nsproxy *nsproxy;. }Copy the code

The NSProxy structure contains information about various namespace implementations

# \kernel\msm4.4\include\linux\nsproxy.h
/* * A structure to contain pointers to all per-process * namespaces - fs (mount), uts, network, sysvipc, etc. * * The pid namespace is an exception -- it's accessed using * task_active_pid_ns. The pid namespace here is the * namespace that children will use. * * 'count' is the number of tasks holding a reference. * The count for each namespace, then, will be the number * of nsproxies pointing to it, not the number of tasks. * * The nsproxy is shared by tasks which share all namespaces. * As soon as a single namespace is cloned or unshared, the * nsproxy is copied. */
struct nsproxy {
	atomic_t count;
	struct uts_namespace *uts_ns;
	struct ipc_namespace *ipc_ns;
	struct mnt_namespace *mnt_ns;
	struct pid_namespace *pid_ns_for_children;
	struct net 	     *net_ns;
};
extern struct nsproxy init_nsproxy;  // init_nsproxy initializes namespaces except. Mnt_ns
Copy the code

Linux operations on namespaces

There are three main commands: clone, setns and unshare

clone

// Linux lightweight processes are created by the clone() function
/* fn When a new process is created with clone, It starts execution by calling the function pointed to by fn. Child_stack Specifies the stack space allocated to the child process. Flags Specifies the IPC namespace of the type CLONE_NEWIPC that needs to be created when creating the system call of the new process CLONE_NEWNET corresponds to NET namespace CLONE_NEWNS corresponds to Mount namespace CLONE_NEWPID corresponds to PID namespace CLONE_NEWUSER corresponds to User namespace CLONE_NEWUTS corresponds to UTS namespace CLONE_NEWCGROUP corresponds to the cgroup namespace so that processes have a separate cgroup control group, starting with Linux 4.6 */
int clone(int (*fn)(void *), void *child_stack, int flags, void *arg)
Copy the code

Clone creates a new process and adds it to the new namespace without affecting the current process. The child processes created by Clone can share the virtual space address, file descriptor, and signal processing table of the parent process.

Note:

(1) Clone and fork calls are quite different. The clone call requires passing in a function int (*fn)(void *), which is executed in the child process.

(2) The biggest difference between clone and fork is that Clone no longer copies the stack space of the parent process, but creates a new one by itself. (void *child_stack,), the second argument, allocates the stack pointer space, so it is no longer inherited or copied, but a new creation.

setns

/* fd points to the file descriptor nstype of the /proc/[pid]/ns namespace. If the flags of the corresponding namespace are 0, it indicates that any namespace */ is allowed to enter
int setns(int fd, int nstype)
/ / for
int fd = pidfd_open(1234.0);
setns(fd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWUTS);
Copy the code

Calls a thread (single thread, process) to join the specified namespace

unshare

int unshare(int flags)
Copy the code

To move the calling process to a new namespace, the current process exits the current namespace and enters the new namespace. Note the difference with clone().

Common operations about a namespace

Query the namespace of the current system: LSNS -t

[root@aliyun ns]# lsns -t mnt NS TYPE NPROCS PID USER COMMAND 4026531840 mnt 88 1 root /usr/lib/systemd/systemd --system  --deserialize 17 4026531856 mnt 1 13 root kdevtmpfs 4026532151 mnt 1 541 chrony /usr/sbin/chronydCopy the code

Query the namespace of a process: ll /proc/ / ns. /

#Check the process ID first
[root@aliyun proc]# docker inspect 97649934abf3 | grep -i pid
            "Pid": 26103,
            "PidMode": "",
            "PidsLimit": null,
#View the namespace of a process[root@aliyun proc]# ll /proc/26103/ns total 0 lrwxrwxrwx 1 root root 0 Aug 31 15:19 ipc -> ipc:[4026532163] lrwxrwxrwx 1  root root 0 Aug 31 15:19 mnt -> mnt:[4026532161] lrwxrwxrwx 1 root root 0 Aug 31 15:17 net -> net:[4026532166] lrwxrwxrwx 1 root root 0 Aug 31 15:19 pid -> pid:[4026532164] lrwxrwxrwx 1 root root 0 Aug 31 15:19 user -> user:[4026531837] lrwxrwxrwx 1 root root 0 Aug 31 15:19 uts -> uts:[4026532162]Copy the code

Run the nsenter -t -n IP addr command to enter a namespace. The -t parameter indicates the ID of the target process, and the -n parameter indicates the NET namespace. Nsenter is equivalent to a layer of encapsulation on the setNS example program, in that we do not need to specify the file descriptor of the namespace, but can specify the process number.

Nsenter can run the command of the specified program under the command of the specified process, because most containers do not contain more basic commands for lightweight, which brings great trouble to debug the container network. Only docker inspect container ID can obtain the IP of the container. As well as not being able to test connectivity with other networks (which can actually be managed by the Docker network), the nsenter command can enter the network namespace of the container and debug the network using the host command.

#Use IP addr under the host to view network information in the container[root@aliyun proc]# docker inspect 97649934abf3 | grep -i pid "Pid": 26103, "PidMode": "", "PidsLimit": null, [root@aliyun proc]# nsenter -t 26103 -n ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host LO valid_lft forever preferred_lft forever 10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:11:00:02 brd Ff :ff:ff:ff:ff:ff :ff link- netnSID 0 inet 172.17.0.2/16 BRD 172.17.255.255 scope global eth0 valid_lft forever preferred_lft  forever
#The network information in the container can be found to be identical[root@97649934abf3 /]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host LO valid_lft forever preferred_lft forever 10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:11:00:02 brd Ff :ff:ff:ff:ff:ff :ff link- netnSID 0 inet 172.17.0.2/16 BRD 172.17.255.255 scope global eth0 valid_lft forever preferred_lft  foreverCopy the code

The namespace practice

#Execute the sleep directive in the new Network namespace
[root@aliyun proc]# unshare -fn sleep 60
#View the IP address of the Sleep process
[root@aliyun /]# ps -ef | grep sleep
root     27992  2567  0 15:47 pts/0    00:00:00 unshare -fn sleep 60
root     27993 27992  0 15:47 pts/0    00:00:00 sleep 60
root     28000 25995  0 15:47 pts/1    00:00:00 grep --color=auto sleep
#View the Net namespace of Sleep
[root@aliyun /]# lsns -t net
        NS TYPE NPROCS   PID USER COMMAND
4026531956 net      92     1 root /usr/lib/systemd/systemd --system --deserialize 17
4026532160 net       2 27992 root unshare -fn sleep 60
#Enter the net namespace of sleep through nsenter to check network information (note that the operation should be completed within 60 seconds during the lifetime of sleep process)
[root@aliyun /]# nsenter -t 27992 -n ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Copy the code

Cgroup

Control Groups (Cgroups) is a mechanism used in Linux to limit and monitor the resources required by one or a group of processes, such as CPU usage, memory, and disk I/O. Take a simple example, I run a while(true) program in the system, the CPU resources of the system immediately burst, at this time we can use cgroup to limit its CPU resources.

Cgroup is organized and managed by Hierarchy in different system resource management subsystems: Each Cgroup can contain other child Cgroups. Therefore, the resources that can be used by a child Cgroup are limited by the resource parameters configured by the Cgroup and the parent Cgroup.

Linux implementation of cgroup

We can also see variables associated with cgroups in the task_struct structure

## \kernel\msm4.4\include\linux\sched.h
struct task_struct {
	/* Control Group info protected by css_set_lock: */
	struct css_set __rcu *cgroups;
	/* cg_list protected by css_set_lock and tsk->alloc_lock: */
	struct list_head cg_list;
}
Copy the code
struct css_set { 
    /* * Set of subsystem states, one for each subsystem. This array is * immutable after creation apart from the init_css_set during * subsystem registration (at boot time). */
	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
}
Copy the code

Subsystem under Cgroup

Under /sys/fs/cgroup we can see the subsystems contained in cgroup:

  • Blkio: This subsystem is set to limit input/output control for each block device. For example: disk, CD and USB and so on;
  • CPU: This subsystem uses the scheduler to provide CPU access for CGroup tasks;
  • Cpuacct: generates CPU resource reports for Cgroup tasks.
  • Cpuset: If it is a multi-core CPU, this subsystem will allocate a separate CPU and memory for cgroup tasks;
  • Devices: allows or denies access to devices by Cgroup tasks.
  • Freezer: suspend and resume cgroup tasks;
  • Memory: Sets the memory limit for each Cgroup and generates a memory resource report;
  • Net_cls: marks each network package for easy use by Cgroup;
  • Ns: namespace subsystem;
  • Pid: indicates the process identification subsystem.

Important and commonly used are the CPU subsystem and the Memory subsystem

CPU subsystem

The main contents of the CPU subsystem are as follows

The Go language demonstrates the CPU limits of Cgroup

Write a test.go test file and run it

package main

func main(a){
    i := 0
    go func(a){
        for {
            i++
        } 
    }()
    
    for {
        i++
    }
}
Copy the code

Using top to view the CPU usage without limitation, you can see that the test file accounts for more than 90% of the CPU resources

Create folders in the Cgroup CPU subsystem and restrict them using cgroups

[root@aliyun test]# mkdir /sys/fs/cgroup/cpu/test
[root@aliyun test]# cd /sys/fs/cgroup/cpu/test
[root@aliyun test]# ls
cgroup.clone_children  cpuacct.stat          cpu.cfs_period_us  cpu.rt_runtime_us  notify_on_release
cgroup.event_control   cpuacct.usage         cpu.cfs_quota_us   cpu.shares         tasks
cgroup.procs           cpuacct.usage_percpu  cpu.rt_period_us   cpu.stat
#Setting the cpu.cfs_quota_us value to 2000 cpu.cfs_period_us time interval default value to 100000 will reduce the CPU utilization of the test.go process to about 2%
[root@aliyun test]# ehco 2000 > cpu.cfs_quota_us
#testAdd the process ID to cgroup.procs
[root@aliyun test]# ehco 31944 > cgroup.procs
Copy the code

At this time, use top to view the process resource information, you can see the test test file on the CPU usage dropped to 2%, the effect is obvious!

Note:

Tasks and cgroup.procs, many articles on the Internet describe cgroup tasks as OS processes, which is not accurate. More accurately, the Pids in cgroup.procs file are the process lists in our common sense. The Pids contained in the Tasks file can actually be the Pids of a Linux light-weight-process (LWP), whereas threads in the Linux PThread library are actually implemented by a lightweight Process. In simple terms: The main thread PID of a Linux process = the process PID, while the PID of other threads (LWP PID) is assigned independently

To add a Thread to a Cgroup, write the Thread PID to tasks or cgroup.procs. Cgroup. procs automatically changes to the Proc PID of the task. Procs file. The tasks file is automatically updated with all Thread Pids in the Proc. Using Tasks, you can implement thread-level management.

The Memory subsystem

The memory subsystem of cgroup is called the Memory Resource Controller. It can limit the memory used by all cgroup tasks and swap memory, and control whether to kill the process while OOM.

The Go language demonstrates the memory limitation of Cgroup

Write a mem.go file

package main

// Write Docker automatically.

import (
	"fmt"
	"io/ioutil"
	"os"
	"os/exec"
	"path"
	"strconv"
	"syscall"
)

const CgroupMemoryHierarchyMount = "/sys/fs/cgroup/memory"

func main(a) {
	if os.Args[0] = ="/proc/self/exe" {
		fmt.Println("-- -- -- -- -- -- -- -- -- -- 2 -- -- -- -- -- -- -- -- -- -- -- --")
		fmt.Printf("Current pid: %d\n", syscall.Getpid())

		// Create the stress child process to apply memory stress
		allocMemSize := "99m" // 
		fmt.Printf("allocMemSize: %v\n", allocMemSize)
		stressCmd := fmt.Sprintf("stress --vm-bytes %s --vm-keep -m 1", allocMemSize)
		cmd := exec.Command("sh"."-c", stressCmd)
		cmd.SysProcAttr = &syscall.SysProcAttr{}
		cmd.Stdin = os.Stdin
		cmd.Stdout = os.Stdout
		cmd.Stderr = os.Stderr

		iferr := cmd.Run(); err ! =nil {
			fmt.Printf("stress run error: %v", err)
			os.Exit(- 1)
		}
	}

	fmt.Println("-- -- -- -- -- -- -- -- -- -- 1 -- -- -- -- -- -- -- -- -- -- -- --")
	cmd := exec.Command("/proc/self/exe")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWNS | syscall.CLONE_NEWPID,
	}
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	// Start the child process
	iferr := cmd.Start(); err ! =nil {
		fmt.Printf("/proc/self/exe start error: %v", err)
		os.Exit(- 1)
	}

	cmdPid := cmd.Process.Pid
	fmt.Printf("cmdPid: %d\n", cmdPid)

	// Create a subcgroup
	memoryGroup := path.Join(CgroupMemoryHierarchyMount, "test_memory_limit")
	os.Mkdir(memoryGroup, 0755)
	// Set the memory limit
	ioutil.WriteFile(path.Join(memoryGroup, "memory.limit_in_bytes"),
		[]byte("100m"), 0644)
	// Add the process to cgroup
	ioutil.WriteFile(path.Join(memoryGroup, "tasks"),
		[]byte(strconv.Itoa(cmdPid)), 0644)

	cmd.Process.Wait()
}
Copy the code

Function interpretation (At startup, Stress takes up 99M memory, cgroup limits maximum use of 100M memory)

  1. At the beginning we usego run mem.goorgo build .Run time, not enoughif os.Args[0] == "/proc/self/exe"Condition, so skip.
  2. The function thencmd := exec.Command("/proc/self/exe")A /proc/self.exe child process was created
  3. in/sys/fs/cgroup/memory/createtest_memory_limitFile, set the memory limit to 100M
  4. Add child processes to task files
  5. Wait for the child process to finish
  6. The child process is actually the current program, but its name isproc/self/exe, as determined by the initial if statement, the stress child process is created and then run.

Start the

Let’s go to the test_memory_limit directory and look at the maximum memory limit and task and see that the maximum memory limit is the same as we designed.

[root@aliyun test_memory_limit]# cat memory.limit_in_bytes 104857600 [root@aliyun test_memory_limit]# cat Cgroups. proc 3599 /proc/self-/exe process 3602 3603 // stress processCopy the code

You can view the resource usage by viewing top

Cgroup. procs indicates the processes in cgrouptest_memory_list. These are real processes. You can run pstree -p to view them

Let’s see if the STRESS memory exceeds the 100M limit.

Modify the code, set the memory to 110M, and run again

You can see that the entire process is killed by kill-9. This is the memory limit of cgroups.

Docker demonstrates cgroup limits on CPU and memory

#Start an nginx imageDocker run -d --cpu-shares 513 --cpus 0.2 --memory 1024M --memory-swap 1234M --memory-swappiness 7 -p 8081:80 nginx#Argument parsing-- CPU-shares 513 represents relative allocated quota --cpus 0.2 limits CPU utilization by the ratio of CPU.cfs_period_us to cpu.cfs_quota_us Limit_in_byte --memory-swap 1234M Memory.memsw.limit_in_byte --memory-swappiness 7 Memory.swappiness -p 8081:80 (exposed host port: Nginx port)#Viewing container IDS
f4437f9db69d
Copy the code

Next, we can enter the CPU subsystem under Cgroup and find a docker file. Enter the docker file and you can see the folder named after the container ID just now

If you go into this folder, you can see that it is very consistent with our previous test.go case.

The cpu-shares 513 parameter just configured when Docker is started can be represented in the cpu.shares file

The parameter –cpus 0.2 is the ratio of CPU.cfs_quota_us to cpu.cfs_period_us.

Similarly, a Docker file can also be found in the memory subsystem under Cgroup, and there is also a folder named by the container ID corresponding to it

Go to this folder and view the configuration information relative to the docker parameter

The parameter –memory 1024 configured at startup can be reflected in memory.limit_in_bytes

The –memory-swap 1234M argument can be represented in memory.memsw.limit_in_byte

Parameter –memory-swappiness 7 can be represented in memory.swappiness

The above is to analyze the two most important subsystems in Cgroup, CPU subsystem and memory subsystem

Union FS

Docker images are actually layers of file systems, called UnionFS. UnionFS can mount layers of directories together to form a virtual file system. (Here’s a simple example: We have two folders A and B, a folder contains a1.txt a2.txt, b folder contains b1.txt b2.txt, we created a new folder C, TXT a1.txt b1.txt b2.txt

At a basic level, a typical Linux file system consists of bootfs and rootfs:

  • ** Bootfs (boot file system)** Mainly contains the bootloader and kernel. Bootloader mainly loads the kernel. When Linux starts, the bootfs file system will be loaded, and the whole kernel will be in memory after the boot loading is completed. Memory usage has been transferred from bootfs to the kernel, at which point bootfs is unmounted. At the bottom of the Docker image is bootfs.

  • ** Rootfs (root file system)** Bootfs contains standard directories and files such as /dev, /proc, /bin, and /etc in typical Linux systems. Rootfs is a variety of operating system distributions, such as Ubuntu, CentOS, etc.

The Mount command implements the federated file system case

#Creating a folder
[root@aliyun uniontest]# mkdir lower upper merge work
#Write files to lower and upper files
[root@aliyun uniontest]# echo "from lower" > lower/in_lower.txt
[root@aliyun uniontest]# echo "from upper" > upper/in_upper.txt
[root@aliyun uniontest]# echo "from lower" > lower/in_both.txt
[root@aliyun uniontest]# echo "from upper" > upper/in_both.txt
#Run the mount command to mount it
[root@aliyun uniontest]# mount -t overlay overlay -o lowerdir=lower/,upperdir=upper/,workdir=work merge
#View the file after the joint mount[root@aliyun uniontest]# ├── ─ ├── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── ── TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT#You can see that the upper layer is displayed
[root@aliyun uniontest]# cat merge/in_both.txt 
from upper
Copy the code

Overlay2

Let’s distinguish a few concepts.

  • OverlayFS refers to the Linux kernel driver
  • Overlay/Overlay2 refers to the Docker storage driver.

Merged (lowerdir, upperdir, merged image layer) merged (rootfs) The image layer can be divided into multiple layers, so the corresponding lowerdir can have multiple directories. Upperdir is the layer above lowerdir, the read and write layer that is created when a container is started and where all changes to the container data occur. Finally, the merged directory is the mount point of the container, which is the unified perspective that the user is exposed to. The directory layer is stored in the/var/lib/docker/overlay2 /

Let’s download a container for analysis and find that there is only one layer of files

/var/lib/docker-overlay2

The lower case L folder is a symbolic link to this layer, just to reduce the limiting effect that the mount parameter may have

If you enter the folder starting with E757, you can find the image content of the current layer in the diff folder and the short name in the link file

If you want to start an image and check it again, you can find two more layers of files

Merged mount point overlay2 merged mount point merged mount point overlay2 merged mount point merged mount point overlay2 merged mount point merged mount point overlay2 merged mount point merged mount point overlay2 merged mount point merged mount point Workdir is the transfer layer that performs copy_UP operations that involve changing lowerdir (for example, upperdir does not exist and needs to be copied from lowerdir)

According to the mount, we can analyze the mount process:

  1. First VKYIMVSFFSVOAWCBIZZQ6RNFB4 short links corresponding is 8 cbdc1ab4a0567cf7841b3a6326bbbec185850b56dff309a572a6cff5be7742f – init/diff mirror configuration folder (marked 1.)

    4 lisvlhrvbortvpcsncjhbmvsz short links corresponding to mirror e75746dca68dcf02f7fd5dc90a5828066c4d66ab8e6719030362e546e98bdc62 / diff folders (marked as (2) and (1) and (2) are all belong to At the lowerdir layer, read-only layer.

  2. View the information in step 1

You can see the contents of the diff file

  1. See step (2)devandetcInformation in folders

  1. The last viewupperdirandmergedThe file

For another test, go to the newly launched Ubuntu container and create a test.txt file

Next we will look at the read/write layer of the container. You can see test.txt in the diff folder and merged folder of the container layer itself. However, it is not seen in the read-only layer of the mirror, indicating that the analysis is correct.

conclusion

Above is the analysis of core technology of three Docker operation, spent about three or four days of time, but was very fruitful for individuals, for these also heard before, but understand not very thorough, force yourself to serious this time to learn, hope can enter the in-depth study in the future. The article may be a bit long lol

If there are any problems, please correct them and I will correct them

Reference Documents:

Staight. Making. IO / 2019/09/23 /…

www.cnblogs.com/sammyliu/p/…

Wudaijun.com/2018/10/lin…

Lessisbetter. Site / 2020/09/01 /…

Lessisbetter. Site / 2020/08/30 /…

www.jianshu.com/p/274af1c01…

zhuanlan.zhihu.com/p/41958018

www.cnblogs.com/wdliu/p/104…

This article is published by OpenWrite.