The container’s past and present life

What is the container

It is obvious from the name that containers are entities that hold things, such as cups for drinks ☕️.

In the computer world, there is no drink, the computer world only resources, such as CPU, memory, disk and so on, and the function of the container is to hold our various computer resources. Container is translated from container, but the other translation of container, ‘container’, is probably more semantically appropriate. For example, the car 🚗 (our program) is loaded into a container from Tianjin port (development environment) and transported to Singapore port (production environment) without any loss of parts, while the car 🚘 can be directly started after landing in Singapore port. This is the first advantage of containerization, packaging environment (namespace), we directly put the car and the gasoline (memory, CPU) needed in a container, landing can start, instead of using some local gasoline may not match the model. Another advantage of containers is that the size of cars can be flexibly selected to avoid waste (control group). As for the origin of the container, and this design to the container of the past life.

The virtual machine

Until containers were available, isolation schemes used virtual mechines. Most of the virtual machine isolation scheme is based on the hypervisor (add virtual to physical hardware and operating system layer) for hardware isolation on the physical plane, each virtual machine is a complete implementation of the operating system consists of an operating system, applications, binary files, and the necessary database — a complete copy these files take up dozens of gb, Vm startup can also be slow.Virtual machine implementation architecture diagram, the picture is from the network. A virtual machine is an abstraction of physical hardware that transforms one server into multiple servers.

There is a classic saying that any problem in computer science can be solved by adding an indirect middle layer, and if it can’t be solved, add another one (ps: classic nesting doll 2333). This allows you to run Linux programs on A Linux VM, run Windows programs on a Windows VM, and install a lower version of Linux VM on Linux to run programs that are incompatible with higher versions. All sounds are so beautiful, the virtual machine is the biggest problem is that he is too “heavy”, a virtual opportunities including all basic component unit of a computer (CPU, memory, a complete operating system implementation and its attached all applications), it is a nightmare for operational work, like you just want to simple start an app, But you have to boot a “physical” computer (CPU, memory, operating system), and so on.

The container

The implementation architecture diagram of the container (Docker), the picture comes from the network.

Unlike virtual machines, containers are virtualized at the operating system level. Compared to virtual machines, containers are more portable and lightweight. A container is an abstraction of the application layer that wraps code and dependencies together. Multiple containers can run on the same machine and share the operating system kernel with other containers, each running as a separate process in user space. Containers take up less space than VMS (container images are typically tens of megabytes) and can handle more applications. The resources used by each container are still dependent on the host OS, but the specific resource quota/environment isolation is guaranteed by Docker. And the operation and maintenance deployment has become easy, packaging resources -> go online -> start the container.

The core of the container

Namespace

Introduction to the

control what you can see

Linux Namespace is a kernel-level method of environment isolation provided by Linux. Linux now implements six different types of namespaces, each of which encapsulates specific operating system resources in an abstraction so that the processes in the namespace appear to have their own operating system resources. Supporting container implementations is a very important goal for Namespaces. The following table lists the implementations of the six namespaces.

The name of the The kernel parameters Kernel version role
Mount namespaces CLONE_NNENS Linux 2.4.19 Provides file system layer isolation
UTS namespaces CLONE_NEWUTS Linux 2.6.19 Isolate the sethostName () and setDomainName () system calls
IPC namespaces CLONE_NEWIPC Linux 2.6.19 Isolate Unix IPC(Interprocess Communication)
PID namespaces CLONE_NEWPID Linux 2.6.24 Isolation of pid
Network namespaces CLONE_NEWNET Started in Linux 2.6.24 and largely completed by LINNUx 2.6.29 Isolate networks, such as IP addresses and routing tables
User namespaces CLONE_NEWUSER Started in Linux 2.6.23 and largely completed by LINNUx 3.8 Isolate user ID, groupId

Namespace has three key system calls

  1. Clone (), create a new process, and use the above namespaces kernel parameters to achieve resource isolation.
  2. Setns () to add the process to a namespace.
  3. Unshare (), which detaches the process from a namespace.

use in golang

All the following examples use the Golang version of the environment: GO1.15.3 Linux kernel version: 4.14.81

Our directory structure is very simple, containing only our main file and mod dependency files:

├── go.mod ├─ main. Go out 0 directories, 2 filesCopy the code

The command to start a container in Docker is docker run {image} < CMD > . So we also refer to the start of Docker, for our container, the start command is go run main.go run {our image}.

hello container

According to the programmer convention 👀, emmmm, I come up with a hello world.

package main

import (
	"fmt"
	"os"
)

//docker run image 
      
//go run main.go run image
func main(a) {
	switch os.Args[1] {
	case "run":
		run()
	default:
		panic("bad command")}}func run(a) {
	fmt.Printf("Running %v\n", os.Args[2:)}func must(err error) {
	iferr ! =nil {
		panic(err)
	}
}
Copy the code

Let’s try Running ~ go run main.go run echo hello container we get: Running [echo hello container]. Our first container is up and running, it simply takes the command-line arguments and types it as it is, and we’ll continue to refine it.

run cmd in container

Let’s add a little more “dynamic” to our container, show you the code

// Our main function stays the same
func run(a) {
	fmt.Printf("Running %v\n", os.Args[2:])
	cmd := exec.Command(os.Args[2], os.Args[3:]...). cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr cmd.Run() }Copy the code

Run go run main.go run echo Hello Container and we’ll get the output:

Running [echo hello container]
hello container
Copy the code

In the run function we get the command line argument (echo) and execute it, so on the screen we’ll see a line of information from echo. With exec, we can start a shell, go run main.go run /bin/bash, This is a very important step, because with the shell we can see our “container” more clearly.

UTS namespace

To refine our main function, let’s try our first namespace -> “UTS Namespace”

UTS Namespace isolates nodename and domainname from the uname() system call. In container context, UTS allows each container to have its own hostname and domainname. There are many scripts that can be customized for, making it easier in our case to distinguish between our host and container.

func run() {

	fmt.Printf("Running %v\n", os.Args[2:])
	cmd := exec.Command(os.Args[2], os.Args[3:]...)
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	cmd.Run()
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags:syscall.CLONE_NEWUTS,
	}
}
Copy the code

Continue togo run main.go run /bin/bashWe can build, then sudo, and then execute binary. Let the container run. Our container hostname is inherited from the host, so if we execute hostname directly, it will be consistent with the host. However, subsequent changes are not synchronized to the host.To make it easier to distinguish the container from our host later on, we set the process’s hostname directly before it starts, becausecmd.Run()After that, the process will block on this row, but oncmd.Run()Implementing sethostname previously interferes with our host, so is there a better way to do it? The answer is… Nesting dolls, classic nesting dolls.We start our current process in the container to set thostname, and then execute ZSH to have a pleasant interaction 😸In Linux, there is a special process called/proc/self/exeIt always points to the currently running process, so it’s easy to use it to make happy nesting dolls.

package main

import (
	"fmt"
	"os"
	"os/exec"
	"syscall"
)

//docker run image 
      
//go run main.go run image
func main(a) {
	switch os.Args[1] {
	case "run":
		run()
	case "child":
		child()
	default:
		panic("bad command")}}func run(a) {
	fmt.Printf("Running run %v\n", os.Args[2:])
	cmd := exec.Command("/proc/self/exe".append([]string{"child"}, os.Args[3:]...). ...). cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS, } err := cmd.Run()iferr ! =nil {
		panic(err)
	}
}

func child(a) {
	fmt.Printf("Running child %v\n", os.Args[2:])
	cmd := exec.Command(os.Args[2], os.Args[3:]...). syscall.Sethostname([]byte("container"))
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	err := cmd.Run()
	iferr ! =nil {
		panic(err)
	}
}

Copy the code

pid namespace

PID Namespace INDICATES the PID namespace of the isolation process. In other words, processes in different PID namespaces can use the same PID. The greatest benefit of a PID namespace is that you can migrate containers between hosts while keeping the same process ID for processes within the container. The PID namespace also allows each container to have its own init process (that is, the process with PID 1), which is the “ancestor of all processes,” manages various system initialization tasks, and reclaims resources from isolated child processes when they terminate (preventing zombie processes).

In the UTS Namespace above we use /proc/self/exe, which points to our current running process. To facilitate the following pid introduction, let’s briefly introduce the /proc directory in Linux and what pid is.

PID stands for Process Identification, which is the unique ID automatically assigned to each process when it is created in the operating system. while/procA directory is a virtual file system in memory that records all the runtime information of the operating system.Above is our machine/procDirectory, you can see that the directory name is the pid of the currently running process, and/proc/selfIt points to the currently running process!

The following figure shows us executing the process inside the container. You can see sudo, /proc/self.exe, our container, ZSH, and the current ps. Let’s try the PID namespace and expect the EXE to become our init process (PID = 1).

func run(a) {
	fmt.Printf("Running run %v\n", os.Args[2:])
	cmd := exec.Command("/proc/self/exe".append([]string{"child"}, os.Args[3:]...). ...). cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS |/ / use of bit, using the | operator to specify more than one clone flag
			syscall.CLONE_NEWPID,
	}
	err := cmd.Run()
	iferr ! =nil {
		panic(err)
	}
}
Copy the code

Execute the program again. In the program print, you can see that the pid of the actual process has been changed, but you can still see another set of PID through ps.In fact, processes in the PID namespace have two types of PID, namely, the pid in the host and the PID in the containerpsCommand to view/procDirectory, so what we’re looking at inside the container using the ps command is actually the host pid. Here we can mount the proc directory in memory to see the current running process. In the figure below, 1, 14 and 6 are our EXE, LS and ZSH processes respectively.However, the ps command still shows the status of our host. Since mounting in our container affects our host machine, it was a natural choiceCLONE_NEWNSTo isolate the mount point, but because of the direct mount/procDirectories affect our host, so is it possible to mount the new FS directly?

mount namespace

In order not to affect the proc directory of our host, we selected the apline file system and put it in the current directory. The current file structure is like this.

Tree - L 2. ├ ─ ─ alpine │ ├ ─ ─ bin │ ├ ─ ─ dev │ ├ ─ ─ etc │ ├ ─ ─ home │ ├ ─ ─ lib │ ├ ─ ─ media │ ├ ─ ─ MNT │ ├ ─ ─ opt │ ├ ─ ─ proc │ ├ ─ ─ root │ ├ ─ ─ the run │ ├ ─ ─ sbin │ ├ ─ ─ the SRV │ ├ ─ ─ sys │ ├ ─ ─ TMP │ ├ ─ ─ usr │ └ ─ ─ var ├ ─ ─. Mod ├ ─ ─ implement - container └ ─ ─ main.goCopy the code

Linux inherits a very important sys call->chroot from Unix. Here we can change the root directory of the container using chroot+chdir, and then mount the proc directory.

func child(a) {
	fmt.Printf("Running %v as %d\n", os.Args[2:], os.Getpid())

	syscall.Chroot("./implement-container/apline")
	syscall.Chdir("/")
	syscall.Mount("proc"."proc"."proc".0."")
	defer syscall.Unmount("proc".0)
	cmd := exec.Command(os.Args[2], os.Args[3:]...). syscall.Sethostname([]byte("container"))
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	err := cmd.Run()
	iferr ! =nil {
		panic(err)
	}
}
Copy the code

Compile and execute our program again. The process ID of the process has been changed to the id we expected. Pid =1 is our self, PID =6 is the shell we started, pid=8 is our current ps process.

But don’t get too excitedIf we perform ps on our physical machine, we find a very pleasing red color.

This is because Systemd defines the default mount event propagation mechanism as MS_SHARED, so I won’t expand it too much here, but you can check out the mount mechanism yourself. The solution is as simple as setting Unshareflags to CLONE_NEWNS.

func run(a){... some code cmd.SysProcAttr = &syscall.SysProcAttr{/ / use of bit, using the | operator to specify more than one clone flag
		Cloneflags:   syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
		Unshareflags: syscall.CLONE_NEWNS,
	}
... some code
}
Copy the code

For those of you who have used Docker, your container looks a lot like a Docker container. It has a separate PID, a separate hostname, and a separate file system (ps: In fact, we use alpine’s file system, which I actually cp out of Docker Apline.) The remaining IPC Namespace, NET namespace, and USER namespace are basically used in the same way, and the role of namespace is: Namespace control what you can see.

CGroups

control what you can use

Introduction to the

In the previous article, we have used the namespace technology to achieve environment isolation, but our container can still use various resources of the host machine without limit, such as CPU, I/O, network, etc., so we need to increase the resource limit of our container, and Linux CGroups can achieve this.

CGroups Full name: Control Groups is a mechanism provided by the Linux kernel to limit the resources used by a single process or multiple processes. It can realize refined Control of CPU, memory and other resources. There are four very important concepts in cGroups:

  • Task: Each process is a task
  • Control groups: A collection of processes that describe restricted resources and related quotas
  • Hierarchy: Control groups also have the concept of hierarchy. The cgroup of the child hierarchy inherits the cgroup attribute of the parent hierarchy
  • Subsystem: Each controllable resource is a subsystem. Typical examples are as follows:
    The name of the Kernel version role
    cpu Linux 2.6.24 Limit the CPU usage of processes
    cpuacct Linux 2.6.19 Statistics on the CPU usage of cgroup processes
    cpuset Linux 2.6.19 Bind a process to a specific CPU
    memory Linux 2.6.24 Limit the memory used by processes
    pids Linux 4.3 Limiting the number of processes
    netcls Linux 2.6.29 The network packet is marked and can then be controlled using trafic Control
    . . .

Cgorup is a kernel-mode control mechanism in Linux. How can cgroup be exposed to user processes? The answer is the VIRTUAL File System (VFS), which converts operations on the file hierarchy in user mode to operations on the Cgroup hierarchy. The specific path is /sys/fs/cgroup, and you can see that there are many subsystems in this directory.

More control Groups on the introduction of their own review, the next step into the actual combat.

use in golang

pid control group

Pid Subsystem controls the number of processes in our container. If the number of processes created has reached the upper limit, the kernel will return insufficient resources when creating again. Pid The CG path is in /sys/fs/cgroup/pid. Let’s create a PID group first.

func child(a) {
	// Remember to call before chroot, otherwise the new CG will not be the host's CG
	pidControl(20)
    //other code...
}
func pidControl(maxPids int) {
	pidCg := "/sys/fs/cgroup/pids"
	groupPath := filepath.Join(pidCg, "/gocg")
	// Create gocg group
	err := os.Mkdir(groupPath, 0775)
	iferr ! =nil && !os.IsExist(err) {
		panic(err)
	}
	// Maximum number of Pids
	must(ioutil.WriteFile(filepath.Join(groupPath, "pids.max"), []byte(strconv.Itoa(maxPids)), 0700))
	Add the current process to the GOCG group
	must(ioutil.WriteFile(filepath.Join(groupPath, "cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700))}Copy the code

In this program, we create a new PID cgroup named GOCG, limit the number of processes to 10, and add the current process to the GOCG control group. When we run the program and check the goCG status on the host, we find that there are some files automatically created by OS besides the newly created pds. Max and cgroup.procs. When we check pds. Max and cgroup.procs, we find that our control data has been written normally.

Let’s test whether the maximum pid value is valid. There is a classic shell script called fork Bomb.

: () {: | : &}; :Copy the code

The meaning is

  1. Definition,:The function of
  2. Functions do not take arguments(a)
  3. The body of the function of{: | : &}Call yourself recursively and create a process using the pipe
  4. Closing function definition;
  5. The last of the:That is to execute the calling command

The result is to keep creating child processes until they crash. Let’s test that out.Looking inside the container, you can see that the number of processes is limited by the Cgroup mechanism.

cpu control group

CPU subsysytem limits the CPU time slice used by the system. The most basic parameters are cpu.cfs_quota_us and cpu.cfs_period_us. Cfs_period_us is used to set the period length, and cfs_quota_us is used to set the number of CPU time that can be used by the current cgroup within the period length. The two files work together to set the upper limit of CPU usage. The units of the two files are microseconds (us). The value of cfs_period_us ranges from 1 millisecond (ms) to 1 second (s). If the value of cfs_quota_us is greater than 1ms, the CPU time is not limited. The text description is not clear enough, here are a few examples to help you understand.

# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */ # echo 250000 > cpu.cfs_period_us /* period = 250ms */ 2. Limit the use of 2 cpus (cores) (1000ms CPU time per 500ms) Cfs_quota_us /* quota = 1000ms */ # echo 500000 > cpu.cfs_period_us /* period = 500ms */ 3. Limit the use of one CPU to 20% (10ms CPU time per 50ms) Cfs_quota_us /* quota = 10ms */ # echo 50000 > cpu.cfs_period_us /* period = 50ms */Copy the code

Let’s feel the charm of CPU CG. First of all, start our container. Here we write a shell infinite loop script to check the CPU consumption.

while true; do ; done;
Copy the code

You can see that our process directly pulls a core, so let’s addcpu cgroupLimit the CPU to 0.5 cores.

Func child() {// limit 20 processes to pidControl(20) // Limit CPU usage to 0.5 core cpuControl(0.5) //other code... } func cpuControl(core float64) { pidCg := "/sys/fs/cgroup/cpu" groupPath := filepath.Join(pidCg, Err := os.mkdir (groupPath, 0775) if err! = nil && ! Os.isexist (err) {panic(err)} //10ms CFS := float64(10000) // CPU quota must(ioutil.writefile (filepath.Join(groupPath, "Cpu.cfs_quota_us "), []byte(strconv.Itoa(int(CFS *core)), 0700))) "cpu.cfs_period_us"), []byte(strconv.Itoa(int(cfs))), 0700)) must(ioutil.writefile (filepath.join (groupPath, "cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700)) }Copy the code

Run the same command again and you can see that it has been limited to 0.5 cores.The rest of the varietycgroups“Substitute” has a basically similar use, and the function of cgroups is our opening sentence:cgroups control what you can use

conclusion

We implemented a very simple container. We learned to have a hostname, PID, and root FS independent of the host through namespace. Learn to use cgroups to limit the maximum number of processes to 20 and CPU cores to 0.5. The core of container implementation in the industry is also the combination of Namespace +cgroup. The API design of Linux is highly orthogonal, and we can obtain a variety of containers through the configuration of various parameters. Very feeling, writing a technical article is really too tired, their knowledge is too little, or need to strengthen the daily knowledge reserve, to write the article can be very smooth to complete, this whole source code and alpine condensed implementation has been uploaded to Github, need to take it. Github.com/yinpeihao/g…

Say goodbye to Linux for the time being, it is too difficult, next will try to comb golang source package ~

reference

  1. What is container technology?
  2. Official Documentation of Namespace
  3. Containers From Scratch