This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

The difference between docker and container?

In fact, containers are older than Docker. Docker is also not synonymous with containers, which are a collection of kernel features.

In 2000, LXC container was released, a kernel virtualization technology that provides lightweight virtualization to isolate processes and resources. LXC was an implementation of the specific kernel functionality originally used by Docker.

In 2013, Docker was released, combining Linux technologies such as LXC, Union File System and Cgroups to create a containerized standard. Moreover, Docker introduces the concept of SHIP and builds a set of build-ship-run software development process, which makes software development, release and operation easier.

Comparison of the content Traditional development Container development approach
Build a way Maven packaging dockerfile
delivery Jar package or WAR package The container
consistency Weak, the development environment, test environment, production environment is difficult to be consistent Strong, independent of OS and runtime environment

What kernel technologies are used in Docker

NameSpace Isolates the running environment of the process

A NameSpace is a method used by the Linux kernel to isolate kernel resources. Through the NameSpace, processes can only see resources related to themselves. Processes under different namespaces are isolated from each other and cannot be perceived. A NameSpace encapsulates and isolates global system resources, enabling processes in different namespaces to have independent global system resources. Changing system resources in a NameSpace affects only processes in this NameSpace, but does not affect processes in other namespaces.

One of the main purposes of the Linux kernel to implement NameSpace is to achieve lightweight virtualization services. Processes under the same NameSpace can sense the existence of each other, but know nothing about external processes. The container thinks that it is in an independent system, so as to achieve the purpose of isolation.

From a Dokcer implementer’s point of view, file systems can be isolated by switching the mount point of the root directory using the chroot command.

Containers have independent IP addresses, ports, routes, and networks that are isolated from each other. The container has its own PID (process number) isolated from the host’s PID. The container has its own users, and user groups need to be quarantined.

Through these isolation of global system resources, processes in the container appear to have a separate system environment. There are LInux commands that operate on processes and namespaces, such as Clone. Create a new process and specify its NameSpace.

Cgroup parameter limits the quota of resource allocation?

Cgroups is a mechanism provided by the Linux kernel to limit the resources used by one or more processes. It can achieve fine control for memory and CPU.

Cgroups defines a subsystem for each resource that can be controlled by cgroups. Typical subsystems are described as follows:

  1. The CPU subsystem, which limits the CPU usage of processes.
  2. Cpuacct subsystem, which can count CPU usage reports of processes in Cgroups.
  3. Cpuset subsystem, which can assign separate CPU nodes or memory nodes to processes in CGroups.
  4. A memory subsystem that limits the memory usage of processes.
  5. Blkio subsystem, which can limit the block device IO of a process.
  6. Devices subsystem that controls processes that can access certain devices.
  7. Net_cls subsystem, which can mark the network packets of processes in Cgroups, and then control the packets by using the TC module (traffic control).
  8. Freezer subsystem, which can suspend or resume processes in cgroups.
  9. Ns subsystem, which enables processes under different Cgroups to use different namespaces.

Each of these subsystems needs to coordinate with other modules in the kernel to accomplish resource control.

Cgroup hierarchy:

The diagram above shows the relationship between processes and Cgroups as a whole. The P at the bottom represents a process. Each process descriptor has a pointer to a secondary data structure cSS_SET (CGroups Subsystem set). Processes that point to a cSS_set are added to the process list of the current CSS_set. A process can belong to only one CSS_set. A CSS_set can contain multiple processes. Processes belonging to the same CSS_set are limited by the resources associated with the same CSS_set.

The “M×N Linkage” shows that the CSS_set can be many-to-many associated with cgroups nodes through secondary data structures. However, the implementation of cgroups does not allow cSS_set to associate multiple nodes in the same Cgroups hierarchy. This is because Cgroups does not allow multiple restricted configurations for the same resource.

When a CSS_set is associated with multiple nodes in the Cgroups hierarchy, it indicates that processes in the current CSS_set need to control multiple resources. When a Cgroups node is associated with multiple CSS_sets, it indicates that the process lists under multiple CSS_sets are subject to the same restriction of the same resource.

Switch the root directory of a process to the jointly mounted rootfs (change root)

Docker uses the ability of Union File System to mount multiple directories in different locations to the same directory. The multi-layer content in the container image is presented as a unified rootfs (root File System). Rootfs packages files and directories throughout the operating system and is the most complete “dependency library” that an application needs to run.

AUFS(Another Union File System or Advanced Multilayered Unification File System) is used in Docker as a joint File System. Not only can AUFS assign read-only, Readwrite, and witeout-able permissions to each directory, but AUFS also supports layering, for example, by logically changing the read-only portions incrementally without affecting the read-only portions.

When Docker starts a container with an image, the Docker image will allocate the file system and mount a new read-write layer to the container. The container will be created in the file system and the read-write layer will be added to the image. Docker currently supports AUFS, Btrfs, VFS, and DeviceMapper.

In Docker, the Image of the upper layer depends on the Image of the lower layer, so the Image of the lower layer is called the parent Image in Docker, and the Image without a parent Image is called the Base Image.

Therefore, to start a Container from an Image, Docker loads its parent Image up to the Base Image, and the user’s process runs in the Writeable file system layer.

The data information of all parent images, IDS, network and LXC-managed resource limits, and the configuration of specific containers constitute a Container in the concept of Docker.

Docker security:

Docker container security issues in the shared kernel, so the attack will be particularly large when attacked.

The main purpose of SELinux is to minimize the number of resources available to the server processes in the system (the minimum permission rule).

In an operating system using SELinux, the factor determining whether a resource can be accessed is not only whether a resource has the permission of the corresponding user (read, write, and execute), but also whether each type of process has the permission to access a certain type of resource.

In this way, even if the process is running as root, you need to determine the type of the process and the type of resources it is allowed to access before deciding whether to allow access to a resource. The activity space of the process can also be compressed to a minimum.

Even a server running as root can generally access only the resources it needs. Even if a program is compromised, the impact is limited to the resources it allows access to. Security is greatly increased.

\