It has been almost five years since I started developing HyperContainer (runV) in early May 2015. It is a bit embarrassing to write an article about what a safe container is at this time. But after five years, when more and more people began to feel like I needed it but couldn’t say it, it was time to reinterpret the word “safe container.”

Origin: The naming of safe containers

Phil Karlton has a famous saying —

There are only two real problems in computer science — cache invalidation and naming.

As far as container circles are concerned, I believe the name definitely fits the sentence, and it’s definitely something that silences older developers and brings new ones to tears.

In terms of system software alone, the term LinuxContainer, as it is commonly known today, has been used with names like jail, zone, VirtualServer, Sandbox… Similarly, in the early days of the virtualization stack, a virtual machine environment was called a container. After all, the term itself, which refers to objects that contain, encapsulate and segregate, is all too common. So much so that the famously rigorous Wikipedia entry for this kind of technology is called “system-level virtualization,” sidestepping the question of what a container is.

After the Docker in 2013, container with the concept “immutable infrastructure”, “cloud native” this a series of concepts, in the next few years, overturned by trend based on “package + configuration” combination of fine-grained application deployment, use simple declarative strategy + immutable container is refreshingly describes the software stack. How the application is deployed seems beside the point, but what we want to emphasize here is —

A container in a cloud native context is essentially an “application container” — an application package packaged in a standard format that runs on a standard operating system environment (usually the Linux ABI) — or the program/technology that runs that application package.

This definition is what I wrote down here, but it’s not my will, it’s based on the consensus of the OCI (Open ContainerInitiative) specification, which dictates what environment applications in containers are going to be placed in, how they’re going to operate, Such as which executable file on the container root file system to start, which user to use, what CPU, memory resources are required, what external storage is available, what sharing requirements are available, and so on.

With this consensus as the foundation, we come to safe containers, which is another history of naming blood and tears. Back then, my co-founder Zhao Peng named our technology “Virtualization Container,” using the banner “Secure as VM, Fast asContainer” to get attention. Since then, People who were upset about the security of the Container immediately called it “Secure Container” and it went on and on. In our minds, this technology provides an additional layer of isolation, which may mean a link in security, but also some operational efficiency, some optimization possibilities, or other functions. In fact, it makes more sense to define a safe container

A runtime technology that provides a complete operating system execution environment (usually a LinuxABI) for container applications but provides additional protection between container hosts and between containers by isolating application execution from the host operating system and avoiding direct application access to host resources.

Indirection: The essence of a secure container

The only solution to a security problem is to allow bugs to occur, but block them with an extra layer of isolation. — LinuxCon NA 2015, Linus Torvalds

Why introduce an indirection layer for security? Because of the current scale of major host operating systems such as Linux, it is impossible to theoretically verify that a program is bug-free, and once the right bugs are properly exploited, the security risk becomes a security issue. Frameworks and patching do not ensure security, so additional isolation, reduced attack surface, We can’t make sure there are no vulnerabilities, but we can combine them to reduce the risk of a complete breach.

Kata: Virtualization of cloud protobiochemistry

We launched the Kata Containers secure Containers project on KubeCon in December 2017. The two predecessors of the project — runV, which we launched, and Clear Container, which Intel launched — were both released in May 2015 (yes, Prior to Linus’ quote above). The idea for this group of projects is simple —

  1. The container mechanism of the operating system itself cannot solve the security problem and requires an isolation layer.
  2. The VIRTUAL machine is a ready-made layer of isolation, and cloud services like AWS have convinced the world that “secure of VM” is available to consumers.
  3. As long as there is a kernel in the virtual machine, it can support the semantics of the OCI specification. Running Linux applications on the kernel is not too difficult to implement.
  4. The virtual machine may not be fast enough to be used in a container environment. Can we have speed of Container?

Now, if the last problem can be solved, then it’s the “safe Containers” we want — that’s Kata Containers. Currently Kata Containers are typically used in Kubernetes environments, where Kubelet lets Containerd or Cri-O perform runtime operations through the CRI interface. Mirroring operations are typically performed by these CRI daemons, and on request, Write the Runtime operation as an OCI Spec and hand it to the OCI Runtime. Here, containerd Containers up to 1.2 and Kata Containers up to 1.5 are typically used:

  • Each Pod has a shim-v2 process that performs various run-time operations for Containerd/Cri-O. This process is consistent with the Pod lifecycle and provides containerd/ Cri-O with a ttRPC interface.
  • Shim-v2 provides isolation by starting a virtual machine for Pod as a PodSandbox that runs a Linux kernel, which is usually a clipped kernel that does not support unnecessary devices;
  • The VIRTUAL machine used here can be A Qemu or A Firecracker. If it is a Qemu, there is no simulated device at all, while for Qemu, it can be as small as possible through configuration and patch. Other supported virtual machines include ACRN and Cloud-Hypervisor, the latter of which may be used more and more in the future.
  • The virtual machine may start up as an initrd without a full operating system, or it may have a very small operating system, but it is not configured as a simulator operating system at all. It is simply the infrastructure to support container applications. Metrics Resources needed to collect or debug application tracking;
  • The container’s rootFs will be hot-plugged into the virtual machine after the Sandbox starts and after receiving the OCI request from Contaienrd/Cri-O to create the container. The preparation of the container’s RootFs and the virtual machine’s own startup can be parallel.
  • According to CRI semantics and OCI specifications, Pod can start multiple related containers, which are placed in the same virtual machine, and can share namespace with each other.
  • External volumes can be inserted into the PodSandbox as block devices or file system shares, but for containers in pods, they are all loaded file systems. The file system approach that is beginning to gain popularity is virtio-fs, designed for scenarios like Kata. Not only is it faster and has more complete POSIX file system support than traditional 9PFS, but it also shares page-cache of the same content between different pods thanks to so-called vhost-user and DAX technology. Make them just like runC containers, saving precious memory;
  • For the network, tcfilter can directly support a variety of CNI plug-ins to provide the container network, which has the advantage of working naturally without any adjustment, but the efficiency is lost because of a bridge. In the production environment, we can also consider using enlightened network mode. Use a special CNI plug-in for efficient access to the container network.

As you can see, the first Kata Containers is a full-featured container runtime engine, it doesn’t like to use traditional virtual machine, and engine is a container, and, by “using less unnecessary memory” and “sharing can share memory” to reduce the overhead of memory, less memory overhead not only smaller, start and more light, For most scenarios, this implements “secure of VM, speed of Container “. In addition to security, compared with traditional virtual machines, it has more flexibility of the container, less of the kind of physical operation feel of the machine, we call this technology “cloud biological virtualization” or “cloud native virtualization” technology.

GVisor: process-level virtualization

Six months after Kata Containers, at KubeCon in Copenhagen, Google responded by opening source their gVisor secure container, which they had been developing internally for five years.

If Kata Containers is a combination and modification of isolation technologies to build an isolation layer between Containers, the design of gVisor is obviously more concise — gVisor is a user-mode operating system kernel, called Sentry, rewritten in Go language, It does not depend on the virtual machine. Instead, it uses the “Platform” capability to have the host machine transfer all access to the application back to The Sentry, and then ask the host machine to do the necessary operations for it.

What gVisor is doing is a pure application-oriented isolation layer, from the Linux ABI to the “filters” of the Linux ABI. The advantage of the new writing is that it doesn’t have to bend too much to the shackles of the existing technology stack, it can be written lighter, it’s certainly quicker to start up, and in fact, it’s easier to scale resources, or be more containable. Many of our friends in the OS community have made no secret of the fact that they prefer gVisor’s architecture if it solves some of the problems that are currently hard to solve.

As an isolation layer, gVisor’s security is based on:

  • The first is that the attack surface is smaller, and the host operating system will only make about 20% of the Linux system calls to the sandboxed applications. Through research, gVisor’s authors have found that most attacks are carried out through defects in uncommon system calls, whose implementation paths are generally less reviewed and less secure than those of hot paths, GVisor is designed to prevent most attacks by preventing application access to infrequently used system calls from landing on the host operating system kernel at all;
  • Second, they figured out that the most commonly attacked system call was open(), so they just gave the really necessary open() calls to a special process called Gopher, which could be more easily restricted, audited and controlled.
  • Finally, they wrote the kernel in the high-level language Go, which of course had the advantage of being more memory-safe, but of course they admitted that the language wasn’t really “system-level” and they had to do a lot of tinkering to get there, as well as contributing a lot of changes to the Go Runtime.

Of course, the architecture of gVisor is beautiful, but there are not many companies except Google that can re-implement a kernel (similar to Microsoft’s original WSL), and there are some practical problems with this advanced design:

  • First, it’s not Linux, so there’s a gap in compatibility with solutions like Kata;
  • Second, for the current Linux system call mode and CPU instruction system, each system call interception will have considerable performance overhead, although the full stack optimization can be alleviated to some extent, but for the scenario with more system calls, the performance loss can not be ignored.

Therefore, gVisor solution cannot be the ultimate solution in the short term. Of course, it can not only adapt to some specific scenarios, but also bring Revelations that may play a role in the evolution of future operating systems and even CPU instruction sets, thus promoting us to have a more perfect solution for secure containers.

Safe containers: More than just safe

The isolation layer of the security container keeps application problems — whether malicious attacks or accidental errors — from affecting the host machine and from interacting with different pods. And in fact, the impact of the extra layer of isolation goes beyond security and benefits scheduling, quality of service, and protection of application information.

The traditional operating system container technology is an extension of kernel process management. Container process itself is a group of associated processes, which are completely visible to the host scheduler. All containers or processes in a Pod are also scheduled and managed by the host machine. This means that in environments with a large number of containers, the host kernel is heavily burdened, and the overhead of this burden has been observed in many real-world environments. After adopting the security container, these complete information can not be seen from the host machine, and the isolation layer also reduces the scheduling overhead of the host machine, reduces the maintenance burden, and avoids the interference of the quality of service between containers and between containers and host machines. On the other hand, the security container serves as a barrier to prevent the operation and maintenance management operations of the host from directly accessing application data. In this way, the application data of users can be directly protected in the sandbox to reduce user authorization requirements and ensure data privacy of users.

When we looked to the future, to be able to see, safe container security isolation is not just doing, because of the isolation layer security container kernel, the kernel is relative to the host machine independent, specialized for application services, from this perspective, the function of the host/application between the function of rational allocation and optimization, to show people the anticipated potential, the safety of the vessel in the future, It may not only reduce the cost of isolation, but even improve the efficiency of the application

Isolation makes cloud native infrastructure more perfect.

Kata Containers open source address: katacontainers.io/

Related Live recommendation: Cloud Native Isolation has to be said

Live webcast SOFAChannel#16, subject: have to say cloud native isolation type, will invite peng tao, Kata Containers maintainer (name: Bader takes us closer to Kata Containers, the cloud-native infrastructure mentioned in this article, and shares details about how it solves container isolation issues in a cloud-native framework.

I will share with you from the following aspects:

  • Starting with Kubernetes Pod, Pod sharing in the classic container scenario;
  • Problems and solutions of shared kernel;
  • And God said, Let there be light; We say, let’s have Kata;
  • The speed of containers, The security of VMs;
  • Kata Containers features large transmission;
  • What? You said something about adding a VM indirection layer?

SOFAChannel#16: it has to be said that cloud native isolation

Live time: 2020/5/21 (Thursday) 19:00-20:00

Registration: Click “here” to register

Financial Class Distributed Architecture (Antfin_SOFA)