Meituan container platform architecture and container technology practice

This article is based on a speech delivered by Ouyang Jian, technical director of Meituan Infrastructure/Container R&d Center, at QCon 2018.

background

Meituan’s container cluster management platform is called HULK. Marvel’s HULK turns into the HULK when he’s angry. This feature is similar to the container’s “elastic expansion”, so we named the platform HULK. There seems to be some company’s container platform with that name, which is purely coincidental.

In 2016, Meituan began to use containers. At that time, Meituan already had a certain scale, and various systems existed before using containers, including CMDB, service governance, monitoring alarms, publishing platform, and so on. When we explore container technology, it’s hard to give up assets. So the first step in containerization is to open up the container life cycle and the interaction of these platforms, such as applying/creating, deleting/releasing, publishing, migrating, etc. We then verified the feasibility of the container as an operational environment for the core online business.

In 2018, after two years of operation and practical exploration, we upgraded the container platform, which is HULK 2.0.

Upgrade the OpenStack based scheduling system to Kubernetes (hereafter referred to as K8s), the de facto standard in the field of container orchestration.
Provides a richer and more reliable container elasticity strategy.
In view of some problems encountered in the basic system before, optimization and polishing.

The container usage status of Meituan is as follows: at present, there are more than 3000 online services and more than 30000 container instances. Many core link services with large concurrency and low latency requirements have been running stably on HULK. This paper mainly introduces some of our practices in container technology, which belong to basic system optimization and polishing.

The basic architecture of Meituan container platform

Firstly, I would like to introduce the infrastructure of Meituan container platform. I believe that all container platform architectures are basically the same.

First of all, container platform connects with service governance, publishing platform, CMDB, alarm monitoring and other systems. By communicating with these systems, the container provides the same experience as the virtual machine. Developers can use containers in the same way they use VMS without changing their habits.

In addition, containers provide elastic capacity expansion, which can dynamically increase and decrease the number of container nodes of services according to certain elastic policies, so as to dynamically adjust the service processing capacity. There is also a special module — “Service Portrait”. Its main function is to better complete the scheduling of containers and optimize resource allocation through the collection and statistics of service container instance operation indicators. For example, you can determine whether a service is computation-intensive or IO-intensive based on the CPU, memory, and I/O usage of the container instance of a service. Try to put complementary containers together during scheduling. For example, if we know that each container instance of a service will run with about 500 processes, we will create the container with a reasonable limit on the number of processes (such as a maximum of 1000 processes) to prevent the container from consuming too much system resources in the event of problems. If the container of this service suddenly applies to create 20000 processes when it is running, we have reason to believe that the service container encounters a Bug, restricts the container by previous resource constraints, and sends an alarm to inform the service to handle it in time.

The next layer is “container Choreography” and “image Management.” Container choreography addresses the problem of dynamic instances of containers, including when containers are created, where they are created, when they are deleted, and so on. Image management addresses static instances of containers, including how container images should be built, how they should be distributed, where they should be distributed, and so on.

The lowest layer is our container runtime. Meituan uses the mainstream Linux+Docker container solution. HULK Agent is our management Agent on the server.

Expanding on the previous container runtime, you can see this architecture diagram, from bottom to top:

At the lowest level are basic physical resources such as CPU, memory, disk and network.
At the next level up, we use CentOS7 as the host operating system, Linux kernel version 3.10. Based on the default kernel of CentOS distribution, we have added some new features developed by Meituan for container scenarios, and optimized some kernel parameters for high concurrency and low latency services.
At the next level up, we use the Docker that comes with the CentOS distribution, currently version 1.13, again with some of our own features and enhancements. HULK Agent is a host management Agent developed by us, which manages agents on the host. The Falcon Agent resides in both the host and container. It collects basic monitoring indicators of the host and container and reports them to the background and monitoring platform.
The top layer is the container itself. We currently support CentOS 6 and CentOS 7 containers. CentOS 6 has a container init process, which is the first process in the development container. It initializes the container and starts the service process. In CentOS 7, we use systemd as the container process number 1. Our containers support a variety of major programming languages, including Java, Python, Node.js, C/C++, and more. Above the language layer are various proxy services, including service governance Agent, logging Agent, encryption Agent and so on. At the same time, our container also supports some internal business environments of Meituan, such as set information and swimlane information. With the service governance system, intelligent routing of service invocation can be realized.

Meituan mainly uses open source components from the CentOS family, because we believe Red Hat has strong open source technology and we hope that the open source versions of Red Hat will help solve most of the system problems rather than directly using the open source community versions. We also found that even with CentOS open source components deployed, it is possible to run into problems that the community and Red Hat have not addressed. To some extent, it also shows that large Domestic Internet companies have reached the world’s leading level of technology application scenarios, scale and complexity, which is why they encounter these problems before the community and Red Hat customers.

Some problems encountered by containers

In container technology itself, we encountered four major issues: isolation, stability, performance, and generalization.

Isolation has two levels: the first is whether the container can correctly understand its own resource allocation; The second question is whether containers running on the same server can affect each other. For example, if the I/O of a container is very high, the service delay of other containers on the host increases.
Stability: This refers to the unstable system functions after high pressure, large-scale and long-term operation. For example, containers cannot be created or deleted, and software freezes or breaks down.
Performance: When virtualization technology is compared with container technology, it is generally believed that the execution efficiency of container is higher. However, in practice, we encounter some special cases: the service throughput and response latency of the same code on the same container are worse than that of virtual machine.
Promotion: After we basically solve the previous problems, businesses may still be reluctant to use containers. Part of the reason is technical factors, such as the difficulty of accessing containers, peripheral tools, ecology, etc., which will affect the cost of using containers. Promotion is not a purely technical issue, but closely related to the company’s internal business development stage, technical culture, organizational setting, KPI and other factors.

Container implementation

A container is essentially a group of related processes that serve the same business goal in a namespace. The processes in the same namespace can communicate with each other but cannot see the processes in other namespaces. Each namespace can have its own independent host name, process ID system, IPC, network, file system, user, and other resources. In a way, a simple virtualisation is achieved: a host can run multiple systems at the same time that do not know each other.

In addition, to limit the use of physical resources by the namespace, you must limit the CPU and memory resources that can be used by the process. That’s the Cgroup technology, the Control group technology. For example, we often say that the 4C4G container is actually limited to the process used in the namespace of the container, can use up to 4 core computing resources and 4GB of memory.

In short, the Linux kernel provides namespace for isolation and cgroups for resource restriction. Namespace +Cgroup forms the underlying technology of the container (RootFS is the container file system layer technology).

Solution, improvement and optimization of Meituan

isolation

I had been working with virtual machines until I used the container, but I found that the CPU and Memory information in the container were the information of the server host, not the configuration information of the container itself. Until now, the community edition container is still like this, such as a 4C4G container, inside the container can be seen 40 CPU, 196GB of memory resources, these resources are actually the container host information. This can feel like a vessel’s “ego expansion”, where you think you’re competent when you’re not, and can cause problems.

The figure above is an example of memory information isolation. When obtaining system memory information, the kernel of Community Linux returns the host memory information uniformly, whether on the host or in the container. If the application in the container is configured according to the host memory discovered by it, the actual resources are far from enough, resulting in an OOM exception in the system soon.

When we get memory information from the container, the kernel returns the container’s memory information based on the container’s Cgroup information (similar to what LXCFS does).

CPU information isolation is implemented similarly to memory, but here is an example of how the number of cpus affects application performance.

As you all know, JVM GC (garbage object collection) has an impact on Java program execution performance. The default JVM uses the formula “ParallelGCThreads = (ncpus <= 8)? Ncpus: 3 + ((ncpus * 5) / 8) “to calculate the number of parallel GC threads, where ncpus is the number of system cpus discovered by the JVM. Once the JVM in the container discovers the number of cpus on the host machine (which is usually much higher than the actual CPU limit of the container), this can cause the JVM to start too many GC threads, which can directly degrade GC performance. Java services experience increased latency, increased TP monitoring curve spikes, and decreased throughput. There are various solutions to this problem:

Explicitly passing the JVM startup parameter “-xx :ParallelGCThreads” tells the JVM that several ParallelGCThreads should be started. It has the disadvantage of requiring business awareness and passing different JVM parameters for different configured containers.
Hack glibc inside the container so that the JVM (via sysconf system calls) gets the correct number of CPU resources in the container. That’s what we did for a while. The advantage is that services do not need to be aware and can automatically adapt to different configurations of containers. The disadvantages are that you have to use a modified gliBC, there are upgrade maintenance costs, and if you are using a native gliBC, the problem still exists.
On the new platform, we have improved the kernel to get the correct number of CPU resources in the container, making it transparent to the business, image and programming language (similar issues can affect the performance of OpenMP, Node.js and other applications).

There was a time when our container was running with root permission, which was realized by adding the ‘Privileged = True’ parameter during docker run. This extensive usage enables the container to see the disks of all containers on the server on which it resides, leading to security and performance issues. Security issues are well understood, but why do they cause performance problems? Imagine a scenario where each container does a disk state scan. Of course, too much permission is also reflected in the mount operation can be arbitrarily, can arbitrarily change the NTP time, and so on.

In the new version, we removed the root permission for the container and found some side effects, such as causing some system calls to fail. By default, we give the container additional sys_ptrace and sys_admin permissions to run GDB and change the host name. If a special case container requires more permissions, it can be configured on our platform by service granularity.

Linux has two types of IO: Direct IO and Buffered IO. Direct IO Directly writes data to the disk. Buffered IO writes data to the cache before writing data to the disk. In most scenarios, Buffered IO writes data to the disk.

We use Linux kernel 3.X, the community version of all containers Buffer IO share a kernel cache, and the cache is not isolated, there is no rate limit, resulting in high IO containers can easily affect other containers on the host. Buffer IO cache isolation and speed limiting in Linux 4.x are significantly improved by Cgroup V2. We also use Cgroup V2 to implement the same functionality in our Linux 3.10 kernel: Each container has a proportional I/O Cache based on its own memory. The rate at which Cache data is written to disks is limited by the CONTAINER’S Cgroup I/O configuration.

Docker itself supports more Cgroup resource limits for containers, but K8s can pass fewer parameters when calling Docker. In order to reduce the interaction between containers, we set different resource limits for containers of different services based on resource allocation of service portrait. In addition to common CPU and memory, There are IO limits, Ulimit limits, PID limits, and so on. So we extended K8s to do this.

A core dump file is generated during the use of a container. For example, a C/C++ program memory access is out of bounds, or the system kills a process that occupies too much memory during OOM use.

The default core dump file in the community container system is generated on the host. Some core dump files are relatively large. For example, core dump files in JVM are usually several GB, or some programs have bugs. It also causes high disk I/O, which also affects other containers. Another problem is that the consumer of the business container does not have access to the host machine to retrieve the dump file for further analysis.

To do this, we changed the core dump process to write the dump file to the container’s own file system and use the container’s own Cgroup I/O throughput limit.

The stability of

We found in practice that Linux Kernel and Docker are the main factors affecting system stability. Although they are reliable system software in their own right, there are some bugs in large-scale, high-intensity scenarios. This also shows that Chinese Internet companies are leading the world in application scale and application complexity.

On the Kernel side, Meituan found an implementation problem with the Kernel 4.x Buffer IO limit, which was confirmed and fixed by the community. We also followed up with a series of Ext4 patches for CentOS, which solved the problem of frequent process freezes over a period of time.

We hit two key stability issues with Red Hat Docker:

After the Docker service is restarted, Docker Exec cannot enter the container, which is a complicated problem. Before the solution we used Nsenter instead of Docker Exec and gave positive feedback to RedHat. Red Hat addressed this issue with an update earlier this year. Access.redhat.com/errata/RHBA…
Under certain conditions, the Docker Daemon will Panic and the container cannot be deleted. After our own debugging and comparing the latest code, we found that the problem has been solved in Docker upstream, and the feedback to Red Hat has been solved quickly. Github.com/projectatom…

Facing the system software of system kernel, Docker, K8s and other open source communities, there is a point of view that we do not need to analyze problems by ourselves, just need to take the latest updates from the community. However, we do not agree. We believe that the ability of the technical team is very important, mainly for the following reasons:

The application scale of Meituan is large and the scene is complex. Many enterprises may not have encountered many problems, so they cannot passively wait for others to solve them.
Some real business problems or requirements, such as the correct number of cpus returned in the container, may not be considered important or the right concept by the community and may not be addressed.
A lot of times the community only resolves problems with upstreams, which are often unstable, and scheduling is difficult to guarantee even with Backport to the version we are using.
There are many patches released by the community, and the descriptions are often obscure. Without a deep understanding of the problem, it is difficult to relate the actual problems encountered to a series of patches.
For some complex problems, community solutions may not be suitable for our own scenarios, and we need to be able to make judgments and trade-offs.

When Meituan solves open source system problems, it usually goes through five stages: digging by itself, developing solutions, paying attention to the community, interacting with the community, and finally contributing to the community.

performance

Container platform performance mainly includes two aspects:

Performance of business services running on containers.
Performance of container operations (create, delete, and so on).

The figure above is an example of our CPU allocation. The mainstream server we use is a two-channel 24-core server, consisting of two nodes with 12 cores each, and a total of 48 logical cpus including hyperthreading. Typical NUMA architecture: each Node has its own memory, and a CPU in a Node can access its memory much faster than another Node’s memory.

In the past, we have encountered network outages concentrated on CPU0, which could lead to increased network latency or even packet loss under heavy traffic. In order to ensure network processing power, we have allocated eight logical cpus from Node0 to handle network outages and host system tasks, such as high-CPU tasks such as image decompression. These eight logical cpus do not run any Workload of the container.

In terms of container scheduling, we try not to allocate container CPU across nodes. Practice has proved that memory access across nodes has a great impact on application performance. In some computationally intensive scenarios, container allocation within Nodes can increase throughput by more than 30%. The Node allocation scheme also has some disadvantages: it leads to increased CPU fragmentation in order to use CPU resources more efficiently. In a real system, we would allocate CPU-insensitive service containers to use CPU resources across nodes based on the information from the service portrait.

The figure above shows the comparison of TP indicator lines for response delay of a real service before and after CPU allocation optimization. You can see that the TP999 line has dropped by an order of magnitude and all the indicators are more stable.

Performance optimization: File system

For file system performance optimization, the first step is selection. According to the application read and write characteristics, we choose Ext4 file system (more than 85% of file read and write operations are on files smaller than 1M).

The Ext4 file system has three journaling modes:

Journal: Waits for Metadata and data logs to fall before writing data.
Ordered: Records only Metadata logs and ensures that data is removed before writing Metadata logs.
Writeback: Only records Metadata logs. Writeback does not ensure that data falls before Metadata.

We chose Writeback mode (oderded by default), which is the fastest of the mount modes, but has the disadvantage of not recovering data in the event of a failure. Most of our containers are stateless, just pull up another one on another machine when it fails. So between performance and stability, we chose performance. The container provides an optional memory-based file system (TMPFS) for applications to improve the performance of services with a large number of temporary file reads and writes.

As shown in the figure above, creating a VIRTUAL machine inside Meituan takes at least three steps, with an average time of more than 300 seconds. The average time to create a container with an image is 23 seconds. The flexibility and speed of the container is clearly demonstrated.

The average capacity expansion time of 23 seconds covers various aspects of optimization, such as capacity expansion link optimization, image distribution optimization, initialization optimization, and service pull optimization. Next, this article focuses on the optimization related to image distribution and decompression that we did.

The figure above shows the overall architecture of Meituan container image management, which has the following features:

Multiple sites exist.
Cross-site synchronization is supported. You can determine whether cross-site synchronization is required based on the label of the image.
Each Site has an image backup.
Each Site has a P2P network for image distribution.

Image distribution is an important link that affects the capacity expansion time of containers.

Cross-site synchronization: Ensure that the server can always pull images from the nearest image repository for capacity expansion, reducing the pull time and cross-site bandwidth consumption.
Basic image pre-distribution: The basic image of Meituan is a public image used to build service images. It is usually hundreds of MB in size. The business mirror layer is the application code for the business and is usually much smaller than the underlying mirror. During capacity expansion, if the basic mirror is already local, you only need to pull the service mirror, which greatly speeds up capacity expansion. To achieve this, we pre-distribute the base image to all servers.
P2P image distribution: Basic image pre-distribution in some scenarios, thousands of servers will pull images from the image repository at the same time, putting great pressure on the image repository service and bandwidth. Therefore, we developed the function of P2P distribution of images. The server can not only pull images from the image warehouse, but also get the image fragments from other servers.

As can be seen from the figure above, the original distribution time increases rapidly as the number of distributors increases, while the P2P image distribution time basically remains stable.

Docker image pull is a process of parallel download and serial decompression. In order to improve the speed of decompression, we have also made some optimization work in Meituan.

For the decompression of a single layer, the parallel decompression algorithm is used to replace the default serial decompression algorithm of Docker, and the implementation is to replace gzip with PGZIP.

Docker images have a hierarchical structure, and the combination of image layers is a serial operation of “decompress one layer and merge another layer, and then decompress another layer and merge another layer”. In fact, only merge needs to be serial, and decompression can be done in parallel. Instead of decompressing multiple layers in parallel, the decompressed data is stored in temporary storage and then sequentially merged based on the dependencies between layers. The previous change (unpacking all the layers into temporary space in parallel) nearly doubled the number of disk I/OS, which also resulted in the unpacking process not being fast enough. Instead, we used memory-based Ramdisk to store the extracted temporary files, reducing the overhead of additional file writes. Having done the above work, we also found that the layering of the container also affected the time to download and decompress. The above is the result of our simple test: the decompression time can be greatly improved regardless of the level of parallel image decompression, especially for multiple levels of image decompression.

To promote

The first step in promoting containers is to be able to say the advantages of containers. We believe that containers have the following advantages:

Lightweight: The container is small, fast, and can be started in seconds.
Application distribution: Containers use image distribution, development test containers and deployment containers are configured exactly the same.
Elasticity: The capacity can be rapidly expanded based on service specifications, such as CPU and memory usage, QPS, and latency.

The combination of these three features brings greater flexibility and lower computing costs to the business.

Because the container platform itself is a technology product and its customers are RD teams in various businesses, we need to consider the following factors:

Product advantages: Promoting the container platform is to some extent a ToB business in its own right, and it starts with a good product, which has many advantages over previous solutions (virtual machines).
Integration with existing systems: The product should be able to integrate well with the customer’s existing systems, rather than allowing the customer to overturn all the systems and start from scratch.
Native application development platform and tools: The product should be easy to use and have a tool chain that works together.
Smooth migration of virtual machines to containers: It is best to provide a migration solution from legacy to new products that is easy to implement.
Work closely with application RD: provide good customer support (even if some problems are not caused by the product).
Resource tilt: Supports disruptive new technologies from a strategic perspective: Resources tilt toward container platforms. If there is no sufficient reason, do not allocate VM resources.

conclusion

Docker container and Kubernetes arrangement is one of the mainstream practices of container cloud. HULK, meituan container cluster management platform, also adopts such a scheme. This paper mainly shares some exploration and practice of Meituan in container technology. The content mainly covers some optimization work of Meituan container cloud in Linux Kernel, Docker and Kubernetes, as well as some thoughts of meituan to promote containerization process, welcome you to exchange and discuss with us.

Author’s brief introduction

Ouyang Jian, graduated from Tsinghua University in 2006, has 12 years of experience in data center development and management. He used to work as Staff Engineer of VMware China, CTO of Matchless Technology and chief Architect of Zhongke Ruiguang. Now he is the technical director of Meituan Infrastructure Department/Container R&D Center, responsible for the related work of Meituan containerization.

Recruitment information

Meituan-dianping infrastructure team is looking for senior and senior Technical experts in Java, based in Beijing and Shanghai. We are the group’s core team dedicated to the development of company-level, industry-leading infrastructure components, covering the technical areas of distributed monitoring, service governance, high-performance communications, message-oriented middleware, basic storage, containerization, cluster scheduling and so on. Interested students are welcome to submit their resumes to [email protected].