The authors introduce

Leon Li, senior development engineer of Tencent, mainly focuses on container storage and image storage related fields. Currently, he is mainly responsible for the development and design of Tencent container image service and image storage acceleration system.

Li Zhiyu, Tencent cloud backstage development engineer. Responsible for Tencent cloud TKE cluster node and runtime related work, including containerd, Docker and other container runtime components customized development and troubleshooting.

Hong Zhiguo, Tencent Cloud architect, responsible for the development of TKE product container runtime, K8s, container network, Mesh data surface and other basic components.

background

In the context of containerized services, different service scenarios have different requirements for container startup. In offline computing and some online service scenarios that require rapid increase of computing resources (scaling groups), the container startup speed is often high.

Image pulls often take up 70% or more of the entire container startup cycle. According to statistics, it takes up to 40 minutes for an offline computing service to expand thousands of PODS each time due to a large container image. Image distribution is a major obstacle to rapid elastic scaling of containers.

ImageApparate (phantom)

To solve this problem, Tencent cloud Container Service TKE team developed the next generation image distribution solutionImageApparate (phantom)To improve the speed of large scale image distribution5 to 10 times.

In response to the problems caused by the existing Docker image download mode, the discussion of the community’s new scheme mainly focuses on the lazy-pull of image data and the design of the new image format is not based on the minimum unit of layer, but on chuck or the file in the image itself.

But for nowOCI V2It’s still a long way from us, so how do we deal with this kind of scenario right now?

Back to the problem itself, the current OCI V1 and container runtime interaction logic need to download the complete image before the container can run, but the container startup and runtime will use the content of the image, this paper FAST ’16 statistics of some common official images in DockerHub after its use to read the amount of data. It concluded that on average only 6.4% of the content needed to be read. That is to say, most of the content in the image may not be needed at all during the lifetime of the container, so if we load only 6% of the data, we can greatly reduce the image pull time, thus speeding up the container startup, which provides a theoretical premise for subsequent optimization.

Therefore, the key to reducing the container startup time is to get the container image of rootfs.

Based on this premise, TCR launched the ImageApparate container image acceleration service under the FRAMEWORK compatible with OCI V1. First of all, on 200 nodes, the content of the mirror is 5% to 10% of the total size of the mirror. As mentioned above, ImageApparate provides a five to ten fold improvement in the total startup time of the container compared to the traditional method of downloading all images. And this test doesn’t just focus on the container creation time, but continues to test the total time from the time the container is started to the time the business process can provide services:

  • Read 500MB large files sequentiallyTests include the time from the start of the container to the completion of sequential reading of 500MB files
  • Read 1000 small files randomlyThe tests included the time from the start of the container to the completion of 1000 4K-16K readings
  • Executing python programsTests include the time it takes to complete a simple piece of Python code after loading the Python interpreter from the container
  • Performing GCC compilationThe tests included the time it took to compile a simple piece of C code after executing GCC from the container startup and running it

ImageApparate program design

The problem with traditional models

Since the release of Docker, huge changes have taken place in the field of cloud computing. Traditional virtual machines are gradually replaced by containers. Adhering to the concept of Build, Ship And Run, Docker has excellently completed the design of container runtime And container image, leading the entire container industry. However, with the passage of time, Ship And Run of containers have gradually exposed some problems in the face of a wide range of user demand scenarios.

Traditional container startup and image download methods are as follows:

  1. Access the image repository service to obtain permission authentication and image storage address
  2. Access the image storage address on the network to download all image layers and decompress them
  3. Mount all layers using the federated file system based on the layer information of the imagerootfsCreate and start containers on this file system

  1. The container image design has been used since The release of Docker and has become the de facto standard that we use today OCI V1, using a layered design that dramatically reduces space footprint, All kinds of federated file systems (Aufs, Overlayfs) are used to mount each layer of federated file system to form a complete RootFS read-only root file system. The write operation of container running will be in the uppermost read/write layer of federated file system, which is very delicate design.

    However, developers and users have an endless pursuit for speed. With the widespread use of cloud in business, in order to give full play to the elastic capacity of cloud resources, users often need newly expanded computing nodes to use containerized computing capacity at the fastest speed (container startup services can accept traffic). At this point, the new node needs to download all the layers of the container image, which greatly slows down the startup speed of the container. In this scenario, the layered design of the container image is not fully utilized and completely ineffective.

    The community has also begun to focus on some problems with the OCI V1 container image format. Currently, tar packages as the OCI V1 image layer distribution format mainly have the following problems:

    1. Content redundancy between different layers
    2. There is no file-based addressing capability, which requires full unpacking to access
    3. No unpack capability
    4. Using Whiteout to handle file deletions converting between different storage types leads to inefficient decompression

TCR – Apparate OCI products

Our design goal is to be production-level and support both image accelerated mode and normal mode on nodes. In order to decouple from normal OCI V1 image storage, We have developed the image attached storage IAS(ImageAttachStorage) combined with the Foreign Layer in the image Manifest, which can complete the production, upload and download of accelerated images in accordance with the OCI V1 semantics. While inherits the original image permissions, The accelerated mirror Manifest index is stored as an OCI artifact in storage in the mirror repository itself.

ImageApparate uses a read-only file system in place of tar in order to support on-demand loading and overcome some of the previous shortcomings of tar. Read-only file systems address file addressing capabilities within the mirror layer while providing reliable performance as Rootfs. ImageApparate still uses a layered design to specify attachment storage addresses directly in the Manifest external layer, and the additional storage layer IAS can be mounted on demand when an image is downloaded.

After the user enables the image acceleration function and sets relevant rules, the ImageApparate after push image will run the following process in the background:

  1. User with arbitrary matchOCI V1Interface standard clients (including Docker)Push images to the TCR repository
  2. TCR’s mirroring service writes user data to the backend storage of the mirroring warehouse itself, typically COS object storage.
  3. TCR’s mirroring service checks the image-acceleration rules and sends a Webhook notification to the Apparate-Client component requesting a conversion of the image format if the rules are met.
  4. Apparate-client receives notification and writes COS data toIASIn, a specific algorithm is used to convert each Layer of this image individually to a Layer format that supports ImageApparate mounting.

As a result, TCR users only need to define rules to mark which images need to be accelerated, and there is no change in the way CI/CD is used, so the original development pattern is inherited.

Image Attached Storage IAS(ImageAttachStorage)

As the name implies, an IAS is a data storage address in addition to its own mirrored back-end storage. An IAS can use either the same object storage as a mirrored warehouse, NFS or Lustre. The image-attached storage in Apparate includes a plug-in interface (POSIX-compliant) and a Layout in the image-layer IAS, in addition to the storage address. Each directory represents a Layer in IAS, where the same Layer of Content is reused using Content Addressable. Read-only file system files contain all the Content in this original Layer, and the entire directory tree can be accessed at any time by loading metadata indexes. Apparate currently uses Tencent Cloud CFS as an implementation of IAS, and the high throughput and low latency CFS currently fits well with the mirror download scenario.

The mirror local cache is realized by different IAS add-on storage plug-ins themselves. Currently, CFS uses FScache framework as local cache to automatically cache some data accessed on remote storage by page. According to the local cache capacity of the current disk, the performance and stability of mirror data repeated access are effectively improved.

Runtime implementation

The IAS add-on storage plugin currently used by ImageApparate on the node is called apparate-Snapshotter and is implemented through Containerd’s proxy-Snapshotter capability.

Apparate-snapshotter parses the IAS information recorded in the mirroring layer to obtain additional data storage addresses, then loads the remote data into the data storage service and provides Posix access locally.

For example, in a CFS scenario, remote data is mounted locally and the mount point is used as an entry point for subsequent local access. Snapshotter or the kernel provides the ability to load on demand when remote data is needed.

Read-only image format

For image file system support Lazy – Pull, is crucial to the read-only attribute, because the read-only file system does not need to consider writing data and remove debris and garbage collection, can be made in ahead of the file system when optimizing the distribution of data blocks and index, so can greatly improve the reading performance of file system.

The current read-only file system supported by IAS also adds the directory index based on alphabetical sorting, which greatly speeds up the Lookup operation of directory items.

ImageApparate is used in TCR

Create acceleration components

Currently ImageApparate requires whitelist enablement for alpha in TCR. To enable the acceleration component, select a high-performance VERSION of the CFS. Ensure that the CFS version is available in your region.

Creating acceleration Rules

Create an acceleration rule. Only matched images or tags in the rule will be accelerated automatically. After pushing the image to TCR, it can be seen that the image matching the acceleration rule will generate the suffix as-apparatetheOCIProducts.

The acceleration function is enabled on the TKE cluster

Enable the image acceleration configuration when creating the TCR plug-in in the TKE cluster. Then you can label the nodes in the cluster to be acceleratedkubectl label node xxx cloud.tencent.com/apparate=true, Pod images in the cluster can still use the original image name (for example, test/nginx:1.9), and the acceleration plug-in supports automatic selection of accelerated images to mount. If the mirror has been accelerated, observe the Pod’s in the TKE clusterimageThe field can be seen replaced with test/nginx:1.9-apparate.

The follow-up work

When container images are loaded on demand, Layer may no longer be the smallest unit of reuse, and ImageApparate will later explore file – or block-based image formats and conversion tools for higher performance and efficiency. On the interface side, IAS will also support more data sources, including the integration with TKE P2P components. Loading on demand combined with P2P can better cope with super-large image loading scenarios, greatly reducing the pressure on the source site.

Private invitation

The accelerated service of ImageApparate is now open for private test, we sincerely invite you to participate in private test application ~ the quota is limited, click here to direct to the private test application page for information submission.

reference

  • Slacker: Fast Distribution with Lazy Docker Containers
  • Image Manifest V 2, Schema 2
  • General Filesystem Caching
  • EROFS: A Compression-friendly Readonly File System for Resource-scarce Devices