Abstract: Computing TechDay41, Ali Cloud senior R & D engineer Okban brought Docker image optimization and best practices. Starting from the principle of Docker image storage, this paper introduces how to optimize these key points in construction for image storage and network transmission. It also introduces Docker’s latest multi-phase build features to solve the problem of build dependent intermediates.

Here are the highlights:

Mirror concept



What is a mirror image? From a more specific point of view, an image is a file stored in multiple layers. Compared with ordinary ISO system images, hierarchical storage brings two advantages. One is that hierarchical images are easier to expand, for example, we can build our Nginx image based on an Ubuntu image. In this way, we only need to do some Nginx installation and configuration work on the base of Ubuntu images. Once an Nginx image is made, we do not need to make various images from scratch. Another point we can optimize the image storage space, if we have two mirror, mirror Tag1.0 and Tag2.0 image, if we in the traditional way to pass the two images, each image is probably more than 130 megabytes, but if we are in a hierarchical way to store two images, we through the following two purple can share, can save a lot of space, The two images together need just over 140 megabytes of space to be saved. In this way, firstly, storage space is saved and secondly, network overhead can be reduced. For example, we have downloaded the image below, and when we want to download the image above, we only need to go to the 10M part.

If go to see, from the perspective of abstract Docker mirror is actually Docker provides a standardized means of delivery, the traditional application at the time of delivery is actually delivered an executable file, the executable file does not include its running environment, we may be because the 32-bit or 64 – bit system, or use 1.0 software development and testing, As a result, we found that the user’s environment was 2.0 and other problems when we delivered, which caused us to spend time on troubleshooting. If we delivered in the standardized form of Docker image, we could avoid these problems.

Basic operations and storage modes of mirroring



One of our mirror there will be a coordinates, a mirror coordinate will basically consists of four parts, there will be a mirror in front of the domain name service, every service providers have different domain, when we determine the service provider to our domain name, we generally want to service providers to apply for their own namespace, Warehouse names are generally used to identify the use of images, such as Ubuntu images and CentOS images. Labels are generally used to distinguish image versions. For example, we may print some 16.04 packages for Ubuntu images. So we can do something with the mirror image.

First we need to log in, we will use the first command to log in, and then, when we locally prepare an image to upload, we need to mark the image, change its coordinates to the coordinates we need to upload the image now, and then do some push and pull actions. Finally, two additional commands are provided for Docker to deliver the image. If we are in a special environment and there is no way to connect the network, we can package the image into a common file for transmission. For example, when we cooperate with the public security, they cannot download the image through our Registry, so we may have to make it into a common file and deliver it by WAY of U disk.

Image Storage Details



Docker mirror is the joint of the file system, each image is actually a tiered storage, such as in the first layer we added three new file, and then on the basis of this layer we add a layer, add a file, the third layer may need to make some changes, we move the File3 made a modification to the above, Then deleted File4, here will lead to joint the inside of the file system to write replication mechanism, when we are going to modify a file, image depend on the underlying are read-only, we can’t go to modify directly, for example, we want to modify the File3, we can’t go directly to modify this file, we need to modify the copy files to the current this layer, L3, for example, and then modify it.

When we look at the image from a federated file system perspective, we don’t see L1, L2, L3, we end up with File1, File2, File3, File4. And then, after we understand how it works, we can understand what the container looks like in action. The bottom part of the figure is also a three-layer image of L1, L2 and L3. When the container is running, Docker Daemon dynamically generates a writable layer as the running layer of the container. Then, when the container needs to modify some files, such as File2, Also copy on write mechanism, copy the file, and then make some changes, the new file is the same, then the container at run time also can have a view, when we are to stop the container, the view layer, it will be destroyed, but the container layer, speaking, reading and writing will be retained, so we are to stop the container restart, We’re still going to see some of the things we did in the container.

Common storage drivers mainly include AUFS, OverlayFS, and Device Mapper. The first two drivers are based on files. Their principle is that when a file needs to be modified, the entire file is copied to make modifications. The nice thing about it is that when I want to modify a file, I don’t copy the entire file, I copy a few blocks of the file to make some changes, and when I have a large file that I want to modify, Device Mapper is much better than AUFS and OverlayFS. So AUFS and OverlayFS is more suitable for traditional WEB applications, it will not a lot of file operations, but it may be to our application startup will have some requirements, for example, I often want to release, I want to be able to start more quickly, but I don’t really care for some file modification efficiency, it can be used to drive based on file, When we are computationally intensive applications, we can choose Device Mapper, which is slow to start but performs better.

Image automation build



When we build an image, Docker actually provides a standardized set of build instructions. When we use these build instructions to write similar scripts, which we call DockerFile, Docker can automatically parse the DockerFile and build it into an image. So you can simply think of this as a standardized script. What does DockerFile do? We use the official image of openJDK. Based on the JAVA environment, we are going to run a JAVA application on the image. Then the next two lables are to mark the image, marking the image version and build date. Then the next six runs were a Maven installation, a JAVA lifecycle management tool, adding some source code from the external environment to the image, packaging the two RUN commands, and finally writing a startup command.



In general, DockerFile can write, at least the idea is very clear, step by step from the basic image selection to the compilation environment, and then add the source code, and then to the final build, start command write good, readability, maintainability can be, but still can be optimized.

We can reduce the number of layers of the image. Docker has certain requirements on the number of layers of the Docker image. Apart from the top read/write layer when the container is running, we can only have 127 layers for an image. The following six RUN commands do the installation work for Maven. We can also make it a single layer and string these commands together. We can also make it a single layer for the subsequent build, so that we can cut the number of mirror layers in half from 14 to 7.

When we do image optimization, we want to minimize the number of image layers, but it corresponds to the readability of our DockerFile, and we need to make a compromise between the two. We want to minimize the number of image layers without greatly affecting the readability. In fact, the six RUN commands do one thing. Doing maven environment tying and preparing the environment for compilation.



Now let’s go ahead and optimize the image. What can we do? When installing maven build tools we added a line, we spread out the installation package and catalog deleted, we clean up the building in the middle of the product, we need to pay attention to every build instruction execution time, put the rubbish cleared as far as possible, we through the apt – get to install some software, we can also go to do such a clean up, By adding a command line, we can reduce the size of the image from 137M to 119M.

To install software or commands through apt-get is basically written by all the people who write DockerFile, so the official has added Hack in debian and Ubuntu warehouse images by default, it will help you to delete the source code automatically in install.



We can make use of the built cache. Docker build will enable the cache by default. There are three key points for the cache to take effect. As long as a build instruction meets these three conditions, this layer of mirror build will not be executed, it will directly use the results of the previous build, according to the build cache feature we can add a line RUN, here is a JAVA application for example, generally a JAVA application POM file is describing some JAVA dependencies, And these rely on in the process of the development of our normal package change frequency is lower, then we can add in POM and the POM file depends on all ready, and then to the source code, to do construction work, as long as we don’t turn off the cache, we don’t need every time the build to the installation package, this can save a lot of time, You can also save some network traffic.



Now ali Cloud’s container image service has actually provided the construction function, we will find in the statistics of user failure cases, network causes of failure accounted for 90%, for example, if users through Node development NPM in the installation of some software packages often stuck in the middle. So we suggest adding a software source, we add ali Cloud Maven address to it, we add the configuration item to ali Cloud software address, add Ali cloud Maven source as the download target of the software package, the time is directly reduced by 40%, which is also helpful to the success rate of a mirror build.

Multistage construction



The final product of DockerFile is actually a JAVA application. We don’t care about building, compiling, packaging, or installing. We want the final product. Therefore, we can take a step-by-step approach to build the mirror. First, we will make all the problems encountered before into the basic mirror. In fact, the above FROM mirror has been changed to a new one, and the address of the software source in the mirror has been changed to Maven. We’re going to make use of the cache, and then we’re going to add the source code, and we’re going to make an image of what we built, and we’re going to let the image do the building, and then we’re going to go ahead and copy the JAVA package in and start it up, but the two Dockerfiles are really two images, so we’re going to need a script to help it, The shell script in the first line is to do the first build instruction. We specify Bulid’s DockerFile to start the build, and then generate an APP Bulid image. The next two lines are to generate the image, copy the build product inside, and then do the build. Finally, we need to build the JAVA application, so that our DockerFile is more clear than before, and the steps are very simple.



Docker official support after 17.05 the multi-stage build, we removed the following script, we don’t need a auxiliary script, we only need statements base image behind mark, we build the first phase of the product name is what, we can in the second construction phase in the product of the first construction phase. For example, we built the JAVA application in the first stage, copied the JAVA rack package under target in Maven package to the new image, and then after all the optimization, the effect is shown in the figure. In the first build, 102 seconds before optimization, and only 55 seconds after optimization of Docker build, we completed it. The main optimization is on the network. When we modify the JAVA file to build, the second building took 86 seconds, because the Maven installed a piece of cached, we use the cache build, so less than 20 seconds, the optimized took only 8 seconds, because all the source code in front of some packages to download all cached, we direct the new image, and then rely on does not change, Build directly, so 8 seconds is pretty much the full build time.

Let’s look at the optimization of storage space. In the first build, we printed the image with 137M before optimization, but after the whole optimization, it was only 81M. The basic image here was changed from JDK to JRE, why? Because before we put all the process in a mirror inside, we need to do is to build the, build time required to RUN the Maven, in this case no JDK environment is RUN up, but if we in stages, give the Maven build mirror, the real running to new image to do, just don’t have to use the JDK, We used JRE directly, and the image was nearly 50% less after optimization. When we modified the source code for re-construction, due to the mirror-sharing reason, the second build actually added two to three layers before optimization, with a total of 9M. However, the second build after optimization only added 1.93KB, so our optimization for DockerFile was finished.

What are the important aspects of mirror optimization? Details are as follows:

1. Reduce the number of mirror layers and try to combine some unified commands.

2. Pay attention to cleaning up the image build intermediates, such as some installation packages are deleted after installation;

3. Pay attention to optimize the network request. We use some mirror sources and some open source sites with good network, which can save time and reduce the failure rate;

4. Use the build cache as much as possible. We try to put things that don’t change or things that change less first, because things that don’t change can be cached.

5 multi-stage mirror construction, the purpose of our mirror production to do a clear, our construction and some real products to do separation, construction with the construction of the mirror to do, the final product will play the final product of the mirror.

Container Image Service

Finally, we will introduce ali Cloud container image service. This service has been in public beta for one year, and now all of our services are free. Our services have been deployed in 12 regions around the world, and each Region has Intranet services and VPC network services. If ECS is in the same Region, its services are very fast. Then team management and organization account functions have been launched, mirror purchase and build and mirror message notification are actually some DevOps capabilities, for some mirror optimization we provide some mirror layer information browsing capabilities, we will provide analysis, image security scan, mirror synchronization.