Are you sure you can write Dockerfile?

The GitHub repository now contains thousands of dockerfiles, but not all of them are efficient. This article will introduce best practices for Dockerfiles from five aspects to help you write more elegant Dockerfiles. If you’re new to Docker, congratulations, this article is for you. The next series will be more in-depth, stay tuned!

This article uses a Maven-based Java project as an example, and then continues to improve the way Dockerfile is written until you have the most elegant Dockerfile. All of the steps in between are intended to illustrate best practices in one area or another.

1. Reduce build time

A development cycle consists of building a Docker image, changing the code, and then rebuilding the Docker image. If you can take advantage of caching during the process of building an image, you can reduce the unnecessary repetition of build steps.

Build order affects cache utilization

The order in which the image is built is important. When you add a file to a Dockerfile or change a line in it, that part of the cache is invalidated and the subsequent steps of the cache are interrupted and need to be rebuilt. So the best way to optimize your cache is to put the rows that don’t need to change very often first and the rows that change the most often at the back.

Copy only required files to prevent cache overflow

When copying files to an image, try to COPY only the required files. Do not use COPY. Command to copy the entire directory. If the contents of the copied file change, the cache is corrupted. In the example above, only the jar package that has been built is needed in the image, so you just need to copy this file so that the cache will not be affected if other unrelated files change.

Minimize the cacheable execution layer

Each RUN instruction is considered a cacheable unit of execution. Too many RUN instructions increase the number of layers and size of the image, and putting all commands into the same RUN instruction breaks the cache and slows down the development cycle. When installing software using the package manager, it is common to update the software index information before installing the software. It is recommended that you place the update index and the installation software in the same RUN directive to form a cacheable execution unit, otherwise you might install older packages.

2. Reduce the mirror volume

The size of the mirror is important because the smaller the mirror, the faster the deployment and the smaller the attack range.

Remove unnecessary dependencies

Remove unnecessary dependencies and do not install debugging tools. If you really need a debugging tool, you can install it after the container is running. Some package management tools, such as APT, install recommended packages in addition to user-specified packages, which increases the size of the image for no reason. Apt can ensure that you don’t install unwanted dependencies by adding the — no-install-recommends parameter. If you do need some dependencies, add them manually later.

Delete the cache of the package management tool

The package management tool maintains its own cache, which remains in the image file, and the recommended way to do this is to remove the cache at the end of each RUN instruction. If you remove the cache in the next instruction, the size of the image will not decrease.

Of course, there are other more advanced ways to reduce mirror size, such as the multi-phase build described below. Next we’ll explore how to optimize dockerfiles for maintainability, security, and repeatability.

3. Maintainability

Use official mirrors whenever possible

Using an official image can save a lot of maintenance time because best practices are used for all installation steps of the official image. If you have multiple projects, you can share these mirror layers because they can all use the same base image.

Use more specific labels

Try not to use the latest tag for base images. While this is convenient, the latest image can change significantly over time. Therefore, it is best to specify the specific label of the underlying image in the Dockerfile. Let’s use openJDK as an example, specifying the tag 8. Please check the official warehouse for more labels.

Use the smallest base image

The size of the image varies depending on the label style of the base image. The Slim style image is based on the Debian distribution, while the Alpine style image is based on the smaller Alpine Linux distribution. One obvious difference: Debian uses the C standard library implemented by the GNU project, while Alpine uses the Musl C standard library, which is designed to replace the GNU C standard library (Glibc) for embedded operating systems and mobile devices. So using Alpine can cause compatibility problems in some cases. In the case of OpenJDK, jRE-style images only contain the Java runtime, not the SDK, which can also greatly reduce the image size.

4. Reuse

So far, we’ve assumed that your JAR packages are built on the host, which is not ideal because it doesn’t take full advantage of the consistent environment provided by the container. For example, if your Java application relies on a particular operating system library, you may have problems because the environment is inconsistent (depending on the machine on which the JAR package is being built).

Build from source code in a consistent environment

Source code is the ultimate source of your Docker image, Dockerfile only provides the build steps.

You should first determine all the dependencies you need to build your application. The sample Java application in this article is simple and requires only Maven and JDK, so the base image should be the official smallest Maven image that also includes the JDK. If you need to install more dependencies, you can add them in the RUN directive. The pom.xml file and the SRC folder need to be copied to the image because these dependency files will be used when the MVN package command (-e to show errors, -b to run in non-interactive “batch” mode) is finally executed.

While we have now solved the problem of environment inconsistencies, there is another problem: ** After every code change, all the dependencies described in POM.xml have to be fetched again. ** Let’s solve this problem.

Get the dependencies in a separate step

In combination with the caching mechanism mentioned earlier, we can make the dependency fetching step a cacheable unit, and as long as the contents of the POM.xml file remain unchanged, no matter how the code changes, this layer of caching will not be broken. The RUN directive in the middle of the COPY directives tells Maven to get only dependencies.

Now there’s a new problem: compared to copying the JAR package directly before, the image is larger because it contains many build dependencies that are not needed to run the application.

Use multi-phase builds to remove build-time dependencies

Multi-phase builds can be identified by multiple FROM instructions, each FROM statement representing a new build phase, the name of which can be specified with the AS parameter. In this example, the name of the first phase is specified as Builder, which can be directly referenced by the second phase. The two phases have the same environment, and the first phase contains all build dependencies.

The second phase is the final phase of building the final image, which will include all the necessary conditions for the application runtime, in this case Alpine based minimal JRE image. The previous build phase will have a lot of caching, but this will not happen in phase 2. To add the built JAR package to the final image, use the COPY –from=STAGE_NAME directive, where STAGE_NAME is the name of the previous build phase.

Multi-phase builds are the preferred solution for removing build dependencies.

In this paper, the optimization starts from building a large image in a non-consistent environment, and continues to build a minimum image in a consistent environment, while making full use of the caching mechanism. The next article will cover additional uses for multi-phase builds.