Container technology represented by Docker has been a topic for several years. We thought that Docker would become a standard configuration for deploying and managing services in production environment among startups with no historical burden. However, we recently found that after some friends learned that we used Docker and Kubernetes in production, He actually showed some surprise. I think these teams did not adopt Docker but continued to use the traditional operation and maintenance scheme. They still think they hope to have more control over the system even if they do more complicated operation and maintenance work.

Docker relies on LXC container technology, which was used by big manufacturers like Google more than 10 years ago, and some big domestic manufacturers also started to invest in research and development and use at a very early time. For them, using containers can make full use of computing resources and save the cost of hardware. For a small company, however, squeezing 10 servers down to five won’t help the company survive.

Container technology really began to have a wide impact from the birth of Docker. When Docker first came out, its slogan on the official website was: “Build Ship Run” accurately describes the positioning of Docker, including the process from packaging to deployment and traditional operation and maintenance. Due to differences in language, environment and platform, almost every company will develop its own release system. With Docker, these can be standardized, and compared with the heavy virtual machine, Docker is almost a simple process packaging, only a little extra overhead, and the startup speed is almost equivalent to directly starting the application process.

For immediate teams without full-time operation and maintenance, it is natural to use Docker as a service release tool from the beginning of the project:

Stage 1: Mapping host port + HAProxy forwarding

Because a Docker container is just a encapsulation (or group) process, a container to network to provide services to the host machine, can bind the host port, port management is also a most don’t want people to worry about in all the circumstances, in order to avoid port conflicts, exposure of port container to need, Docker will randomly bind a host port, at which point we need service discovery mechanism to help services on different machines to communicate.

We used a simple solution, a set of open source tools: Docker-Discover and Docker-Register, which contains two components:

  1. Each Worker node VIRTUAL machine runs a Docker-Register container, which scans local containers and registers their service name (with the image name as the service name) and port (both in-container port and host port) to etCD.

  2. A Docker-Discover container is used as the central proxy, which contains a HAProxy. It scans the registration service on Etcd every few seconds and generates configuration files. The template for generating backend configuration is as follows

{% for service in services %}
listen {{ service }}
    bind *:{{services[service].port}}
    {% for backend in services[service].backends %}
    server {{ backend.name }} {{ backend.addr }} check inter 2s rise 3 fall 2{% endfor %}
{% endfor %}Copy the code

This scheme provides us with several basic functions of a system: application publishing, service discovery, load balancing, and process daemon. The application publishing is to execute scripts to pull the latest images from worker nodes, and the process daemon is provided by restarts=always of the Docker Daemon.

In addition to providing a consistent operating environment to make the release and rollback of services more controllable, this simple system still requires remote script execution in the release process like traditional operation and maintenance, with relatively simple functions. As our back-end system grows, it will soon be insufficient.

Stage 2: Rancher

Docker itself only provides an operating environment. In addition to running services, we also need a container Orchestration system to collaborate with multiple service containers. Typically, we expect the Orchestration system to accomplish several purposes:

Basic release automation features:

The choreography process includes allocating machines, pulling images, starting/stopping/updating containers, survival monitoring, and container number expansion and contraction

Declaratively define the service stack:

Provides a mechanism to use configuration files to declare network ports, mirrors, and versions of services, creating a complete set of services that can be recreated by configuration when needed.

Service discovery:

DNS and load balancing are provided. When a container is started, other services need to be able to access it, and when a container is terminated, traffic is not directed to it.

Status check:

The system needs to be continuously monitored for compliance with the state declared in the configuration, for example, if a host machine is down, the container running on it needs to be started on another healthy node, and if a container is down, it needs to be restarted.

From the design ideas, community activity and other factors, Kubernetes is undoubtedly the best choice for the layout tool, but due to many components, learning costs are not low, and the wall factor, even installation in China is not an easy thing.

At this point we found out that Rancher was officially released. It wasn’t as popular as Kubernetes, but it had all the features we needed and a simple and easy-to-use Web UI. In the early days, we had fewer machines and fewer services, and we needed an orchestration tool to focus our limited efforts on development, so we quickly moved our services to Rancher.

To be precise, Rancher is a container-managed package that supports three orchestration engines: Kubernetes, Swarm, and Rancher’s own Cattle (recently, it seems, Mesos). In terms of feature integrity and ease of use, Rancher can even be considered commercial software and is extremely simple to deploy, which is why we chose it as an entry-level container management platform.

Rancher component diagram, software features commonly used by small and medium enterprises can be found:


Stage 3: Kubernetes

Although Rancher was very easy to use, as the number of our back-end machines and projects increased, some of its problems were exposed. The UI was stuck, and the release speed became slower and slower. After 1.3, even the expected state of the service (number of containers, version) could not be guaranteed, and it was stuck in the release or completion state. What really made us decide to migrate was a major failure, which was suspected to be caused by network avalanche. All the machines in the cluster were repeatedly disconnected. Originally, the design of the general arrangement system was based on overlay network communication between containers on worker node. The connection between Node and Rancher Server can be disconnected without affecting the running of started containers. However, in Rancher after 1.3, whether it is intentional design or a bug, after worker node is reconnected to Rancher Server, all containers on the node will be re-scheduled, resulting in all containers in the cluster are constantly destroyed and re-created.

When we encounter large and small problems on Rancher, we find it difficult to find solutions provided by the community. We are probably the unfortunate ones who step on the pit early. Although Rancher is open source, the technical documents are much less than the use documents, so it is difficult to investigate the problems by understanding its implementation. There are very few contributors outside of Rancher, which is obviously different from Kubernetes.

Unlike Rancher, which provides a complete solution, Kubernetes provides a framework that Rancher can use as a publishing tool, even if the components are not familiar with, and can be installed with a preset configuration. Kubernetes requires that users have some knowledge of its components. The community provides a number of installers to help with deployment, but they all require quite a bit of configuration work before they can be used.

Kubernetes architecture diagram from DevOps:

After gaining some experience with container orchestration on Rancher, we gained some confidence in using Kubernetes for orchestration and moved to Kubernetes, which had better performance and a more active community.

We switched the online service traffic from Rancher to Kubernetes cluster little by little. At first, we found that the traffic increased slightly and the packet loss was serious, and the problem was identified as slow DNS resolution. After observation of kernel log, we found that the host conntrack count reached the upper limit and packet loss occurred.

Here’s conntrack in iptable: When we use iptable to make a connection-oriented firewall, we need to allow the establishment of a specific connection with an IP address. In addition to allowing the following packets to come in, we also need to allow the reply packets to go out. Since IP is a stateless protocol, a session table is needed to record the stateful connections. Conntrack is used to keep track of these sessions. In microservices, there are frequent calls between services, resulting in a large number of DNS queries. The Linux default conntrack_max is easily breached, so setting a high conntrack_max value on machines where DNS services are deployed is almost mandatory.

Container ecological

In addition to this, compared with the traditional architecture, we also get a whole set of logging and monitoring solutions using containers, such as log collection. Logs deployed in containers can be directly printed to the standard output. Docker itself supports a variety of logging drivers. Logs can be directly sent to Graylog, AWS CloudWatch and other log platforms, or log collection tools such as Fluentd can be used to collect jSON-file output by default by Docker. In contrast, traditional architecture may need to use specially configured logger output to the collection system in the application. Or configure a special collector to collect different log files.

Small companies don’t invest enough in infrastructure and typically don’t have the people to be familiar with big open source projects like Kubernetes, but instant is a company that is open to technology and willing to let engineers try it out. Because container choreography systems increase their complexity to cover the complexity of configuration and scripting, they are more difficult to troubleshoot when problems occur. However, those who have done operation and maintenance work should understand that it is difficult for scripts written by themselves to be universal, and it is likely that they cannot be used if the environment is slightly changed. Operation and maintenance is rather boring work, and with the help of the choreography system to deal with these, we can put more energy on more meaningful work.

Picture By: Mengtianfang

Reference:

Jasonwilder.com/blog/2014/0…

Thenewstack. IO/containers -…

rancher.com/rancher/

conntrack-tools.netfilter.org/manual.html