Remember a real Kafka server down experience!!

Hello everyone, I am Glacier ~~

It is estimated that the pre-festival worship server is not working, after the server is always more or less problems. I don’t know if it’s a person problem or a feng shui problem. When I got off work yesterday, I explained several times to my operation and maintenance partners: If you use Docker to install Kafka cluster, you also need to allocate the server hard disk of Kafka cluster to a larger number, the company has a large business volume, a lot of service communication, data flow, log collection transmission, etc., are carried out through Kafka message bus.

Unexpectedly, as soon as I arrived at the company this morning, I opened my computer and received a large number of server alarm information in the email. Then I saw the monitor screen showing that several test servers on the Intranet were down. At this point, my face looks like this.

What the hell is going on? You just got here and screwed up? Which servers are down? Take a look at the big screen again. Oh, my God, aren’t these the Kafka cluster servers I told my operation and maintenance buddies yesterday?

You just failed the test? It can’t be that bad, can it?

So, I hurried to the operation and maintenance partner next to say: how did you configure the server yesterday?

He said: I didn’t configure it? Isn’t it a test environment? I don’t have much configuration. I’m a Kafka cluster with 120GB space per server installed by default.

Me: didn’t tell you to let you set the server disk space larger? .

No matter how speechless in heart, also want to solve a problem! So I quickly log in to the server, execute the command on the server command line, and switch the directory where the current server terminal is located to the default directory of the Docker image.

[root@localhost ~]# cd /var/lib/docker
Copy the code

The result is an error message, as shown below.

[root@localhost ~]# ls-bash: temporary file cannot be created for immediate document: no space on the device -bash: temporary file cannot be created for immediate document: no space on the device -bash: temporary file cannot be created for immediate document: no space on the device -bash: temporary file cannot be created for immediate document: no space on the device -bash: temporary file cannot be created for immediate document: No space on the device -bash: temporary file cannot be created for immediate document: No space on the device -bash: temporary file cannot be created for immediate document: No space on the device -bash: temporary file cannot be created for immediate document: no space on the device -bash: temporary file cannot be created for immediate document: No space on the device -bash: Temporary file cannot be created for immediate document: No space on the device -bash: temporary file cannot be created for immediate document: No space on the device -bash: temporary file cannot be created for immediate document: no space on the device -bash: temporary file cannot be created for immediate document: No space on the device -bash: Temporary file cannot be created for immediate document: No space on the device Unable to create temporary file for immediate document: no space on device -bash: unable to create temporary file for immediate document: no space on deviceCopy the code

Unable to switch directories. Do how? I subconsciously looked at the server disk condition, the results as soon as the matter.

[root@localhost ~]# df -lh File system capacity used Available Used % mount point devtmpfs 3.8g 0 3.8g 0% /dev TMPFS 3.9g 0 3.9g 0% /dev/ SHM TMPFS 3.9g 82M 3.8g 3% /run TMPFS 3.9g 0 3.9g 0% /sys/fs/cgroup /dev/mapper/localhost-root 50G 50G 0G 100% / / /dev/sda1 976M 144M 766M 16% /boot /dev/mapper/localhost-home 53G 5G 48G 91% /home tmpfs 779M 0 779M 0% /run/user/0 overlay 50G 50G 0G 100% /var/lib/docker/overlay2/d51b7c0afcc29c49b8b322d1822a961e6a86401f0c6d1c29c42033efe8e9f070/merged overlay 50G 50G 0G 100%  /var/lib/docker/overlay2/0e52ccd3ee566cc16ce4568eda40d0364049e804c36328bcfb5fdb92339724d5/merged overlay 50G 50G 0G 100% /var/lib/docker/overlay2/16fb25124e9b85c7c91f271887d9ae578bf8df058ecdfece24297967075cf829/mergedCopy the code

Damn it, root disk usage is 100%, just like I thought. In addition, several important information is displayed in the output information, as shown below.

overlay 50G 50G 0G 100% /var/lib/docker/overlay2/d51b7c0afcc29c49b8b322d1822a961e6a86401f0c6d1c29c42033efe8e9f070/merged  overlay 50G 50G 0G 100% /var/lib/docker/overlay2/0e52ccd3ee566cc16ce4568eda40d0364049e804c36328bcfb5fdb92339724d5/merged overlay 50G 50G 0G 100%  /var/lib/docker/overlay2/16fb25124e9b85c7c91f271887d9ae578bf8df058ecdfece24297967075cf829/mergedCopy the code

Isn’t this the default installation image of Docker?

What’s the next step? /var/lib/docker = /var/lib/ Docker = /home/ Docker = /var/lib/ Docker = /home/ Docker = /var/lib/ Docker = /home/ Docker The rest will be switched when the servers are reassigned.

Immediately dry up, so I started migrating the Docker default image directory.

Docker default image directory migration, there are two schemes, here to tell friends, one scheme is: soft link method; Another solution is to modify the configuration method. Let’s take a look at each of these methods.

1. Soft link method

(1) The default Docker storage location is: /var/lib/docker, we can check the default Docker image installation directory through the following command.

[root@localhost ~]# docker info | grep "Docker Root Dir"
Docker Root Dir: /var/lib/docker
Copy the code

(2) Next, we execute the following command to stop the Docker server.

systemctl stop docker
Copy the code

service docker stop
Copy the code

(3) Then move the /var/lib/docker directory to /home.

mv /var/lib/docker /home
Copy the code

This process can take a long time.

(4) Next, create a soft link, as shown below.

ln -s /home/docker /var/lib/docker
Copy the code

(5) Finally, we start the Docker server.

systemctl start docker
Copy the code

service docker start
Copy the code

(6) Check the directory of the Docker image again, as shown below.

[root@localhost ~]# docker info | grep "Docker Root Dir"
Docker Root Dir: /home/docker
Copy the code

At this point, the Docker image directory is successfully migrated.

Next, let’s talk about modifying configuration methods.

2. Modify the configuration method

— graph=/var/lib/docker — graph=/var/lib/docker — graph=/var/lib/docker — graph=/var/lib/docker — graph=/var/lib/docker

In this case, THE server OPERATING system I use is CentOS. Therefore, the configuration of Docker can be modified in the following ways.

(1) Stop Docker service

systemctl stop docker
Copy the code

service docker stop
Copy the code

(2) Modify the startup file of docker service.

vim /etc/systemd/system/multi-user.target.wants/docker.service
Copy the code

Add the following line to the startup file.

ExecStart=/usr/bin/dockerd --graph=/home/docker
Copy the code

(3) Reload the configuration and start

systemctl daemon-reload
systemctl start docker
Copy the code

(4) Check the directory of the Docker image again, as shown below.

[root@localhost ~]# docker info | grep "Docker Root Dir"
Docker Root Dir: /home/docker
Copy the code

At this point, the Docker image directory is successfully migrated.

Kafka cluster can be used temporarily, let the data run first. So I redistributed the servers, set up the Kafka cluster, and migrated the test environment to the new Kafka cluster at noon. Currently in testing…

Did you learn?

PS: The server OPERATING system version is as follows.

[root@localhost ~]# cat /etc/redhat-release
CentOS Linux release 8.1.1911 (Core) 
Copy the code

The Docker versions used are as follows.

[root@localhost ~]# docker info Client: Debug Mode: false Server: Containers: 4 Running: 3 Paused: 0 Stopped: 1 Images: 33 Server Version: 19.03.8 # # # # # # # # # # # # other output information just # # # # # # # # # # # #Copy the code

Finally, why do I need to operate Kafka cluster servers with a larger hard disk?

Because the flow of our production environment is relatively large, usually at 50,000 ~ 80,000 QPS, if it meets the peak, it will be much larger than these flows. At that time, I was in the production environment to part of the traffic to the test environment. If Kafka cluster disks are not larger, Kafka can take up a lot of disk space if Kafka consumer performance degrades or messages pile up in Kafka for some other reason. If the disk space is full, Kafka’s server will crash and go down.

Ok, today stop here, I am the glacier, we have any questions can leave a message below, exchange technology together, together into dachang ~~

Remember a real Kafka server down experience!!

Related Posts

Ambassador mode

Java Design Patterns – (I) Overview

This is probably the most comprehensive summary of web crawlers you’ve ever seen!