Abstract: Remote sensing image, as a selfie of the earth, can provide people with more dimensions of auxiliary information from a broader perspective, to help human perception of natural resources, agriculture, forestry and water conservancy, traffic disasters and other fields of information.

This article is shared from Huawei cloud community “AI+ cloud native, the satellite remote sensing abuse alive and well”, author: TSJSDBD.

So 1+1>2?

Remote sensing images, as earth selfies, can provide people with auxiliary information of more dimensions from a broader perspective to help people perceive information in multiple fields such as natural resources, agriculture, forestry and water conservancy, and traffic disasters.

AI technology can surpass human beings in many fields, but the key is that it is automatic, saving time and effort. It can significantly improve the working efficiency of remote sensing image interpretation and automatically detect all kinds of ground object elements, such as buildings, rivers, roads, crops, etc. It can provide decision-making basis for smart city development & governance.

Cloud native technology has been a hot topic in recent years. Easy to build, repeatable, no dependence and other advantages, no matter from which point of view are born with AI algorithm. Therefore, we can also see that most AI scenarios in various fields run AI reasoning algorithms in Docker containers.

AI+ cloud native so 6, so strong together, ground object classification, target extraction, change detection and other high-performance AI interpretation is not easy? We also think so, so based on AI+Kubernetes cloud native, built the space and earth platform to support AI processing of remote sensing images.

However, the ideal is good, but the process is generally with the west, ninety-eight difficult, and finally the fruit.

Service Scenario Introduction

The business scene in question is called Panfusion, which is a “multi-lens collaborative beauty” feature on a global selfie. (Read: multiple cameras on a phone take pictures at the same time and merge them into one large, high-resolution color image.)

So the simple summary of the business is: read 2 pictures, generate a new picture. We put this function in a container to perform, and each result picture after fusion is about 5GB.

The problem is that a batch of business needs to process more than 3,000 satellite images, so each batch only needs to run more than 3,000 containers at the same time. Cloud native YYDS!

Business Architecture Diagram

To help understand, the logical diagram for implementing this business scenario using the cloud native architecture is broken down as follows:

In the cloud, the raw data, as well as the resulting data, must be stored in object buckets. Because of this amount of data, only object storage is affordable. Object storage, 10 cents per GB. File storage costs 30 cents /GB)

Because containers are independent of each other, each container only needs to process its own image. For example, container 1 handles 1.tif images; 2. Tif image; And so on.

Therefore, the hypervisor only needs to post the corresponding number of containers (3000+) and monitor whether each container successfully completes execution (here to simplify the illustration, the actual business scenario is a pipeline process). So, the requirements have been decomposed according to the ideal state of cloud native, let’s start tang and fly keng ~

Note: the problems described below are presented after combing, and the actual problems are interspersed and complicated.

K8s died

When jobs are delivered, it doesn’t take long for the system to show them failing. K8s failed to call the K8s interface. The K8s Master has been suspended.

K8s-master processing process, summary version:

1. The Master is suspended because the CPU burst

2. Expand the capacity of the Master node (repeat N times).

3. Performance optimization: Expand the number of nodes in a cluster.

4. Performance optimization: container delivery in batches;

5. Performance optimization: query the container execution progress, use ListPod interface less;

A detailed version:

The CPU that monitors the Master has already burst, so the simplest and most crude idea is to expand the Master. So from 4U8G * 3 expansion all the way test all the way failed, expanded to 32U64G * 3. You can see that the CPU is still full. It seems that simple expansion is not feasible.

There are more than 3000 containers, and a large number of them are in Pending state after being cast to K8s. For the Pending Pod, the Scheduler of K8s will continue rotation training to determine whether there are resources to arrange it. This also puts a lot of CPU pressure on the Scheduler. Expanding the number of nodes in a cluster reduces the number of queuing PODS.

Besides, since there are too many queues, it is better to deliver the containers to K8s in batches. So he began to deliver the task in batches, thinking not to overwhelm the K8s at once. The number of deliveries went down to 1,000, then to 500, then to 100.

At the same time, when querying Pod progress, avoid using ListPod interface and directly query specific Pod information. Because of the List interface, the processing inside K8s will List all the Pod information, which is also very stressful.

After this combination, the Master node finally does not die. But when one problem is pressed down, another one pops up.

The container is halfway down. It’s dead

Although the Master does not fail, containers fail again after one or two batches of jobs are delivered.

Container hanging process, summary version:

1. Eviction when a container is damaged;

2. Eviction is caused by node reporting DiskPressure (storage capacity full).

3. Expand the storage capacity of the node.

4. Extend the tolerance time before expulsion container (active kill container);

A detailed version:

(Note: The following questions are presented to you in order after positioning and sorting. But when things go wrong, the order is not so friendly.)

The script execution log in the container cannot be found

When querying Pod information, some containers were damaged by Eviction from the event event. It can also be seen that the reason for expulsion is DiskPressure (that is, the storage of the node is full).

When DiskPressure occurs, the node is tagged with eviction, and then the logic of actively evicting the container starts:

Due to the node entering Eviction state, containers above the node will be actively killed by Kubelet if they are not finished running after 5 minutes. K8s want to exit Eviction asap by killing containers to free up more resources.

We assume that each container has an uptime of 1-2 hours, so you should not kill the container as soon as a driver occurs (since killing a container halfway through execution is costly). We expect to wait as long as possible for all containers to run. So this pod-eviction timeout tolerance time should be set to 24 hours (greater than the average execution time per container).

The direct cause of Disk Pressure is the insufficient capacity of the local site. Therefore, node storage capacity must be expanded. Two options are available: 1) Using cloud storage EVS (attaching cloud storage to nodes). 2) Expand the capacity of VMS on local storage nodes.

Because the bandwidth of cloud storage (EVS) is too low, 350MB/s. We can run more than 30 containers at one node at the same time, and the bandwidth is completely inadequate. A VM of type I3 was selected. This VM comes with local storage. And 8 NVMe disks, constitute Raid0, bandwidth can x8.

Description Object storage fails to be written

Container executions continue to fail.

The container failed to write to the object store.

1. Write to the local directory first and then cp to the local directory.

2. Change a common object bucket to a parallel file bucket that supports file semantics.

A detailed version:

The script failed to write new images to the storage when it generated new images:

Our whole cluster has a scale of 500 cores, and the number of containers running at the same time is about 250 (2U2G each). So many containers append to one object bucket concurrently. This should be the cause of the IO problem.

The object storage protocol s3FS itself is not suitable for apending large files. Because it operates on files holically, even if you append 1 byte to a file, it will cause the entire file to be rewritten.

Generate the target image file locally, then copy it to the object store at the end of the script. Equivalent to adding a temporary storage transfer.

In the selection of temporary transfer storage, both local storage have been tried: 1) The block storage bandwidth is too low, 350MB/s affects the overall operation speed. 2) You can select VMS with local storage. Multiple local storage forms a Raid array, and the bandwidth speed is extremely high.

Huawei Cloud also has an extension to the object storage protocol to support apend write POSIX semantics, called parallel file buckets. Then we changed the normal object buckets to file semantic buckets. To support large-scale concurrent appending to files.

The K8s compute node is down

So, keep running. However, the container operation failed again

The compute node fails. After locating and combing, the summary version:

1. The compute node is down because the K8s heartbeat is not reported for a long time.

2. There is no heartbeat because Kubelet (agent of K8s node) is not doing well.

3. Kubelet is running out of resources by the container.

4. To protect Kubelet, all containers have limits set.

Detailed version, directly from all kinds of weird mess and other problems:

  • The container failed to start. A timeout error was reported.

  • Then, what about the PVC shared storage mount failure:

  • Or, some containers don’t end properly (they can’t be deleted).

  • Query node Kubelet log, you can see various timeout errors:

Ah, so many low-level containers timed out, at first it felt like the Docker Daemon was hanging up, I tried to fix the problem by restarting the Docker service.

The K8s cluster shows that many compute nodes are Unavailable (the nodes are all dead).

Kubelet has not sent a heartbeat message to the Master for a long time, so the Master thinks the node is down. Note Not only Docker daemons are affected, but also Kubelet nodes are affected.

Kubelet, Docker, and other host processes are not working properly. This brings up Kubernetes’ concept of Request and Limit when scheduling containers.

Request is used by K8s to schedule containers to idle compute nodes. The Limit is passed to the Docker to Limit the container’s resources (triggering the Limit can be killed by oom killer). In order to prevent the job from being killed, we only set Request and no Limit for the container. That is, each container can actually exceed the amount of resources requested to preempt additional host resources. Host resources are affected when a large number of containers are concurrent.

Considering that although it does not kill the job, it is very friendly to users, but the platform itself can not stand it is not a matter. To prevent the container from exceeding the Limit of resources, the user process is forced to run within the Limit of resources, and then Kill it. This ensures that host processes (such as Docker, Kubelet, etc.) have enough resources to run.

K8s compute node, down again

So, continue to run the task. Many job execution and double 叒 failed bird ~

The node failed again.

1. Analyze the log. This hang is due to a PLEG (Pod Lifecycle EventGenerator) failure.

2. PLEG is abnormal because there are too many history containers on the node (>500) and the query time is too long.

3. Clear containers that have finished running in a timely manner. (Even containers that have finished running still occupy storage resources of nodes.)

4. Various timeouts of the container interface (CPU +memory is protected by limit, but I/O is still preempted).

5. Improve the I/O performance of system disks and prevent Docker container interfaces (such as List) from timeout.

A detailed version:

Kubelet: PLEG is nothealthy:

I searched the Kubelet log related to PLEG and found quite a few errors:

This error occurs because Kubelet timed out to list all containers of the current node, including containers that have already been run. Look at the code:

Github.com/kubernetes/…

Kubelet determines the timeout. The length of 3 minutes is dead. So the higher the number of pods, the higher the timeout probability. Many scenarios show that PLEG problems are easy to occur when the cumulative number of containers on nodes reaches more than 500. K8s can be more Flexible and the timeout should be adjusted dynamically.

Mitigation measures are timely cleaning of containers that have been run. However, once the finished container is cleaned up, container records and container logs will also be cleaned up, so there needs to be a function to compensate for these problems (such as log collection system, etc.).

List all container interfaces, in addition to the number of containers, slow I/O will also cause a timeout.

When a large number of concurrent containers are running at the same time, the write bandwidth of the cloud disk is occupied:

The impact on storage pools is also significant:

This also leads to poor IO performance, and it also affects the list container interface timeout to a certain extent, resulting in PLEG errors.

To solve this problem, use VMS with local high-speed disks as much as possible, and configure multiple data disks to form Raid arrays to improve read/write bandwidth.

In this way, the VM serves as the node of K8s. Containers on the node directly read and write to the local site, and the IO performance is good. (The node usage is the same as that of the big data cluster, which strongly depends on local shuffle~).

After these measures are implemented, subsequent batches of operations can be smoothly completed.

Conclusion: the “AI+ cloud native” road

Cloud native is the trend, has become a consensus, all areas have begun to cloud native as the base of the business attempt. AI is the future, and it is the unstoppable force of the present. But when AI goes down this cloud-native path, it’s not so smooth. At the very least, we can see that huawei cloud’s cloud native base (of course, also including storage, network and other surrounding infrastructure) has more room for improvement.

However, we do not need to worry too much, because the current Sky and Earth platform of Huawei Cloud, after years of AI+ cloud native accumulation, can very stable process PB level daily remote sensing image data, support all kinds of space-based, space-based, ground-based and other scenes, and maintain absolute leading combat value in this field. Although we can see some twists and turns during this process, all difficulties are the fire of nirvana, and the difficulties overcome are the promise to customers in the future. Here can be very clear to tell you: AI+ cloud native = really sweet.

The purpose of writing this article is not to expound the difficulties, but to summarize and share. Share with people in the same field and promote the rapid development of remote sensing field, and jointly promote the landing of AI+ cloud native.

Click to follow, the first time to learn about Huawei cloud fresh technology ~