At Alibaba, how do we find and locate Kubernetes cluster problems before users?

Author: Pang Nan-guang (Gwang-nan)

This article is compiled from the speech delivered by Peng Nanguang, senior R&D engineer of Aliyun, at KubeCon China 2021, and shares how Alibaba has solved the stability challenges of large-scale clusters through its self-developed universal link detection + directional inspection tool KubeProbe. About ali cloud native team in this KubeCon to share all content precipitation in the e-book “cloud native and cloud future new possibilities”, you can click on the end of the article “read the original” download.

The ability to quickly discover and locate problems is the cornerstone of rapid system recovery. Only by quickly discovering and locating problems first can we talk about how to solve problems and minimize user losses. So how to find and locate problems before users in complex large-scale scenarios? I will share with you some of our experiences and practices in managing large Kubernetes clusters to quickly find and locate problems — how we managed to meet the stability challenges of large clusters by developing our own universal link detection + directional inspection tool KubeProbe.

Link detection: simulates generalized user behavior and detects whether links and systems are abnormal

Directional detection: Checks cluster exception indicators to find future or possible risks

System enhancement: find problems, improve efficiency and speed, root cause analysis

After finding problems: post-check and self-healing, chat-OPS

Business background and challenges

The container service team of Aliyunyun native application platform owns ACK, ASI and other products, and manages a large-scale Kubernetes cluster. It not only provides Kubernetes services to external public cloud users, but also undertakes the comprehensive containerization of Alibaba group’s cloud and Alibaba applications.

Currently, the entire Alibaba business runs on Kubernetes cluster and realizes cloud native and containerization, such as: Tmall/Taobao/Autonavi/Kaola/Ele. me and so on. Container services serve as the control base of Ali Cloud, and various cloud services also run on these clusters, such as video cloud/DataWorks /MSE microservice engine /MQ message queue and so on. We need to be responsible for the stability of that infrastructure.

Now, cloud native architecture is more and more popular, more and more products and applications are choosing cloud native architecture, there is a picture, roughly indicated the cloud native application of modern architecture, application in the cloud, longer than the cloud, at all levels to provide a layered service, this kind of layered service can make business and application focus on the business layer, shielding complex concept of platform and infrastructure layer.

From the perspective of stability, when the application architecture is layered, the stability of the upper application will begin to depend on the support of the underlying infrastructure. In addition, the unified base not only provides scenarios for large-scale resource scheduling optimization and off-line mixing, but also poses great challenges for infrastructure teams to maintain the stability of large-scale clusters.

Kubernetes cluster is very complex. There are dozens or even hundreds of link components in a single cluster. Let alone large-scale multi-cluster management? But the business students running in the upper level will not perceive the complexity, because we have removed the complexity package, leaving the user with a simple unified interface. For example, an application like Taobao is actually very complex, but in the user’s view, it is just a simple order submission, and there are extremely complex contents behind the buttons. Why do you do that? Because we keep the complexity for ourselves and give the simplicity to the user.

A lot of times, good app developers aren’t necessarily infrastructure experts. Cloud native allows business to focus on business and infrastructure to focus on infrastructure. Business at the same time, a lot of the time can only care about the stability of the business itself, business most of the time don’t have the ability to care about, or don’t want to spend a lot of manpower concerned about the stability of the infrastructure and platform layer, therefore, regarding the stability of the platform layer and infrastructure issues, we need to turn the complex to yourself, leave simple for users, to provide users with a stable platform layer services. At the same time, it is more concerned with global stability and global availability than with single point availability.

Container service is the base of Alibaba Group business and Ali Cloud control/cloud service, on which run a variety of businesses, such as e-commerce business/middleware/two-party business/search/Ali Cloud service and so on. In addition, there are hundreds of self-developed and open source components, hundreds of thousands of component changes/thousands of clusters/hundreds of thousands of nodes per year, and even large clusters of single cluster node size has exceeded ten thousand. The business architecture is more complicated, including single-rent cluster, multi-rent cluster, VC cluster, federal cluster and so on, as well as various off-line mixing, unified scheduling and promotion activities. RunC, runD, and so on exist at runtime.

Therefore, complex components, frequent changes, different user scenarios, large clusters, complex service architectures…… Both present business challenges:

Challenge 1: How to reduce systemic risk. The scene is complex, business forms are different, any omission of insignificant details or careless disposal of links may lead to the expansion of harm;

Challenge 2: How to be responsible for the stability of user clusters. How to find and locate problems before users becomes the priority of stability construction of container service production and also the cornerstone of global high availability system.

Systems are so complex that any minor omission or careless handling can lead to unintended harm. How can we reduce systemic risk? And how are we responsible for the global stability of a heterogeneous user cluster? How to find and locate the existing or upcoming problems in these clusters before users, is the key to ensure the stability of the cluster construction, but also the cornerstone of Kubernetes global HIGH availability system.

Thinking and planning

Based on these challenges, we did some thinking and presupposition. The following figure shows an extremely simplified link for user release capacity expansion. Although extremely simplified, we can still see that the link is relatively complex.

In order to ensure that the user expansion/distribution link is smooth, we will first bring a few presets:

Presupposition 1: There are many complex link components and each component is upgraded and iterated respectively, so data monitoring cannot cover all scenarios without dead spots; * * * *

Default value 2: Even if the monitoring data of each component or node in the link is normal, the link of the cluster cannot be 100% available. The available conclusion can be determined only after the link detection.

Default 3: Proof by contradiction is superior to proof in the scenario where the cluster is unavailable. Even if 100% monitoring data is normal, if the release fails, the link is proved to be disconnected. In addition, we also need to pay attention to the management of multiple clusters. The following are some examples of instability factors in the management of multiple clusters. As you can see, the complexity of stability management will be magnified in the multi-cluster scenario.

Presupposition 4: In large-scale cluster scenario, the problem of data consistency will become more obvious, and may cause serious failure, becoming a significant unstable factor;

Default value 5: The monitoring alarm link in the cluster is self-dependent. If the cluster is faulty, the monitoring alarm may fail at the same time.

Here are some of our solutions based on the above assumptions.

Exploration and solutions

1. Link detection

Link detection simulates user behaviors in a broad sense to detect whether links are unblocked and flows are normal.

In order to discover system problems before users, we must first become users of the system, and be the users who use the system the most, understand the most, use the system all the time and perceive the state of the system.

The so-called link detection is to simulate the user behavior in a broad sense to detect all kinds of objects waiting to be detected in the cluster component link. It should be noted here that the user here does not just refer to the students who use the system in a narrow sense, but to the users in a broader sense, or can be understood and extended as dependent downstream.

In addition, it is necessary to disassemble the circuit to realize short-circuit detection in the whole circuit while realizing full-link detection, which is also a supplement to full-link detection.

2. Directional inspection

Directional inspection refers to checking and analyzing abnormal indicators in a large-scale cluster to find existing or possible risks, just like repairing pipes.

For example, there are several clusters, which are divided into many cluster groups. Whether the ETCD cold/hot backup is fully configured among different cluster groups, whether the risk control flow limiting configuration is normal, whether the Webhook version is normal, whether the mixing parameters are consistent, including whether the certificate validity period is about to expire, etc. There may be differences between cluster groups, but there is a balance between clusters of the same type, so we can do targeted patrols.

Here are some common scenarios for link detection:

Like a game designer, if he doesn’t even play the game he made, will he find a problem with the game mechanics and make it better? If we want to discover system problems before users, we must first become users of the system, and must be the users who use the most, understand the deepest, and use and perceive the state of the system all the time.

In addition, the so-called link detection, is to let oneself become the user of their own system, simulation of the generalized “user” behavior to cluster/component/link in a variety of objects waiting to be detected to do detection.

It is important to note that the term “user” here does not just refer to students using the system in a narrow sense, but to users in a broader sense, or by extension, dependent downstream.

For example, if business students want to publish their business, they must go through the Git system, then to the publishing system, and then to our underlying infrastructure platform, namely ASI, which is a full-link detection process. In this case, the business student is the user, and the detection object can be the full link. However, if we regard ETCD as a system service and APIServer is its user in a broad sense, it makes sense for us to simulate the probe of the link where APIServer requests ETCD.

In addition, for example, MSE operates ZooKeeper, external users create ACK clusters through the Ali cloud console, PaaS platform operates federal clusters, and even video cloud business party initiates a transcoding task, all in the same way.

It is also important to note that although full-link detection looks beautiful, many times full-link detection is also very long, and it can be very problematic when it fails. Therefore, while realizing the full link detection, it is necessary to disintegrate the link to realize the short link detection in the full link, which is also a supplement to the full link detection.

The following figure shows a targeted inspection scenario. Compared with link detection, targeted inspection focuses on link availability. In a large-scale cluster scenario, data consistency is a very difficult problem.

Directional preventive inspection (PMI) checks all data and indicators in the cluster or link for known reasons to find out whether there is any inconsistency or data deviation and prevent potential risks.

For example, if there is A cluster group of the same type, cluster A found that its certificate validity period is less than three years, while the certificate validity period of other clusters is three years. The Webhook version of cluster B may be V2, while that of other clusters may be V3. The risk control traffic limiting configuration of CLUSTER C is not configured with a traffic limiting to expel Pod, but other clusters are configured with traffic limiting to expel Pod, which is definitely not in line with expectations. For example, the cold/hot backup of cluster D etCD is not configured or is not running properly, we can also check it out first.

System implementation

Based on many of the background presets and solutions above, we designed and implemented a patrolling/probing platform called KubeProbe (not open source, no connection to any other project with similar names in the community).

We also considered using the community project Kuberhealthy early on. We made some code contributions to Kuberhealthy and fixed some serious code bugs, but we chose to build it ourselves because it wasn’t functional enough for our scenario.

The figure above is a central architecture. We will have a central control system. Users’ use cases are accessed through the mirror of the unified repository, using our common SDK library, and custom inspection and detection logic. We will configure the relationship between the cluster and the use cases on the central management system, such as which cluster groups a use case should be executed on, and do various runtime configurations. We support periodic triggering/manual triggering/event triggering (such as publishing) use case triggering. After a test case is triggered, a Pod is created in the cluster to perform inspection or probe logic. The Pod executes customized service inspection or probe logic, and notifies the central end of a success or failure through a callback or message queue. The central end is responsible for clearing alarms and use case resources.

For example, Kubelet makes batch releases on our component operation and maintenance platform. Each batch will trigger a link detection case of the relevant cluster as a post check. Once we find that the post check of a release fails, we will block the current release of users to prevent further damage. At the same time, alarm immediately and notify related colleagues to check whether the new component version does not meet expectations.

We also support third party event callbacks for faster integration into third party systems.

In addition, we also implemented another resident distributed architecture for some high-frequency short-period Probe cases requiring 724 hours of continuous operation, which uses a ProbeOperator in a cluster to listen for changes in Probe Config CRD. The probe logic is executed over and over in the probe POD. This architecture perfectly uses the additional functions provided by KubeProbe’s central end, such as alarm/root cause analysis/release blocking, and uses the standard Operator’s cloud native architecture design. The resident system brings a great increase in detection frequency (because it eliminates the overhead of creating patrol POD and cleaning data) and can basically achieve seamless coverage of the cluster in 724 hours, while facilitating external integration.

Another important point to mention is that the platform only provides a platform-level capability support, which really works depends on whether the use cases built on the platform are rich, and whether it is easy for more people to write various inspection and detection cases. Just as the test platform is important, the test cases are more important than the test platform. Some general workload detection and component detection can find many problems on the control link, but more problems, even problems at the business layer, actually depend on the joint efforts of the infrastructure and business layer students.

In our practice, testing and business students contribute a lot of relevant inspection cases, such as test students contribute ACK & ASK the creation of the delete all link detection inspection, canary business link capacity all use cases, such as local life classmate PaaS platform application inspection, etc., also got a lot of stability of the results and benefits. At present, there are dozens of inspection/detection cases maintained by us. Next year, the number of inspection/detection cases is nearly 30 million, and it may exceed 100 million next year. At present, more than 99% cluster management and control problems and hidden dangers can be discovered in advance, and the effect is very good.

After problem discovery: root cause analysis and incident handling

Let’s talk about what happens after the problem is discovered. Here’s an example of a consultation conversation where the patient finds out, “Oh, I don’t feel well!” That’s spotting the problem. Doctors refer to a variety of laboratory tests, while doing information aggregation analysis and inference, tell the patient “you have not slept for 24 hours, you can’t sleep because you are very anxious, the root cause of your anxiety is because the day after tomorrow will be the final exam.” This is to identify the root cause of the problem, and then to address the root cause of the problem, he tells patients, “Don’t worry, I just received the news that primary school students no longer need to take final exams.” This process must be fast!

The content of the alarm from the detection link is often chaotic, which is different from the data monitoring alarm. As mentioned above, the alarm of link detection alarm is likely to be a sentence that the patient’s I am not comfortable, you need to judge as a doctor, why is he uncomfortable? What is the root cause. In many cases, data monitoring itself represents the reason. For example, Etcd OOM, existing onCall experience may not get the best effect.

In addition, rapid positioning and root cause analysis is a tree-like search and a process of experiential processing and judgment, that is, how to infer the root cause from a chaotic representation, the core of which is logic.

This is different from a health checkup, which lists the tests 1,2,3,4,5…… And it gives you a bunch of numbers. Most of the time, even though there is a physical examination center, we still need a professional doctor in the hospital to interpret and judge the condition for you, right?

At the same time, the key to root cause analysis/problem self-healing lies in the sinking of expert experience, that is, sinking expert experience into the system, and the greatest benefit brought by the sinking of expert experience is reusable and output. You can think about it, if we put the ability of a most professional doctor into the system, isn’t it more convenient for him to analyze the condition for everyone?

This is KubeProbe found the problem after the whole process, we will first after we build a centralized returning analysis system, here we will aggregate analysis all failed and the related information, including event/log/change/alarm/component upgrade, etc., these information we will proceed aggregate analysis, and the event correlation, do Finally, a tree-like analysis system is used to preliminatively locate the cause of a probe failure, such as APIServer timeout or ETCD disconnection.

In addition, I would like to add that text association is also a good root cause analysis method. We can use machine learning to train text recognition to associate the root cause most associated with such a failure case. We are only slightly involved in this work of AIOps, which is still in continuous exploration. I think that must be one of the directions of the future.

The whole process of KubeProbe root cause analysis and post-treatment

Above the lower left corner is one of our failure alarm, it found the first after returning analysis system at the core, the correlation, the single biggest reason, however, may be APIserver disconnected and currently has been restored, so may be only occasional network jitter, we don’t have to pay special attention to temporarily, but at the moment, can see the confidence level of 90%.

There are other possible causes that are all connected. For example, a component is detected this time from the release of a component, whose publisher is XXX. We can observe that this release will have some impact on THE API server, whether the list watch does not meet the expectation for many times, and then the API Server list watch has problems. The confidence level is 50 percent.

When we get a preliminary cause, we will enter the secondary confirmation system to confirm the secondary cause. For example, we judge that the cause may be APIServer timeout/ETCD disconnection/node timeout, etc. We will automatically pull the APIServer interface again to see whether it still times out and whether it is recovered. And if it does, we just give a normal alarm and tell the user that it’s okay now, but you have to pay attention. If it doesn’t, it’s serious, it’s the highest priority, it’s a direct phone alarm.

This is the same idea. If the system cannot locate a problem and the problem persists, a high-level alarm will be triggered and the related root cause analysis identification tree logic will be added.

Too many alarms equals no alarms. I hate the alarm sea. Speaking from experience, when we built a set of root cause analysis + secondary confirmation + post-check system, our Oncall cost decreased by more than 90%, and can continue to decline, the final state can be said to be unattended, we can also try similar work, it can be said that the investment is small, the effect is big. Since these systems have been built, we can proudly say that we have oncalled every alarm (yes, every alarm, every alarm of thousands of clusters, tens of millions of probes) with minimal effort and that nothing has been missed.

And finally, a little dessert for the Oncall folks, chat-Ops.

Chat-ops System based on NLP Semantic Recognition

We use the NLP robot provided by Dingding to build a set of relatively complete chat-OPS system, so that our Oncall personnel can easily operate KubeProbe related functions in the alarm group by chatting, such as: Re-run failed detection, query cluster status, pull diagnosis information, query detection logs, and silence cluster alarms.

The picture above shows us operating the chat-OPS system. This process is very convenient.

For example, at night, WHEN I am already in bed, it sends me an alarm. For example, a cluster has recovered from a previous failure and needs my attention.

Now that I’m concerned, I want one of my common cases to run again (it might be a long cycle, say an hour), and since the short link use case might run at any time, I tell the robot to run again, and the robot recognizes my semantics and runs the cluster again. After the run, I can check the current status of the cluster by querying the status. This is very convenient. Sometimes you can go to work in the evening, or on the road, or in bed, and also very comfortable to on-call a system.

The Demo sample

1, publish,

2. Probe list

3. Probe Pod starts running

4. Detection results

5. Root cause analysis & alarms

6, Chat – ops

Click “here” to download the full contents of the e-book “Cloud Native and New Possibilities of the Cloud Future”.

The recent hot

Cloud native and new possibilities for the future of the cloud

Copy and go to the link below to download the ebook for free

https://developer.aliyun.com/topic/download?id=8265

Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.

Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!