With the growth of Alibaba’s big data product business, the number of servers keeps increasing, and the pressure of IT operation and maintenance also increases proportionally. Service interruption caused by hardware and software failures is one of the important factors affecting stability. This paper explains in detail how Ali realizes hardware failure prediction, automatic offline of servers, self-healing of services and self-balanced reconstruction of clusters, so as to realize automatic closed-loop strategy of hardware failure before affecting business and solve common hardware failure automatically without manual intervention.

1. The background

1.1. Challenges

Bearing alibaba group, 95% for data storage and calculation of offline MaxCompute computing platform, with business growth, the scale of the server has reached hundreds of thousands of units, and the characteristics of offline operation is not easy to cause hardware failure was found in the software level, at the same time group unified hardware reported barrier threshold is often missed some influential to application of a hardware failure, Each omission poses a significant challenge to the stability of the cluster.

In view of these challenges, we are faced with two problems: timely discovery of hardware failures and service migration of failed machines. Below, we will analyze these two problems and introduce the automatic hardware self-healing platform — DAM in detail. Before the introduction, we first understand the application management system of flying operating system – Tianji.

1.2. Space-based application management

MaxCompute is built on Top of Apsara, alibaba’s data center operating system, where all applications are space-based managed. Skybase is an automated data center management system that manages the hardware life cycle and various static resources (programs, configurations, operating system images, data, etc.) in the data center. Our hardware self-healing system is closely combined with space-based Healing mechanism to build a closed-loop system of hardware fault discovery and self-healing repair for complex business.



Through space-based, we can send the instructions of various physical machines (restart, reinstall, repair), and space-based will translate them to each application of the physical machine, and the application will decide how to respond to the instructions according to its own business characteristics and self-healing scenarios.

2. Discover the hardware fault

2.1. How to find out

We pay attention to the following hardware problems: hard disk, memory, CPU, and nic power supply. The following lists the methods and main tools for discovering common hardware problems:



Hard disk faults account for more than 50% of all hardware faults. The following describes the most common faults: Hard disk media faults. Usually the problem is a file read/write failure/stuck/slow. However, reading and writing problems are not necessarily caused by media failures, so it is necessary to explain the representations of media failures at various levels.



A. An error message similar to the following can be found in /var/log/messages

Sep 3 13:43:22 host1.a1 kernel: : [14809594.557970] SD 6:0:11:0: [SDL] Sense Key: Medium Error [current]

A1 kernel: : [61959097.553029] Buffer I/O error on device SDI1, Logical Block 796203507

B. Tsar IO index changes refer to changes or mutations in RS/WS /await/ SVCTM /util, which are usually reflected in IOSTAT and then collected in a tsar because of the pause in reading and writing during error reporting.


C. System indicator changes are usually caused by IO changes, such as load increase caused by D.

D. The change of smart value refers to the change of 197(Current_Pending_Sector)/5(Reallocated_Sector_Ct). The relationship between these two values and read/write exceptions is as follows:





In conclusion, it is not enough to observe only one stage in the whole error reporting link, and comprehensive analysis of multiple stages is needed to prove the hardware problem. Since we can rigorously prove media failures, we can also work backwards and quickly distinguish between software and hardware problems when there is an unknown problem.

The above tools are accumulated based on o&M experience and fault scenarios. We also know that a single source is not enough. Therefore, we also introduce other sources of hardware fault discovery and combine various inspection methods to finally diagnose hardware faults.

2.2. How to converge

Many of the tools and paths mentioned in the previous chapter are used to find hardware faults, but not every time we find a hardware fault, we maintain the following principles when we perform hardware problem convergence:





2.3. The coverage

Taking the IDC work order of a production cluster in x year 20xx as an example, the statistics of hardware faults and work orders are as follows:



Excluding out-of-band failures, our hardware failure detection rate is 97.6%.

3. The hardware fault is rectified

3.1 Self-healing process

For the hardware problems of each machine, we will issue an automatic rotation work order to follow up. Currently, there are two sets of self-healing processes: [Application maintenance process] and [non-application maintenance process]. The former is for the hot-pluggable hard disk failure, and the latter is for the rest of the whole machine hardware failure.



In our automated process, there are a few clever designs:

A. No disk diagnosis





B. Determine the impact surface or upgrade the impact





C. Automatic backstop for unknown problems


D. Downtime analysis








3.2. Process statistical analysis

If the same hardware problem repeatedly triggers the self-healing, the problem can be found in the process work order statistics. For example, lenovo RD640 virtual serial port problem, before locating the root cause, we found through statistics: the machine of the same model has repeated downtime self-healing situation, even after the machine is reinstalled, the problem will still appear. We then quarantined the machines, keeping the cluster stable and buying time for the investigation.

3.3. Misunderstanding of business association

In fact, with the above complete self-healing system, some business /kernel/software problems that need to be dealt with can also enter this self-healing system and then go to the branch of unknown problems. In fact, hardware self-healing to solve business problems, a bit of poison quench thirst, easy to make more and more problems have not thought clearly, try to solve the bottom in this way.



At present, we are gradually removing the processing of non-hardware problems and returning to the hardware-oriented self-healing scenarios (software-oriented general self-healing is also carried by the system, and such scenarios have great coupling with the business and cannot be generalized to the group), which is more conducive to the classification of hardware and software problems and the discovery of unknown problems.

4. Architecture evolution

4.1. The cloud

The initial version of the self-healing architecture was implemented on the controller in each cluster, because the operations students initially handled the problems on the controller as well. However, as automation continues, it is found that such an architecture seriously hinders data openness. Therefore, we used a centralized architecture for reconstruction, but the centralized architecture would encounter the problem of massive data processing, which simply a few servers could not handle.

Therefore, we further carried out distributed service-oriented reconstruction of the system to support massive business scenarios, disassembled each module in the architecture, introduced ali Cloud Log Service (SLS)/ Ali Cloud Streaming Computing (BLINK)/ Ali Cloud analysis database (ADS), and shared each collection and analysis task by cloud products. Only the core hardware fault analysis and decision making functions are left on the server.

Here is the architectural comparison of DAM1 and DAM3



4.2. Digital

With the deepening of the self-healing system, the data at each stage also have stable output. Higher dimensional analysis of these data can enable us to find more valuable and clear information. At the same time, we also reduced the dimension of the high-dimensional analysis results and marked each machine with health distribution. Through health points, operation and maintenance students can quickly know the hardware situation of a single machine, a cabinet, or a cluster.

4.3. As a service

Based on the control of the whole link data, we provide the whole fault self-healing system as a standardized service of the whole hardware life cycle to different product lines. Based on the sufficient abstraction of the decision, the self-healing system provides various perception thresholds, supports the customization of different product lines, and forms a full life cycle service suitable for personalization.

5. Fault self-healing closed-loop system

In the closed-loop system of AIOps perception, decision making and execution, fault self-healing of software/hardware is the most common application scenario, and everyone in the industry also chooses fault self-healing as the first location of AIOps. In our opinion, providing a set of universal fault self-healing closed-loop system is the cornerstone to realize AIOps and even NoOps (unattended operation and maintenance), and the intelligent self-healing closed-loop system is particularly important to deal with the operation and maintenance of massive systems.

5.1. Necessity

In a complex distributed system, there will inevitably be operational conflicts among various architectures, and the essence of these conflicts lies in information asymmetry. The reason for this information asymmetry is that every distributed software architecture is designed in a closed loop. Now, these conflicts can be smoothed out with a variety of mechanisms and operations tools, but this approach is like patching, and as the architecture continues to evolve, the patching seems endless, and more and more. Therefore, it is necessary to abstract this behavior into a self-healing behavior, explicitly declare this behavior at the architecture level, let each software participate in the whole process of self-healing, and transform the original conflict into synergy in this way.

At present, we focus on the biggest conflict point in the operation and maintenance scene: the conflict between hardware and software to design the architecture and product, and improve the overall robustness of the complex distributed system through self-healing.

5.2. General

Through the hardware self-healing rotation of a large number of machines, we found that:








Therefore, in fact, self-healing is to construct a layer of closed-loop architecture after fully abstracting the operation and maintenance automation on a complex distributed system, so as to form greater coordination and unity of architecture ecology.