Hands-on operation: Hard disk failure prediction based on random forest algorithm

Abstract:The industry expects to use machine learning technology to build the model of hard disk failure prediction, more accurately perceive hard disk failure in advance, reduce the cost of operation and maintenance, and improve the business experience. This case will use random forest algorithm to train a hard disk failure prediction model.

This article is shared from the Huawei Cloud Community “Hard Drive Failure Prediction Based on Random Forest Algorithm”, the original author: Shanhaizhiguang.

The experimental goal

Master the basic process of using machine learning method to train the model;
Master the basic methods of using Pandas for data analysis.
Master the methods of constructing, training, saving, loading, predicting, statistical accuracy index and viewing confusion matrix of random forest model by using Scikit-Learn;

Case content introduction

With the development of the Internet and cloud computing, the demand for data storage is doubling day by day. Large-scale and massive data storage centers are essential infrastructure. Although new storage media, such as SSDs, offer better performance than disks in many ways, their high cost is still unaffordable for most data centers, so large data centers will continue to use traditional mechanical hard disks as storage media.

The life cycle of mechanical hard disks is usually 3 to 5 years, and the failure rate increases significantly after 2 to 3 years, leading to a sharp increase in the volume of disk replacement. According to statistics, in the server hardware failure, the hard disk failure accounted for 48%+, is an important factor affecting the reliability of the server operation. As early as the 1990s, people realized that the value of data far exceeds the value of the hard drive itself, eager to have a technology to predict hard drive failure and achieve relatively safe data protection, so S.M.A.R.T. Technology has emerged.

S.M.A.R.T. (Self-Monitoring Analysis and Reporting Technology) is an automated system and specification for the detection and early warning of the state of hard disks. Through in the hard disk hardware testing instruction for disk hardware such as head, platters, motor, circuit monitor the running status, record and set by default when compared with the manufacturers, if the situation will be monitoring or is beyond the preset TVC safety range, can automatically by the host monitor hardware or software to the user to make warning and slightly automatic repair, To ensure the security of hard disk data in advance. The technology is now available on most hard drives, with the exception of some very old ones. For more information about this technology, check out S.M.A.R.T.- Baidu Encyclopedia.

Although hard disk manufacturers use S.M.A.R.T. technology to monitor the health status of hard disks, most manufacturers are based on the design rules of fault prediction means, the prediction effect is very poor, can not meet the increasingly strict demand for hard disk fault prediction in advance. Therefore, the industry expects to use machine learning technology to build a hard disk failure prediction model, more accurately perceive hard disk failure in advance, reduce operation and maintenance costs, and improve business experience.

This case will take you to train and test a hard disk failure prediction model using an open source S.M.A.R.T. dataset and random forest algorithms in machine learning. See this video for a theoretical explanation of random forest algorithms.

Matters needing attention

If this is your first time using JupyterLab, please see the “ModelATRS JupyterLab Instructions” for how to use it;
If you encounter an error while using JupyterLab, please refer to “ModelATRS JupyterLab FAQ Solution” to try to resolve the problem.

The experimental steps

1. Data set introduction

The dataset used in this case is an open source dataset from Backblaze, Inc., a computer backup and cloud storage service provider. Every year since 2013, Backbreze has publicly released the S.M.A.R.T. of the hard drives used in their data centers. Log data effectively promotes the development of hard disk fault prediction using machine learning technology. Due to the large amount of S.M.A.R.T. log data released by Backblaze, this case is to quickly demonstrate the process of using machine learning to build a hard disk failure prediction model. Only the data released by Backblaze in 2020 is used. The relevant data has been prepared and placed in OBS, and you can download the data by running the following code.

Note: The code to download data in this step needs to run on Huawei Cloud ModelArts Codelab

import os import moxing as mox if not os.path.exists('./dataset_2020.zip'): mox.file.copy('obs://modelarts-labs-bj4/course/ai_in_action/2021/machine_learning/hard_drive_disk_fail_prediction/datase t_2020.zip', './dataset_2020.zip') os.system('unzip dataset_2020.zip') if not os.path.exists('./dataset_2020'): Raise Exception(' Error! Data does not exist! ')! ls -lh ./dataset_2020

Data interpretation:

2020-12-08.csv: S.M.A.R.T. on 2020-12-08 extracted from Backblaze’s Q4 2020 dataset. Log data 2020-12-09.csv: S.M.A.R.T. on 2020-12-09 extracted from the Q4 2020 dataset published by Backblaze Inc. Log data DATASET_2020. CSV: The 2020 full year S.M.A.R.T. that has been processed. Prepare_Data.py: Run this script and it will download the S.M.A.R.T. for the year 2020. Log the data and process it to get DATASET_2020.CSV. This script requires 20 gigabytes of local storage to run

2. Data analysis

Before using machine learning to build any model, you need to analyze the data set to understand the size of the data set, attribute names, attribute values, various statistical indicators, and null values. Because we need to understand the data before we can use it.

2.1 Read CSV files

Pandas is a common Python data analysis module, and we’ll start by using it to load CSV files in our dataset. In the case of 2020-12-08.csv, we first load the file to analyze the S.M.A.R.T. Logging data

2.2 View the data size of a single CSV file

2.3 Look at the first five rows of data

When you load CSV with Pandas, you get a DataFrame object, which you can think of as a table. You can call the object’s head() function to see the first five rows of the table

df_data.head()

5 rows × 149 columns

The first 5 rows of the table are shown above. The header is the property name, and below the property name is the property value. The Backblaze website explains the meaning of the property value, which translates as:

2.4 Check the statistical indicators of the data

After looking at the first five rows of the table, we then call the describe() function of the DataFrame object to calculate the statistical metrics for the table data

df_data.describe()

Rows × 146 Columns

By default, the describe() function performs statistical analysis on columns of numeric type. Since the first three columns of the table ‘DATE’, ‘SERIAL_NUMBER’, and ‘MODEL’ are strings, they do not have statistical metrics.

Mean: the mean of this column STD: the standard deviation of the value of this column min: the minimum value of this column 25%: 25% of the value of this column median value 50%: 50% of the value of this column median value 75%: The 75% median value of the column value Max: The maximum value of the column value

2.5 Check the null value of data

As you can see from the above output, the count index for some of the attributes is small, for example, the count for smart_2_raw is much smaller than the total number of rows for df_train, so let’s take a closer look at the null value for each column attribute by executing the following code to see the null value

df_data.isnull().sum()

date 0 serial_number 0 model 0 capacity_bytes 0 failure 0 smart_1_normalized 179 smart_1_raw 179 smart_2_normalized 103169 smart_2_raw 103169 smart_3_normalized 1261 smart_3_raw 1261 smart_4_normalized 1261 smart_4_raw 1261 smart_5_normalized 1221 smart_5_raw 1221 smart_7_normalized 1261 smart_7_raw 1261 smart_8_normalized 103169 smart_8_raw 103169 smart_9_normalized 179 smart_9_raw 179 smart_10_normalized 1261 smart_10_raw 1261 smart_11_normalized 161290 smart_11_raw 161290 smart_12_normalized 179 smart_12_raw 179 smart_13_normalized 161968 smart_13_raw 161968 smart_15_normalized 162008 ... smart_232_normalized 160966 smart_232_raw 160966 smart_233_normalized 160926 smart_233_raw 160926 smart_234_normalized 162008 smart_234_raw 162008 smart_235_normalized 160964 smart_235_raw 160964 smart_240_normalized 38968 smart_240_raw 38968 smart_241_normalized 56030 smart_241_raw 56030 smart_242_normalized 56032 smart_242_raw 56032 smart_245_normalized  161968 smart_245_raw 161968 smart_247_normalized 162006 smart_247_raw 162006 smart_248_normalized 162006 smart_248_raw 162006 smart_250_normalized 162008 smart_250_raw 162008 smart_251_normalized 162008 smart_251_raw 162008 smart_252_normalized 162008 smart_252_raw 162008 smart_254_normalized 161725 smart_254_raw 161725 smart_255_normalized 162008 smart_255_raw 162008 Length: 149, dtype: int64

This display is not easy to view, so we plot the number of null values as a graph to make it more intuitive

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df_data_null_num = df_data.isnull().sum()
x = list(range(len(df_data_null_num)))
y = df_data_null_num.values
plt.plot(x, y)
plt.show()

As you can see from the above results, some of the attributes in the table have a large number of null values.

In the field of machine learning, the existence of null values in the data set is a very common phenomenon. There are many reasons for null values. For example, there are many attributes in a user portrait, but not all users have corresponding attribute values, so the null value is generated. Or some data because of transmission timeout, resulting in no collection, there may also be a null value.

2.6 Analysis of category equilibrium

The task we want to achieve is “hard disk failure prediction”, that is, to predict whether a hard disk is normal or damaged at a certain time, which is a fault prediction problem or abnormal detection problem. This kind of problem has a feature that there are very many normal samples, very few fault samples, and the number of the two kinds of samples is very different.

For example, if you execute the following code, you can see that there are more than 160,000 normal samples of hard disks in DF_data, but only 8 failed samples, and the categories are extremely uneven.

Because most of the learning process of machine learning methods are based on the statistical idea for learning, if directly with the above categories imbalanced training data, so the ability of the model may be more obvious preference category sample, category of less samples would be “swamped” off, don’t play a role in the process of learning, So we need to balance different kinds of data. To get more breakdown sample data, we can check out Backblaze Inc. ‘s annual 2020 S.M.A.R.T. The log data will pick out all the failure samples, and also randomly pick out the same number of normal samples, can be achieved by the following code.

This code has been commented out and requires 20 gigabytes of local storage to run. You don’t need to run this code, because the dataset_2020.zip was downloaded at the beginning of this example, and the dataset_2020. CSV is provided in the zip package, which is the file that you get from running the following code

Since the following code will load the log data into the DF_DATA object, in order to avoid the risk of memory overflow, you can manually recycle the memory here first. Because JupyTerLab does not automatically reclaim memory in the environment during the run

2.7 Load the class-balanced dataset

Dataset_2020. CSV is a hard disk S.M.A.R.T. that has been class-balanced. Log data. Now let’s load the file and check the category balance again

As you can see, there are 1497 normal samples and 1497 failure samples

3. Feature engineering

With the available training set in place, the next step is feature engineering, which in general means selecting which properties in a table to build a machine learning model. The quality of artificial design features largely determines the quality of machine learning model effect. Therefore, researchers in the field of machine learning need to spend a lot of energy on artificial design features, which is a time-consuming and labor-consuming project that requires expert experience.

3.1 Study on the correlation between SMART attribute and hard disk failure

(1) Backblaze analyzed the correlation between HDD failures and SMART attributes, and found that SMART 5, 187, 188, 197, 198 had the highest correlation rate with HDD failures, and these SMART attributes were also associated with scan errors, reassignment counts, and trial counts [1]; (2) El-Shimi et al. found that in addition to the above 5 features in the random forest model, there were 5 attributes of SMART 9, 193, 194, 241 and 242 that had the maximum weight [2]. (3) Pitakrat et al. evaluated 21 machine learning algorithms used to predict hard disk failures, and found that among the 21 machine learning algorithms tested, the random forest algorithm had the largest area under the ROC curve, while the KNN classifier had the highest F1 value [3]. (4) Hughes et al. also studied machine learning methods used to predict hard disk failures. They analyzed the performance of SVM and Naive Bayes. SVM achieved the highest performance with a detection rate of 50.6% and a false positive rate of 0% [4].

[1] Klein, Andy. “What Smart Hard Disk Errors Actually Tell Us.” Backblaze Blog Cloud Storage & Cloud Backup,6 Oct. 2016, http://www.backblaze.com/blog… [2] El-Shimi, “Predicting Storage Failures.” Vault-Linux Storage and File Systems Conference. Vault-Linux Storage and File 2017, Cambridge. [3] Pitakrat, Teerat, Andre van Hoorn, And Lars Grunske. “A Comparison of Machine Learning Algorithms for Proactive Hard Drive Failure Detection.” Proceedings of the 4th international ACM Sigsoft symposium on Architecting critical systems. ACM, 2013. [4] Hughes, Gordon F., et al. “Improved Disk-Drive Failure Warns.” IEEE Transactions on Reliability 51.3 (2002):350-357.

The above are some previous research results. This case plans to adopt the random forest model. Therefore, properties such as SMART 5, 9, 187, 188, 193, 194, 197, 198, 241 and 242 can be selected as features according to the research results in Article 2 above, and their meanings are: SMART 5: Remap Sector Counting SMART 9: Accumulation of power on time SMART 187: Error that cannot be corrected SMART 188: instruction timeout COUNT SMART 193: Head load/unload COUNT SMART 194: Temperature SMART 197: Number of sectors waiting to be mapped SMART 198: Errors reported to the operating system that cannot be corrected by hardware ECC SMART 241: Total number of logical block addressing mode writes SMART 242: Total number of logical block addressing mode reads

In addition, since different types of hard drives from different manufacturers may have different standards for recording SMART log data, it is better for us to select the same type of hard drive data as training data, and specifically train a model to predict whether the hard drive fails or not. If you need to predict the failure of multiple different types of hard disks, you may need to train multiple models separately.

3.2 Selection of hard disk model

Execute the following code to see how much data each model has on the hard disk

df_data.model.value_counts()

ST12000NM0007                         664
ST4000DM000                           491
ST8000NM0055                          320
ST12000NM0008                         293
TOSHIBA MG07ACA14TA                   212
ST8000DM002                           195
HGST HMS5C4040BLE640                  193
HGST HUH721212ALN604                  153
TOSHIBA MQ01ABF050                     99
ST12000NM001G                          53
HGST HMS5C4040ALE640                   50
ST500LM012 HN                          40
TOSHIBA MQ01ABF050M                    35
HGST HUH721212ALE600                   34
ST10000NM0086                          29
ST14000NM001G                          23
HGST HUH721212ALE604                   21
ST500LM030                             15
HGST HUH728080ALE600                   14
Seagate BarraCuda SSD ZA250CM10002     12
WDC WD5000LPVX                         11
WDC WUH721414ALE6L4                    10
ST6000DX000                             9
TOSHIBA MD04ABA400V                     3
ST8000DM004                             2
ST18000NM000J                           2
Seagate SSD                             2
ST4000DM005                             2
ST8000DM005                             1
ST16000NM001G                           1
DELLBOSS VD                             1
TOSHIBA HDWF180                         1
HGST HDS5C4040ALE630                    1
HGST HUS726040ALE610                    1
WDC WD5000LPCX                          1
Name: model, dtype: int64

It can be seen that the hard disk of ST12000NM0007 has the largest amount of data, so we filter out the data of the hard disk of this model

df_data_model = df_data[df_data['model'] == 'ST12000NM0007']

3.3 Feature selection

Select the 10 attributes mentioned above as features

Null values exist, so fill them in first

3.4 Divide training set and test set

Train_test_split of Sklearn can be used to divide the training set and the test set. Test_size represents the proportion of the test set, which is generally 0.3, 0.2 or 0.1

from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data test_size = 0.2, random_state = 0)

4. Start training

4.1 Model Construction

After the training set and test set are prepared, the model can be built. The steps to build the model are very simple. Just call RandomForestClassifier in the machine learning framework SKLearn directly

from sklearn.ensemble import RandomForestClassifier 

rfc = RandomForestClassifier()

Super parameters have a lot of random forest algorithm, take different parameter values to build model will receive different training effect, for starters, you can directly use the default parameter values provided by the library, in the understanding of the principle of random forest algorithm has certain, can try to modify the parameters of the model to adjust the training effect of the model.

4.2 Data fitting

The process of model training, that is, the process of fitting the training data, is also very simple to implement. You can start the training by calling the FIT function

/ home/ma - user/anaconda3 / envs/XGBoost - Sklearn/lib/python3.6 / site - packages/Sklearn/ensemble/forest. The py: 248: FutureWarning: The default value of N_Estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in ", FutureWarning) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, Max_features ='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1 Min_samples_split =2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

5 Start predicting

Call predict to start the prediction

Y_pred = rfc.predict(X_test)

5.1 Statistical prediction accuracy

In machine learning, there are four commonly used performance indicators for classification problems: Accuracy, Precision, Recall and F1-score. The closer the four indicators are to 1, the better the performance will be. The Sklearn library has functions for these four metrics that can be called directly.

About four kinds of index theory explanation, can refer toThis article

Each time the random forest model is trained, different test accuracy indexes of the model will be obtained, which is a normal phenomenon because the training process of the random forest algorithm has a certain randomness. However, the prediction results of the same model and the same sample are invariable.

5.2 Save, load and forecast the model

Model to save

import pickle
with open('hdd_failure_pred.pkl', 'wb') as fw:
    pickle.dump(rfc, fw)

The model is loaded

with open('hdd_failure_pred.pkl', 'rb') as fr:
    new_rfc = pickle.load(fr)

Model reprediction

5.3 View the confusion matrix

To analyze the effect of the classification model, the confusion matrix can also be used to check. The horizontal axis of the confusion matrix represents the categories of predicted results, the vertical axis represents the categories of real labels, and the values in the matrix grid represent the number of test samples overlapping in the corresponding horizontal and vertical coordinates.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix 

LABELS = ['Healthy', 'Failed'] 
conf_matrix = confusion_matrix(Y_test, Y_pred) 
plt.figure(figsize =(6, 6)) 
sns.heatmap(conf_matrix, xticklabels = LABELS,  
            yticklabels = LABELS, annot = True, fmt ="d"); 
plt.title("Confusion matrix") 
plt.ylabel('True class') 
plt.xlabel('Predicted class') 
plt.show()

6. Ideas for improving the model

The above content is a demonstration of the process of constructing the hard disk fault prediction model using the random forest algorithm. The accuracy of the model is not high. There are several ideas to improve the accuracy of the model: (1) This case only uses the data of Backblaze Company in 2020, so you can try to use more training data; (2) In this case, only 10 SMART attributes are used as features. You can try other methods to build features; (3) This case uses the random forest algorithm to train the model. You can try to use other machine learning algorithms;

Click into Huawei Cloud ModelArts Codelab to directly run the code of this case

Click on the attention, the first time to understand Huawei cloud fresh technology ~