Hands-on practice丨Hard disk failure prediction based on random forest algorithm

Abstract: industry expects to use machine learning technology to build a hard disk failure prediction model, to more accurately detect hard disk failures in advance, reduce operation and maintenance costs, and improve business experience. In this case, a random forest algorithm will be used to train a hard disk failure prediction model.

This article is shared from the HUAWEI cloud community " based on random forest algorithm for hard disk failure prediction ", the original author: Shanhaizhiguang.

Experiment goal

Master the basic process of using machine learning methods to train models;
Master the basic methods of using pandas for data analysis;
Master the method of using scikit-learn to construct, train, save, load, predict, statistical accuracy index and view confusion matrix of random forest model;

Case content introduction

With the development of the Internet and cloud computing, the demand for data storage has doubled day by day, and large-scale mass data storage centers are indispensable infrastructure facilities. Although new storage media such as SSDs already have better performance than disks in many aspects, at present, their high cost still makes most data centers unaffordable. Therefore, large data centers will still use traditional machinery The hard disk is used as the storage medium.

The life cycle of a mechanical hard disk is usually 3 to 5 years. After 2 to 3 years, the failure rate will increase significantly, resulting in a sharp increase in the amount of disk replacement. According to statistics, among server hardware failures, hard disk failures account for 48%+, which is an important factor affecting server operating reliability. As early as the 1990s, people realized that the value of data is far more valuable than the value of the hard disk itself, and eager for a technology that can predict hard disk failures and achieve relatively safe data protection, so SMART technology came into being.

SMART, the full name of "Self-Monitoring Analysis and Reporting Technology", that is, "self-monitoring, analysis and reporting technology", is an automatic hard disk status detection and early warning system and specification. Through the detection instructions in the hard disk hardware, the operating conditions of the hard disk hardware such as heads, platters, motors, and circuits are monitored, recorded and compared with the preset safety values set by the manufacturer. If the monitoring conditions will or have exceeded the expected By setting the safety range of the safety value, the monitoring hardware or software of the host can automatically warn the user and perform minor automatic repairs to ensure the safety of the hard disk data in advance. Except for some hard drives that have been shipped very early, most hard drives are now equipped with this technology. For more information about this technology, please check SMART-Baidu Baike.

Although hard disk manufacturers use SMART technology to monitor the health of hard disks, most manufacturers use failure prediction methods based on design rules. The prediction effect is very poor and cannot meet the increasingly strict demand for predicting hard disk failures in advance. Therefore, the industry expects to use machine learning technology to build a hard disk failure prediction model to more accurately detect hard disk failures in advance, reduce operation and maintenance costs, and improve business experience.

This case will take you to use an open source SMART data set and random forest algorithm in machine learning to train a hard disk failure prediction model and test the effect.
For the theoretical explanation of the random forest algorithm, please refer to this video .

Precautions

If you are using JupyterLab for the first time, please refer to the "ModelAtrs JupyterLab User Guide" to learn how to use it;
If you encounter an error while using JupyterLab, please refer to "ModelAtrs JupyterLab Common Problem Solutions" to try to solve the problem.

Experimental steps

1. Introduction to the data set

The dataset used in this case is an open source dataset from Backblaze, which is a computer backup and cloud storage service provider. Since 2013, Backbreze has publicly released the SMART log data of the hard drives used in their data centers every year, effectively promoting the development of hard drive failure prediction using machine learning technology.
Due to the large amount of SMART log data released by Backblaze, this case is to quickly demonstrate the process of using machine learning to build a hard disk failure prediction model. Only the data released by the company for 2020 are used. The relevant data has been prepared and placed in OBS. , Run the following code to download this part of the data.

Note: The code for downloading data in this step needs to be run on Huawei Cloud ModelArts Codelab

import os
import moxing as mox
if not os.path.exists('./dataset_2020.zip'):
    mox.file.copy('obs://modelarts-labs-bj4/course/ai_in_action/2021/machine_learning/hard_drive_disk_fail_prediction/dataset_2020.zip', './dataset_2020.zip')
    os.system('unzip dataset_2020.zip')

if not os.path.exists('./dataset_2020'):
    raise Exception('错误！数据不存在！')

!ls -lh ./dataset_2020

Data interpretation:

2020-12-08.csv: SMART log data on 2020-12-08 extracted from the 2020 Q4 data set released by backblaze
2020-12-09.csv: SMART log data on 2020-12-09 extracted from the 2020 Q4 data set released by backblaze
dataset_2020.csv: The SMART log data for the whole year of 2020 that has been processed. The following "Section 2.6 Category Equilibrium Analysis" will explain how to obtain this part of the data
prepare_data.py: Run this script to download the SMART log data for the whole year of 2020 and process it to get dataset_2020.csv. Running the script requires 20G of local storage space

2. Data analysis

Before using machine learning to build any model, you need to analyze the data set to understand the size of the data set, attribute names, attribute values, various statistical indicators, and null values. Because we need to understand the data before we can make good use of it.

2.1 Read csv file

Pandas is a commonly used python data analysis module, we first use it to load the csv file in the data set. Take 2020-12-08.csv as an example, let’s load the file first to analyze the situation of SMART log data

2.2 View the scale of a single csv file data

2.3 View the first 5 rows of data

After using pandas to load the csv, what you get is a DataFrame object, which can be understood as a table. Call the head() function of the object to view the first 5 rows of data in the table.

df_data.head()

5 rows × 149 columns

As shown above is the first 5 rows of data in the table. The header is the attribute name, and the attribute value is below the attribute name. The backblaze website explains the meaning of the attribute value, which is translated as follows:

2.4 View the statistical indicators of the data

After viewing the first 5 rows of data in the table, we then call the describe() function of the DataFrame object to calculate the statistical indicators of the table data

df_data.describe()

8 rows × 146 columns

As shown above are the statistical indicators of table data. The describe() function performs statistical analysis on numeric columns by default. Since the first three columns of the table are of string type,'date','serial_number', and'model', these three columns There are no statistical indicators.

The meaning of each row of statistical indicators is explained as follows:
count: how many non-empty values are in the column
mean: the mean of the column
std: the standard deviation of the column value
min: the minimum value of the column value
25%: 25% median value of the column value
50%: 50% median value of the column value
75%: 75% of the median value of the column value
max: the maximum value of the column value

2.5 Check the data null value situation

It can be observed from the above output that the count indicator of some attributes is relatively small. For example, the count of smart_2_raw is much smaller than the total number of rows of df_train, so we need to take a closer look at the null value of each column attribute and execute the following code You can view the null value

df_data.isnull().sum()

date                         0
serial_number                0
model                        0
capacity_bytes               0
failure                      0
smart_1_normalized         179
smart_1_raw                179
smart_2_normalized      103169
smart_2_raw             103169
smart_3_normalized        1261
smart_3_raw               1261
smart_4_normalized        1261
smart_4_raw               1261
smart_5_normalized        1221
smart_5_raw               1221
smart_7_normalized        1261
smart_7_raw               1261
smart_8_normalized      103169
smart_8_raw             103169
smart_9_normalized         179
smart_9_raw                179
smart_10_normalized       1261
smart_10_raw              1261
smart_11_normalized     161290
smart_11_raw            161290
smart_12_normalized        179
smart_12_raw               179
smart_13_normalized     161968
smart_13_raw            161968
smart_15_normalized     162008
                         ...  
smart_232_normalized    160966
smart_232_raw           160966
smart_233_normalized    160926
smart_233_raw           160926
smart_234_normalized    162008
smart_234_raw           162008
smart_235_normalized    160964
smart_235_raw           160964
smart_240_normalized     38968
smart_240_raw            38968
smart_241_normalized     56030
smart_241_raw            56030
smart_242_normalized     56032
smart_242_raw            56032
smart_245_normalized    161968
smart_245_raw           161968
smart_247_normalized    162006
smart_247_raw           162006
smart_248_normalized    162006
smart_248_raw           162006
smart_250_normalized    162008
smart_250_raw           162008
smart_251_normalized    162008
smart_251_raw           162008
smart_252_normalized    162008
smart_252_raw           162008
smart_254_normalized    161725
smart_254_raw           161725
smart_255_normalized    162008
smart_255_raw           162008
Length: 149, dtype: int64

This display method is not easy to view, we draw the number of nullable values into a graph, which looks more intuitive

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df_data_null_num = df_data.isnull().sum()
x = list(range(len(df_data_null_num)))
y = df_data_null_num.values
plt.plot(x, y)
plt.show()

As can be seen from the above results, some attributes in the table have a large number of null values.

In the field of machine learning, the existence of null values in data sets is a very common phenomenon. There are many reasons for null values. For example, there are many attributes in a user portrait, but not all users have corresponding attribute values. When a null value is generated. Or some data is not collected due to the transmission timeout, or null values may appear.

2.6 Analysis of category balance

The task we want to achieve is "hard disk failure prediction", that is, predict whether a certain hard disk is normal or damaged at a certain time. This is a failure prediction problem or anomaly detection problem. This type of problem has a characteristic: there are many normal samples. There are very few failure samples, and the number of samples in the two types differs greatly.
For example, if you execute the following code, you can see that there are more than 160,000 normal hard disk samples in df_data, but only 8 failed samples, and the categories are extremely unbalanced.

Since the learning process of most machine learning methods is based on statistical ideas, if you directly use the above unbalanced data for training, the ability of the model may be obviously biased towards samples with more categories and fewer categories The samples of will be "overwhelmed" and will not play a role in the learning process, so we need to balance different types of data.
In order to obtain more failure sample data, we can select all the failure samples from the SMART log data released by the backblaze company for the whole year of 2020, and also randomly select the same number of normal samples, which can be obtained by the following code achieve.

code has been commented out. To run it, 20G of local storage space is required. You don’t need to run this code, because dataset_2020.zip has been downloaded at the beginning of this case, and dataset_2020.csv is already provided in this compressed package. This csv is the file

import gc
del df_data   # 删除 df_data 对象
gc.collect()  # 由于接下来的代码将加载日志数据到df_data对象中，为避免内存溢出的风险，可以在此处先手动回收内存，因为jupyterlab在运行过程中不会自动回收环境中的内存

2.7 Load a data set with balanced categories

dataset_2020.csv is the hard disk SMART log data that has been processed by category balance. Let's load the file and confirm the category balance.

As you can see, there are 1497 normal samples and fault samples.

3. Feature Engineering

After preparing the available training set, the next step is to do feature engineering. In layman's terms, feature engineering is to select which attributes in the table to build a machine learning model. The quality of artificial design features largely determines the effectiveness of machine learning models. Therefore, researchers in the field of machine learning need to spend a lot of energy on manual design features, which is time-consuming, labor-intensive, and requires Engineering with expert experience.

3.1 Related research on SMART attributes and hard disk failure

(1) BackBlaze analyzed the correlation between its HDD failures and SMART attributes, and found that SMART 5, 187, 188, 197, 198 have the highest correlation rate with HDD failures. These SMART attributes are also related to scanning errors and redistributing counts. Related to trial count [1];
(2) El-Shimi et al. found that in addition to the above 5 features in the random forest model, there are 5 attributes of SMART 9, 193, 194, 241, and 242 that have the greatest weight [2];
(3) Pitakrat et al. evaluated 21 machine learning algorithms for predicting hard disk failures and found that among the 21 machine learning algorithms tested, the random forest algorithm has the largest area under the ROC curve, while the KNN classifier has the highest F1 Value[3];
(4) Hughes et al. also studied machine learning methods for predicting hard disk failures. They analyzed the performance of SVM and Naive Bayes. SVM achieved the highest performance with a detection rate of 50.6% and a false alarm rate of 0% [4 ];

[1] Klein, Andy. “What SMART Hard Disk Errors Actually Tell Us.” Backblaze Blog Cloud Storage & Cloud Backup,6 Oct. 2016, http://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/
[2] El-Shimi, Ahmed. “Predicting Storage Failures.” VAULT-Linux Storage and File Systems Conference.VAULT-Linux Storage and File Systems Conference, 22 Mar. 2017, Cambridge.
[3] Pitakrat, Teerat, André van Hoorn, and Lars Grunske. “A comparison of machine learning algorithms for proactive hard disk drive failure detection.” Proceedings of the 4th international ACM Sigsoft symposium on Architecting critical systems. ACM, 2013.
[4] Hughes, Gordon F., et al. “Improved disk-drive failure warnings.” IEEE Transactions on Reliability 51.3 (2002):350-357.

The above are some of the research results of the predecessors. This case plans to use the random forest model, so you can choose SMART 5, 9, 187, 188, 193, 194, 197, 198, 241, 242 based on the above research results in Article 2 As features, their meanings are:
SMART 5: Remap sector count
SMART 9: Accumulated power-on time
SMART 187: Uncorrectable error
SMART 188: Command timeout count
SMART 193: head loading/unloading count
SMART 194: Temperature
SMART 197: Number of sectors waiting to be mapped
SMART 198: Errors reported to the operating system that cannot be corrected by hardware ECC
SMART 241: Logical block addressing mode write total
SMART 242: Logical block addressing mode read total

In addition, since the standards for recording SMART log data may be different for different models of hard disks from different hard disk manufacturers, we'd better select the same model of hard disk data as training data to specifically train a model to predict whether the hard disk of that model is faulty. If you need to predict whether multiple hard drives of different models are faulty, you may need to train multiple models separately.

3.2 Hard disk model selection

Execute the following code to see how much data is in each model of hard disk

df_data.model.value_counts()

ST12000NM0007                         664
ST4000DM000                           491
ST8000NM0055                          320
ST12000NM0008                         293
TOSHIBA MG07ACA14TA                   212
ST8000DM002                           195
HGST HMS5C4040BLE640                  193
HGST HUH721212ALN604                  153
TOSHIBA MQ01ABF050                     99
ST12000NM001G                          53
HGST HMS5C4040ALE640                   50
ST500LM012 HN                          40
TOSHIBA MQ01ABF050M                    35
HGST HUH721212ALE600                   34
ST10000NM0086                          29
ST14000NM001G                          23
HGST HUH721212ALE604                   21
ST500LM030                             15
HGST HUH728080ALE600                   14
Seagate BarraCuda SSD ZA250CM10002     12
WDC WD5000LPVX                         11
WDC WUH721414ALE6L4                    10
ST6000DX000                             9
TOSHIBA MD04ABA400V                     3
ST8000DM004                             2
ST18000NM000J                           2
Seagate SSD                             2
ST4000DM005                             2
ST8000DM005                             1
ST16000NM001G                           1
DELLBOSS VD                             1
TOSHIBA HDWF180                         1
HGST HDS5C4040ALE630                    1
HGST HUS726040ALE610                    1
WDC WD5000LPCX                          1
Name: model, dtype: int64

It can be seen that the hard disk of the ST12000NM0007 model has the most data, so we filter the data of the hard disk of this model

df_data_model = df_data[df_data['model'] == 'ST12000NM0007']

3.3 Feature selection

Select the 10 attributes mentioned above as features

There are null values, so you must fill in the null values first

3.4 Divide training set and test set

Use sklearn's train_test_split to divide the training set and the test set, test_size represents the proportion of the test set, generally the value is 0.3, 0.2 or 0.1

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size=0.2, random_state=0)

4. Start training

4.1 Build the model

After preparing the training set and test set, you can start to build the model. The steps to build the model are very simple. You can directly call the RandomForestClassifier in the machine learning framework sklearn.

from sklearn.ensemble import RandomForestClassifier 

rfc = RandomForestClassifier()

There are many hyperparameters of the random forest algorithm. Taking different parameter values to build a model will get different training effects. For beginners, you can directly use the default parameter values provided in the library. You have a certain understanding of the principles of the random forest algorithm. After that, you can try to modify the parameters of the model to adjust the training effect of the model.

4.2 Data fitting

The process of model training, that is, the process of fitting training data, is also very simple to implement. Call the fit function to start training

/home/ma-user/anaconda3/envs/XGBoost-Sklearn/lib/python3.6/site-packages/sklearn/ensemble/forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

5 Start forecasting

Call the predict function to start prediction

Y_pred = rfc.predict(X_test)

5.1 Statistical forecast accuracy

In machine learning, there are four commonly used performance indicators for classification problems: accuracy, precision, recall, and F1-Score. The closer the four indicators are to 1, the better the effect. it is good. There are functions of these four indicators in the sklearn library, which can be called directly.

Four metric theory of interpretation, reference this article

Each time the random forest model is trained, different test accuracy indicators of the model will be obtained. This is caused by the randomness of the training process of the random forest algorithm, which is a normal phenomenon. However, the prediction results of the same model and the same sample are certain.

5.2 Model saving, loading, and re-prediction

Model save

import pickle
with open('hdd_failure_pred.pkl', 'wb') as fw:
    pickle.dump(rfc, fw)

Model loading

with open('hdd_failure_pred.pkl', 'rb') as fr:
    new_rfc = pickle.load(fr)

Model prediction

5.3 View the confusion matrix

To analyze the effect of the classification model, you can also use the confusion matrix to view. The horizontal axis of the confusion matrix represents each category of the prediction result, the vertical axis represents the category of the true label, and the value in the matrix grid represents the overlap of the corresponding horizontal and vertical coordinates. The number of test samples.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix 

LABELS = ['Healthy', 'Failed'] 
conf_matrix = confusion_matrix(Y_test, Y_pred) 
plt.figure(figsize =(6, 6)) 
sns.heatmap(conf_matrix, xticklabels = LABELS,  
            yticklabels = LABELS, annot = True, fmt ="d"); 
plt.title("Confusion matrix") 
plt.ylabel('True class') 
plt.xlabel('Predicted class') 
plt.show()

6. Ideas to improve the model

The above content is a demonstration of the process of using the random forest algorithm to build a hard disk failure prediction model. The accuracy of the model is not high. There are several ideas to improve the accuracy of the model:
(1) This case only uses Backblaze's 2020 data, you can try to use more training data;
(2) In this case, only 10 SMART attributes are used as features. You can try other methods to construct features;
(3) This case uses the random forest algorithm to train the model, you can try other machine learning algorithms;

Click to enter the HUAWEI CLOUD ModelArts Codelab to run this case code directly

Click to follow, and get to know the fresh technology of Huawei Cloud for the first time~