Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

Today we bring you a machine learning in industrial data combat article: steel defect detection and classification based on machine learning classification algorithm

This article is from uci data sets, specifically for machine learning provides the data of a web site: archive.ics.uci.edu/ml/index.ph…

The data set contains seven types of strip defects (seven types of plate faults: decoration, Z_ scratch, K_ scratch, stains, dirt, bumps, and other faults) and 27 characteristics of strip defects

The main knowledge points of this article:

Data and information

Specific view website: archive.ics.uci.edu/ml/datasets…

Data preprocessing

Import data

In [1]:

import pandas as pd
import numpy as np

import plotly_express as px
import plotly.graph_objects as go
# subgraph
from plotly.subplots import make_subplots

import matplotlib.pyplot as plt
import seaborn as sns 
sns.set_theme(style="whitegrid")
%matplotlib inline

# ignore warnings
import warnings
warnings.filterwarnings('ignore')
Copy the code

In [2]:

df = pd.read_excel("faults.xlsx")
df.head()
Copy the code

Out[2]:

Data segmentation

Separate the seven different types from the previous feature fields:

df1 = df.loc[:,"Pastry":]  # 7 Different types
df2 = df.loc[:,:"SigmoidOfAreas"]  # all feature fields

# Classification data
df1.head()  
Copy the code

Here’s the data for 27 features:

Classification label generation

7 different labels are classified and generated:

Type encoding

In [7]:

dic = {}
for i, v in enumerate(columns):
    dic[v]=i  # Category starts at 0

dic
Copy the code

Out[7]:

{'Pastry': 0.'Z_Scratch': 1.'K_Scatch': 2.'Stains': 3.'Dirtiness': 4.'Bumps': 5.'Other_Faults': 6}
Copy the code

In [8]:

df1["Label"] = df1["Label"].map(dic)

df1.head()
Copy the code

Out[8]:

Data consolidation

In [9]:

df2["Label"] = df1["Label"]
df2.head()
Copy the code

EDA

Basic statistics of data

In [10]:

Df2.isnull ().sum()Copy the code

The result shows no missing values:

Individual feature distribution

parameters = df2.columns[:-1].tolist()

sns.boxplot(data=df2, y="Steel_Plate_Thickness")
plt.show()
Copy the code

The distribution of values of individual features can be observed from the box diagram. The box diagram of value distribution of all parameters is drawn below:

Two basic parameters: set row and column
fig = make_subplots(rows=7, cols=4)  # 1 row, 2 columns

# fig = go.Figure()
# Add two data tracks to form a graph

for i, v in enumerate(parameters):  
    r = i // 4 + 1
    c = (i+1) % 4 
    
    if c ==0:
        fig.add_trace(go.Box(y=df2[v].tolist(),name=v),
                 row=r, col=4)
    else:
        fig.add_trace(go.Box(y=df2[v].tolist(),name=v),
                 row=r, col=c)
    
fig.update_layout(width=1000, height=900)

fig.show()
Copy the code

Some conclusions:

  1. The value ranges from negative to 10M
  2. Some features have outliers
  3. Some features only have values of 0 and 1

Sample imbalance

Quantity per category

In [15]:

Df2 ["Label"].value_counts()Copy the code

Out[15]:

6    673
5    402
2    391
1    190
0    158
3     72
4     55
Name: Label, dtype: int64
Copy the code

You can see that the sample for category 6 has 673 items, but the sample for category 4 only has 55 items. It’s obviously unbalanced

SMOTE solve

In [16]:

X = df2.drop("Label",axis=1)
y = df2[["Label"]]
Copy the code

In [17]:

SMOTE = SMOTE(random_state=42) X_smo, USE SMOTE from IMblearn. y_smo = smo.fit_resample(X, y) y_smoCopy the code

Count the numbers in each category:

Data normalization

The eigenmatrix is normalized

In [19]:

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

ss = StandardScaler()
data_ss = ss.fit_transform(X_smo)

Restore to original data
# origin_data = ss.inverse_transform(data_ss)
Copy the code

The normalized eigenmatrix

In [21]:

df3 = pd.DataFrame(data_ss, columns=X_smo.columns)
df3.head()
Copy the code

Out[21]:

Add y_smo

In [22]:

df3["Label"] = y_smo
df3.head()
Copy the code

modeling

Randomly scrambled data

In [23]:

from sklearn.utils import shuffle
df3 = shuffle(df3)
Copy the code

Data set partitioning

In [24]:

X = df3.drop("Label",axis=1)
y = df3[["Label"]]
Copy the code

In [25]:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, Test_size = 0.2, random_state = 4)Copy the code

Modeling and Evaluation

To solve it as a function:

In [26]:

from sklearn.model_selection import cross_val_score  # Cross validation score
from sklearn import metrics  # Model evaluation


def build_model(model, X_test, y_test) :
    
    model.fit(X_train, y_train)
    # Predicted probability
    y_proba = model_LR.predict_proba(X_test)
    # Find the index with the highest probability value as the classification result of prediction
    y_pred = np.argmax(y_proba,axis=1)
    y_test = np.array(y_test).reshape(943)
    
    print(f"{model}Model score:")
    print("Recall rate:",metrics.recall_score(y_test, y_pred, average="macro"))
    print("Accuracy:",metrics.precision_score(y_test, y_pred, average="macro"))
Copy the code
# Logistic regression (classification)
from sklearn.linear_model import LogisticRegression  
# Build a model
model_LR = LogisticRegression()
# call functionBuild_regression (model_LR, X_test, y_test)0.8247385525937151Accurate rate:0.8126617210922679
Copy the code

Here is how to build each model separately:

Logistic regression

modeling

In [28]:

Linear_model import LogisticRegression # from sklearn.model_selection import cross_val_score # Model_LR = LogisticRegression() model_lr.fit (X_train, y_train)Copy the code

Out[28]:

LogisticRegression()
Copy the code

To predict

In [29]:

# predict_proba = model_lr.predict_proba (X_test) y_proba[:3]Copy the code

Out[29]:

Array ([[4.83469692E-01, 4.23685363E-07, 1.08028560E-10, 3.19294899E-07, 8.92035714E-02, 1.33695855E-02, 4.13956408E-01, [3.49120137E-03, 6.25018002E-03, 9.36037717E-03, 3.64702993E-01, 1.96814910E-01, 1.35722642E-01, 2.83657697E-01], [1.82751269E-05, 5.55981861E-01, 3.16768568E-05, 4.90023258E-03, 2.84504970E-03, 3.67190965E-01, 6.90319398 e-02]])Copy the code

In [30]:

Y_pred = np.argmax(y_proba,axis=1) y_pred[:3]Copy the code

Out[30]:

array([0, 3, 1])
Copy the code

evaluation

In [31]:

# usion_matrix = metrics. Confusion_matrix (y_test, y_predCopy the code

Out[31]:

array([[114.6.0.0.7.11.10],
       [  0.114.1.0.2.4.4],
       [  0.1.130.0.0.0.2],
       [  0.0.0.140.0.1.0],
       [  1.0.0.0.120.3.6],
       [ 13.3.2.0.3.84.11],
       [ 21.13.9.2.9.25.71]])
Copy the code

In [32]:

y_pred.shape
Copy the code

Out[32]:

(943)Copy the code

In [33]:

y_test = np.array(y_test).reshape(943)
Copy the code

In [34]:

print("Recall rate:",metrics.recall_score(y_test, y_pred, average="macro"))
print("Accuracy:",metrics.precision_score(y_test, y_pred, average="macro")) Recall rate:0.8247385525937151Accurate rate:0.8126617210922679
Copy the code

Random forest regression

SVR

Decision tree regression

The neural network

GBDT

from sklearn.ensemble import GradientBoostingClassifier
gbdt = GradientBoostingClassifier(
# loss='deviance',
# learning_rate=1,
# n_estimators=5,
# subsample=1,
# min_samples_split=2,
# min_samples_leaf=1,
# max_depth=2,
# init=None,
# random_state=None,
# max_features=None,
# verbose=0,
# max_leaf_nodes=None,
# warm_start=False
)

gbdt.fit(X_train, y_train)

# Predicted probability
y_proba = gbdt.predict_proba(X_test)
# Index of maximum probability
y_pred = np.argmax(y_proba,axis=1)

print("Recall rate:",metrics.recall_score(y_test, y_pred, average="macro"))
print("Accuracy:",metrics.precision_score(y_test, y_pred, average="macro")) Recall rate:0.9034547294196564Accurate rate:0.9000750791353891
Copy the code

LightGBM

The results of

model Recall Precision
Logistic regression 0.82473 0.8126
Random forest regression 0.9176 0.9149
SVR 0.8897 0.8856
Decision tree regression 0.8698 0.8646
The neural network 0.8908 0.8863
GBDT 0.9034 0.9
LightGBM 0.9363 0.9331

The results are clear:

  1. The effect of integrated learning scheme LightGBM, GBDT and Random forest is higher than other models
  2. LightGBM model works best!