This article is the third in the machine learning series and the eighth in the pre-learning machine series. The concepts in this article are relatively simple and focus on code practices. As I said in the last article, we can use linear regression to make predictions, but obviously in real life there is not only a prediction problem but also a classification problem. We can simply distinguish it from the types of predicted values: the prediction of continuous variables is regression, and the prediction of discrete variables is classification.

Logistic regression: dichotomies

1.1 Understand logistic regression

We artificially define successive predicted values, with one side of the boundary defined as 1 and the other side as 0. So we turn the regression problem into a classification problem.

As shown in the figure above, we suppressed the continuous variable distribution within the range of 0-1, and took 0.5 as the boundary of our classification decision. If the probability is greater than 0.5, the discriminant is 1; if the probability is less than 0.5, the discriminant is 0.

We cannot use infinity and negative infinity for arithmetic operations. We can limit numerical calculation to 0-1 by Logistic regression function (Sigmoid function/S-type function /Logistic function).


sigma ( x ) = 1 1 + e x \sigma(x) = \frac{1}{1+e^{-x}}

So that’s a simple explanation of logistic regression. Let’s apply real data examples to binary code practice.

1.2 Code practices – Import data sets

Add a reference:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Copy the code

Import dataset (don’t worry about this domain name) :

df = pd.read_csv('https://blog.caiyongji.com/assets/hearing_test.csv')
df.head()
Copy the code
age physical_score test_result
33 40.7 1
50 37.2 1
52 24.7 0
56 31 0
35 42.9 1

The dataset, an experiment with 5,000 participants, looked at the effects of age and physical fitness on hearing loss, specifically the ability to hear high notes. This data shows the results of the study participants were assessed and rated for physical ability and then had to take an audio test (pass/fail) to assess their ability to hear high frequencies.

  • Characteristics: 1. age 2. health score
  • Tag :(1 pass /0 fail)

1.3 Observed Data

sns.scatterplot(x='age',y='physical_score',data=df,hue='test_result')
Copy the code

Seaborn was used to plot scatter plots of age and health score characteristics corresponding to test results.

sns.pairplot(df,hue='test_result')
Copy the code

The pairplot method is used to plot the corresponding relationship between two features.

We can make a general judgment that it’s hard to pass the test when you’re over 60, and that the average fitness score that passes the test is over 30.

1.4 Training model

from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import X = df.drop('test_result',axis=1) y = df['test_result'] X_test, X_test, y_test = train_test_split(X, y, test_size=0.1, random_state=50) scaler = StandardScaler() scaled_X_train = scaler.fit_transform(X_train) scaled_X_test = Scaler. Transform (X_test) # define model log_model = LogisticRegression() # train model log_model.fit(scaled_X_train,y_train) # predict data y_pred  = log_model.predict(scaled_X_test) accuracy_score(y_test,y_pred)Copy the code

After data preparation, we defined the model as LogisticRegression model, fitted the training data by FIT method, and finally predicted by PREDICT method. Finally, accuracy_score method was used to obtain 92.2% accuracy of the model.

Ii. Model performance evaluation: accuracy, accuracy and recall rate

How did we get 92.2 percent accuracy? We call the plot_confusion_matrix method to draw the confusion matrix.

plot_confusion_matrix(log_model,scaled_X_test,y_test)
Copy the code

We observed 500 test instances and obtained the matrix as follows:

We define the above matrix as follows:

  • True TP: the prediction is Positive, but the actual result is Positive. For example, in the lower right corner of the figure 285.
  • True Negative (TN) : The prediction is Negative and the actual result is Negative. For example, upper left corner 176.
  • False Positive (FP) : The prediction is Positive but the actual result is negative. For example, the lower left corner of the figure 19.
  • False Negative class FN(False Negative) : the prediction is Negative, but the actual result is positive. For example, 20 in the upper right corner of the picture.

The formula for Accuracy is as follows:


A c c u r a c y = T P + T N T P + T N + F P + F N Accuracy = \frac{TP+TN}{TP+TN+FP+FN}

In this example:


A c c u r a c y = 285 + 176 285 + 176 + 20 + 19 = 0.922 Accuracy = \frac{285+176}{285+176+20+19} = 0.922

The formula for Precision is as follows:


P r e c i s i o n = T P T P + F P Precision = \frac{TP}{TP+FP}

In this example:


P r e c i s i o n = 285 285 + 19 = 0.9375 Precision = \frac{285}{285+19} = 0.9375

The formula of Recall is as follows:


R e c a l l = T P T P + F N Recall = \frac{TP}{TP+FN}

In this example:


R e c a l l = 285 285 + 20 = 0.934 Recall = \frac{285}{285+20} = 0.934

We call the ClassiFICation_report method to verify the results.

print(classification_report(y_test,y_pred))
Copy the code

Softmax: multiple categories

3.1 Understand Softmax multiple logistic regression

Both Logistic regression and Softmax regression are classification models based on linear regression. There is no essential difference between them. Both of them combine maximum loglikelihood estimation from Bernoulli fraction.

Maximum likelihood estimation: In simple terms, maximum likelihood estimation is to use the known sample results information to deduce the model parameter values that are most likely (maximum probability) to lead to the occurrence of these sample results.

The terms “probability” and “likelihood” are often used interchangeably in English, but they have very different meanings in statistics. Given a statistical model with some parameters θ, the word “probability” is used to describe the rationality of a future outcome x (knowing the parameter value θ), and the word “likelihood” is used to describe the rationality of a particular set of parameter values θ after knowing the outcome x.

The Softmax regression model first calculates the scores for each class and then applies Softmax functions to these scores to estimate the probabilities for each class. We predict the class with the highest estimated probability, simply by finding the class with the highest score.

3.2 Code practice – Import data set

Import dataset (don’t worry about this domain name) :

df = pd.read_csv('https://blog.caiyongji.com/assets/iris.csv')
df.head()
Copy the code
sepal_length sepal_width petal_length petal_width species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa

This dataset contains data of 150 iris samples, including the length and width of petals and the length and width of sepals, including three species of iris, namely setosa, Versicolor and Virginica.

  • Features: 1. calyx length 2. calyx width 3. petal length 4 Calyx width
  • Iris Setosa, Iris Versicolor and Iris Virginica

3.3 Observed Data

sns.scatterplot(x='sepal_length',y='sepal_width',data=df,hue='species')
Copy the code

Seaborn was used to draw scatter plots of calyx length and width characteristics corresponding to iris species.

sns.scatterplot(x='petal_length',y='petal_width',data=df,hue='species')
Copy the code

Seaborn was used to draw scatter plots of petal length and width characteristics corresponding to iris species.

sns.pairplot(df,hue='species')
Copy the code

The pairplot method is used to plot the corresponding relationship between two features.

We can make a general judgment. Considering the overall size of petals and calyx, iris mountain is the smallest, iris color is the medium size, and iris Virginia is the largest.

3.4 Training model

X = df. Drop ('species',axis=1) y = df['species'] X_train, X_test, y_train, y_test = train_test_split(X, y, Test_size = 0.25, random_state=50) scaler = StandardScaler() scaled_X_train = scaler.fit_transform(X_train) scaled_X_test = Multinomial (multi_class="multinomial",solver=" multinomial", C=10) (scaled_X_test) # predict(scaled_X_test) # predict(scaled_X_test) accuracy_score(y_test,y_pred)Copy the code

Multinomial (multinomial) LogisticRegression model with multi_class=”multinomial” is defined after data preparation, and the solveer is set to LBFGS. Multinomial training data is fitted using fit method and finally predicted using predict method. Finally, accuracy_score method was used to obtain 92.1% accuracy of the model.

We call the Classification_report method to check the accuracy, accuracy and recall rate.

print(classification_report(y_test,y_pred))
Copy the code

3.5 Extension: Draw petal classification

We only extracted the characteristics of petal length and width to draw the classification image of iris.

X = df[['petal_length','petal_width']].to_numpy() y = df["species"].factorize([' species '], Multinomial regression (multi_class="multinomial",solver=" LBFGS ", C=10) 00 00 00 00 00 00 00 00 00 00 0 Np. Linspace (0, 3.5, 200). Reshape (1, 1)) X_new = np. The c_ (x0. Ravel (), Predict_proba (X_new) = predicmax_reg. predict(X_new) = predicmax_reg. predict(X_new) = predict_proba [:,  1].reshape(x0.shape) zz = y_predict.reshape(x0.shape) plt.figure(figsize=(10, 4)) plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica") plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor") plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa") from matplotlib.colors import ListedColormap custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0']) plt.contourf(x0, x1, zz, cmap=custom_cmap) contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg) plt.clabel(contour, inline=1, fontsize=12) plt.xlabel("Petal length", fontsize=14) plt.ylabel("Petal width", fontsize=14) plt.legend(loc="center left", fontsize=14) plt.axis([0, 7, 0, 3.5]) PLT. The show ()Copy the code

The images of irises classified by petals are as follows:

Four, summary

This article focuses on hands-on rather than conceptual understanding, and you should feel “hand hot” through hands-on programming. By the end of this article, you should be familiar with the concept of machine learning. Let’s briefly summarize:

  1. Classification of machine learning
  2. Industrial processes for machine learning
  3. Concept of features, tags, instances, models
  4. Overfitting, underfitting
  5. Loss function, least square method
  6. Gradient descent, learning rate

Linear regression, logistic regression, polynomial regression, stepwise regression, ridge regression, Lasso regression, ElasticNet regression are the most commonly used regression techniques. Sigmoid function, Softmax function, maximum likelihood estimation

If you are still not clear, please refer to:

  • Machine learning (II) : Understanding linear regression and gradient descent and making simple predictions
  • Machine learning (I) : 5 minutes to understand and practice machine learning
  • Machine learning (5) : master common Matplotlib usage in 30 minutes
  • 2. Machine learning is used to hunt Pandas
  • Pre machine learning (3) : 30 minutes to master common NumPy usage
  • Machine learning (2) : 30 minutes to master the usage of common Jupyter Notebook
  • Pre-processing machine learning (I) : Mathematical symbols and Greek letters