Machine learning 009- Solving multiple classification problems with logistic regression classifiers

(Python libraries and versions used in this article: Python 3.5, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2)

Machine learning 008 has explained using simple linear classifiers to solve binary classification problems, but what about multi-classification problems?

Here is a classifier for solving multiple classification problems: logistic regression. Although the name contains the word regression, logistic regression can be used not only for regression analysis, but also for classification problems. Logistic regression is a machine learning algorithms that are widely used in the field, is used to estimate the possibility of sample category, deeper about logistic regression formula deduction, see https://blog.csdn.net/devotion987/article/details/78343834.


1. Prepare data sets

Here we have built some simple data samples as data sets. First we need to analyze the data set, so that we have a clear understanding of the characteristics of the data set.

Prepare the data set first
# Eigenvector
X =np.array([[4.7], [3.5.8], [3.1.6.2], [0.5.1], [1.2],
             [1.2.1.9], [6.2], [5.7.1.5], [5.4.2.2]]) # Custom data set
The #
y = np.array([0.0.0.1.1.1.2.2.2]) #  Three categories

Draw data points into scatter plots by category
class_0=np.array([feature for (feature,label) in zip(X,y) if label==0])
# print(class_0) # Make sure there are no problems
class_1=np.array([feature for (feature,label) in zip(X,y) if label==1])
# print(class_1)
class_2=np.array([feature for (feature,label) in zip(X,y) if label==2])
# print(class_2)

# drawing
plt.figure()
plt.scatter(class_0[:,0],class_0[:,1],marker='s',label='class_0')
plt.scatter(class_1[:,0],class_1[:,1],marker='x',label='class_1')
plt.scatter(class_2[:,0],class_2[:,1],marker='o',label='class_2')
plt.legend()
Copy the code

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. It can be seen from the Y label of the data set that the whole data set has three categories, and the data points of each category are clustered together, which can be seen from the scatter diagram. Therefore, this is a typical multi-classification problem.

2. The sample number of the data set here is relatively small (three samples for each category), and there are only two feature vectors. In addition, it can be seen from the scatter diagram that each category of the data set is well differentiated, so it is relatively easy to classify.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


2. Construct logistic regression classifier

The logical regression classifier is very simple to build, as shown in the following code. First let’s try the classification using the default parameters of the classifier.

# Build logistic regression classifiers
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=37) Use the default parameters first
classifier.fit(X, y) # Train the international regression classifier
Copy the code

Although we built the logistic regression classifier here and trained with our data set, how should we check the effect of training? At this time, we also have no test set, so for the time being, we draw the classification effect of the classifier on the training set into the figure to give an intuitive classification effect. To see the classification effect in the diagram, you need to define a function that specifically plots the classifier effect display, as follows.

Draw the classifier to the diagram
def plot_classifier(classifier, X, y):
    x_min, x_max = min(X[:, 0]) - 1.0, max(X[:, 0]) + 1.0 Calculate the range of coordinates in the graph
    y_min, y_max = min(X[:, 1]) - 1.0, max(X[:, 1]) + 1.0
    step_size = 0.01 # set step size
    x_values, y_values = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size))
    # Build grid data
    mesh_output = classifier.predict(np.c_[x_values.ravel(), y_values.ravel()])
    mesh_output = mesh_output.reshape(x_values.shape) 
    plt.figure()
    plt.pcolormesh(x_values, y_values, mesh_output, cmap=plt.cm.gray)
    plt.scatter(X[:, 0], X[:, 1], c=y, s=80, edgecolors='black', linewidth=1, cmap=plt.cm.Paired)
    # specify the boundaries of the figure
    plt.xlim(x_values.min(), x_values.max())
    plt.ylim(y_values.min(), y_values.max())

    # specify the ticks on the X and Y axes
    plt.xticks((np.arange(int(min(X[:, 0])- 1), int(max(X[:, 0]) +1), 1.0)))
    plt.yticks((np.arange(int(min(X[:, 1])- 1), int(max(X[:, 1]) +1), 1.0)))

    plt.show()
Copy the code

Then call the drawing function directly to check the classification effect of the logistic regression classifier on the training set.

plot_classifier(classifier, X, y)
Copy the code

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. You can easily define and train a LogisticRegression classifier model using the LogisticRegression function in the sklearn module.

2. As the default parameter of the classifier is adopted instead of the most suitable parameter, the classification effect obtained is not the best. For example, as can be seen from the figure, although the classification model can distinguish the three categories, its model can obviously be further optimized.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


3. Optimization of classification model

The logistic regression classifier has two most important parameters: Solver and C, where solver is used to set the algorithm type to solve the system equation, and C represents the penalty value for classification errors. Therefore, the larger C is, the greater the punishment for classification errors is, and the more unacceptable the classification errors are.

Here, as a starting point, the influence of C value on classification effect can be optimized. As follows, we randomly select several C values, draw the classification results, and judge which one is better by intuitive feeling. Of course, it is more scientific to use the test set combined with various evaluation indicators to comprehensively evaluate the model under that parameter combination.

# Optimize parameter C in the model
for c in [1.5.20.50.100.200.500]:
    classifier = LogisticRegression(C=c,random_state=37)
    classifier.fit(X, y)
    plot_classifier(classifier, X, y)
# The more C, the better the classification.
Copy the code

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. Optimization of the model is a physical work, which is also the most test of machine learning skills. Here, as a primer, we only optimized one parameter of the logistic regression classifier.

2. The larger the C value of logistic regression classifier is, the more distinguishable the classifier model will be between the two data sets, which also meets our expectation. Then, is it necessary to set a very large C value at the beginning?

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


Note: This part of the code has been uploaded to (my Github), welcome to download.

References:

1, Classic Examples of Python machine learning, by Prateek Joshi, translated by Tao Junjie and Chen Xiaoli