Use logistic regression to classify

Logistic regression is a method often used in machine learning, which belongs to supervised machine learning. Although the name of logistic regression has the word “regression”, it actually belongs to a classification method. This paper introduces how to use logistic regression to classify.

First, I will introduce the basic principles of logistic regression.

Figure 1. Logical function graph

Logistic regression is called “logic” because it uses Logistic function (also known as Sigmoid function), whose form is shown in Chinese (1) in Figure 2, and the graph is shown in Figure 1. Since logistic regression is a classification method, we will use the simplest dichotomy here. The dichotomy output is labeled y=0 or 1, and the predicted value produced by linear regression is z = ω^Tx+b. We let t=z and substitute the z expression into equation (1) to obtain Equation (2). Y is the positive example and 1-y is the negative example. The ratio of the two can be called probability, so Eq. (3) can be called “logarithmic probability”. And then we want to solve for omega and b, using maximum likelihood estimation. We will y as posterior probability to estimate p (y | x = 1), you can get in figure 3 (4) and (5). So let’s say beta is equal to omega; B) and x = (x; Formula (6) can be obtained. From formula (6), (7), (8) and (9) in Figure 4 can be obtained. (9) is the objective function, and the optimal parameter can be obtained by solving the objective function. These derivations are quite complicated, and the author only lists the main parts here. If you are interested, you may consult relevant information by yourself.

Figure 2. Logistic regression Derivation formula (1) – (3)

Figure 3. Logistic regression Derivation Formula (4) — (6)

Figure 4. Logistic regression Derivation Formula (7) — (9)

After understanding the basic principles of logistic regression, let’s use an example to illustrate the use of logistic regression.

The logistic regression model used in this article is derived from Scikit-learn, and the data set used is also derived from Scikit-Learn, as shown below.

import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 

X, y = make_classification(n_samples=100, n_features=2, 
n_informative=2, n_redundant=0, n_clusters_per_class=1,
class_sep = 2.0, random_state=15)
fig, ax = plt.subplots(figsize=(8.6))
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Copy the code

Figure 5. Data point \ used in this example

The result is shown in Figure 5. This data set is generated using the make_classification method. There are 100 points, two features (dimensions), and all the data is divided into two classes. It can be seen that the purple points are divided into one category and the yellow points into another. Then divide the data set into training set and test set. The code is as follows. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=30)

In this case, we set the number of data in the test set to 30 and random_state to 30, which we can set at will. Next, we use logistic regression to train and predict, and the results are output by classiFICation_report method.

X_test = model. Predict (X_testprint(classification_report(y_test, Y_predict)Copy the code

The result is shown in Figure 6. As can be seen from Figure 6, the accuracy of this model is 0.97. Since there are 30 test data in total, it means that we only have one prediction error, indicating that the classification effect of this model is quite good.

Figure 6. Model results report

Then in order to let you have a further understanding of the classification effect of the model, the author here further study, we will look at the classification boundary of the logistic regression model, that is, where the model starts to divide, the code is as follows.

step = 0.01X_min = X[:,0].min() - 1X_max = X[:,0].max() + 1
y_min = X[:, 1].min() - 1
y_max = X[:, 1].max() + 1
x_mesh, y_mesh = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step)) 
data_mesh = np.stack([x_mesh.ravel(), y_mesh.ravel()], axis=- 1Convert mesh to2Oclc (x mesh. Shape), oclc (x mesh. Shape), oclc (x mesh.8.6Pcolormesh (x_mesh, y_mesh, Z, cmap=plt.cm.cool)0], X[:, 1], c=y, cmap=plt.cm.ocean)
plt.show()
Copy the code

This code is a little bit complicated, so let me explain. Our design idea is as follows, because the logistic regression model used this time is a binary model, that is, the results are divided into two classes, so we mark the area of each class in the model with a color, so that there are two colors. The points that fall into each region belong to that region, which is this class. X_mesh, y_mesh = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step)). X_min, X_max, y_min, y_max are the boundaries of this region, which is larger than the scope of the data set we are using. Data_mesh = Np.stack ([x_mesh.ravel(), y_mesh.ravel()], axis=-1); data_mesh = np.stack([x_mesh.ravel(), y_mesh.ravel()], axis=-1) Z = model. Predict (data_mesh) is the predicted value of each point in the area. Plt. pcolorMesh and plt.scatter are used to draw the color of the area and the color of the data points respectively, so we can clearly see which area the points are in. The result is shown in Figure 7.

Figure 7. Different colors are used to represent different divided areas

It can be seen from the result that a green point falls into the wrong area, indicating that the prediction of this point is wrong, which is consistent with the result obtained in the classiFICATION_report before.

Logistic regression is widely used in machine learning and has good effects, but it also has some disadvantages, such as inability to solve nonlinear problems, sensitivity to multicollinearity data, difficult to deal with data imbalance and so on. Its principle also wants to be more complex than what the author introduces many, the reader that wants to understand deeply can search relevant data to study by oneself.

About the author: Mort, a data analysis enthusiast, is good at data visualization, pays more attention to machine learning, and hopes to learn more and communicate with friends in the industry.

Appreciate the author

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Long press scan to add “Python Assistant”

Click here to become a community member

Related Posts

The Admission Webhook mechanism is used to control multi-cluster resource quota

How are hidden HTTP request header fields automatically added when using Apache’s HttpClient for HTTP communication

How to judge whether an optimization problem is convex or non-convex?