• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/3…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source

The introduction

In this article we introduce you to one of the most common models in machine learning: logistic regression. It is also by far the most widely addressed problem in industry as a baseline solution. Logistic regression is widely used because it is simple, effective and interpretable.

The structure of this paper is as follows:

  • Part 1: Review of machine learning and classification problems. Review one of the most important problems in machine learning: classification problems, different classification problems and mathematical abstractions.

  • Part 2: Core ideas of logistic regression. Introduce linear regression problem and logistic regression solution, explain the core idea of logistic regression.

  • Part 3: Sigmoid functions and classifier decision boundaries. The most important Sigmoid transformation function in logistic regression model and the decision boundary obtained by different classifiers are introduced.

  • Part 4: Gradient descent algorithm used for model optimization. Gradient descent is the most commonly used optimization algorithm in the process of model parameter learning.

  • Part 5: Model overfitting and regularization. This paper introduces the model state analysis and over-fitting problem, and the regularization method to alleviate the over-fitting problem.

  • Part 6: Feature transformation and nonlinear segmentation. This paper introduces the transformation of feature from linear classifier to nonlinear classification scene, such as constructing polynomial feature, so that the classifier can obtain the ability of linear segmentation.

(this part of logistic regression algorithm involves the basic knowledge of machine learning, no sequence knowledge reserve first baby articles can view ShowMeAI graphic machine learning | machine learning basic knowledge).

1. Machine learning and classification

1) Classification problems

Classification problem is a very important part of machine learning. Its goal is to determine which category a sample belongs to based on certain characteristics of a known sample. The classification problem can be subdivided as follows:

  • Dichotomous problem: indicates which known sample class the new sample belongs to when there are two categories in the classification task.

  • Multiclass Classification problem: Indicates that there are multiple categories in the Classification task.

  • Multilabel Classification problem: give each sample a series of target labels.

2) Mathematical abstraction of classification problems

To solve a classification problem from the perspective of algorithm, our training data will be mapped to sample points in n-dimensional space (where N is the feature dimension). What we need to do is to classify points in n-dimensional sample space, and some points will be assigned to a certain category.

The following figure shows two types of sample points in a two-dimensional plane. Our model (classifier) is learning a method to distinguish different categories. For example, a straight line is used to segment two types of sample points.

There are many common application scenarios for classification problems. We choose several examples to illustrate them:

  • Spam identification: As a dichotomous problem, you can classify messages as “spam” or “normal”.

  • Image content recognition: because there is more than one type of image content, the image content may be cats, dogs, people and so on, so it is a multi-category classification problem.

  • Text emotion analysis: it can be used as a dichotomous problem, which divides emotions into two kinds: positive or negative. It can also be used as a multi-category classification problem, which expands the types of emotions, such as: very negative, negative, positive, very positive, etc.

2. The core idea of logistic regression algorithm

Next, we will introduce the algorithm — Logistic Regression. Logistic regression is an extension of linear regression to deal with classification problems.

1) Linear regression and classification

Classification problems and regression problems are similar to some extent. Both of them predict unknown results by learning data sets. The difference lies in the different output values.

  • The output values for the classification problem are discrete values (such as spam and normal mail).

  • The output value of a regression problem is a continuous value (such as the price of a house).

Since the classification problem and regression problem have certain similarity, then can we classify on the basis of regression?

One possible approach is to use linear fitting first, and then quantify the predicted result value of linear fitting, that is, quantize the continuous value to discrete value — that is, use “linear regression + threshold” to solve the classification problem.

Let’s look at an example. If you now have a data set on tumor size, you need to determine whether the tumor is benign (represented by the number 0) or malignant (represented by the number 1) based on its size. This is a classic dichotomy problem.

As shown in the figure above, we can get an intuitive judgment in this simple scene: the tumor size is greater than 5, it is malignant tumor (output is 1); A tumor of size equal to 5 is a benign tumor (output 0). Now we try the idea mentioned earlier, using unary linear function h(x)=θ0+θ1xh(x) = \theta_0+\theta_1xh(x)=θ0+θ1x to fit the data. The function is shown in the picture as the black line.

In this way, the classification problem can be transformed into: for this linear fitting hypothesis function, given the size of a tumor, simply substitute it into the hypothesis function and compare its output value with 0.5:

  • If the linear regression value is greater than 0.5, output 1 (malignant tumor).

  • If the linear regression value is less than 0.5, output 0 (benign tumor).

The classification problem in the data set above is perfectly solved. However, if we change the data set, as shown in the figure, if we still use 0.5 as the threshold, we will misjudge the tumor size of 6 as good.

Therefore, this method is very unstable for classification simply by comparing the output value of linear fitting with a certain threshold value.

2) Logistic regression to the core idea

Since the “linear Regression + threshold” method is difficult to obtain a classifier with good robustness, we extend it to obtain a Logistic Regression with better robustness (also known as “log-probability Regression” in some places). Logistic regression fits the data into a logit function, so as to complete the prediction of event probability.

If the output of linear regression is a continuous value, and the range of the value is indeterminate, we cannot get a stable decision threshold in this case. Can this result be mapped to a fixed size interval (such as 0 to 1) and then judged?

Sure, that’s what Logistic regression does, and the function used to compress the transformation of continuous values is called the Sigmoid function.

The Sigmoid mathematical expression is


S ( x ) = 1 1 + e x S(x) = \dfrac{1}{1+e^{-x}}

You can see that the output value of the S function is between 0 and 1.

3.Sigmoid function and decision boundary

Now that you have seen the Sigmoid function, let’s talk about how it, in combination with linear fitting, can complete the classification problem and achieve a clear and explainable classifier decision “decision boundary”.

1) Classification and decision boundaries

Decision boundaries are the boundaries on which the classifier distinguishes the samples, mainly including linear decision boundaries and non-linear decision boundaries, as shown in the figure below.

2) Linear decision boundary generation

So, how does logistic regression get decision boundaries, and how does it relate to Sigmoid functions?

Here’s an example:

If we express the Sigmoid function in terms of g, The output of logistic regression results by assuming function h theta (x) = g (theta. Theta. Theta 0 + 1 x1 + 2 x2) h_ \ {\ theta} (x) = g left (\ theta_ {0} + \ theta_ {1} x_ + \ theta_ {1} {2} Theta x_ {2} \ right) h (x) = g (theta. Theta. Theta 0 + 1 x1 + 2 x2).

For the example in the figure, we temporarily take the parameters θ0, θ1, θ2\theta_{0}, \theta_{1}, \theta_{2}θ0, θ1, and θ2 as -3, 1, and 1 respectively. Then for the two types of sample points in the figure, Let’s plug in some coordinates to hθ(x)h_{\theta}(x)hθ(x).

  • For the point (x1,x2)\left (x_{1}, x_{2} \right) (x1,x2) (e.g. (100,100)\left (100,100)\ right) (100,100)), −3+x1+x2-3+ x_{1}+ x_{2}−3+x1+x2 yields a value greater than 0, and a Sigmoid mapping yields a value greater than 0.5.

  • For straight line at the bottom of the point (x1, x2), left (x_ {1}, x_ {2} \ right) (x1, x2) (e.g. (0, 0), left (0, 0 \ right) (0, 0)), −3+x1+x2-3+ x_{1}+ x_{2}−3+x1+x2 yields a value less than 0, and a Sigmoid mapping yields a value less than 0.5.

If we use 0.5 as the decision boundary, the linearly fitted line −3+x1+x2=0-3+ x_{1}+ x_{2} =0 −3+x1+x2=0 is transformed into a decision boundary (in this case, the linear decision boundary).

3) Non-linear decision boundary generation

In fact, we can get not only linear decision boundaries, but even nonlinear decision boundaries for sample segmentation when H θ(x)h_{\theta}(x)hθ(x) is more complicated.

Here’s another example: If we express the Sigmoid function in terms of g, The output results of logistic regression by assuming function h theta (x) = g (theta. Theta. Theta 0 + 1 x1 + 2 x2 + + theta. Theta 3 x12 4 x22) h_ \ {\ theta} (x) = g left (\ theta_ {0} + \ theta_ {1} x_ + \ theta_ {1} {2} X_ + \ theta_ {2} {3} x_ {1} ^ + \ theta_ {2} {4} x_ {2} ^ {2} \ right) h theta (x) = g (theta. Theta. Theta 0 + 1 x1 + 2 x2 + + theta. Theta 3 x12 4 x22).

For the example in the figure, we temporarily take the parameters θ0, θ1, θ2, θ3, θ4\theta_{0}, \theta_{1}, \theta_{2}, \theta_{3}, \theta_{4} θ0, θ1, θ2, θ3, and θ4 as -1, 0, 0, 1, and 1 respectively. So for the two kinds of sample points on the graph, if we plug in some coordinates to hθ(x)h_{\theta}(x)hθ(x), what will we get?

  • For points outside the circle (x1,x2)\left (x_{1}, x_{2} \right) (x1,x2) (e.g. (100,100)\left (100,100)\ right) (100,100)), Into theta. Theta. Theta 0 + 1 x1 + 2 x2 + + theta. Theta 3 x12 4 x22 \ theta_ {0} + \ theta_ {1} x_ + \ theta_ {1} {2} x_ + \ theta_ {2} {3} x_ {1} ^ + \ theta_ {2} {4} X_ {2}^{2}θ0+θ1×1+θ2×2+θ3×12+θ4×22, the value is greater than 0, and the value is greater than 0.5 after Sigmoid mapping.

  • For point inside the circle (x1, x2), left (x_ {1}, x_ {2} \ right) (x1, x2) (e.g. (0, 0), left (0, 0 \ right) (0, 0)), Into theta. Theta. Theta 0 + 1 x1 + 2 x2 + + theta. Theta 3 x12 4 x22 \ theta_ {0} + \ theta_ {1} x_ + \ theta_ {1} {2} x_ + \ theta_ {2} {3} x_ {1} ^ + \ theta_ {2} {4} X_ {2}^{2}θ0+θ1×1+θ2×2+θ3×12+θ4×22, the value is less than 0, after the Sigmoid mapping is less than 0.5.

If we use 0.5 as the decision boundary, the linearly fitted circular curve −1+x12+x22=0-1+x_{1}^2+x_{2}^2=0−1+x12+x22=0 is transformed into a decision boundary (in this case a nonlinear decision boundary).

4. Gradient descent and optimization

1) Loss function

In the previous example, we manually took some values of θ, and finally got the decision boundary. But you can obviously see that when you take different parameters, you get different decision boundaries.

Which decision boundary is best? We need to define a function that quantifies the quality of the model — the loss function (sometimes called the “objective function” or “cost function”). Our goal is to minimize the loss function.

The simplest and most straightforward way to measure the difference between the predicted value and the standard answer is the mean square error in mathematics. For all sample points xix_{I}xi, the predicted value hθ(xi)h_{\theta}(x_{I})hθ(xi) is squared with the standard answer yiy_{I}yi, and the mean can be calculated. The smaller the value is, the smaller the difference is.


M S E = 1 m i = 1 m ( f ( x i ) y i ) 2 MSE=\frac{1}{m} \sum_{i=1}^{m}\left(f\left(x_{i}\right)-y_{i}\right)^{2}

Loss function corresponding to mean square error: Mean square error loss (MSE) is widely used in loss definition and optimization of regression problems, but it is not applicable to logistic regression problems. We managed to get loss function sigmoid function transformation makes the curve as shown in the figure below, it is very not smooth and uneven, the math is called non convex loss function (about loss function and convex optimization more knowledge can reference articles ShowMeAI diagram AI mathematical foundation | calculus and optimization), It is difficult to find the optimal parameter (the parameter that minimizes the value of the function).

Explanation: In the scenario of logistic regression model, the loss function obtained by using MSE is non-convex, and its mathematical characteristics are not very good. We expect the loss function to be convex as follows. In convex optimization problems, the local optimal solution is also the global optimal solution, which makes convex optimization problems easier to solve in a sense, while general non-convex optimization problems are more difficult to solve.

We would prefer that our loss function, as shown in the figure below, be convex, and we have mathematically good optimization methods to optimize it.

In the scenario of logistic regression model, we will use logarithmic loss function (binary cross entropy loss), which can also well measure the quality of parameters and ensure the characteristics of convex function. The formula of the logarithmic loss function is as follows:


J ( Theta. ) = 1 m [ i = 1 m y ( i ) log h Theta. ( x ( i ) ) + ( 1 y ( i ) ) log ( 1 h Theta. ( x ( i ) ) ) ] J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]

Where y(I)y^{(I)}y(I) represents the sample value, which is 1 when it is a positive sample and 0 when it is a negative sample. Let’s look at the two cases:


Cost ( h Theta. ( x ) . y ) = { log ( h Theta. ( x ) )  if  y = 0 log ( 1 h Theta. ( x ) )  if  y = 1 \operatorname{Cost}\left(h_{\theta}(x), y\right)=\left\{\begin{array}{ll} -\log \left(h_{\theta}(x)\right) & \text { if } y=0 \\ -\log \left(1-h_{\theta}(x)\right) & \text { if } y=1 \end{array}\right.
  • Y (I) = 0, y ^ {} (I) = 0, y (I) = 0. When a sample is negative, if hθ(x)h_{\theta}(x)hθ(x) is close to 1 (i.e., the prediction is positive), So – log ⁡ theta (x) 1 – (h) – / log/left (1 – h_ {\ theta} \ left (x, right), right) – log (1 – h theta (x)) value is large, then get the punishment.

  • Y (I) = 1 y ^ {} (I) = 1 y (I) = 1: When a sample is positive, if the result of hθ(x)h_{\theta}(x)hθ(x) is close to 0 (i.e. the prediction is negative sample), So – log ⁡ theta (x) (h) – / log/left (h_ {\ theta} \ left (x, right), right) – log theta (x) (h) value is large, then get the punishment.

2) Gradient descent

The loss function can be used to measure the quality of model parameters, but we still need some optimization methods to find the best parameters (minimize the current loss function value). One of the most common algorithms is the “gradient descent method”, which iterates gradually to reduce the loss function (very easy to use in convex function scenarios). Like going down a hill, find the direction (slope), one small step at a time, until you reach the bottom of the hill.

Gradient Descent is a first-order optimization algorithm, often called the fastest Descent method. To find the local minimum of a function using the gradient descent method, one must search iteratively for a point with a specified step in the direction opposite to the gradient (or approximate gradient) corresponding to the preceding point of the function.

In the figure above, α is called the learning rate, which intuitively means the step length of each step as the function moves towards the minimum value. Too large will generally miss the minimum, too small will result in too many iterations.

(more about loss function and convex optimization mathematical foundation knowledge can reference ShowMeAI articles diagram AI | calculus and optimization, http://www.showmeai.tech/article-detail/165, More summary about learning supervision can view ShowMeAI summary of AI knowledge skill quick quick table manual | machine learning – learning supervision)

5. Regularization and mitigation of overfitting

1) Over-fitting phenomenon

When the training data is not enough or the model is complex and overtrained, the model will fall into the state of Overfitting. As shown in the figure below, different fitting curves (decision boundaries) obtained represent different model states:

  • Fitting curve 1 can correctly classify some samples, but there are still a large number of samples that are not correctly classified and the classification accuracy is low, which is the state of “under-fitting”.

  • Fitting curve 2 can correctly classify most samples and has sufficient generalization ability, so it is a better fitting curve.

  • Fitting curve 3 can distinguish the current sample well, but when a new sample comes in, there is a high probability that it cannot distinguish correctly, because the decision boundary tries too hard to learn the current sample points, or even “writes” them down directly.

“Jitter” in the fitting curve indicates that the fitting curve is irregular and not smooth (fitting curve 3 in the figure above). The degree of learning of data is deep and the fitting is over.

2) Regularization

Regularization is one of the processing methods of overfitting. By adding regularization terms to the loss function, we can restrict the search space of parameters, so as to ensure that the decision boundary of fitting will not be very shaken. Add regularization term to the logarithmic loss function as shown below (here is an L2 regularization term)


J ( Theta. ) = 1 m i = 1 m [ y ( i ) log ( h Theta. ( x ( i ) ) ) ( 1 y ( i ) ) log ( 1 h Theta. ( x ( i ) ) ) ] + Lambda. 2 m j = 1 n Theta. j 2 J(\theta)=\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)-\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]+\frac{\lambda}{2 m} \sum_{j=1}^{n} \theta_{j}^{2}

Where λ\lambdaλ represents regularization coefficient and punishment degree. The larger the value of lambda lambda is, the smaller the absolute value of the parameter θ\thetaθ must be in order to make the value of J(θ)J(\theta)J(θ) smaller, usually corresponding to the smoother function, that is, the simpler function. Therefore, the problem of fitting is not easy to occur. We can still use gradient descent to optimize the loss function with regularization term.

6. Feature transformation and nonlinear expression

1) Polynomial characteristics

For the input features, if we directly perform linear fitting and then give the Sigmoid function, the linear decision boundary is obtained. However, by adding polynomial features, the sample points can be fitted by polynomial regression, and better nonlinear decision boundaries can be obtained later.

Polynomial regression, the regression function is a polynomial regression variable. Polynomial regression model is a kind of linear regression model, in which the regression function is linear about the regression coefficient.

In practical applications, it is usually effective to increase the model complexity by adding some nonlinear characteristics of input data. A simple and general approach is to use polynomial features, which can obtain higher dimensions of features and terms related to each other, leading to better experimental results.

2) Nonlinear segmentation

As shown in the figure below, in logistic regression, the decision boundary obtained by fitting can be adjusted to nonlinear decision boundary by adding polynomial features, which is capable of nonlinear segmentation.

  • Z theta (x) Z_ {\ theta} theta (x) (x) Z theta \ theta theta is parameters, theta when Z (x) = theta. Theta 0 + 1 xz_ {\ theta} (x) = \ theta_ {0} + \ theta_ {1} xZ theta (x) = theta. Theta 0 + 1 x, at this time was linear decision boundary;

  • Z theta (x) = theta. Theta. Theta 0 + 1 x + 2 x2z_ {\ theta} (x) = \ theta_ {0} + \ theta_ {1} x + \ theta_ {2} x ^ 2 Z theta (x) = theta. Theta. Theta 0 + 1 x + 2 x2, using the polynomial characteristics, is nonlinear decision boundary.

The idea is that data that is linearly separable in lower dimensions becomes linearly separable in higher dimensions. The linear segmentation parameters in high – dimensional space are mapped back to low – dimensional space, which is presented as low – dimensional nonlinear segmentation.

More supervised learning algorithm model summary articles can view ShowMeAI AI knowledge skill quick | machine learning – learning supervision.

Video tutorial

You can click on theB stationCheck out the [bilingual subtitles] version of the video

MIT bilingual + data download 】 【 6.036 | introduction to machine learning (2020 · complete version)

www.bilibili.com/video/BV1y4…

ShowMeAI related articles recommended

  • 1. Machine learning basics
  • 2. Model evaluation methods and criteria
  • 3.KNN algorithm and its application
  • 4. Detailed explanation of logistic regression algorithm
  • 5. Detailed explanation of naive Bayes algorithm
  • 6. Detailed explanation of decision tree model
  • 7. Detailed explanation of random forest classification model
  • 8. Detailed analysis of regression tree model
  • 9. Detailed explanation of GBDT model
  • 10. The XGBoost model is fully resolved
  • 11. Detailed explanation of LightGBM model
  • 12. Detailed explanation of support vector machine model
  • 13. Detailed explanation of clustering algorithm
  • 14.PCA dimension reduction algorithm in detail

ShowMeAI series tutorials recommended

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials