This is the seventh day of my participation in the August More text Challenge. For details, see:August is more challenging

preface

This is the first of a series of notes in Ng’s machine learning course on linear regression and logistic regression, and the differences between them.


LinearRegression

Element linear regression

It’s a regression problem, so let’s look at the definition of the cost function.

Cost function

Linear square cost function:


h Theta. ( x ) = Theta. 0 + Theta. 1 x h_{\theta}(x)=\theta_0+\theta_1x

Sum of squares of modeling errors:


J ( Theta. 0 . Theta. 1 ) = 1 2 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) 2 J(\theta_0,\theta_1)=\dfrac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2

Several concepts:

  • Sum of the Squared Errors (SSE) : used to indicate the performance of function fitting

  • Mean square error (MSE) : the expected value of the squared difference between the estimated value of a parameter and the true value of a parameter.

Gradient descent

Based on the classic downhill problem, when you are at a certain point on a mountain, how to choose the best direction to go down the mountain, and how big to go with each step. Gradient descent is designed to minimize the cost function.

Algorithm idea:

  • Initially select a parameter combination and calculate the cost function
  • Then find the parameter combination that causes the cost function value to decrease the most, and calculate the cost function likewise
  • Continue the process until the target value is the local minimum

Batch Gradient Descent

Formula (assuming two parameters) : Theta equals theta j j – alpha partial partial theta jJ (theta 0, theta j) \ theta_j = \ theta_j – \ alpha \ dfrac {\ partial} {\ partial \ theta_j} J (, \ \ theta_0 theta_j) theta equals theta J J – alpha partial theta J partial J (theta 0, theta. J)

  • Alpha alpha is the Learing rate, which determines how far down we go in the direction of the maximum reduction of the cost function. In batch gradient descent, we subtract the learning rate times the derivative of the cost function from all parameters simultaneously each time.
  • For multiple θ\thetaθ, the values must be updated synchronously, and the learning rate will also be updated, because the closer we get to the extreme point, the smaller our learning rate, that is, the steps we take down the mountain, the more accurate the result will be. In fact, you want the partial derivative to be zero, so that it converges to the local optimal solution.

How to select α\alphaα value?

  • Based on the above equation, we can see that if α\alphaα is too small, each step is very small and moves closer to the lowest point bit by bit, such an iteration will take a long time.
  • If α\alphaα is too large, then the pace of movement is too large, and the lowest point may be crossed, and the next iteration will be moved again and again, and the lowest point will be crossed again and again, so that convergence will not be possible.
  • In general practical application, select a group of intervals such as [0.01,0.1,1,10][0.01,0.1,1,10][0.01,0.1,1,10] and try many times, and then select a more appropriate interval according to the results.

In gradient descent method, when we are close to local lows, gradient descent will automatically take smaller amplitude, this is because when we are close to local lows, apparently in local minimum derivative is equal to zero, so when we are close to the local minimum, derivative value will automatically become smaller and smaller, so the gradient descent will automatically take smaller amplitude, So alpha \alpha alpha doesn’t have to be reduced at this point.

Linear regression of gradient descent

Gradient descent algorithm:


R e p e a t    u n t i l    c o n v e r g e n c e { Theta. j = Theta. j Alpha. partial partial Theta. 0 J ( Theta. 0 . Theta. 1 ) f o r    j = 0    a n d    1 } \begin{aligned} &Repeat\; until\; convergence\{ \\ & \quad\theta_j=\theta_j-\alpha\dfrac{\partial}{\partial\theta_0}J(\theta_0,\theta_1) \\ & \quad for\; j=0\; and\; 1\\ &\} \end{aligned}

Take the derivative of the cost function:


partial partial Theta. j J ( Theta. 0 . Theta. 1 ) = partial partial Theta. j 1 2 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) 2 \dfrac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)=\dfrac{\partial}{\partial\theta_j}\frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2


j = 0 . partial partial Theta. 0 J ( Theta. 0 . Theta. 1 ) = 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) . Theta. 0 = Theta. 0 Alpha. 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) j=0,\quad\dfrac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)}),\quad\theta_0=\theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})


j = 1 . partial partial Theta. 1 J ( Theta. 0 . Theta. 1 ) = 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x ( i ) . Theta. 1 = Theta. 1 Alpha. 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x ( i ) j=1,\quad\dfrac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)},\quad\theta_1=\theta_1-\alpha\frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}

We can also use ** “least squares” ** to solve the model. In linear regression, the least squares method is an attempt to find a line that is the smallest sum of Euclidean distances from all samples to the line, i.e. J(θ0,θ1)J(\theta_0,\theta_1)J(θ0,θ1) in the model.

Solution: take the derivative of θ0,θ1\theta_0,\theta_1 \theta_0, θ1 respectively and let the formula be 0, so as to solve the optimal solution of θ0,θ1\theta_0,\theta_1\0,θ1.

Multiple linear regression

In fact, it is a single feature => a multidimensional feature, expressed as a vector

Assuming that the number of features is NNN, the eigenmatrix is as follows:


X = x ( 1 ) : x ( 2 ) : x ( 3 ) : . . . Characteristics of the 1 Characteristics of the 2 Characteristics of the 3 Characteristics of the 4 1 0.2 41 6 2 0.6 35 4 5 0.5 60 8 . . . . . . . . . . . . = [ x ( 1 ) T x ( 2 ) T x ( 3 ) T . . . x ( n ) T ] X=\begin{matrix}\\\\x^{(1)}:\\x^{(2)}:\\x^{(3)}:\\… {matrix} \ \ end begin {matrix} \ \ \ text characteristic} {1 & \ text characteristic} {2 & \ text features {} 3 & \ text features {} 4 \ \ 1 & 2 & 0.2 & 41 & 6 \ \ 0.6 & 35 & 4 \ \ \ \ 5 & & & 60 0.5 8… &… &… &… \end{matrix}=\begin{bmatrix}x^{(1)^T}\\x^{(2)^T}\\x^{(3)^T}\\… \\x^{(n)^T}\\\end{bmatrix}

X (I)x^{(I)}x(I) is the third eigeninstance, the third row of the eigenmatrix, is a vector, Such as x (2) = [20.6354] x ^ {} (2) = 2 \ \ begin {bmatrix} newline0.6 \ newline35 \ newline4 \ end (2) = {bmatrix} x [20.6354]

The multivariable HHH is expressed as:


h Theta. ( x ) = Theta. 0 + Theta. 1 x + Theta. 2 x 2 + . . . + Theta. n x n h_{\theta}(x)=\theta_0+\theta_1x+\theta_2x_2+… +\theta_nx_n

It can be seen that the parameter is a vector with dimension N +1n+1n+1, and the dimension of the eigenmatrix is m∗(n+1)m*(n+1)m∗(n+1). For the convenience of calculation, x0= 1X_0 =1×0=1 is introduced, that is, the formula is:


h Theta. ( x ) = Theta. 0 x 0 + Theta. 1 x 1 + Theta. 2 x 2 + . . . + Theta. n x n h_{\theta}(x)=\theta_0x_0+\theta_1x_1+\theta_2x_2+… +\theta_nx_n

Make theta equals theta n] [theta. Theta 0 1… \ theta = \ begin {bmatrix} \ theta_0 \ newline \ theta_1 \ newline… {bmatrix} \ newline \ theta_n \ end theta equals theta n] [theta. Theta 0 1…

H θ(x)=θTxh_{\theta}(x)=\theta^Txhθ(x)=θTx

Gradient descent

Expanding on the gradient descent of linear regression,


partial partial Theta. j J ( Theta. j ) = partial partial Theta. j 1 2 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) 2 \dfrac{\partial}{\partial\theta_j}J(\theta_j)=\dfrac{\partial}{\partial\theta_j}\frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2

Feature scaling: scale each feature approximately to [−1,1][-1,1][−1,1]

Mean normalization: normalization


  • x i = x i mu i m a x m i n x_i = \dfrac{x_i-\mu_i}{max-min}

  • i > 1 . because x 0 = 1 I > 1, \ text {for} x_0 = 1

Normal equations

Assuming that the training set result is vector yyy, after introducing x0x_0x0, the eigenmatrix can also be written as:


X = Characteristics of the 1 Characteristics of the 2 Characteristics of the 3 Characteristics of the 4 1 0.2 41 6 2 0.6 35 4 5 0.5 60 8 . . . . . . . . . . . . = 1 x ( 1 ) T 1 x ( 2 ) T 1 x ( 3 ) T 1 . . . X = \ begin {matrix} \ \ \ text characteristic} {1 & \ text characteristic} {2 & \ text features {} 3 & \ text features {} 4 \ \ 1 & 2 & 0.2 & 41 & 6 \ \ 0.6 & 35 & 4 \ \ \ \ 5 & & & 60 0.5 8… &… &… &… \end{matrix}=\begin{matrix}\\\\1&x^{(1)^T}\\1&x^{(2)^T}\\1&x^{(3)^T}\\1&… \end{matrix}

Write the above cost function in matrix form:


J ( Theta. ) = 1 2 m ( X Theta. y ) T ( X Theta. y ) J(\theta)=\frac{1}{2m}(X\theta-y)^T(X\theta-y)

In fact, the value of parameter θ\thetaθ which makes J(θ)J(\theta)J(θ) the smallest is solved as follows:


Theta. = a r g m i n ( X Theta. y ) T ( X Theta. y ) \theta = argmin(X\theta-y)^T(X\theta-y)

Make E theta equals theta (X – y) T E_ (X theta – y) = {\ theta} (X \ theta – y) ^ T (X \ theta – y) E theta equals theta (X – y) T (X theta – y), the theta \ theta theta derivation can solve:


Theta. = ( X T X ) 1 X T y \theta = (X^TX)^{-1}X^Ty

Comparison of gradient descent with normal equations

Gradient descent Normal equations
You need to choose the learning rate
Alpha. \alpha
No study rate is required
Alpha. \alpha
Multiple iterations are required No iteration is required, and matrix operations can be performed in a single line of code
Under multiple characteristics, the adaptability is good The computational complexity of the matrix inverse is O(
n 3 n^3
), so if the feature dimension is too high (especially over 10000 dimension), the operation cost is too high, and this method should not be considered again
Can be applied to some more complex algorithms, suitable for all types of models The matrix needs to be invertible, and, for some more complex algorithms, this method does not work. Only suitable for linear regression model, not suitable for logistic regression model and other models

Logistic Regression

It’s a classification problem. For classification problems, data are generally discrete, and it is not a good method to apply linear regression to deal with classification problems. Therefore, a classical algorithm for classification problems — logistic regression is introduced below.

The essence of logistic regression is to assume that the data follow this distribution, and then use maximum likelihood estimation to estimate the parameters, so that the variable range of the output is always between 0 and 1. In the following, the algorithm is introduced by taking dichotomies as an example.

The distribution function of logistic regression is SigmoidSigmoidSigmoid function: g(z)=11+e−zg(z)=\dfrac{1}{1+e^{-z}}g(z)=1+e−z1. The image of this function is as follows:

[External link image dump failed, the source site may have anti-link mechanism, it is recommended to save the image directly upload (img-CB4IBxex-1628157342810)(/imgs/Sigmoid function image.jpg)]

It can be seen that this function converts the input value to the probability value, so we can construct the logistic regression model by using this hypothesis function:


h Theta. ( x ) = g ( Theta. T x ) = 1 1 + e Theta. T x h_{\theta}(x)=g(\theta^{T}x)=\dfrac{1}{1+e^{-\theta^{T}x}}

Where, XXX represents the eigenvector, and θ\thetaθ is the parameter we want to take.


P ( y = 1 x ; Theta. ) = h Theta. ( x ) P(y=1|x; \theta)=h_{\theta}(x)

That is, the probability of predicting y=1y=1y=1 given XXX and θ\thetaθ

The decision function can be obtained:


y ( i ) = 1 . if  P ( y = 1 x ) > 0.5 Y (I) = 1, \ text {if} P (y | x = 1) > 0.5

0.50.50.5 is the threshold we choose. Of course, other values can be selected according to the actual situation. For example, if the prediction of positive examples is higher, it can be greater than 0.5

The decision boundary

This concept is introduced by using professor Ng’s powerpoint lecture

Linear decision boundary

Nonlinear boundary

Cost function

The cost function of logistic regression is defined as:


J ( Theta. ) = 1 m i = 1 m C o s t ( h Theta. ( x ( i ) ) y ( i ) ) J(\theta)=\frac{1}{m}\sum\limits_{i=1}^m Cost(h_{\theta}(x^{(i)})-y^{(i)})

Among them:


C o s t ( h Theta. ( x ) . y ) = { l o g ( h Theta. ( x ) ) y = 1 when l o g ( 1 h Theta. ( x ) ) y = 0 when Cost (h_ {\ theta} (x), y) = \ begin {cases} – log (h_ {\ theta} (x)) & y = 1 \ text {the} \ \ – log (1 – h_ {\ theta} (x)) & y = 0 \ text {cases} {the} \ end

Theta Cost (h (x), y) Cost (h_ {\ theta} (x), y) Cost (h theta (x), y) functions of features are:

  • When actual y = 1 y = 1 y = 1, theta h (x) = 1 h_ {\ theta} (x) = 1 h theta (x) = 1 when the price is 0, theta when h (x) h_ {\ theta} (x) h theta (x) not for 111, The cost increases as hθ(x)h_{\theta}(x)hθ(x) decreases;
  • When the actual y = 0, y = 0, y = 0, h theta h_ (x) = 0 {\ theta} (x) = 0 h theta (x) = 0 when the price is 0, when h theta (x) h_ {\ theta} (x) h theta (x) not for 000, The cost increases as hθ(x)h_{\theta}(x)hθ(x) increases;

The image is as follows:Can be further simplified:


J ( Theta. ) = 1 m i = 1 m [ y ( i ) l o g ( h Theta. ( x ) ) + ( 1 y ( i ) ) l o g ( 1 h Theta. ( x ) ) ] J(\theta)=-\frac{1}{m}\sum\limits_{i=1}^m [y^{(i)}log(h_{\theta}(x))+(1-y^{(i)})log(1-h_{\theta}(x))]

Use gradient descent

The derivative of the cost function J(θ)J(\theta)J(θ) can be obtained


partial partial Theta. J ( Theta. ) = 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x ( i ) \dfrac{\partial}{\partial\theta}J(\theta)=\frac{1}{m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}

Same for theta equals theta j j – alpha 1 m ∑ I = 1 m (h theta – y (x (I)) (I)) \ theta_j = \ theta_j – \ alpha \ frac {1} {m} \ sum \ limits_ {I = 1} ^ m (h_ {\ theta} (x ^ {(I)}) – y ^ {(I)}) theta equals theta j j – alpha m1i = 1 ∑ m (h theta – y (x (I)) (I)) continually update each parameter values, to beg to function and minimum value, namely derivative value to 000 theta \ theta theta.

Multiple classification problem

One-vs-all, also known as one-VS-the Rest method, is usually adopted to realize multi-classification, which transforms the multi-classification problem into multiple dichotomies problem. In fact, it is to take turns to treat one feature as a positive sample and the rest as negative samples, and then conduct model training in the same way as the dichotomies. If the sample feature number is NNN, a total of N −1n-1n−1 decision boundaries will be obtained.

For example: given the input XXX, calculate hθ(I)(x), I =1,2… , nh_ ^ {\ theta} {(I)} (x), I = 1, 2,… , nh theta (I) (x), I = 1, 2,… ,n, and then compare, if hθ(k)(x)h_{\theta}^{(k)}(x)hθ(k)(x) is closest to 111, it is predicted that XXX belongs to KKK.

Overfitting and underfitting

What is overfitting?

Underfitting: the fitting degree is not high and the data is far from the fitting curve

Overfitting: Excessive fitting, which appears to fit almost every piece of data, but loses information. In machine learning, it is often found that the degree of training set fitting is too high and too accurate, and such a model has no practical prediction significance.

regularization

The existing overfitting problem has been mentioned above, so the methods to solve the overfitting are generally as follows:

  • Reduce the number of features. This obviously destroys the integrity of the feature.
  • Retain the feature, but weaken the coefficient θ I \theta_iθ I of some higher order terms. We call this weakening the punishment of the parameter θ I \theta_iθ I.

Regularization is the process of weakening higher-order features.

Regularization in linear regression

On the basis of linear regression, we introduce the parameter λ\lambdaλ to realize the regularization penalty. Change the cost function to:


J ( Theta. ) = 1 2 m i = 1 m [ ( h Theta. ( x ( i ) ) y ( i ) ) 2 + Lambda. i = 1 n Theta. j 2 ] = 1 2 m [ ( X Theta. y ) T ( X Theta. y ) + Lambda. j = 1 n Theta. j 2 ] J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m [(h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda\sum\limits_{i=1}^{n}\theta_j^2 ]\\ =\frac{1}{2m}[(X\theta-y)^T(X\theta-y)+\lambda\sum\limits_{j=1}^{n}\theta_j^2]

The larger λ\lambda lambda is, the greater the regularization penalty, and the more overfitting can be avoided. But not too big, because as the parameter goes to zero, you might end up with a straight line.

At the same time, there are corresponding changes in gradient descent:


R e p e a t    u n t i l    c o n v e r g e n c e { Theta. 0 = Theta. 0 Alpha. 1 m i = 1 m [ ( h Theta. ( x ( i ) ) y ( i ) ) x 0 ( i ) ] Theta. j = Theta. j Alpha. 1 m i = 1 m [ ( h Theta. ( x ( i ) ) y ( i ) ) x ( i ) + Lambda. m Theta. j ] f o r   j = 1 . 2 . . . . n } \begin{aligned}&Repeat\; until\; convergence\{ \\ & \quad \theta_0=\theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m [(h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)}] \\ & \quad \theta_j=\theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m [(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} + \ frac {\ lambda} {m} \ \ \ & theta_j] \ quad for \ j = 1, 2,… ,n\\ &\}\end{aligned}

It can be further simplified as:


Theta. j = Theta. j ( 1 Alpha. Lambda. m ) Alpha. 1 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] x j ( i ) \ theta_j = \ theta_j (1 – \ alpha \ dfrac {\ lambda} {m}) – \ alpha \ dfrac {1} {m} \ sum \ limits_ {I = 1} ^ {m} [h_ {\ theta} (x ^ {(I)}) – y ^ {(I)}] x ^ {(I) }_j

It can be seen that the change of the gradient descent algorithm of the regularized linear regression is that the θ\thetaθ value is reduced by an additional value each time based on the original algorithm’s update rules, that is, the θ\theta theta theta value is also decreased each time the θ\theta theta is updated in the gradient descent.

If a normal equation is used, it can also be solved as follows:


Theta. = ( X T X + Lambda. [ 0 . . . . . . 0 0 1 . . . 0 . . . . . . . . . . . . 0 0 . . . 1 ] ) 1 X T y \theta= (X^TX+\lambda \begin{bmatrix}0&… &… \ \ & 0 0 and 1 &… & 0 \ \… &… &… &… \ \ 0 & 0 &… &1\end{bmatrix})^{-1}X^Ty

Regularization in logistic regression

Similarly, for logistic regression, we also add a regularized expression to the cost function to obtain the cost function:


J ( Theta. ) = 1 m i = 1 m [ y ( i ) l o g ( h Theta. ( x ) ) ( 1 y ( i ) ) l o g ( 1 h Theta. ( x ) ) ] + Lambda. 2 m j = 1 n Theta. j 2 J(\theta)=\frac{1}{m}\sum\limits_{i=1}^m [-y^{(i)}log(h_{\theta}(x))-(1-y^{(i)})log(1-h_{\theta}(x))]+\frac{\lambda}{2m}\sum\limits_{j=1}^{n}\theta_j^2

Similarly, the gradient descent algorithm is:


R e p e a t    u n t i l    c o n v e r g e n c e { Theta. 0 = Theta. 0 Alpha. 1 m i = 1 m [ ( h Theta. ( x ( i ) ) y ( i ) ) x 0 ( i ) ] Theta. j = Theta. j Alpha. 1 m i = 1 m [ ( h Theta. ( x ( i ) ) y ( i ) ) x ( i ) + Lambda. m Theta. j ] f o r   j = 1 . 2 . . . . n } \begin{aligned}&Repeat\; until\; convergence\{ \\ & \quad \theta_0=\theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m [(h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)}] \\ & \quad \theta_j=\theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m [(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} + \ frac {\ lambda} {m} \ \ \ & theta_j] \ quad for \ j = 1, 2,… ,n\\ &\}\end{aligned}

Note that θ0\theta_0θ0 is not involved in any of these regularization.

summary

The difference between linear regression and logistic regression

Both linear regression and logistic regression are generalized linear models. Linear regression assumes that the dependent variable Y follows a Gaussian distribution, while logistic regression assumes that the dependent variable Y follows a Bernoulli distribution. Linear regression is a regression task and logistic regression is a classification task.

Although the cost function of logistic regression and linear regression looks the same after taking the derivative, the assumed function hθ(x)h_{\theta}(x)hθ(x) of the two are different.

  • Linear regression: hθ(x)=θTxh_{\theta}(x)=\theta^{T}xhθ(x)=θTx
  • Logistic regression: h theta (theta Tx) h_ (x) = g {\ theta} (x) = g (\ theta ^ {T} (x) h theta (x) = g (theta Tx), including g (z) g (z) g (z) as SigmoidSigmoidSigmoid function

Code implementation (Github) attached with homework:

  • ex1-Linear Regression
  • ex2-Logistic Regression