This is the 24th day of my participation in the August More Text Challenge

Regularization of linear regression

We have discussed two algorithms for linear regression before:

  • Gradient descent
  • Normal equations

Regularized linear regression


J ( Theta. ) = 1 2 m [ i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) 2 + Lambda. j = 1 n Theta. j 2 ] min Theta. J ( Theta. ) \begin{aligned} &J(\theta)=\frac{1}{2 m}\left[\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}+\lambda \sum_{j=1}^{n} \theta_{j}^{2}\right] \\ &\min _{\theta} J(\theta) \end{aligned}

For the cost function above we want to find the right theta to minimize it.

Gradient descent

Remember that traditional gradient descent looks like this:

Repeat {


Theta. j : = Theta. j Alpha. 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x j ( i ) ( j = 0 . 1 . 2 . 3 . . n ) \theta_{j}:=\theta_{j}-\alpha \quad \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) X_ {j}^{(I)} \quad\quad (j=0,1,2,3, \ldots, n)

}

So let’s just take the gradient descent theta 0\theta_0 theta 0 and add the penalty term to the rest of it. Because, as we said before, the penalty term starts at θ1\theta_{1}θ1.

Repeat {


Theta. 0 : = Theta. 0 Alpha. 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x 0 ( i ) \theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}


Theta. j : = Theta. j Alpha. [ 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x j ( i ) + Lambda. m Theta. j ] ( j = 1 . 2 . 3 . . n ) \theta_{j}:=\theta_{j}-\alpha \lbrack\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) X_ ^ {j} {(I)} + \ frac {\ lambda} {m} \ theta_ {j} \ rbrack \ quad \ quad (j = 1, 2, 3, \ ldots, n)

}

The simplification can be written like this:

Repeat {


Theta. 0 : = Theta. 0 Alpha. 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x 0 ( i ) \theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}


Theta. j : = Theta. j ( 1 Alpha. Lambda. m ) Alpha. 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x j ( i ) ( j = 0 . 1 . 2 . 3 . . n ) \theta_{j}:=\theta_{j}\left(1-\alpha \frac{\lambda}{m}\right)-\alpha \frac{1}{m} \ sum_ {I = 1} ^ {m} \ left (h_ {\ theta} \ left (x ^ {(I)} \ right) – y ^ {(I)} \ right) x_ ^ {j} {(I)}, quad, quad (j = 0,1,2,3 \ ldots, n)

}

Normal equaion

The original normal equation:


Theta. = ( X T X ) 1 X T y \theta=\left(X^{T} X\right)^{-1} X^{T} y

Where X = [… (x0i) (x2i) (x1i) T… T… T… T…] (xni) ∈ Rm * n + 1 x = \ begin {bmatrix}… (x^i_0)^T… \ \… (x^i_1)^T… \ \… (x^i_2)^T… \ \… \ \… (x^i_n)^T… {\ end bmatrix} \ \ in R ^ {m \ times n + 1}, quad, quad, quadX = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡… (x0i)T…… (x1i)T…… (x2i)T……… (xni)T… ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ∈ Rm * n + 1 y = [y1y2y3… ym] ∈ Rmy = \ begin {bmatrix} \ \ ^ 1 ^ 2 \ \ y y y ^ 3 \ \… \\ y^m \end{bmatrix} \in \R^my=⎣⎢⎢⎢⎢⎢, repeatable… Ym ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ∈ Rm

After using regularization


Theta. = ( x x + Lambda. [ 0 1 1 1 ] ) 1 x T y \theta=\left(x^{\top} x+\lambda\left[\begin{array}{lllll} 0 & & & \\ & 1 & & \\ & & 1 & \\ & & & \cdots \\ & & & & 1 \end{array}\right]\right)^{-1} x^T y

In which the diagonal matrix 011. [1] \ left [\ begin {array} {LLLLL} 0 & && \ \ & && \ \ & 1 & \ \ & && \ cdots \ \ & & & & 1 {array} \ \ end right] ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ 011. 1 ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ dimension of Rn (n + 1 + 1 \ R_ (n + 1 \ times n + 1} Rn * n + 1 + 1

Regularization of logistic regression

Cost function of logistic regression:


J ( Theta. ) = 1 m [ i = 1 m y ( i ) log h Theta. ( x ( i ) ) + ( 1 y ( i ) ) log ( 1 h Theta. ( x ( i ) ) ) ] \begin{aligned} J(\theta) =-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \end{aligned}

We will also add one to the end, after which is:


J ( Theta. ) = 1 m [ i = 1 m y ( i ) log h Theta. ( x ( i ) ) + ( 1 y ( i ) ) log ( 1 h Theta. ( x ( i ) ) ) ] + Lambda. 2 m j = 1 n Theta. j 2 \begin{aligned} J(\theta) =-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]+\frac{\lambda}{2 m} \sum_{j=1}^{n} \theta_{j}^{2} \end{aligned}

Even if you fit a lot of parameters with a high order, if you add the regularization term and keep the parameters small, you still get a reasonable decision boundary.

Gradient descent

We already know that linear regression and logistic regression look the same in form, so we can directly transfer the gradient descent of linear regression:

Repeat {


Theta. 0 : = Theta. 0 Alpha. 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x 0 ( i ) \theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}


Theta. j : = Theta. j Alpha. [ 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x j ( i ) + Lambda. m Theta. j ] ( j = 1 . 2 . 3 . . n ) \theta_{j}:=\theta_{j}-\alpha \lbrack\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) X_ ^ {j} {(I)} + \ frac {\ lambda} {m} \ theta_ {j} \ rbrack \ quad \ quad (j = 1, 2, 3, \ ldots, n)

}

In order to regularize it to logistic regression, we need to add another expression to the second expression:

Repeat {


Theta. 0 : = Theta. 0 Alpha. 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x 0 ( i ) \theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}


Theta. j : = Theta. j Alpha. [ 1 m i = 1 m ( h Theta. ( x ( i ) ) y ( i ) ) x j ( i ) + Lambda. m Theta. j ] ( j = 1 . 2 . 3 . . n ) \theta_{j}:=\theta_{j}-\alpha \lbrack\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) X_ ^ {j} {(I)} + \ frac {\ lambda} {m} \ theta_ {j} \ rbrack \ quad \ quad (j = 1, 2, 3, \ ldots, n)

}

It looks the same as linear regression, but it’s important to remember both
h ( x ) h(x)
There is a difference.

Advanced optimization

When we talked about logistic regression, we talked about other advanced algorithms besides gradient descent, but we didn’t go into detail. So how do you use regularization in advanced algorithms


f u n c t i o n [ j V a l . g r a d i e n t ] = c o s t F u n c t i o n ( t h e t a ) function [jVal, gradient] = costFunction (theta)

JValjValjVal =[code to compute J(θ)J(\theta)J(θ)]

Gradient (1)=[gradient (1)=\left[\right. Gradient (1)=[code to compute ∂θ0J(θ)]\left.\frac{\partial}{\partial \theta_{0}} J (\ theta) \ right] partial theta 0 partial J (theta)]

Gradient (2)=[gradient (2)=\left[\right.gradient(2)=[code to compute ∂∂θ1J(θ)]\left.\frac{\partial}{\partial \theta_{1}} J (\ theta) \ right] partial theta 1 partial J (theta)]

.

Gradient (n+1)=[gradient (n+1)=\left[\right. Gradient (n+1)=[code to compute ∂θnJ(θ)]\left.\frac{\partial}{\partial \ theta_ {n}} J (\ theta), quad, right] partial theta n partial J (theta)]

Again, you need to write the costFunction function yourself, and in this function:

  • function[jVal,gradient]=costFunction(theta)function [jVal, Gradient]=costFunction(theta) function[jVal,gradient]=costFunction(theta) needs to be passed in, Theta = 1 ⋮ theta n] [theta. Theta 0 = \ left [\ begin {array} {c} \ theta_ {0} \ \ \ theta_ {1} \ \ \ vdots \ \ \ theta_ {n} {array} \ \ end right] = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ theta 0 1 ⋮ theta. Theta n ⎦ ⎥ ⎥ ⎥ ⎥ ⎤
  • JValjValjVal =[code to compute J(θ)J(\theta)J(θ)
  • Gradient (1)=[gradient (1)=\left[\right. Gradient (1)=[code to compute ∂θ0J(θ)]\left.\frac{\partial}{\partial \theta_{0}} J (\ theta) \ right] partial theta 0 partial J (theta)] is calculated 1 m ∑ I = 1 m (h theta – y (x (I)) (I)) x0 (I) \ frac {1} {m} \ sum_ {I = 1} ^ {m} \ left (h_ {\ theta} \ left (x ^ {(I)} \ right) – y ^ {(I)} \ right) x_ {0} ^ {(I)} m1 ∑ I = 1 m (h theta – y (x (I)) (I)) x0 (I)
  • Gradient (n+1)=[gradient (n+1)=\left[\right. Gradient (n+1)=[code to compute ∂θnJ(θ)]\left.\frac{\partial}{\partial \ theta_ {n}} J (\ theta), quad, right] partial theta n partial J (theta)] is 1 m ∑ I = 1 m (h theta – y (x (I)) (I)) xn lambda mJ theta (n) (I) + \ frac {1} {m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) X_ ^ {n} {(I)} + \ frac {\ lambda} {m} J (\ theta_n) m1 ∑ I = 1 m (h theta – y (x (I)) (I)) xn (I) + m lambda J theta (n)