This is the 19th day of my participation in the August More Text Challenge

This normal equation is in the context of multiple linear regression. In linear regression, we talked about gradient descent, where you iterate through a formula and go down, whereas normal equations, by contrast, go straight to the optimal solution for θ. It’s basically one step.

What is a normal equation?

Here’s a simple example:

Intuition:

If 1 D theta ∈ (R) 1 \ mathrm {D} (\ theta \ \ R) in 1 D theta ∈ (R)


J ( Theta. ) = a Theta. 2 + b Theta. + c J(\theta)=a \theta^{2}+b \theta+c

Now suppose theta is just a real number, not a vector, and the function J is a quadratic function of theta.

How do I minimize this function in one step? If you’ve ever taken high school math, you know this: Take derivatives. Dh (x)dx=0\frac{\text dh(x)}{\text dx} = 0dxdh(x)=0

But in general, we’re not dealing with functions like this, we’re dealing with vectors. In gradient descent, we’re doing this in a loop where we take the partial derivative of each theta, and we figure out when theta is equal to 0, so now we can go straight to zero.

Now we have a training sample and add a column x0=1x_0 = 1×0=1 to the data set to turn the training set into a coefficient matrix:


X = [ 1 2104 5 1 45 1 1416 3 2 40 1 1534 3 2 30 1 852 2 1 36 ] X=\left[\begin{array}{ccccc}1 & 2104 & 5 & 1 & 45 \\ 1 & 1416 & 3 & 2 & 40 \\ 1 & 1534 & 3 & 2 & 30 \\ 1 & 852 & 2 & 1 & 36\end{array}\right]

Let’s also list y as a vector:


y = [ 460 232 315 178 ] y=\left[\begin{array}{l}460 \\ 232 \\ 315 \\ 178\end{array}\right]

The matrix X contains all the eigenvalues, is an M *n+1 matrix, and y is an m-dimensional matrix. M is the number of training samples.

Now you just need to step: theta = (XTX) – 1 xty \ theta = \ left (X ^ {T} \ right) X ^ {1} X ^ {T} y theta = (XTX) – 1 xty can find out the optimal solution.

Set theta to be equal to X transpose X inverse times X transpose y, this would give you the value of theta that minimizes your cost function.

The eigenvector transpose times itself, and then the inverse, and then the eigenvector transpose, and then the y vector.

So the normal equation is:

m examples ((x1,y1),… ,(xn,yn))((x^1,y^1),… ,(x^n,y^n))((x1,y1),… , xn, yn), n the features.

Let’s say we now have m training samples in our training set. There are n eigenquantities. So the vector to the eigenvalue x is going to be


x = [ x 0 i x 1 i x 2 i . . . x n i ] R n + 1 x = \begin{bmatrix} x^i_0 \\ x^i_1 \\ x^i_2\\ … \\ x^i_n\end{bmatrix} \in \R^{n+1}

And when YOU convert x to the matrix x becomes 1, 2, 3


X = [ . . . ( x 0 i ) T . . . . . . ( x 1 i ) T . . . . . . ( x 2 i ) T . . . . . . . . . ( x n i ) T . . . ] R m x n + 1 X = \begin{bmatrix} … (x^i_0)^T… \ \… (x^i_1)^T… \ \… (x^i_2)^T… \ \… \ \… (x^i_n)^T… \end{bmatrix} \in \R^{m \times n+1}

And y is:


y = [ y 1 y 2 y 3 . . . y m ] R m y = \begin{bmatrix} y^1 \\ y^2 \\ y^3 \\ … \\ y^m \end{bmatrix} \in \R^m

After listing Xy:


Theta. = ( X T X ) 1 X T y \theta=\left(X^{T} X\right)^{-1} X^{T} y

In Octave only one sentence pinv(X’ * X) * X’ * y is needed

And this method does not require scaling of eigenvalues.

if
X T X X^{T} X
What if the matrix is not invertible?

In fact, in Octave, there are two ways of finding inverse matrices, one pinv() and one inv(). With the former, even if the matrix is irreversible, you can get the correct value of theta for the pawn red.

The invertible matrix AB is equal to BA is equal to I, for A, if you can find A matrix B and multiply it by it so that it equals the identity matrix, then A is invertible.

In general, you have two cases of irreversible matrices:

  1. There are redundant eigenvalues like here

    x1=sizeinfeet2x2=sizeinm2x_1 = size \quad in \quad feet^2 \\x_2 = size \quad in \quad m^2×1=sizeinfeet2x2=sizeinm2 One area unit is square feet and one is square meters. In this case you can discard an eigenquantity.

  2. Excessive characteristic quantity (m<=n)

    In this case, some eigenvalues are deleted or regularized.

    We’ll talk about regularization later.

    I had a half-baked idea: Why don’t we just loop through m and call it a square matrix hahahahaha, Just like [a11 a12, a13, a14 a15a21, a22, a23, a24, a25a31, a32, a33, a34, a35a11, a12, a13, a14, a15a21, a22, a23, a24, a25] \ begin {bmatrix} a_{11},a_{12},a_{13},a_{14},a_{15}\\a_{21},a_{22},a_{23},a_{24},a_{25}\\a_{31},a_{32},a_{33},a_{34},a_{35}\\a_{11},a_{12 A_ a_}, {13}, {14}, a_ {15} \ \ a_ {21}, a_ {and}, a_ {23}, a_ {24}, a_ {25} {bmatrix} \ end ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡, a11 a12, a13, a14, a15a21, a22, a23 a24, a25a31,, A32, a33, a34 a35a11 a12, a13, a14, a15a21, a22, a23, a24, a25 ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤

Compare the normal equation with gradient descent

Gradient descent Normal equation
The Need to choose alpha

Needs many iterations
No need to choose alpha

Don’t need to iterate
Works well even when n is large. Need to compute
( X T X ) 1 (X^TX)^{-1}
Slow if n is very large

The normal equation method actually do not work for some more sophisticated learning algorithms.
  • Gradient descent:
    • Disadvantages:
      • It takes several attempts to select a suitable alpha
      • Multiple iterations are required
    • Advantages:
      • It can also be used when there is a large amount of data
  • Normal equation:
    • Advantages:
      • You don’t have to choose alpha
      • You don’t need multiple iterations
      • Don’t scale without thinking about the range first
    • Disadvantages:
      • You need to compute (XTX)−1(X^TX)^{-1}(XTX)−1 and multiplying two matrices is order of magnitude O(n3)O(n^3)O(n3) so it’s going to be slow if you have a lot of data
      • Some complex algorithms cannot be used

conclusion

Simple algorithms with small amounts of data use normal equations more quickly. If the data is large or the algorithm is more complex, you still need gradient descent.