This is the 19th day of my participation in the August More Text Challenge

Notes of Andrew Ng’s Machine Learning —— (1) Linear Regression with One Variable

Model Representation

Notation


  • x ( i ) x^{(i)}
    denote the input variables, called input features

  • y ( i ) y^{(i)}
    denote the output variables, called target variable

  • ( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)})
    is called a training example
  • (x(i),y(i)); i=1,… ,m(x^{(i)}, y^{(i)}); i=1, … , m(x(i),y(i)); i=1,… M — is called a training set

⚠️【Note】The superscript “(I)^{(I)}(I)” in The notations is simply an index into The train set.


  • X X
    denote the space of input values

  • Y Y
    denote the space of output values

In many examples,
X = Y = R X = Y =\R
.

A more formal description of supervised learning

To learn a function h:X→Yh: X \to Yh:X→Y So that h(X)h(X) is a “good” predictor for the corresponding value of YYy.

For historical reasons, this function
h h
is called a hypothesis.

Pictorially:

With our new notations, the categories of supervised learning problems can be this:

  • regression problem: when the target variable is continuous.
  • classification problem:
    y y
    can take on only a small number of discrete values

Hypothesis & Model Regression

To represent the hypothesis
h h
, for example, if we want to fit a data set like this:

Simply, We may make the hθ(x)=θ0+θ1xh_ theta(x)=\theta_0+\theta_1xhθ(x)= \ the0 +θ1x. This is means that we are going to predict that yyy is a linear function of xxx.

Plotting this in the picture, it will be something like this:

And what this function is doing is predicting that
y y
is some straight line function of
x x
.

In this case, this model is called linear regression with one variable(or Univariate linear regression).

P.S. we can also fit a more complicated (perhaps non-linear) functions, but this linear case is the simplest.

Cost Function

Well, now we have got this:

To get a “good” hypothesis function, we need To choose θ0\theta_0θ0, So that hθ(x)h_\theta(x)hθ(x) is close to yyy for our training exmaples (x,y)(x, y)(x,y). A cost function is of great help with this goal.

Cost function

A cost function can help us to measure the accuracy of our hypothesis function.

This takes an average difference of all the results of the hypothesis with inputs from x’s and the actual output y’s:


J ( Theta. 0 . Theta. 1 ) = 1 2 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] 2 (1) J(\theta_0, \theta_1) = \frac{1}{2m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2 \tag{1}

The function
( 1 ) (1)
is our cost function exactly. Take a look at it:

  • H θ(x(I))−y(I)h_\theta(x^{(I)})-y^{(I)}hθ(x(I))−y(I) shows the difference between the predicted value and the actual value.

  • 1 m ∑ I = 1 m… \frac{1}{m}\sum^{m}_{i=1}… M1 ∑ I = 1 m… Offers the mean of the squares of the h theta – y (x (I)) (I) h_ \ theta (x ^ {(I)}) – y ^ {(I)} h theta – y (x (I)) (I)

  • The mean is halved (
    1 2 \frac{1}{2}
    ) as a convenience for the computati::on of the gradient descent, as the
    1 2 f 2 = f \frac{1}{2}f^2 = f
    .

P.S. This function is otherwise called the “Squared error function“, or “Mean squared error“.

Goal

To make hθ(x)h_\theta(x)hθ(x) closing To YYy, We are just expected to minimize our cost function by adjusting the value of θ0\ theTA_0 θ0, θ1\ theTA_1 θ1.

We describe this goal like this:


m i n i m i z e Theta. 0 . Theta. 1 J ( Theta. 0 . Theta. 1 ) (2) \mathop{minimize}\limits_{\theta_0, \theta_1} J(\theta_0, \theta_1) \tag{2}

Or, more directly:


m i n i m i z e Theta. 0 . Theta. 1 1 2 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] 2 (3) \mathop{minimize}\limits_{\theta_0, \theta_1} \frac{1}{2m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2 \tag{3}

And if we are working with a linear regression with one variable, The h theta (x) = theta. Theta 0 + 1 xh_ \ theta (x) = \ \ theta_1xh theta theta_0 + (x) = theta. Theta 0 + 1 x.

⚠️【Note】Hypothesis Function & Cost Function

  • H θ(x)h_\theta(x)hθ(x) for fixed θ I \theta_iθ I, is a function of XXX
  • J(θ0,θ1)J(\theta_0, \theta_1)J(θ0,θ1) is a function of the parameter \theta_iθ I.

Cost Function – Intuition I

Let’s draw some pictures for better understanding of what the values of the cost function.

To getting start, we are going to work with a simplified hypothesis function:

Our training data set is scattered on the x-y plane. We are trying to make a straight line (defined by H θ(x)h θ(x)h θ(x)) which passes through these scattered data points.

Our objective is to get the best possible line. The best possible line will be such:

So that the average squared vertical distances of the scattered points from the line ( 1 m ∑ I = 1 m/h theta – y (x (I)) (I)] 2 \ frac {1} {m} \ sum ^ m_ {I = 1} [h_ \ theta (x ^ {(I)}) – y ^ {(I)}] ^ 2 m1 ∑ I = 1 m/h theta – y (x (I)) (I)] 2) will be the further.

Ideally, the line should pass through all the points of our training data set. In such a case, the value of cost function
J ( θ 0 , θ 1 ) J(\theta_0, \theta_1)
will be 0.

E.g. A ideal situation where
J = 0 J=0
:

Let this be our training set: Only three points (1, 1), (2, 2) & (3, 3)

Now, we try setting θ1\ theTA_1 θ1 to different values: -0.5, 0, 0.5, 1, 1.5……

When θ1=1\theta_1=1θ1=1, we get a slope of 1 which goes through every single data point in our model. When θ1=0.5\theta_1=0.5θ1=0.5, we see the vertical distance from our fit to the data points increase:

By doing this, We got a series of graph of hθ(x)h_\theta(x)hθ(x) in x-y plane as well as yield to the following θ1\theta_1θ1- J (theta 1) J (\ theta_1) J (theta 1) graph:

Thus as a goal, we should try to minimize the cost function. In this case,
θ 1 = 1 \theta_1=1
is our global minimum.

Cost Function – Intuition II

Unlike before, this time, we won’t continue with the simplified hypothesis, We are going to keep both of parameters θ0\theta_0θ0 and θ1\theta_1θ1. So the hypithesis function will be Theta h (x) = theta. Theta 0 + 1 xh_ \ theta (x) = \ \ theta_1xh theta theta_0 + (x) = theta. Theta 0 + 1 x.

Here’s our problem formulation as usual:


Hypothesis:  h Theta. ( x ) = Theta. 0 + Theta. 1 x Parameters:  Theta. 0 . Theta. 1 Cost Function:  J ( Theta. 0 . Theta. 1 ) = 1 2 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] 2 Goal:  m i n i m i z e Theta. 0 . Theta. 1 J ( Theta. 0 . Theta. 1 ) \begin{array}{rl} \textrm{Hypothesis: } & h_\theta(x)=\theta_0+\theta_1x\\ & \\ \textrm{Parameters: } & \theta_0, \theta_1\\ & \\ \textrm{Cost Function: } & J(\theta_0, \theta_1) = \frac{1}{2m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2\\ & \\ \textrm{Goal: } & \mathop{minimize}\limits_{\theta_0, \theta_1} J(\theta_0, \theta_1) \end{array}

Same as last time, we want to unserstand the hypothesis
h h
and the cost function
J J
via a series of graph. However, we’d like to use a contour plot to describe our
J ( θ 0 , θ 1 ) J(\theta_0, \theta_1)
.

A contour plot is a graph that contains many contour lines.

A contour line of a two variable function has a constant value at all points of the same line.

An example:

Taking any color and going along the ‘circle’, one would expect to get the same value of the cost function.

To touch our optimization objective, we can try To set the parameters θ I \ theta_I θ I To different values.

When θ0=360\theta_0 =360 θ0=360 and θ1=0\theta_1 =0 θ1=0, the value of J(θ0,θ1)J(\theta_0, \ TheTA_1)J(θ0,θ1) in the contour plot gets closer to the center thus reducing the cost function error. Now we get a result in a better fit of the data:

Minimizing the cost function as much as possible and consequently, The result of theta 1\theta_1θ1 and theta 0\theta_0θ0 tend to be around 0.12 and 250 respectively graph to the right seems to put our point in the center of the inner most ‘circle’.

Obviously, we dislike to write a software to just plot out a contour plot and then try to manually read off the numbers to reach Our goal. We want an efficient algorithm for automatically finding the value of θ0\ theTA_0 θ0 and θ1\ theTA_1 θ1 that minimizes the cost function JJJ. Actually, the gradient descent algorithm that we will talk about works great on this question.

Gradient Descent

There is a algorithm called gradient descent for minimizing the cost function
J J
. And we can use it not only in linear regression as it’s actually used all over the place in machine learning.

Let’s talk about gradient descent for minimizing some arbitrary function
J J
. So here’s the problem setup:

We put θ0\theta_0θ0 on the x axis and θ1\theta_1θ1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup:

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

Image that we are physically standing at a point on a hill, in gradient descent, what we’re going to do is to spin 360 degrees around and just look all around us, and ask, “If I were to take a little step in some direction, and I want to go down the hill as quickly as possible, what direction should I take?” then you take a step in that direction. Repeat doing this until you converge to a local minimum. Like the black line in the picture above shows.

Notice that if we choose different points to grandient descent, we may reach different local optimums.

Mathematically, this is the definition of the gradient descent algorithm:

Gradient Descent Algorithm

repeat until convergence {


Theta. j : = Theta. j Alpha. partial partial Theta. j J ( Theta. 0 . Theta. 1 ) (for  j = 0  and  j = 1 ) \qquad\theta_j := \theta_j – \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)\qquad \textrm{(for } j=0 \textrm{ and } j=1 \textrm{)}

}

The
α \alpha
is a number that is called the learning rate. It basically controls how big a step we take downhill with gradient descent.

At each iteration JJJ, one should simultaneously update the parameters θ1,θ2… ,θn. Updating a specific parameter prior to refining another one on the J (th) iteration would yield to a wrong implementation:

Gradient Descent Intuition

Let’s explore the scenario where we used **one parameter
θ 1 \theta_1
** and plotted its cost function to implement a gradient descent. Our formula for a single parameter was : Repeat until convergence:


Theta. 1 : = Alpha. d d Theta. 1 J ( Theta. 1 ) \theta_1:=\alpha\frac{d}{d\theta_1}J(\theta_1)

Regardless of the slope’s sign for DD θ1J(θ1)\frac{d}{d\ theTA_1}J(\theta_1)dθ1dJ(θ1) eventually converges to its minimum value.

The following graph shows that when The slope is negative, The value of θ1\ theTA_1 θ1 increases and when it is positive, The value of θ1\theta_1θ1 decreases:

On a side note, we should adjust our parameter
α \alpha
to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

How does gradient descent converge with a fixed step size
α \alpha
? The intuition behind the convergence is that
d d θ 1 J ( θ 1 ) \frac{d}{d\theta_1}J(\theta_1)
approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

Gradient Descent for Linear Regression

Now, we have learnt the gradient descent, the linear regression model and the squared error cost function as well. This time, we are going to put together gradient descent with our cost function that will give us an algorithm for linear regression for fitting a straight line to our data.

This is what we worked out:

We are going to apply gradient descent algorithm to minimize our squared error cost function.

The key term is the derivative term:


partial partial Theta. j J ( Theta. 0 . Theta. 1 ) = partial partial Theta. j 1 2 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] 2 = partial partial Theta. j 1 2 m i = 1 m [ Theta. 0 + Theta. 1 x ( i ) y ( i ) ] 2 \frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2 = \frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum^m_{i=1}[\theta_0+\theta_1x^{(i)}-y^{(i)}]^2

j = 0 : partial partial Theta. 0 J ( Theta. 0 . Theta. 1 ) = 1 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] j = 1 : partial partial Theta. 1 J ( Theta. 0 . Theta. 1 ) = 1 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] x ( i ) \begin{array}{rl} j=0: & \frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]\\ & \\ j=1: & \frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}] \cdot x^{(i)} \end{array}

Plug them back into our gradient descent algorithm:

repeat until convergence {


Theta. 0 : = Theta. 0 Alpha. 1 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] \qquad\theta_0 := \theta_0 – \alpha\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]


Theta. 1 : = Theta. 1 Alpha. 1 m i = 1 m [ h Theta. ( x ( i ) ) y ( i ) ] x ( i ) \qquad\theta_1 := \theta_1 – \alpha\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}] \cdot x^{(i)}

}

Notice: update θ0\theta_0θ0 and θ1\theta_1θ1 simultaneously

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate. So, this is simply gradient descent on the original cost function J.

This method looks at every example in the entire training set on every step, and is called batch gradient descent.

Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local,optima; Thus gradient descent is always converges (assuming the learning rate alpha is not too large) to the global minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, The X’s in The figure (joined by straight lines) mark The correct values of theta that gradient descent went through as itconverged to its minimum.


Experiment

As a supplement, I wrote a very easy to understand but surprisingly low performance univariate gradient descent in Python:

(Space limited, I removed a lot of comments, complete the code in this Gist)

import math
import random

class LinearRegressionWithOneVariable(object) :
    def __init__(self, training_set) :
        self.training_set = training_set  # [(x: int, y: int), ...]
        self.theta = [0.0]
    
    def _hypothesis(self, x, theta) :
        return theta[0] + theta[1] * x
    
    def _cost(self, dataset, theta) :
        s = 0
        for i in dataset:
            s += (self._hypothesis(i[0], theta) - i[1]) * *2
        return s / (2 * len(dataset))
    
    def _gradient_descent(self, dataset, starting_theta, learning_rate, epsilon, max_count=4294967296) :
        theta = list.copy(starting_theta)
        last_theta = list.copy(starting_theta)
        cost = self._cost(dataset, theta)
        last_cost = cost * 2
        count = 0
        epsilon = abs(epsilon)
        diff = epsilon + 1
        while (diff > epsilon):
            count += 1
            if count >= max_count:
                raise RuntimeError("not convergence after {mc} iterations.".format(mc=max_count))
    
            try:
                t_sum= sum((self._hypothesis(i[0], theta) - i[1] for i in dataset))
                theta[0] = theta[0] - learning_rate * t_sum / len(dataset)
    
                t_sum = sum(
                    ((self._hypothesis(i[0], theta) - i[1]) * i[0] for i in dataset)
                    )
                theta[1] = theta[1] - learning_rate * t_sum / len(dataset)
    
                last_cost = cost
                cost = self._cost(dataset, theta)
    
                if not any((math.isnan(x) or math.isinf(x) for x in theta)) and abs(cost) <= abs(last_cost):
                    diff = max((abs(last_theta[i] - theta[i]) for i in range(len(theta))))
                    last_theta = list.copy(theta)
                    learning_rate += learning_rate * 4
                else:
                    theta = list.copy(last_theta)
                    learning_rate /= 10
                    if (learning_rate == 0):
                        learning_rate = self._get_learning_rate(self.training_set)
    
            except OverflowError:
                theta = list.copy(last_theta)
                learning_rate /= 10
                if (learning_rate == 0):
                    learning_rate = self._get_learning_rate(self.training_set)


        return theta, count
    
    def _get_learning_rate(self, dataset) :
        return 1 / max((i[1] for i in dataset))
    
    def regress(self, epsilon=1, learning_rate=0, verbose=False) :
        init_theta = [random.random(), random.random()]
    
        if learning_rate == 0:
            learning_rate = self._get_learning_rate(self.training_set)
        
        self.theta, count = self._gradient_descent(self.training_set, init_theta, learning_rate, epsilon)
        
        if verbose:
            print('Gradient descent finished after {count} iterations:\ntheta_0:\t{t0}\ntheta_1:\t{t1}\ncost:\t\t{cost}\nhypothesis:\th(x)  = {t0} + {t1}*x'.format(
                    count=count, t0=self.theta[0], t1=self.theta[1], cost=self._cost(self.training_set, self.theta)))
    
    def hypothesis(self, x) :
        return self._hypothesis(x, self.theta)
Copy the code

Use examples:

    >>> model = LinearRegressionWithOneVariable([(0.0), (1.1), (2.2), (3.3), (4.4), (5.5), (6.6)])
    >>> model.regress(verbose=True, epsilon=0.0001)
    theta_0:        0.025486336182825836
    theta_1:        0.9940305813471573
    cost:           9.99815680487604 e-05
    hypothesis:     h(x) = 0.025486336182825836 + 0.9940305813471573*x
    >>> model.hypothesis(10)
    9.9657921496544
Copy the code