2.1 The Precptron Learning

1. Basic concepts

Suppose that for a binary task, the two categories are 1 (positive category) and -1 (negative category). We define an activation function


ϕ ( z ) = w x = { 1 . z p Theta. 1 . z < Theta. \phi(z) = {w}^\intercal x =\begin{cases} 1, & \text{z} \ge \theta \\ -1, & \text{z} \lt \theta \\ \end{cases}

Xi represents the input feature, and WI represents the weight of the corresponding feature.


x = [ x 1 x 2 x m ] w = [ w 1 w 2 w m ] x = \left[ \begin{matrix} x_1\\ x_2 \\ \vdots \\ x_m \end{matrix} \right] w = \left[ \begin{matrix} w_1\\ w_2 \\ \vdots \\ w_m \end{matrix} \right]

For simplicity, subtract θ\thetaθ from both sides of the equation and set z=z−θz=z-\thetaz=z−θ to get ZZZ = x0w0 + x1w1 +…… xmwm

Where x0= −θ-\theta−θ, x0=1.

The excitation function becomes


ϕ ( z ) = { 1 . z p 0 1 . z < 0 \phi (z)=\begin{cases} 1, & \text{z} \ge 0 \\ -1, & \text{z} \lt 0 \\ \end{cases}

2. Implementation steps

  1. Initialize the weight to 0 or a tiny random number
  2. Iterate over all training samples
    x x
    Perform the following operations:

    1. Calculate the predicted value y^\hat yy^
    2. Update weight WWW

The WWW is updated as


w = w + Δ w Δ w = eta ( y y ^ ) x \begin{aligned} w &= w + \Delta w \\ \Delta w &= \eta(y – \hat y)x \\ \end{aligned}

Where η\etaη is a constant with the learning rate between 0 and 1, and YYy is the actual category. This process is performed several times until the model converges.

Note that the premise of perceptron convergence is that the two categories must be linearly separable and the learning rate must be small enough. In order to prevent the model from never converging, a threshold value for the number of misclassified samples can be set. If the threshold value exceeds the threshold, the model is considered to be unable to converge.

Chapter2.2 is an example of irises from the book:

2.2 Adaptive Linear Neuron (Adaline)

1. Basic concepts

Adeline is an improvement on the 2.1 perceptron model. It updates the weight by a continuous linear excitation function instead of a unit step function, and uses a quantizer to predict the label after updating the weight. It also illustrates the core concept of cost function and minimizes it, which provides a foundation for the subsequent understanding of logistic regression, support vector machine (SVM) and other regression models.

2. Minimize the cost function by gradient descent.

  1. The cost function is simply a function J(w)J(w)J(w) J(w) that can measure the difference between the value y^\hat YY ^ predicted by the model and the real value YYy. Our goal is to find the WWW that can minimize the value of J(w)J(w)J(w) by constantly modifying parameter WWW. That is, for a set of inputs xix_ixi, the model gives a predicted ϕ(z)(I)\phi (z)^{(I)}ϕ(z)(I), which minimizes the difference between the predicted and true values.
  2. In Adaline, we set the cost function as the Sum of Squared Error between the model output and the actual class standard (SSE).

J ( w ) = 1 2 ( Σ ( y ( i ) ϕ ( z ) ( i ) ) ) 2 J(w) = \frac 12( \Sigma (y^{(i)} – \phi (z)^{(i)}))^2
  1. This function has two advantages

    • Can guide. I also added 1/2 as a coefficient to make it easier to take the derivative.
    • It’s a convex function with a minimum point.

    In this way, the weight WWW can be obtained by gradient descent method to minimize the cost function.

  2. Update the weight

    To compute the gradient of the function, we need to take the partial derivative of each weight


partial J partial w i = Σ ( y ( i ) ϕ ( z ( i ) ) ) x j ( i ) \frac {\partial J}{\partial w_i} = -\Sigma(y^{(i)}-\phi(z^{(i)}))x_j^{(i)}

Then update the ownership weight simultaneously


w = w + Δ w Δ w = eta Δ J ( w ) = eta partial J partial w i \begin{aligned} w &= w + \Delta w \\ \Delta w &= -\eta \Delta J(w) = -\eta \frac {\partial J}{\partial w_i}\\ \end{aligned}

The partial derivative has a negative sign because the cost function is going down fastest in the opposite direction of the gradient.

  1. There are two main differences between Adaline and 2.1 perceptrons
    • The perceptron updates the weight every time it accepts a sample, and then it accepts the next sample and updates the weight again. In this section, Adaline updates the ownership weight based on all samples by calculating the same “batch”.
    • In this section ϕ(z)\phi(z)ϕ(z) is a real number, not an integer notation.

2.3 Large-scale machine learning and stochastic gradient descent

1. Introduce context

Since the batch gradient descent method uses the whole training set to train the model, when the data set is too large, each step moving to the global optimal solution requires a lot of calculation, resulting in very slow model convergence. In this case, stochastic gradient descent will have a better effect.

2. Updated weight by stochastic gradient descent.

One training sample is randomly selected each time to update the weight


Δ w = eta ( y ( i ) ϕ ( z ( i ) ) ) x j ( i ) \Delta w = \eta (y^{(i)}-\phi(z^{(i)}))x_j^{(i)}

Stochastic gradient descent does not necessarily lead to a global optimal solution, but approaches it.

The code for the iris example is in this chapter code