Gradient descent and delta rule

The delta rule

Although the perceptron rule can successfully find a weight vector when the training sample is linearly separable, it will not converge if the sample is not linearly separable. Therefore, another training rule was devised to overcome this deficiency, called the Delta rule. If the training sample is not linearly separable, then the delta rule converges to the best approximation of the target concept. The key idea of delta rule is to use gradient descent to search the hypothesis space of possible weight vectors, so as to find the weight vector that best fits the training sample. \

Gradient descent

The requirement of using perceptron rule is that the training sample must be linearly separable. When the sample does not meet this condition, it cannot converge any more. In order to overcome this requirement, the Delta rule is introduced, which will converge to the best approximation of the target concept!

The key idea of delta rule is to search the hypothesis space of possible weight vector by using gradient descent, so as to find the weight vector that best fits the training sample.

Simply understood, training a thresholess perceptron, which is a linear unit. Its output o is as follows:

Specify a metric to measure the training error of the hypothesis (weight vector) relative to the training sample.

Where D is the training sample set, TD is the target output of training sample D, and OD is the output of linear unit to training sample D. E (w) is half of the sum of the square of the difference between the target output TD and the linear unit output OD over all the training samples. We define E as a function of w because the output o of the linear unit depends on this weight vector.

Here, our hypothesis for minimizing E for a given training data is the most probable hypothesis for H, which is to find a set of weights that minimize E.

Intuitively, E function is to make the gap between the target output TD and the linear output OD smaller, that is, to get closer and closer to the target concept. \

In the figure above, the red and green dots are linearly inseparable (you can’t find a line that completely separates the two types of points), but finding a line that separates the two types of points as closely as possible. Make mistakes as few as possible.

Why can’t the perceptron separate?

Is that it is threshold, >0 is 1, <0 is -1. It is this compulsion that makes the output of the function skip (not differentiable), as illustrated in the figure above, that during training the line will always rotate clockwise or counterclockwise without converging to the optimal value.

In the figure above, the two axes represent the possible values of two weights in a simple linear unit, and the circle size represents the value of the training error E.

To determine a weight vector that minimizes E, the gradient descent search starts with an arbitrary initial vector, and then modifies the vector repeatedly in small steps. At each step, the weight vector is modified along the steepest descent of the error curve (see blue line), and the process continues until the global minimum error point is reached.

What is the steepest descent?

You can find this direction by taking the derivative of each component of E with respect to the vector w. This vector derivative is called the gradient of E to W, denoted as δ E (W).

δ E (W) is itself a vector whose members are the partial derivatives of E with respect to each WI. When the gradient is interpreted as a vector in weight space, it determines the direction in which E rises most steeply.

* * Now that the direction is determined, the gradient descent rule is:

Among them:

The η here is a positive constant called the learning rate, which determines the step size in the gradient descent search. The sign in the formula is the direction in which you want the weight vector E to go down.

This training rule can also be written in terms of components:

Among them:

(Formula 1)

The steepest descent can be scaledchangeEach of the components inTo implement.

The components that make up the gradient vector can be obtained by calculating the derivative of E from the previous training error formula.

The derivation is omitted.

Finally get:

Where xID represents an input component xi of training sample D. We now have a formula that can be expressed as the input XID of the linear unit, the output OD, and the target td of the training sample.

Substitute this formula into formula (1) to obtain the gradient descent weight updating rule.

(Formula 2)

Therefore, the gradient descent algorithm for training linear elements is as follows: select an initial random weight vector; Apply linear elements to all training samples, and then calculate δ WI of each weight according to Formula 2. Update each weight by adding δ WI, and repeat the process.

Because the error surface contains only a global minimum, the algorithm converges to the weight vector with the minimum error regardless of whether the training sample is linearly separable. The condition is that a sufficiently small learning rate η must be used.

If the η is too large, the gradient descent search is in danger of passing the minimum of the error surface rather than remaining at that point. Therefore, a common improvement method is to gradually decrease the value of η as the number of steps in gradient descent increases.

Pseudo code of gradient descent algorithm:

To achieve the random approximation of gradient descent, delete (T4.2) and replace (T4.1) with 。

Stochastic gradient descent algorithm

Gradient descent is an important general learning paradigm. It is a strategy for searching a large or infinite hypothesis space, and it can satisfy any of the following conditions:

(1) The assumption space contains the assumption of continuous parameterization.

(2) The error is differentiable for these hypothetical parameters.

The main practical issues in applying gradient descent are:

(1) Sometimes the convergence process may be very slow;

(2) If there are multiple local minima on the error surface, there is no guarantee that the process will find the global minimum.

A common variant of gradient descent that alleviates these difficulties is known as incremental gradient descent or stochastic gradient descent.

Since the gradient descent training rule given in Formula 2 calculates weight update after summation of all training samples in D, the idea of stochastic gradient descent is to calculate weight update according to the error increment of each individual sample and obtain approximate gradient descent search.

The modified training rule is similar to Formula 2, except that the weights are updated according to the following formula during the iterative calculation of each training sample: \

The formula 3

Where t, O and xi are the target value, the unit output and the input of the ith training sample respectively.

Stochastic gradient descent can be viewed as each individual training sample D defines a different error function:

Where, TD and OD are the target output value and unit output value of training sample D.

Each sample D of the training sample set D is computed iteratively by stochastic gradient descentTo change the weight. When iterating through all the training samples, the updated sequence of weights gives the error function for the originalA reasonable approximation of gradient descent of.

Key differences between standard gradient descent and random gradient descent:

(1) Standard gradient descent is to summarize errors of all samples before weight update, while the weight of random gradient descent is updated by examining each training sample.

(2) In standard gradient descent, each step of weight update sums up multiple samples, which requires a lot of calculation.

(3) IfThere are multiple local minima, and random gradient descent may sometimes avoid falling into these local minima because it uses different onesRather thanTo guide the search.

Note:

The increment rule for formula 3 is the same as beforePerceptron ruleThe training rules are similar. But they are different, in the increment rule o is the output of the linear unit of valueFor the perceptron rule, O refers to the threshold outputIn the formulaIn the.

Sample:

Input x1, x2, output o, training W0, W1,w2

Meet the w1 + x1 + x2 * * w1 x2 = o

The training example is:

x1 x2 o
1 4 19
2 5 26
5 1 19
4 2 20
Copy the code

The header file

#ifndef HEAD_H_INCLUDED #define HEAD_H_INCLUDED #include <iostream> #include <fstream> #include <vector> #include <cstdio> #include <cstdlib> #include <cmath> using namespace std; const int DataRow=4; const int DataColumn=3; const double learning_rate=.01; extern double DataTable[DataRow][DataColumn]; extern double Theta[DataColumn-1]; Const double loss_theta = 0.001; const int iterator_n =100; #endif // HEAD_H_INCLUDEDCopy the code

The source code

#include "head.h" double DataTable[DataRow][DataColumn]; double Theta[DataColumn-1]; void Init() { ifstream fin("data.txt"); for(int i=0; i<DataRow; i++) { for(int j=0; j<DataColumn; j++) { fin>>DataTable[i][j]; } } if(! fin) { cout<<"fin error"; exit(1); } fin.close(); for(int i=0; i<DataColumn-1; I++) {Theta [I] = 0.5; }} void batch_grandient() {double loss=1000; for(int i=0; i<iterator_n&&loss>=loss_theta; i++) { double Thetasum[DataColumn-1]={0}; for(int j=0; j<DataRow; j++) { double error=0; for(int k=0; k<DataColumn-1; k++) { error+=DataTable[j][k]*Theta[k]; } error=DataTable[j][DataColumn-1]-error; for(int k=0; k<DataColumn-1; k++) { Thetasum[k]+=learning_rate*error*DataTable[j][k]; } } double a=0; for(int k=0; k<DataColumn-1; k++) { Theta[k]+=Thetasum[k]; a+=abs(Thetasum[k]); } loss=a/(DataColumn-1); }} void stochastic_gradient() // Stochastic_gradient () {double loss=1000; for(int i=0; i<iterator_n&&loss>=loss_theta; i++) { double Thetasum[DataColumn-1]={0}; for(int j=0; j<DataRow; j++) { double error=0; for(int k=0; k<DataColumn-1; k++) { error+=DataTable[j][k]*Theta[k]; } error=DataTable[j][DataColumn-1]-error; double a=0; for(int k=0; k<DataColumn-1; k++) { Theta[k]+=learning_rate*error*DataTable[j][k]; a+=abs(learning_rate*error*DataTable[j][k]); } loss=a/(DataColumn-1); if(loss<=loss_theta) break; } } } void printTheta() { for(int i=0; i<DataColumn-1; i++) cout<<Theta[i]<<" "; cout<<endl; } int main() { Init(); //batch_grandient(); stochastic_gradient(); printTheta(); return 0; }Copy the code

The article, most reprinted from: www.cnblogs.com/dztgc/archi…

The delta rule

Gradient descent

Stochastic gradient descent algorithm

Related Posts

PointNet literature reading and extension reading

Graph theory for Machine Learning (7) : Connectivity

A Quantization-Friendly Separable Convolution for MobileNets