Neuronal model

Neural network is the most basic components in neurons (neuron) model, the figure below shows the M – P neuron model, the principle is the neuron receives from other passed n neurons of the input signal, the input signal by weighted connections (connection), according to the received signal and the current neurons weighting total input values, The total input value is compared to the threshold θ\thetaθ, and the neuron output is processed by activation function.

The ideal activation function is the step function shown in Figure (a) below, corresponding to the two states of the neuron (i.e., whether the neuron has exceeded the threshold). However, due to its discontinuous and non-smooth nature, Sigmoid function will be used in practice, as shown in Figure (b) below.

Perceptron and multi-layer networks

The perceptron is composed of two layers of neurons, as shown in the figure below. The input layer receives the location input signal and then transmits it to the output layer, which is m-P neuron (broad value logic unit, Threshold Logic unit).

The perceptron can simply implement and, or and not operations (assuming y= F (∑ IWIXI −θ)y= F (\ sum_iw_IX_i -\theta)y= F (∑ IWIxi −θ), FFF is a transition function) :

  • And operation: To w1 = w2 = 1, theta equals w_1 = w_2 = 1, 2 \ theta = 2 w1 w2 = = 1, theta = 2, Y + 1 ∗ ∗ (1 x1 x2-2) y (1 + 1 * * x_1 x_2-2) y (+ 1 ∗ ∗ 1 x1 x2-2) only in the x1 x2 = = 1 x_1 = x_2 = 1 x1 x2 = = 1 when y = 1 y = 1 y = 1;
  • Or operation: Make w1 w2 = = 1, theta = 0.5 w_1 = w_2 = 1, \ theta = 0.5 w1 w2 = = 1, theta = 0.5, Y + 1 ∗ ∗ (1 x1 x2-0.5) y (1 + 1 * * x_1 x_2 0.5) y (+ 1 ∗ ∗ 1 x1 x2-0.5) when the x1 = 1 or x2 = 1 x_1 = 1 \ or \ x_2 = 1 x1 = 1 or x2 = 1 when y = 1 y = 1 y = 1;
  • Non-operation (considering only x1x_1x1) : = – 0.6 w1 and w2 = 0, theta w_1 = = – 0.5-0.6, w_2 = 0, \ theta = = – 0.6-0.5 w1 and w2 = 0, theta equals – 0.5, Y (+ 0 ∗ ∗ x1 x2 + 0.5-0.6) y (x_1 + 0-0.6 * * x_2 + 0.5) y (+ 0 ∗ ∗ x1 x2 + 0.5-0.6) when x_1 x1 = 1 = 1 x1 = 1 y = 0, y = 0, y = 0, the x1 = 0 when x_1 = 0 x1 = 0, y = 1 y = 1 y = 1.

Generally, given the training set, we can learn the weight wi(I =1,2… , n) w_i (I = 1, 2,… , n) wi (I = 1, 2,… ,n) and the threshold θ\theta theta. θ\thetaθ can be regarded as the weight wn+1w_{n+1}wn+1 of the connection corresponding to a dummy node with a fixed input of -1.0, so that the weight and threshold learning can be unified into a single weight learning.

For the training sample (x,y)(\ PMB {x},y)(xx,y), if the current perceptron output is y^\hat{y}y^, its weight can be adjusted according to the following formula:


w i please w i + Δ w i Δ w i = eta ( y y ^ ) x i w_i \leftarrow w_i + \Delta w_i \\ \Delta w_i=\eta(y-\hat{y})x_i

η∈(0,1)\eta \in(0,1)η∈(0,1) is the learning rate. It can be seen from the above equation that if the prediction is correct, that is, y^=y\hat{y}=yy^=y, then δ w\Delta w δ w does not change; otherwise, it is adjusted according to the degree of error.

Computers have one layer of functional neurons (i.e. only output layer functions are processed by activation functions), have limited learning capacity and are linearly separable from, or, and non-problems. If a problem is linearly separable, then there exists a linear hyperplane that separates the different classes, as shown in the figure below. This proves that the perceptron will eventually converge to obtain the appropriate weight vector w\ PMB {w}ww. Non-convergent perceptrons will oscillate, as shown in Figure (d) below. Such problems are “xOR” problems or nonlinear separable problems.

To solve the nonlinear separable problem, multi-layer functional neurons need to be considered. As shown in Figure (a) below, a layer of neurons is added into the input and output layers, called hidden layer. Both the hidden layer and the output layer are functional neurons with activation functions.

The more common neural network is the hierarchical structure as shown in the figure below, in which neurons at each layer are all connected to the lower layer, and there is no same-layer or cross-layer connection, which is called multi-layer feedforward neural networks.

Intuitively, neural network learning is to adjust the connection weight between different layers w\ PMB {w}ww and the threshold value of each functional neuron θ\thetaθ according to the training data.

Error BackPropagation Indicates the error BackPropagation algorithm

Multi-layer networks have stronger learning ability and require more complex learning algorithms. Error BackPropagation (BP) algorithm is the most outstanding representative and widely used. BP neural network generally refers to the multi-layer feedforward neural network trained by BP algorithm.

Given data set D={(x1,y1)… , (xm, ym)}, xi ∈ Rd, yi ∈ RlD = \ {(\ PMB {x} _1, \ PMB {} y _1),… ,(\pmb{x}_m,\pmb{y}_m)\},\pmb{x}_i \in \mathbb{R}^d,\pmb{y}_i \in \mathbb{R}^lD={(xx1,yy1),… , (XXM, yym)}, xxi ∈ Rd, yyi ∈ Rl. Each input sample contains DDD features, and the output is LLL dimension real-value vector used to express classification information. The following figure shows a multi-layer feedforward network structure.

For the training sample (xk, yk) (\ PMB {x} _k, \ PMB {} y _k) (XXK, yyk), assume that the neural network output to y ^ k = (^ 1 k, y… ,y^lk)\hat{\pmb{y}}_k=(\hat{y}^k_1,… ,\hat{y}^k_l)yy^k=(y^1k,… ,y^ LK), and there are:


y ^ j k = f ( Beta. j Theta. j ) \hat{y}^k_j = f(\beta_j-\theta_j)

Its mean square error is:


E k = 1 2 j = 1 l ( y ^ j k y j k ) 2 E_k=\frac{1}{2}\sum^l_{j=1}(\hat{y}^k_j-y^k_j)^2

There are (D + L +1) Q + L (D + L +1) Q + L (D + L +1) Q + L (d+ L +1) Q + L parameters (all edges + nodes except the input layer) to be determined in the network in the figure above. BP is an iterative learning algorithm, and generalized perceptual learning rules are used to update estimated parameters in each round. The update estimation expression of any parameter VVV is as follows:


v please v + Δ v v \leftarrow v+\Delta v

Taking the connection weight whJW_ {hj} WHJ from the hidden layer to the output layer as an example, the BP algorithm is based on the gradient descent strategy and adjusts the parameters with the negative gradient direction of the target. For the error EkE_kEk and the learning rate η\etaη, we can get:


Δ w h j = eta partial E k partial w h j \Delta w_{hj}=-\eta \frac{\partial E_k}{\partial w_{hj}}

Whjw_ {hj} WHJ first affects the input value βj\beta_jβj and then affects the output value y^jk\hat{y}^k_jy^jk and finally affects EkE_kEk.


partial E k partial w h j = partial E k partial y ^ j k partial y ^ j k partial Beta. j partial Beta. j partial w h j \frac{\partial E_k}{\partial w_{hj}}=\frac{\partial E_k}{\partial \hat{y}^k_j}\cdot \frac{\partial \hat{y}^k_j}{\partial \beta_j}\cdot \frac{\partial \beta_j}{\partial w_{hj}}

According to the definition of βj\beta_jβj (see the formula βj\beta_jβj in figure 5.7), it is clear that:


partial Beta. j partial w h j = b h \frac{\partial \beta_j}{\partial w_{hj}}=b_h

Where bhB_HBH is the output of HHH neuron in the hidden layer. The Sigmoid function has the following properties:


f ( x ) = f ( x ) ( 1 f ( x ) ) f'(x)=f(x)(1-f(x))

It follows that:


g j = partial E k partial y ^ j k partial y ^ j k partial Beta. j = ( y ^ j k y j k ) f ( Beta. j Theta. j ) = y ^ j k ( 1 y ^ j k ) ( y j k y ^ j k ) g_j=-\frac{\partial E_k}{\partial \hat{y}^k_j}\cdot \frac{\partial \hat{y}^k_j}{\partial \beta_j} \\ =-(\hat{y}^k_j-y^k_j)f'(\beta_j-\theta_j) \\ =\hat{y}^k_j(1-\hat{y}^k_j)(y^k_j-\hat{y}^k_j)

In summary, the updated formula of WHJW_ {hj} WHJ in BP algorithm can be obtained as follows:


Δ w h j = eta g j b h \Delta w_{hj}=\eta g_j b_h

The updating formulas of other parameters can be obtained by similar methods. Learning rate η\etaη controls the update step size of each iteration of the algorithm. If it is too large, it is easy to oscillate, and if it is too small, the convergence speed may be too slow.

The pseudo code of BP algorithm is shown as follows:

Here’s an example:

Global minimum and local minimum

In fact, the learning process of neural network can be regarded as a parameter optimization process. The definition of “optimal” is usually divided into two types: local minimum and global minimum. For W ∗,θ∗ W ^*,\theta^*w∗,θ∗, if ϵ>0\epsilon>0ϵ>0,


( w ; Theta. ) { ( w ; Theta. )   ( w ; Theta. ) ( w ; Theta. ) Or less ϵ } \forall (\pmb{w}; \theta) \in \{(\pmb{w}; \theta)|\ ||(\pmb{w}; \theta)-(\pmb{w}^*; \theta^*)||\leq \epsilon\}
  • Have E (w; Theta) p E, (w ∗; Theta ∗) E (\ PMB {w}; \theta) \geq E(\pmb{w}^*; \theta^*)E(ww; Theta) p E (ww ∗; θ∗), then (W ∗; Theta ∗) (\ PMB {w} ^ *; \ theta ^ *) (ww ∗; θ∗) has a locally minimum solution.
  • For any (w; Theta) (\ PMB {w}; \theta)(ww; Theta) has E (w; Theta) p E, (w ∗; Theta ∗) E (\ PMB {w}; \theta) \geq E(\pmb{w}^*; \theta^*)E(ww; Theta) p E (ww ∗; θ∗), then (W ∗; Theta ∗) (\ PMB {w} ^ *; \ theta ^ *) (ww ∗; θ∗) has a global minimum solution.

In practical tasks, people most hope to find the global minimum value, but sometimes the error function has multiple local minimum gradient descent methods and then stops searching for optimization after finding one of them, which is obviously not optimal. Therefore, some researchers have designed several strategies to try to “jump out” of the local minimum and further search for the global minimum:

  1. Multiple neural networks are initialized with multiple sets of different parameter values, and the solution with the smallest error is taken as the final parameter after training. This approach is equivalent to starting search from different initial points, so that different local minima may fall into and approach the global minimum.
  2. By using simulated annealing, at each iteration a result worse than the current solution is accepted with a certain probability (i.e., a suboptimal solution is selected), which helps to jump out of the local minimum. The probability of accepting the suboptimal solution decreases gradually over time to maintain the stability of the algorithm.
  3. Random gradient descent is used to add random factors to the gradient calculation, so that the gradient calculated when falling into the local minimum point may still not be 0, so that the search can continue.
  4. Genetic algorithms.