This is the seventh day of my November challenge

Backpropagation — a fast algorithm to calculate the gradient of cost function

Use matrices to quickly compute the output

In order to explain the problem, a clear definition of weights in neural networks is firstly given, as shown in the figure below:

BJLB ^ l_jbjL is also defined as the bias of JJJ neuron of LLL layer:

Based on the above definition, we can get the activation value ajla^l_jajl of JJJ neuron in LLL layer (this value is related to the activation value AKL − 1A ^{L-1}_kakl−1 of LAYER L − 1L-1L −1) :


a j l = sigma ( k w j k l a k l 1 + b j l ) a^l_j=\sigma(\sum_kw^l_{jk}a^{l-1}_k+b^l_j)

Where ∑ KWjklaKL −1\ sum_KW ^l_{jk} A ^{L-1} _K ∑ KKK in kwjklaKL −1 represents the number of all neurons in layer L − 1L-1L −1, a weight matrix WLW ^ LWL is defined for each layer LLL, The element in WLW ^ LWL is the weight of the connection between LAYER L − 1L-1L −1 and layer LLL, so WJKLW ^ L_ {jk} WJKL can be understood as the weight of the connection between the KKK neuron in layer L − 1L-1L −1 and the JJJ neuron in layer LLL.

The above formula can be simplified as:


a l = sigma ( w l a l 1 + b l ) a^l=\sigma(w^la^{l-1}+b^l)

In the calculation of ALA ^ Lal, the intermediate quantity zL = Wlal −1+ BLZ ^ L = W ^la^{L-1}+b^ LZL =wlal−1+ BL should be calculated. ZLZ ^ LZL is called the weighted input of neurons in LLL layer. Therefore, the above formula can be simplified as al=σ(zl)a^ L =\sigma(z^ L)al=σ(zl).

Two assumptions about the cost function

The objective of the backpropagation algorithm is to compute the partial derivatives of the cost function CCC with respect to two parameters WWW and BBB ∂C∂ W,∂C∂b\frac{\partial C}{\partial W},\frac{\partial C}{\partial B}∂ W ∂C,∂b∂C. CCC is defined as a quadratic cost function:


C = 1 2 n x y ( x ) a L ( x ) 2 C=\frac{1}{2n}\sum_x||y(x)-a^L(x)||^2

NNN represents the total number of training samples, ∑x\sum_x∑x means that all samples are traversed, y=y(x)y=y(x)y=y(x) is the desired target output corresponding to input XXX, LLL represents the number of network layers, AL =aL(x)a^L=a^L(x)aL=aL(x) is the activation value vector of the network output when the input is XXX.

In order to apply the back propagation algorithm, two assumptions are made about the cost function CCC:

  • The cost function can be written as a cost function CxC_xCx mean C=1n∑xCxC=\frac{1}{n}\sum_xC_xC= N1 ∑xCx on each training sample XXX, The price of each independent samples for Cx = 12 ∣ ∣ y – aL ∣ ∣ 2 c_x = \ frac {1} {2} | | y – a ^ L | | Cx = 21 ^ 2 ∣ ∣ y – aL ∣ ∣ 2.
  • The cost function can be written as the output function of the neural network: cost C=C(aL)cost\ C=C(a ^L)cost C=C(aL). Under this assumption CxC_xCx can be further written as:

C = 1 2 y a L 2 = 1 2 j ( y j a j L ) 2 C=\frac{1}{2}||y-a^L||^2=\frac{1}{2}\sum_j(y_j-a^L_j)^2

Hadamard product/Schur product:
s Even though t s\odot t


s Even though t s\odot t
Represents the product of elements,
( s Even though t ) j = s j t j (s\odot t)_j=s_jt_j
, as follows:

Differential, gradient, and gradient descent

Four basic formulas

The core of the back propagation algorithm is to compute partial derivative ∂C∂ WJKL,∂C∂ BJL \frac{\partial C}{\partial W ^l_{jk}},\frac{\partial C}{\partial B ^ L_j}∂ WJKL ∂C,∂ BJL ∂C, To achieve the purpose, we first introduce the intermediate quantity δjl\delta^l_jδjl, which represents the error on the JJJ neuron in the LLL layer:


Delta t. j l = partial C partial z j l \delta^l_j=\frac{\partial C}{\partial z^l_j}

This formula is based on the heuristic understanding that ∂C∂ ZJL \frac{\partial C}{\partial z^l_j}∂ ZJL ∂C is a measure of neuronal error.

Back propagation is based on four basic formulas that allow us to calculate error and cost function gradients.

Formula 1

Output layer error δL\delta^LδL


Delta t. j L = partial C partial a j l sigma ( z j L ) (1) \delta^L_j=\frac{\partial C}{\partial a^l_j}\sigma'(z^L_j) \tag{1}

In the formula, ∂C∂ajl\frac{\partial C}{\partial a^l_j}∂ajl∂C represents the rate at which the cost CCC changes with the activation value output by JJJ neurons in LLL layer. Sigma ‘(zjL)\sigma'(z^L_j)σ ‘(zjL) denotes the rate of change of the activation function sigma \sigma at zjLz^L_jzjL. For the quadratic cost function C = 12 ∑ j (yj – aj) 2 C = \ frac {1} {2} \ sum_j (y_j – a_j) ^ 2 C = 21 ∑ j (yj – aj) 2, Partial C partial ajl = (aj – yj) \ frac {\ partial C} {\ partial a ^ l_j} = (a_j – y_j) partial ajl partial C = (aj – yj).

Formula (1) was rewritten as a matrix for convenient calculation:


Delta t. L = a C Even though sigma ( z L ) \delta^L=\nabla_aC\odot\sigma'(z^L)

Substitute the partial derivative of the quadratic cost function to calculate:


Delta t. L = ( a L y ) Even though sigma ( z L ) \delta^L=(a^L-y)\odot\sigma'(z^L)

Formula 2

Use the error δl\delta^ {L +1}δl+1 of the next layer to represent the error δl\delta^lδl:


Delta t. l = ( ( w l + 1 ) T Delta t. l + 1 ) Even though sigma ( z l ) (2) \delta^l=((w^{l+1})^T\delta^{l+1})\odot\sigma'(z^l) \tag{2}

Formula (2) gives the method of deriving the error of the current layer from the error of the next layer, which is the meaning of reverse in back propagation, that is, it can be regarded as moving the error backwards along the network. The error of any layer can be calculated by combining formula (1) and Formula (2).

The formula 3

The rate of change of cost function and arbitrary bias correlation in the network:


partial C partial b j l = Delta t. j l (3) \frac{\partial C}{\partial b^l_j}=\delta^l_j \tag{3}

Formula (3) proves that the error δjl\delta^l_jδjl and partial derivative values ∂C∂ BJL \frac{\partial C}{\partial b^l_j}∂ BJL ∂C are identical.

The formula of 4

The rate of change of the cost function with respect to any weight:


partial C partial w j k l = a k l 1 Delta t. j l (4) \frac{\partial C}{\partial w^l_{jk}}=a^{l-1}_k\delta^l_j \tag{4}

Formula (4) gives the partial derivative ∂C∂ WJKL \frac{\partial C}{\partial W ^l_{jk}} partial w^l_{jk}}∂ WJKL ∂C, which can be simplified as:


partial C partial w = a i n Delta t. o u t \frac{\partial C}{\partial w}=a_{in}\delta_{out}


a i n a_{in}
Represents that the weight is
w w
Input activation value on the link of,
Delta t. o u t \delta_{out}
Represents that the weight is
w w
The output error on the link is shown in the figure below:

The characteristics of the S-shaped function make sigma \sigma change very little as it approaches 0 or 1. This also makes weight learning slow. In this case, the output neurons are saturated and weight learning stops (or is very slow).

The proof of four formulas

Background knowledge

The chain rule formula of multiple calculus:


d w d t = i = 1 n partial w partial x i d x i d t \frac{dw}{dt}=\sum^n_{i=1}\frac{\partial w}{\partial x_i}\frac{dx_i}{dt}

Proof of Formula 1

Recall the definition of output error δ\deltaδ :


Delta t. j L = partial C partial z j L \delta^L_j=\frac{\partial C}{\partial z^L_j}

Based on the above background knowledge, Formula 1 can be rewritten as:


Delta t. j L = k partial C partial a k L partial a k L partial z j L \delta^L_j =\sum_k \frac{\partial C}{\partial a^L_k}\frac{\partial a^L_k}{\partial z^L_j}

KKK represents all neurons in the output layer. The akLa^L_kakL of the KKK neuron only depends on the input weight when K =jk=jk=j. Partial a^L_k}{\partial z^L_j} partial Z ^L_j} partial Z ^L_j} partial zjL∂akL does not exist when k≠jk \neq jk=j.


Delta t. j L = partial C partial a j L partial a j L partial z j L \delta^L_j = \frac{\partial C}{\partial a^L_j}\frac{\partial a^L_j}{\partial z^L_j}

Based on ajL = sigma (zjL) a ^ L_j = \ sigma (z ^ L_j) ajL = sigma (zjL) on the right side of the above formula can be written as sigma ‘(zjL) \ sigma’ (z ^ L_j) sigma ‘(zjL), so you can get the formula 1:


Delta t. j L = partial C partial a j l sigma ( z j L ) (1) \delta^L_j=\frac{\partial C}{\partial a^l_j}\sigma'(z^L_j) \tag{1}

Formula 2 proof

Continuous extension error δ\deltaδ :


Delta t. j l = partial C partial z j l \delta^l_j=\frac{\partial C}{\partial z^l_j}

Delta ^{l+1}_k delta kl+1, And the delta kl + 1 = partial C partial ZJL + 1 \ delta ^ {l + 1} _k = \ frac {\ partial C} {\ partial z ^ {l + 1} _j} the delta kl + 1 = partial ZJL partial C + 1, Partial z^{l+1}_j partial z^{l+1}_j partial z^{l+1}_j partial ZJL +1


Delta t. j l = partial C partial z j l = k partial C partial z k l + 1 partial z k l + 1 partial z j l = k partial z k l + 1 partial z j l Delta t. k l + 1 \delta^l_j=\frac{\partial C}{\partial z^l_j} =\sum_k \frac{\partial C}{\partial z^{l+1}_k}\frac{\partial z^{l+1}_k}{\partial z^{l}_j}=\sum_k \frac{\partial z^{l+1}_k}{\partial z^{l}_j} \delta^{l+1}_k

According to the definition of ZZZ:


z k l + 1 = j w k j l + 1 a j l + b k l + 1 = j w k j l + 1 sigma ( z j l ) + b k l + 1 z^{l+1}_k=\sum_jw^{l+1}_{kj}a^l_j+b^{l+1}_k=\sum_j w^{l+1}_{kj}\sigma(z^l_j)+b^{l+1}_k

Calculate the partial ZKL + 1 partial ZJL \ frac {\ partial z ^ {l + 1} _k} {\ partial z ^ {l} _j} partial ZJL partial ZKL + 1 are:


partial z k l + 1 partial z j l = w k j l + 1 sigma ( z j l ) \frac{\partial z^{l+1}_k}{\partial z^{l}_j}=w^{l+1}_{kj}\sigma'(z^l_j)

The calculation process can be proved by mathematical induction as follows:

When j = 1 = 1 j j = 1:


partial z k l + 1 partial z 1 l = w k 1 l + 1 sigma ( z 1 l ) \frac{\partial z^{l+1}_k}{\partial z^{l}_1}=w^{l+1}_{k1}\sigma'(z^l_1)

When j = 2 j = 2 j = 2:


partial z k l + 1 partial z 2 l = w k 2 l + 1 sigma ( z 2 l ) \frac{\partial z^{l+1}_k}{\partial z^{l}_2}=w^{l+1}_{k2}\sigma'(z^l_2)

When j = mj = mj = m:


partial z k l + 1 partial z m l = w k m l + 1 sigma ( z m l ) \frac{\partial z^{l+1}_k}{\partial z^{l}_m}=w^{l+1}_{km}\sigma'(z^l_m)

To prove to complete

Formula 2 can be obtained by integrating the above formulas:


Delta t. j l = k w k j l + 1 Delta t. k l + 1 sigma ( z j l ) = ( ( w l + 1 ) T Delta t. l + 1 ) Even though sigma ( z l ) \delta^l_j=\sum_k w^{l+1}_{kj}\delta^{l+1}_k\sigma'(z^l_j)= ((w^{l+1})^T\delta^{l+1})\odot\sigma'(z^l)

Proof of formula 3


partial C partial b j l = partial C partial z j l partial z j l partial b j l = Delta t. j l x 1 \frac{\partial C}{\partial b^l_j}=\frac{\partial C}{\partial z^l_j}\frac{\partial z^l_j}{\partial b^l_j}=\delta^l_j \times 1

Proof of Formula 4


partial C partial w j k l = partial C partial z j l partial z j l partial w j k l = Delta t. j l partial ( k w j k l a k l 1 + b j l ) partial w j k l = a k l 1 Delta t. j l \frac{\partial C}{\partial w^l_{jk}}=\frac{\partial C}{\partial z^l_{j}}\frac{\partial z^l_{j}}{\partial w^l_{jk}}=\delta^l_j\frac{\partial (\sum_k w^l_{jk}a^{l-1}_k+b^l_j)}{\partial w^l_{jk}}=a^{l-1}_k\delta^l_j

Backpropagation algorithm process

  • Enter X: Sets the activation value A1A ^ 1A1 for the input layer
  • Forward propagation: for l=2,3,4… Ll = 2, 3 and 4… Ll = 2, 3 and 4… Calculate the median zl, L = wlal – 1 + BLZ = w ^ ^ L la ^ {1} L + b ^ LZL = wlal – 1 + bl and al = sigma (zl) a ^ L = \ sigma (z ^ L) al = sigma (zl)
  • Output layer error delta L \ delta delta L ^ L: calculation error of the last layer (output layer) delta L = ∇ aC even sigma ‘(zL) \ delta ^ L = \ nabla_aC \ odot \ sigma’ (z ^ L) delta L = ∇ aC even sigma ‘(zL)
  • Backward error propagation: based on δL\delta^LδL inverse error of front layer (L =L−1,L−2… ,2l=L-1,L-2,… , 2 L = L – 1, L – 2,… , 2), to calculate the delta l = ((wl + 1) T delta l + 1) it’s sigma ‘(zl) \ delta ^ l = ((w ^} {l + 1) ^ ^ delta T \ {l + 1}) \ odot \ sigma’ (z ^ l) delta l = ((wl + 1) T delta l + 1) it’s sigma ‘(zl)
  • Output: Calculate the cost function gradient, Partial C∂ BJL =δjl\frac{\partial C}{\partial b^l_j}=\delta^ l_J ∂ BJL ∂C=δjl, ∂C∂ WJKL = akL −1δjl\frac{\partial C}{\partial = a ^ w ^ l_ jk} {} {1} l _k \ delta ^ l_j partial WJKL partial C = akl – 1 the delta jl.

On the basis of the back propagation algorithm, the gradient descent method is applied to update the parameter process:

Implementation of algorithm

Def backprop(self,x,y) def backprop(self,x,y) def backprop(self,x,y) nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] # feedforward Activation = x activations = [x] # zs Zs = [] # for b w in list(zip(self.biases, self.weights)): Z = np.dot(w, Activation) + b zs. Append (z) # Calculate activation values based on Z activation = sigmoID (z) activations. Append (activation) # Backward pass propagation # Cost_derivative Calculates the difference between the network output value and the expected output. Nabla C # sigmoid_Prime returns the derivative of sigmoID function. Delta delta = self.cost_derivative(activations[-1], Y) * sigmoid_prime(zs[-1]) # Calculate the gradient of parameters b and W nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, [-2].transpose()) # transpose() for l in range(2, self.num_layers): -l z = z [-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l + 1].transpose(), Nabla_b [-l] = delta nabla_w[-l] = np.dot(delta, activations[-l - 1].transpose()) return nabla_b, nabla_wCopy the code