An in-depth series of articles

CHAPTER 3 Improving the Way Neural Networks Learn

Softmax preface

Most of the time we use cross entropy to solve the problem of slow learning. However, I want to briefly introduce another approach to this problem, based on the Softmax neuron layer. In artificial neural networks (ANN), Softmax is often used as an activation function for the output layer. This is not only because it works well, but also because it makes the output value of ANN easier to understand. At the same time, the training effect of Softmax with log likelihood cost function is better than that of quadratic cost function.

Softmax function properties

The function formula of Softmax is as follows:


The exponent in the formula ensures that all output activation values are positive. Then the summation of the denominator in the equation ensures that the output sum of Softmax is. This particular form ensures that the output activation values form a probability distribution in a natural way. You can think of it as a kind of recalibrationAnd then combine the results to form a probability distribution.

The most obvious feature of the Softmax function is that it takes the ratio of the input of each neuron to the sum of the input of all neurons in the current layer as the output of this neuron. This makes the output easier to interpret: the larger the output of a neuron, the more likely it is that the neuron’s corresponding category is a true category.

Monotonicity of Softmax

Prove that ifAs is,When negative. As a result, it increasesIncreases the corresponding output activation valueAnd lower all other output activation values. Monotonicity proof is shown below.

Non-locality of Softmax

One benefit of the SoftMax layer is outputIs the corresponding weighted inputThe function. Because the denominator is summing over all of omegaSo you compute each of theseWith the otherClosely related. Deep understanding is for the SoftMax layer: any particular output activation valueDepends on all weighted inputs.

Reverse softmax layer

Suppose we have a neural network that uses the SoftMax output layer and then activate the valuesKnown. It is easy to prove that the corresponding weighted input is of the form, where constantIs independent of the.

Softmax solves slow learning problems

We now have an understanding of the flexible maximum neuron layer. But we have yet to see how a flexible Max layer solves the problem of slow learning. To understand this, let’s define a logarithmic likelihood function. We use theRepresents the training input of the network,Represents the corresponding target output. And then the cost function associated with the training input is going to be


So, if we’re training for MNIST images, the input is, then the corresponding logarithmic likelihood cost is. To see what this intuitively means, think about when the network is doing well, that is, making sure the input isFrom time to time. At this point, he will estimate a corresponding probabilityVery close, so the costIt’s going to be small. Conversely, if the network performs badly, the probabilityIt becomes very small, the costIt increases. So the logarithmic likelihood cost function also satisfies the condition of the cost function we expect.

What about slow learning? To analyze it, recall that the key to slow learning is volumeChanges in the situation. I’m not going to give you an explicit derivation here, but a little bit of algebra will give you



These equations are actually similar to the ones we got for cross entropy. And, as analyzed earlier, these expressions ensure that we don’t encounter slow learning problems. In fact, it is useful to think of a Softmax output layer with a logarithmic likelihood cost as very similar to an S-type output layer with a cross entropy cost.

Given this similarity, should you use an S-shaped output layer with a cross entropy cost, or a flexible maximum output layer with a log likelihood cost? In fact, there are many application scenarios where both approaches work well. As a general perspective, the combination of flexible maximum plus logarithmic likelihood is more suitable for scenarios where the output activation values need to be interpreted as probabilities. That’s not always a concern, but it can be useful for classification problems such as MNIST where there is no overlap.

The validity of Softmax is proved in mathematical form

The function formula of Softmax is as follows:


The derivative results of Softmax are quite special, which can be divided into two situations.

As mentioned above, quadratic cost function may lead to slow training speed when training ANN. That is, the farther the initial output is from the true value, the slower the training will be. This problem can be solved by using cross entropy cost function. In fact, this problem can be solved in another way, that is, softmax activation function and log-likelihood cost function are used to solve the problem.

The formula of log likelihood cost function is:


Note this:Where, represents the firstThe output value of six neurons,Represents the true value corresponding to the KTH neuron, which is 0 or 1. Due to theIt’s either 0 or 1, and for each sample,Only one of them is going to be 1 and the rest are going to be 0, so the log likelihood function can be written off and simplified to 1


In order to verify that Softmax and this cost function can also solve the training speed slowing problem mentioned above, the next focus is to derive the gradient formula of ANN’s weight W and bias B.

First, find the partial derivative of the loss function with respect to bias B:

when, substitute the result above


when, substitute the result above


According to the four equations of back propagation, see “Back Propagation Algorithm” for detailed analysis.

You know,

So, whenWhen,


whenWhen,


For exampleThrough several layers of calculation, Finally, the vector fraction of a training sample is [2, 3, 4], then respectively after softmax function probability is = [e ^ 2 / (e ^ 2 + e ^ 3 ^ 4) + e, e ^ 3 / (e ^ 2 + e ^ 3 ^ 4) + e, e ^ 4 / (e ^ 3 ^ 2 + e + e ^ 4)] = [0.0903,0.2447,0.665], if this sample is correctly classified as the second, then the calculated partial derivative (in fact this partial derivative isOr rather,[0.0903,0.2447-1,0.665]=[0.0903,-0.7553,0.665] Then back Propagation will be done according to this


Attention! whenIf it’s not 0 or 1, but a real value in the interval [0,1], then you just have to change this up a little bit, you just have to put it down hereFrom the result of theInstead ofCan,


All the other derivatives have to be adjusted accordingly. So in some places you’ll see a formula like this,



Both are true, but the premise is different, so the conclusion is different.

The relation between cross entropy and logarithmic likelihood

Conclusion: The loss function of cross entropy and maximum likelihood is consistent, when the classification of samples is unique.

The key points of harmony and unity are as follows:

The sample belongs to a unique category, the sample must belong to a certain category, and the idea of likelihood is to maximize the probability of sampling a sample, so each sample can only be in a fixed state. And that allows the probability form for each sample to be written in terms of a sum, and the sum form can just be broken down into the cross entropy form under log. Under multiple categories, if the category to which the sample belongs is unique, loss of maximum likelihood is still consistent with loss of cross entropy.

Argument:

The binomial distribution

The binomial distribution is also called 0-1 distribution. If the random variable x obeies the binomial distribution, the probability of the parameter μ (0≤μ≤1) taking 1 and 0 is as follows:



Then the probability distribution on x is:


The logarithmic likelihood function of the sample set that follows a binomial distribution

Given sample set D={x1,x2… , xB} is to observe the value of a random variable x, assume sample set from the binomial distribution p (x | mu) independent (p (x1, x2,… ,xN)=∏ IP (xi)), then the likelihood function of μ is:


From the point of view of the frequency school, the parameter μ can be estimated by taking the value of the maximum likelihood function, and maximizing the likelihood function is equivalent to maximizing its logarithmic form:

Its derivative with respect to μ was obtained, and the maximum likelihood solution of μ was:


Here we focus only on:


Cross entropy loss function


X represents the original signal and Z represents the reconstructed signal. (The goal of the loss function is to minimize, and the likelihood function is to maximize, and they differ by only one sign).


reference

[1] Michael Nielsen.CHAPTER 3 Improving the way neural networks learn[DB/OL]. http://neuralnetworksanddeeplearning.com/chap3.html, 2018-06-22.

[2] Zhu Xiaohu. Zhang Freeman.Another Chinese Translation of Neural Networks and Deep Learning[DB/OL].https://github.com/zhanggyb/nndl/blob/master/chap3.tex, 2018-06-22.

[3] __ hon. Softmax log likelihood cost function (formula derivation) [DB/OL]. https://blog.csdn.net/u014313009/article/details/51045303., 2018-06-22.

Example of a [4] have become HIT_NLP. Hand step by step, take you understand softmax function and the derivation process [DB/OL]. https://www.jianshu.com/p/ffa51250ba2e., 2018-06-22.