“This is the 24th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

The basic knowledge of classification task — Softmax and cross entropy loss function, this paper will introduce Softmax and cross entropy loss function from multiple perspectives, and explain in detail the origin of cross entropy and network forward and reverse propagation process.

In conventional multi-classification tasks, we usually use full connection layer to map feature channels to the number of categories to get the scores of each category, use SoftMax to get the probability of each category, and use cross entropy as the loss function to optimize the model. This article will briefly introduce Softmax and cross entropy functions:

softmax


y i = e z i k = 1 n e z k y_i=\frac{e^{z_i}}{\sum_{k=1}^ne^{z_k}}

This allows softmax to map the input distribution between [0,1]. On the one hand, softmax widened the gap between the input values, because we want the model to assign a greater probability to the correct category, while linear normalization is difficult to achieve this effect. On the other hand, the combination of Softmax and cross entropy function is smoother, and its gradient solution form is simpler.

The cross entropy

Speaking of cross entropy, we have to talk about maximum likelihood estimation in probability theory. Maximum likelihood estimation is used to estimate the parameters of the original distribution when the sample distribution is known. Here, we can solve the parameters of the model.

  1. Write the likelihood function
  2. The exponential
  3. differentiate
  4. equation

The likelihood function is defined as follows:


L ( Theta. x ) = P ( x Theta. ) = P ( x 1 . x 2 . . x n Theta. ) = i = 1 n p ( x i Theta. ) \begin{aligned}L(\theta|x) &=P(x|\theta )\\ &=P(x_1, x_2,\cdots,x_n|\theta)\\ &= \prod_{i=1}^{n}p(x_i|\theta)\end{aligned}

So P is the probability of the sample, take the log


l o g i = 1 n x i = l o g ( x 1 x 2 x n ) = l o g ( x 1 ) + l o g ( x 2 ) + . . . + l o g ( x n ) = j = 1 n l o g ( x i ) \begin{aligned} log\prod_{i=1}^{n}x_i &= log(x_1\cdot x_2\cdots \cdot x_n) \\ &= log(x_1)+log(x_2)+… +log(x_n) \\ &= \sum_{j=1}^{n}log(x_i)\end{aligned}

This is not intuitive. For example, given that 6 out of 10 balls are white and 4 are black, assuming that the original distribution follows the Bernoulli distribution with parameter P and p is the probability of white balls, the likelihood function can be obtained as follows:


L = p 6 ( 1 p ) 4 (1) L=p^6*(1-p)^4 \tag1

l o g ( L ) = 6 l o g ( p ) + 4 l o g ( 1 p ) (2) log(L)=6log(p)+4log(1-p) \tag2

So what’s the relationship between maximum likelihood and cross entropy?

Let’s divide both sides of equation 2 by N=10, the total number of samples:


l o g ( L ) N = 6 N l o g ( p ) + 4 N l o g ( 1 p ) l e t   6 N = q . 4 N = 1 p l o g ( L ) N = p l o g ( p ) + ( 1 p ) l o g ( 1 p ) \frac{log(L)}{N}=\frac6N log(p)+\frac4 N log(1-p)\\ let \ \frac6N=q,\frac{4}{N}=1-p\\ \frac{log(L)}{N}=plog(p)+(1-p)log(1-p)

Since gradient descent needs to minimize the loss function, the binary cross entropy loss function can be obtained by taking negative of the above equation, and the same is true for multivariate:


C E = q i l o g ( p i ) CE=-\sum q_ilog(p_i)

On the other hand, cross entropy is also a kind of entropy. Entropy is a concept in information theory, which can describe the redundancy of information and is defined as follows:


E n t r o p y = q i l o g ( q i ) Entropy=-\sum q_ilog(q_i)

In fact the two are very similar, which the probability of p as samples, the probability of q is true (q in the practical training for one – hot or smooth), is the purpose of the cross entropy to p as close as possible to q, when both close enough, we can get the entropy, in this way, we also can understand cross entropy from the viewpoint of information theory.

KL divergence, by the way, is just cross entropy minus entropy, so KL divergence is also called relative entropy.

The gradient is

In probability theory we can directly solve for the parameters of the distribution, but how do we solve for so many parameters in the network?

We use the gradient descent method for approximation, and the primary premise of gradient descent is to solve the gradient.

The derivative function form after softmax+ cross entropy composite is also very simple, and then the gradient of Loss on input Z is solved.

Firstly, the gradient of Softmax should be divided into two cases: I =ji=ji=j and I ≠ji\neq ji=j, which can be obtained from the derivation rule learned in high school and partial derivative rule learned in college

When is equal to:


partial a i partial z i = partial ( e z i k e z k ) partial z i = k e z k e z i ( e z i ) 2 ( k e z k ) 2 = ( e z i k = 1 n e z k ) ( 1 e z i k e z k ) = a i ( 1 a i ) \begin{aligned} \frac{\partial a_i}{\partial z_i} &= \frac{\partial(\frac{e^{z_i}}{\sum_{k}e^{z_k}})}{\partial z_i}\\ &=\frac{\sum_ke^{z_k}e^{z_i}-(e^{z_i})^2}{(\sum_ke^{z_k})^2}\\ &=(\frac{e^{z_i}}{\sum_{k=1}^ne^{z_k}})(1-\frac{e^{z_i}}{\sum_{k}e^{z_k}})\\ &=a_i(1-a_i) \end{aligned}

When not equal:


partial a j partial z i = partial ( e z j k e z k ) partial z i = e z j ( k e z k ) 2 e z i = a i a j \begin{aligned} \frac{\partial a_j}{\partial z_i} &= \frac{\partial(\frac{e^{z_j}}{\sum_{k}e^{z_k}})}{\partial z_i}\\ &=\frac{-e^{z_j}}{(\sum_ke^{z_k})^2}e^{z_i}\\ &=-a_ia_j \end{aligned}

For the cross entropy function:


partial L i partial a j = y j 1 a j \frac{\partial L_i}{\partial a_j}=-y_j\frac{1}{a_j}

By the chain rule;


partial L partial z i = j ( partial L j partial a j partial a j partial z i ) = j indicates i a ( partial L j partial a j partial a j partial z i ) + j = i ( partial L j partial a j partial a j partial z i ) = j indicates i a i y j + ( y i ( 1 a i ) ) = a i j y j y i \begin{aligned} \frac{\partial L}{\partial z_i} &= \sum_j(\frac{\partial L_j}{\partial a_j}\frac{\partial a_j}{\partial z_i})\\ &=\sum_{j \neq i} a(\frac{\partial L_j}{\partial a_j}\frac{\partial a_j}{\partial z_i}) + \sum_{j=i}(\frac{\partial L_j}{\partial a_j}\frac{\partial a_j}{\partial z_i})\\ &=\sum_{j\neq i}a_iy_j + (-y_i(1-a_i))\\ &=a_i\sum_jy_j-y_i \end{aligned}

For the label of the one-hot vector, we can simplify it to


partial L partial z i = a i y i \frac{\partial L}{\partial z_i}=a_i-y_i