Red Stone’s personal website: Redstonewill.com

Speaking of the Cross Entropy Loss function “Cross Entropy Loss”, the following formula comes to mind:


We’re already familiar with this cross entropy function, and most of the time we just use it directly. But where did it come from? Why does it represent the difference between the actual sample tag and the predicted probability? Is there any other variation of the cross entropy function above? Perhaps many friends are not very clear! Never mind, I’m going to answer these questions in the most colloquial language possible.

1. Mathematical principle of cross entropy loss function

As we know, in binary problem models, such as Logistic Regression “Logistic Regression” and Neural Network “Neural Network”, the labels of real samples are [0,1], which respectively represent negative class and positive class. The model usually ends with a Sigmoid function that outputs a probability value that reflects the probability of a positive prediction: the greater the probability, the greater the probability.

The Sigmoid function expression and graph are shown below:


Where S is the output of the upper layer of the model, Sigmoid function has such characteristics: when S = 0, g(s) = 0.5; When S >> 0, g ≈ 1, when S << 0, G ≈ 0. Clearly, g(s) maps the linear output of the previous level to numerical probabilities between [0,1]. G (s) here is the predicted output of the model in the cross entropy formula.

We said that the predicted output, the output of the Sigmoid function, represents the probability that the current sample tag is 1:


Obviously, the probability of the current sample label being 0 can be expressed as:


The point is, if we combine the above two cases from the point of view of maximum likelihood:


It doesn’t matter if you don’t know maximum likelihood estimation. Look at it this way:

When the real sample label y = 0, the first term of the above equation is 1, and the probability equation is transformed into:


When the real sample label y = 1, the second term of the above equation is 1, and the probability equation is transformed into:


In both cases, the probability expression is exactly the same as it was before, except we’ve combined the two cases.

Key after look at the integration of probability expressions, we hope is probability P (y | x) the bigger the better. First of all, we have P (y | x) is introduced into the log function, because the log operation does not affect the monotonicity of the function itself. There are:


We hope that the log P (y | x) the bigger the better, in turn, as long as the log P (y | x) negative – the smaller the log P (y | x). Then we can introduce Loss function, and make the Loss = – log P (y | x). Then the loss function can be obtained as:


It is very simple, we have derived the Loss function of a single sample. If it is to calculate the total Loss function of N samples, it is enough to add up N losses:


Thus, we have completely realized the derivation of the cross entropy loss function.

2. Intuitive understanding of cross entropy loss function

Some readers might say, I already know how to derive the cross entropy loss function. But is there a more intuitive way to understand this expression? Instead of just memorizing this formula. Good question! Next, we analyze the cross entropy function from the point of view of graph to deepen your understanding.

First of all, the cross entropy loss function of a single sample is written:


We know that when y = 1:


At this point, the relationship between L and the predicted output is shown in the figure below:

See the graphics of L, simple and clear! The abscissa is the predicted output, and the ordinate is the cross entropy loss function L. Obviously, the closer the predicted output is to the real sample tag 1, the smaller the loss function L will be. The closer the predicted output is to 0, the greater L is. Therefore, the trend of the function completely conforms to the actual situation.

When y = 0:


At this point, the relationship between L and the predicted output is shown in the figure below:

Similarly, the closer the predicted output is to the real sample label 0, the smaller the loss function L is. The closer the prediction function gets to 1, the bigger L gets. The changing trend of the function is also in line with the actual situation.

From the above two graphs, we can have a more intuitive understanding of the cross entropy loss function. Regardless of whether the real sample label y is 0 or 1, L represents the gap between the predicted output and Y.

In addition, it is important to mention that the graph shows that the more the predicted output differs from Y, the greater the value of L, that is, the greater the “penalty” for the current model, and it increases nonlinearly, a level similar to exponential growth. This is due to the nature of the log function itself. The advantage of this is that the model will tend to make the predicted output closer to the real sample tag Y.

3. Other forms of cross entropy loss function

What? What else is the cross entropy loss function? That’s right! What I just introduced is a typical form. Next, I will derive a new cross entropy loss function from another Angle.

In this form, it is assumed that the labels of real samples are +1 and -1, representing positive and negative categories respectively. The Sigmoid function has the following properties:


This property, we’ll just leave it there, it’ll be useful later.

Well, we said that for y = +1, the following equation holds:


If y = -1 and the properties of Sigmoid function are introduced, the following equation holds:


And now the point is, because y is either +1 or -1, we can substitute the y value, and we can put the two things together:


And this is a little bit easier to understand, because if you take y equals plus 1 and y equals minus 1, you get these two things.

Next, the log function is also introduced to obtain:


You want to maximize the probability, and conversely, you want to minimize the negative number. Then the corresponding loss function can be defined as:


Remember the expression for the Sigmoid function? Substitute g(ys) into:


Well, L is the cross entropy loss function that I’m going to derive. If it is N samples, its cross entropy loss function is:


Now, let’s look at it graphically. When y = +1:


At this point, the relationship between L and the scoring function s of the next layer is shown in the figure below:

The x-coordinate is s, and the y-coordinate is L. Obviously, the closer S is to the real sample tag 1, the smaller the loss function L is. The closer s gets to negative 1, the bigger L gets.

On the other hand, when y = -1:


At this point, the relationship between L and the scoring function s of the next layer is shown in the figure below:

Similarly, the closer S is to the real sample tag -1, the smaller the loss function L is. The closer s gets to plus 1, the bigger L gets.

4. To summarize

This paper mainly introduces the mathematical principle and derivation process of the cross entropy loss function, and also introduces two forms of the cross entropy loss function from different angles. The first form is more common in practical applications, such as neural networks and other complex models. The second is used for simple logistic regression models.