Wangjiang Artificial think Tank

Deep learning series

Why do we need a cross entropy cost function

Humans, on the other hand, are quick to learn from obvious mistakes. On the contrary, when our mistakes are not well defined, the learning process becomes much slower. But this is not necessarily the case with neural networks, which seem to behave very differently from human learning. Artificial neurons actually have a hard time learning when they make a lot of mistakes.

To understand the source of this problem, consider that our neurons are modified by changing weights and offsets, and at a cost to the partial derivative of the functionDetermine the speed of learning. So, when we say slow learning, what we’re really saying is that these partial derivatives are small.

According to the symbolic definition in the previous chapter “Back propagation Algorithm”, for the quadratic cost function, the partial derivative of the weight of the output layer is

itemCan lead to a decrease in learning speed when one of the output neurons gets stuck at the wrong value.

And what we can see here is that when the output of the neuron approachesWhen the curve gets pretty flat, soIt’s very small. In the above equationIt’s going to be very small. This is actually why learning is slow. And, as we’ll see later, the reason for this decline in learning is actually the reason for the slower learning of more general neural networks, not just in this particular case.

Note that the quadratic cost function is used when linear neurons are used at the output layer. Suppose we have a multi-layer multi-neuron network, and the neurons in the final output layer are all linear neurons, and the output is no longerThe result of the action of phi, phi, is phi. If we use the quadratic cost function, then for a single training sampleThe output error of PI is PI

This suggests that if the output neurons are linear then the quadratic cost no longer causes the problem of reduced learning speed. In this case, the quadratic cost function is an appropriate choice.

But if the output neuron isWe had better consider other cost functions.

Definition of cross entropy cost function

So how do we solve this problem? Research shows that we can replace the quadratic cost function by using the cross entropy function. To understand what cross entropy is, let’s change our simple example a little bit. Suppose we now want to train a neuron that has several input variables,The corresponding weight isAnd offset:

The output of the neuron is, includingIs the weighted sum of the inputs. We define the cross entropy cost function of this neuron as follows:

Among themIs the total number of training data, and the sum is over all training inputsOn,Is the corresponding target output.

For the cross entropy cost function, for a training sampleOutput error offor

The partial derivative of the weight of the output layer is

hereSo cross entropy avoids the problem of slow learning.

So when should we replace the quadratic cost function with the cross entropy? In fact, if the output neuron is, cross entropy is generally a better choice. Why is that? Think about it that when we initialize the weights and the bias of the network we usually use some random method. It can happen that these initial choices are going to be quite significant for some training input error, for example, the target output isAnd the actual value isOr vice versa. If we use a quadratic cost function, then this leads to a decrease in learning speed. It doesn’t stop learning completely, because the weights continue to learn from the other samples, but obviously that’s not what we want.

Application of cross entropy in classification problem

When the cross entropy loss function is applied to the classification problem, the label of the category can only be 0 or 1, no matter it is single or multiple classification.

Application of cross entropy to single classification problem

Single category here means that each image sample can only have one category, such as only dogs or only cats. Cross entropy is a standard method for single classification

The above equation is of a sampleCalculation method. Type in theRepresents theCategories. For example,

Application of cross entropy in multi-label problem

Multi-category here means that each image sample can have multiple categories, such as a cat and a dog, and the label for multi-category is N-hot, unlike the label for single-category problem.

It is worth noting that the Pred here is calculated using the sigmoid function. Normalize the output of each node to between [0,1]. The sum of all Pred values is no longer 1. In other words, each Label is distributed independently and has no influence on each other. So the cross entropy here is calculated separately for each node, and each node has only two possible values, so it’s a binomial distribution. And I said that for this particular binomial distribution, the calculation of entropy can be simplified.

Similarly, the calculation of cross entropy can also be simplified, i.e

Note that the above formula is only for one node. This must be distinguished from the single category loss.

Proof of cross entropy cost function derivative with respect to weight

Definition of cross entropy cost function:

Cost functionStrives for the partial derivatives

Among themAccording todefine

so

According to theThe definition,

theInto theavailable

The vector form is

The same method is used for bias

The meaning and origin of cross entropy

Our discussion of cross entropy focuses on algebraic analysis and code implementation. This, while useful, leaves a broader conceptual question unanswered, such as: what exactly does cross entropy mean? Is there some intuitive way to think about cross entropy? How did we come up with this concept?

Let’s start with the last question: What motivates us to think about cross entropy? Suppose we find that the learning speed decreases and understand that the reason for this is that for the quadratic cost function, the partial derivative of the weight of the output layer is

itemCan lead to a decrease in learning speed when one of the output neurons gets stuck at the wrong value. After looking at these formulas, we might be tempted to choose one that does notThe cost function of. So, for a training sampleThe priceContent:

If we choose loss functions that satisfy these conditions, then they can show in a simple way that the larger the initial error, the faster the neuron learns. It also solves the problem of reduced learning speed. In fact, starting with these formulas, now let’s see if it’s possible to derive the form of cross entropy from our mathematical intuition. So let’s push it. By the chain rule, we have

useThe last equation becomes

The contrast equation, we have

This equation is aboutIf I integrate, I get

Where constant is the integral constant. This is a separate training sampleContribution to the loss function. In order to get the whole loss function, we need to average all the training samples, and get

And the constant here is the average of all the individual constants. So we see the equation

The only thing that defines the form of cross entropy, and adds a constant term. This cross entropy doesn’t happen in a vacuum. It’s a result that we get in a natural and simple way.

So what is the intuitive meaning of cross entropy? How do we see it? Explaining this in depth takes us to an area we are reluctant to discuss. It is worth mentioning, however, that there is a standard way of explaining cross entropy, derived from information theory. Roughly speaking, cross entropy is a measure of “uncertainty”. In particular, our neurons want to compute functions. But, it uses functionsI made the substitution. Suppose we’re going toImagine that our neurons are estimated to beThe probability of theta and thetaIt isThe probability. So the cross entropy measure we learnedThe average uncertainty of the correct value of. If we output what we want, we have less uncertainty; Conversely, the uncertainty is greater. Of course, I’m not giving a strict definition of what “uncertainty” actually means here, so it looks like overstatement. But in fact, in information theory there is a precise way to define what uncertainty really is. See the mathematical history of Cross-entropy for details.

The mathematical history of cross-entropy

Generally speaking, Entropy is used to describe the uncertainty of a system. Entropy has different interpretations in different fields, such as the definition of thermodynamics and information theory.

First, give an “unrigorous” conceptual statement:

  • Entropy: can represent the self-information of an event A, that is, how much information A contains.
  • KL divergence: Can be used to indicate how different event B is from the perspective of event A.
  • Cross entropy: Can be used to indicate how event B is described from the perspective of event A.

In a word, the KL divergence can be used to calculate the cost, and minimizing the KL divergence is equivalent to minimizing the cross entropy under certain circumstances. The cross entropy is easier, so I’ll take the cross entropy as the cost.

The amount of information

First, the amount of information. Suppose we hear two things and they are as follows: Event A: Brazil has qualified for the 2018 World Cup finals. Event B: China has qualified for the 2018 World Cup finals. Intuitively, it’s obvious that there’s more information in event B than in event A. The reason is that event A has A high probability of occurrence, while event B has A low probability of occurrence. So the more unlikely the event is, the more information we get. The more probable an event occurs, the less information we get. So the amount of information should be related to the probability of the event happening.

Assuming thatIs a discrete random variable whose set of values is, probability distribution function, the event is definedIs:

Because it’s a probabilityThe value range of is [0,1], and is plotted as follows:

What is Entropy?

In the context of information theory, the amount of information contained in an event. We now have a definition of the amount of information, and entropy is used to express the expectation of all the information, namely:

So entropy is defined as zero

How to measure the difference between two events/distributions: KL divergence

So what we’re talking about is the self-information of event A for A random variable x, if we have another independent random variable X related event B, how do we compute the difference between them?

Here we introduce the default method: THE KL divergence, sometimes called KL distance, is used to calculate the difference between two distributions. Looking at the name seems to be similar to calculating the distance between two points, but it is not, because the KL divergence has no symmetry. Symmetry in distance means that the distance from A to B is equal to the distance from B to A.

Mathematical definition of THE DIVERGENCE of KL:

Relative entropy is also called KL divergence. If we have two separate probability distributions P(x) and Q(x) for the same random variable X, we can use the KULlback-leibler (KL) divergence to measure the divergence between the two distributions

Wikipedia definition of relative entropy

In the context of machine learning, DKL(P‖Q) is often called the information gain achieved if P is used instead of Q.

For discrete events, we can define the difference between events A and B as:

For continuous events, then we’re just going to take the sum and take the integral.

It can be seen from the formula:

  • if, namely, the two event distributions are exactly the same, then the DIVERGENCE of KL is equal to 0.
  • If you look at the formula, you can see that the left-hand side of the minus sign is the entropy of event A, so keep that in mind.
  • If I do it in reverse orderThen you have to use the entropy of B, and you get a different answer. So KL divergence is symmetric when calculating two distributions A and B, there is A “coordinate system” problem **,

In other words, the DIVERGENCE of KL is determined by the entropy of A itself and the expectation of B on A. When the KL divergence is used to measure two events (continuous or discrete), the above formula means to find the expected value of the logarithmic difference between A and B on A.

The KL divergence is equal to the cross entropy. -Entropy?

If we use the KL divergence by default to calculate the difference between two distributions, then what does the cross entropy do?

In fact, the formula of cross entropy and KL divergence is very similar, in fact, it is the second half of KL divergence (Formula 2.1) : the cross entropy of A and B = THE KL divergence of A and B – the entropy of A.

Compare this to the formula for divergence of KL:

Here’s the formula for entropy:

This is the cross entropy formula:

The most important observation here is ifIs a constant, soIn other words, the DIVERGENCE of KL and the cross entropy are equivalent under certain conditions.

Why can cross entropy be a cost?

And then the last point, minimize the distribution of the modelAnd the distribution of training dataThe difference between is equivalent to minimizing the DIVERGENCE of KL between these two distributions, that is, minimizing.

Refer to the formula in Part IV:

  • Here A is the true distribution of data:
  • B here is the distribution learned by the model from the training data:

Coincidentally, the distribution A of the training data is given. So according to what we said in part four, because A is fixed, so let’s findEquivalent to begThat’s the cross entropy of A and B. It is proved that cross entropy can be used to calculate the difference between the distribution of the learning model and the distribution of the training data. We learn the “best model” when the cross entropy is lowest (equal to the entropy of the training data distribution).

However, the perfect training data distribution often means overfitting, because the training data is not equal to the real data, we just assume that they are similar, and generally we also assume that there is a Gaussian distribution error, which is the bottom line of the generalization error of the model.

Therefore, when evaluating machine learning models, we should not only look at the error rate and cross entropy of training data, but also pay attention to the performance of test data. If you perform well on the test set, you can ensure that this is not an over – or under-fitted model. Cross entropy has a further advantage over misscores because it works well with many probabilistic models.

So the logical idea is that in order to make the learned model distribution more close to the real data distribution, we minimize the KL divergence between the model data distribution and the training data. Since the distribution of the training data is fixed, minimizing the KL divergence is equivalent to minimizing the cross entropy.

Because it’s equivalent, and cross entropy is easier and better to calculate, so of course we use it.


reference

[1] Michael Nielsen.CHAPTER 3 Improving the way neural networks learn[DB/OL]. http://neuralnetworksanddeeplearning.com/chap3.html, 2018-06-21.

[2] Zhu Xiaohu. Zhang Freeman.Another Chinese Translation of Neural Networks and Deep Learning[DB/OL].https://github.com/zhanggyb/nndl/blob/master/chap3.tex, 2018-06-21.

[3] Fine tuning. Why can cross-entropy be used to calculate costs? [DB/OL]. https://www.zhihu.com/question/65288314/answer/244557337. 2018-06-21.

[4] Stanley Complex. Article understand cross entropy in the use of machine learning, thoroughly understand the intuition behind the cross entropy [DB/OL]. https://blog.csdn.net/tsyccnh/article/details/79163834., 2018-06-22.