Entropy


Entropy = i P ( i ) log 1 P ( i ) = i P ( i ) log P ( i ) \begin{aligned} \text{Entropy} &= \sum_i P(i)\log\frac{1}{P(i)} \\ &= -\sum_i P(i)\log P(i) \end{aligned}

The formula above is the definition of Shannon’s entropy, but it doesn’t make much sense to look at it, so let’s do an example, okay

Suppose there are four people, each with an equal probability of winning (14\frac{1}{4}41), we calculate the Entropy of this distribution

a = torch.full([4].1/4.) Tensor ([0.2500, 0.2500, 0.2500, 0.2500])
print("Entropy:", -(a*torch.log2(a)).sum()) # Entropy: tensor(2.)
Copy the code

Higher entropy means more stability, less surprise

Suppose there are still four people, but the probability of winning becomes 0.1,0.1,0.1,0.7. How much does Entropy become?

a = torch.tensor([0.1.0.1.0.1.0.7])
print("Entropy:", -(a*torch.log2(a)).sum()) # Entropy: tensor (1.3568)
Copy the code

And what we’ve calculated is that in this case the entropy has gone down, and the way to think about it is, if you were given this probability distribution, and you were told that you won, you would be more surprised than you would have been if you had the same probability

Finally, suppose the probability of winning goes to 0.001,0.001,0.001,0.997. What does Entropy go to?

a = torch.tensor([0.001.0.001.0.001.0.997])
print("Entropy:", -(a*torch.log2(a)).sum()) # Entropy: tensor (0.0342)
Copy the code

In this case, the entropy is lower, which means that in this probability distribution, the surprise that you won is very, very large

Cross Entropy

To compute Entropy of a distribution PPP, we usually denote it by H(p)H(p)H(p). To calculate Cross Entorpy of the two distributions, we usually use H(p,q), H(p,q), H(p,q), H(p,q), H(p,q), H(p,q), H(p,q), H(p,q), H(p,q), H(p,q), H(p,q), H(p,q), H(p,q)


H ( p . q ) = p ( x ) log q ( x ) = H ( p ) + D K L ( p q ) \begin{aligned} H(p,q)&= -\sum p(x) \log q(x) \\ &= H(p) + D_{KL}(p|q) \end{aligned}

Where DKLD_{KL}DKL, Kullback — Leibler divergence, Chinese translation is relative entropy or information divergence, its formula is


D K L ( P Q ) = i P ( i ) ln Q ( i ) P ( i ) D_{KL}(P|Q) = -\sum_i P(i)\ln\frac{Q(i)}{P(i)}

The simpler way to think about it is that if you plot P and Q as functions, the less overlap they have, the greater DKL d_ {KL}DKL. If the two functions almost overlap, DKL≈0 d_ {KL}≈0DKL≈0. Cross Entropy is equal to Entropy if P=QP=QP=Q

For a Classification problem, we get a preD that is an 0-1 Encoding, i.e. [0 0…1…0…0]. Obviously, the Entropy H(p) of this preD =0H(p)=0H(p)=0, Since 1log⁡1=01\log1=01log1=0, then Cross Entropy between this PREd and the real Encoding QQQ


H ( p . q ) = H ( p ) + D K L ( p q ) = D K L ( p q ) \begin{aligned} H(p,q)&= H(p) + D_{KL}(p|q) \\ &= D_{KL}(p|q) \end{aligned}

That is, when we optimize PPP and QQQ’s Cross Entropy, if it is 0-1 Encoding, then this would be the same as optimizing PPP and QQQ’s KL divergence directly, as described above, The KL divergence of PPP and QQQ is to measure the overlap of these two distributions. When KL divergence approaches 0, PPP and QQQ get closer and closer, which is exactly the goal of our optimization

Let’s give an example to show that H(p,q)H(p,q)H(p,q) H(p,q) is the target we need to optimize. Suppose there is a 5 classification problem (you can imagine five animals), the true value p=[1 0 0 0 0 0]p=[1\0\ 0\ 0\ 0]p=[1 0 0 0 0], Predictive value of q = 0.3 0.05 0.05 0.2 [0.4] q = [0.4, 0.3, 0.05, 0.05, 0.2] q = (0.4 0.3 0.05 0.05 0.2), then


H ( p . q ) = i p ( i ) log q ( i ) = ( 1 log 0.4 + 0 log 0.3 + 0 log 0.05 + 0 log 0.05 + 0 log 0.2 ) = log 0.4 material 0.916 \ begin} {aligned H (p, q) & = – \ sum_i p (I) (I) \ \ \ log q & = – (1 + 0 \ \ log0.4 log0.3 + 0 \ log0.05 + 0 \ log0.05 + 0 \ log0.2) \ \ & = -\log0.4 \\ &≈ 0.916 \end{aligned}

Suppose that after a round of parameter updating, the predicted value changes q=[0.98 0.01 0 0 0.01] Q =[0.98\0.01\ 0\ 0\ 0.01] Q =[0.98 0.01 0 0 0.01], then


H ( p . q ) = i p ( i ) log q ( i ) = ( 1 log 0.98 + 0 log 0.01 + 0 log 0 + 0 log 0 + 0 log 0.01 ) = log 0.4 material 0.02 \ begin} {aligned H (p, q) & = – \ sum_i p (I) (I) \ \ \ log q & = – (1 + 0 \ \ log0.98 log0.01 + 0 \ log0 + 0 \ log0 + 0 \ log0.01) \ \ & = -\log0.4 \\ &≈ 0.02 \end{aligned}

Cross Entropy drops by about 0.8, whereas when MSE is used as Loss, it only drops by about 0.3 to 0.4, so we have a sense that the gradient drops faster with Cross Entropy

import torch
import torch.nn.functional as F
x = torch.randn(1.784) # [1, 784]
w = torch.randn(10.784) # [10, 784]
logits = [email protected]() # 1, 10]
pred = F.softmax(logits, dim=1)
pred_log = torch.log(pred)

Note the difference between the parameters passed to cross_entropy and nLL_loss below
print(F.cross_entropy(logits, torch.tensor([3)))The # cross_entropy() function already packages softmax and log together, so you must pass a native value logits

print(F.nll_loss(pred_log, torch.tensor([3)))The null_loss() function is passed through softmax and log
Copy the code