This is the 5th day of my participation in the August More Text Challenge

9.1 the information entropy

Information entropy is a concept in information theory, and before we explain information entropy let’s look at what information is.

In 1948, Shannon mentioned in his famous paper “Mathematical Principles of Communication” that “information is something used to eliminate random uncertainties”. For example, if we want to know what day of the week it is, it could be any day from Monday to Sunday without any information. But if someone tells you today is the weekend, it can only be Saturday or Sunday, removing the uncertainty that today could be Monday to Friday.

But the above concept is too vague, is there a formula to quantify how much information an event has? Yes, it is the amount of information:


H ( x ) = l o g 2 p ( x ) H(x) = -log_2p(x)

Shannon came up with this formula based on three properties of information:

  1. Monotonicity: The higher the probability of occurrence of events, the less information generated. For example, “The sun rises in the East”, which is a certain event with probability 1, H(x)=0H(x)=0H(x)=0, does not eliminate any uncertainty.
  2. Nonnegative: The amount of information cannot be negative.
  3. Additivity: If a system is composed of multiple unrelated subsystems, its amount of information should be equal to the sum of the information of each subsystem.

We define the probability generated by event xix_ixi method as P (xi)p(x_i)p(xi), then the expectation of all event information in a system is:


H ( x ) = 1 n p ( x i ) l o g 2 p ( x i ) H(x) = – \sum_1^np(x_i)log_2p(x_i)

So we artificially define 0log20=00log_20=00log20=0.

This is information entropy, which represents the expectation of the amount of information needed to eliminate the uncertainty of a system, that is, the uncertainty of the system.

The figure below is a function image of binary information entropy. It can be seen that when the probability of two events is the same, the information entropy is the maximum, and n events are the same.

The following is not relevant to decision trees, so you can skip it if you’re not interested.

9.2 KL divergence and cross entropy

KL divergence is a concept in information theory used to quantify the difference between two probability distributions P and Q:


D k l ( P Q ) = P ( x ) ( l o g 2 P ( x ) l o g 2 Q ( x ) ) D_{kl}(P||Q) = \sum P(x)(log_2P(x)-log_2Q(x))

The greater the difference between the two distributions, the greater the KL divergence.

Pay attention to the above formula is based on P, Q and P vary much, usually Dkl (P ∣ ∣ Q) indicates Dkl (Q ∣ ∣ P) D_ {kl} (P | | Q) \ ne D_ {kl} (Q | | P) Dkl (P ∣ ∣ Q)  = Dkl (Q ∣ ∣ P). By Jensen’s Inequality proof Dkl (P ∣ ∣ Q) acuity 0 d_ {kl} (P | | Q) \ ge 0 Dkl (P ∣ ∣ Q) or 0, established the equal sign when P = Q.

Let’s open the parentheses and see:


D k l ( P Q ) = ( P ( x ) l o g 2 Q ( x ) ) ( P ( x ) l o g 2 P ( x ) ) D_{kl}(P||Q) = \sum (-P(x)log_2Q(x))-(-P(x)log_2P(x))

The second half of the formula −P(x)log2P(x) -p (x)log_2P(x)−P(x)log2P(x) represents information entropy, and once a system is defined, its information entropy is constant. The preceding part −P(x)log2Q(x) -p (x)log_2Q(x)−P(x)log2Q(x) is the cross entropy, and the smaller the cross entropy, the smaller the KL divergence, and the smaller the difference between the two distributions. That’s why cross entropy can be used as a loss function for deep learning. And the reason why we have a minus sign here is because the logarithm of the probability is a non-positive number, and a minus sign is the same thing as subtracting two non-negative numbers, so it’s easy to understand.

The resources

Zhihu: What is information entropy

B Station up Wang Wood studies science