Open the inscription

In 1948, Dr. Shannon wrote A Mathematical Theory of Communication, which pioneered information Theory. Since then, information Theory has flourished in the field of Communication and laid A solid theoretical foundation for the success of Communication engineering. However, information theory is a universal theory, which is considered as the “internal power” of the current upsurge of artificial energy and machine learning. This is because the essence of artificial intelligence is to process information. I think the algorithm of machine learning can be understood with the thinking of information theory. In view of this, I am looking forward to learning Information Theory and full of respect for Dr. Shannon, and I am grateful to the scholars who continue to supplement Information Theory.

This article can be regarded as my study notes for learning information theory (the notes will be given in the form of teaching knowledge). If there are any mistakes, please correct them. Thank you!

Entropy (Entropy)

Before introducing the concept of entropy, we can consider the following question:

Which experiment is more uncertain, flipping a coin with even heads and tails, or rolling a die with even six sides?

Roughly speaking, we feel that the coin toss is a little less uncertain, because the coin, after all, has only 2 outcomes, and the dice has 6 outcomes, but how do we quantify this intuitive fact to reflect the magnitude of the uncertainty of two random variables at a numerical level?

Shannon proposed the concept of entropy to solve the above problem. For the above discrete random events, the uncertainty can be defined by discrete entropy

Entropy is a measure of the uncertainty of a random variableFor a discrete random variable, its discrete entropy can be defined as:



Among them: cursiveRepresented as including all of the minorThe set of elements, log base two.

Shannon discrete entropy quantization is used to solve the two tests we introduced before:

Set random variableIs the value of flipping a fair coin, where heads are usedFor tails upRepresents, then there is:

Note: due to theProbability is equal, for the sake of layout clean so combined expression.

Set random variableIs the value of rolling a six-sided uniform die, where, so there are:

To sum up, we can come to a conclusion:



The random variableIs indeed more uncertain than random variablesIt’s small. It’s intuitive.

We can take a further look, the higher the entropy of a random variable is, the greater the uncertainty is. In other words, the more information the random variable contains, what exactly is the information? The amount of information in a coin flip, heads, tails, 2 is the amount of information; Similarly, the amount of information you get when you roll a die is 6 different numbers facing up, which is also the amount of information. So, from a computer point of view, what is entropy

What this means is that, in a computer, you need 1 bit to represent the result of a coin flip, and log base 6 bits to represent the result of a die roll, which means that ** entropy is the average length of the code for a random variable. Why does ** say that?

According to the definition of entropy, we know:



That is, entropy is actually a random variableThe function ofExpectations.

So far, we have seen the meaning of entropy in information theory and its physical meaning in computer coding. And finally, what is the entropy of certain events? It should be zero, because a certain event is certain, it doesn’t contain uncertainty, in other words, it doesn’t contain information.

Mutual Information

Entropy indicates the uncertainty of a single random variable, so is the value of entropy fixed? Is there a way to reduce that uncertainty? And if you can reduce it by how much can you quantify it?

Here quote a digress, Wu Jun teacher “the beauty of mathematics” book once mentioned. The essence of web search is to reduce uncertainty of a process, according to the user’s keyword and other means to reduce those irrelevant search, as close as possible to the user’s search intention, this kind of thinking can reflect the “internal work” nature of information theory.

Let’s take an example of a change in the uncertainty of events

Now suppose I give you a coin, and I tell you it’s fair, and YOU flip it 100 times and you tell me what happens, and you flip it 100 times and you get 90 heads and 10 tails, and you start to wonder, “Is this really a fair coin?”

So from the first part of entropy, we know that the entropy of this coin should be 1 bit, but after this experiment, does the entropy of this coin still have 1 bit? We can assume that the probability of heads is 0.9 and the probability of tails is 0.1, and calculate this entropy:

Among them,It’s the entropy of the original coin given the fact that we got 90 heads.

After 100 flips, we know that the coin is likely to be uneven, and the new entropy is 0.469 bits. That is to say, after we know the fact that 90 heads and 10 tails are down, the coin’s entropy is reduced by 0.531 bits. This 0.531 amount of information is called mutual information.

Thus, we introduce the definition of mutual information:

For two random variables, if its joint distribution is, the edge distribution is, mutual information can be defined as:

At first glance, this definition of mutual information is not quite the same as our example. In fact, this definition is based on relative entropy. Let’s forget about relative entropy and see if this definition can be translated into the form of reduced information we just mentioned.

And we can see that intuitivelyIs the original random variableThe amount of information,To know the factsThe amount of information,Mutual informationIt means knowing the factsAfter, the original amount of information reduced.

In order to describe mutual information more vividly, please allow me to use the Venn diagram on the Internet to illustrate:

And finally, if the random variableIndependent. What is mutual information? It should be 0, which means you know the factsIt doesn’t decreaseThis is intuitive because independence is inherently mutually exclusive.

Relative entropy

According to the above description, we know that in information theory, for an isolated random variable, we can use entropy to quantify it, and for two random variables with dependence, we can use mutual information to quantify it. Then, how much is the difference between two random variables? In other words, do these two random variables have similar distribution functions? If not, can the difference between them be quantified?

Relative entropy is the answer to the above question. It quantifies the degree of difference between two distributions. In other words, relative entropy represents the “distance” between the two distributions.

We introduce the definition of relative entropy:

Two probability density functionsThe relative entropy between is defined as:

One application of relative entropy is to assume that a sample is distributedAnd the distribution that we get from sample fitting isSo the degree of difference between them isAnd then the next step, according to the distributionWe conclude that the code length representing the random variable is its entropyThis expression is equivalent to), and our estimated distribution is, their relationship can be expressed as:

So we’ve quantified the distance between the two distributions.

Finally, we resolve the confusion left by the definition of mutual information:

From this derivation, we know that mutual information essentially describes the joint distribution, and the product of two edge distributionsIf the degree of difference is 0, it means that the two random variables are independent, which conforms to our intuition.

Reprinted from: zhuanlan.zhihu.com/p/36192699