The maximum entropy model can be explained in simple terms

0 x00 the

In this paper, we will try to explain the maximum entropy model in an understandable way and not involve mathematical formulas as much as possible. Instead, we will use perceptual and intuitive thinking to explain the maximum entropy model from the perspective of overall thinking. And from the classic to find a few specific application scenarios to help you in-depth this concept.

0x01 Background Concepts

1. What is entropy?

The concept of entropy can be understood in many ways.

1.1 Understanding entropy from a physical perspective

Entropy originated in physics. The concept of entropy was first proposed by The German physicist Rudolf Clausius to describe how evenly any kind of energy is distributed in space. The more evenly the energy is distributed, the greater the entropy. That is, entropy is a measure of the state of a system of matter, and it is used to characterize the disorder of the system.

  • The greater the entropy is, the more disorder the system is, which means that the system structure and motion are uncertain and irregular.
  • The lower the entropy, the more orderly the system, meaning that the system has a definite and regular state of motion.

1.2 Understanding entropy in terms of system complexity

Information entropy can also be used as a measure of the complexity of a system, that is, a measure of the ordered, organized and complex state of a material system.

  • If the system is more complex, there are more kinds of different situations, then its information entropy is relatively large.

  • If the system is simpler, there are few kinds of situations (the extreme case is 1, then the corresponding probability is 1, and the corresponding information entropy is 0), and the information entropy at this time is small.

The greater the entropy, the greater the uncertainty of the system, and the more possibilities of the future development of the system.

1.3 Derivation & Definition of entropy

The definition of entropy is as follows: 𝐇(𝐱) = −𝒔𝒖𝒎(𝒑) 𝐱 (𝒔𝒖𝒎)

Where, 𝑝(𝑥) represents the probability of the random event 𝑥, and H(X) is called the entropy of the random variable X, which is the measure of the uncertainty of the random variable and the expectation of the information generated by all possible events. The minus sign is used to ensure that the amount of information is positive or zero.

Why is base 2? And that’s because we only need the amount of information for a low probability event x to correspond to a high amount of information. So the choice of logarithms is arbitrary. Just follow the general tradition of information theory and use 2 as the base of the logarithm!

From the formula, the more values of the random variable, the more states, the greater the entropy of information, the greater the degree of confusion.

Entropy is highest when the random distribution is uniform. Here’s a very good example of a uniform distribution with maximum entropy:

Scores in the class, for example, are usually results we know that the whole class is in conformity with the gaussian distribution, through a test, found that everyone is the same, the result is in the school point of view this is a breaking news, because this is a very low, but it did happen, do not conform to the norm, the next step should be investigated.

1.4 A practical example of entropy in the application of system complexity

Here’s an example from the classics:

Entropy of Li Zicheng system

For example, in history, Li Zicheng broke into the army, which was mainly divided into Shaanxi gang and Henan Gang. Let’s say the Shaanxi gang takes 2/3 and the Henan gang takes 1/3

So the specific entropy value is:

- [1/3 * log(1/3) + 2/3 * log(2/3)] = 0.63
Copy the code

The entropy of liangshan

This time by mountain. Such as:

  • Song Jiang Faction (Hua Rong, Lei Heng, Li Kui, Qingfeng Shan, OU Peng, Lu Fang, Guan Sheng, Hu Yanzhu…..)
  • Sanshan Faction (Lu Zhishen, Wu Song, Yang Zhi, Zhang Qing, Cao Zheng…..)
  • Chao Gai faction (Lin Chong, Liu Tang, SAN Ruan, Bai Sheng…)
  • Lu Junyi daimyo faction (Yan Qing, CAI brothers…)
  • Dengzhou faction (Sun Li, Gu Sister-in-law…..)
  • Jieyang River Faction (Li Jun, Tong Wei, Tong Meng…..)
  • Independents (Seow Rang, Kim Dae-kean, Ahn Do-jeon…)

Assuming that Song Jiang’s faction occupies 1/3 and other factions occupy 1/9, the specific entropy value is:

- [1/3 * log(1/3) + 6 * 1/9 * log(1/9)] = 1.83
Copy the code

In this way, Liangshan is more unstable, so it is a good choice to be recruited, and chuangjun is relatively stable, so you can go through the rough defeat Chongzhen.

2. Maximum entropy principle

The principle of maximum entropy is that the world will tend to disorder, whether it is business, life, steel, politics, everything, must be rotten and die!

What he’s describing is the fact of nature, that given what we know, which is constraints, the probability of an unknown event is as close to the natural order of things as possible.

The whole development of the universe is a process of increasing entropy. Without the influence of external forces, things will always be in a more chaotic state, and if the influence of external forces is applied, the system will reach its most chaotic state possible under the constraint of external forces.

For example, when you look at a system, you should think that the system should naturally go to its most disordered state. One might say, well, I’m putting an external force on this system, I’m putting a constraint on it, to make it orderly. But in fact, the system tends to be as disordered as it can be given these forces or these constraints that you put on it. For example, no matter how neatly you tidy up the earphone cable, you will find that the earphone will still have a certain state of chaos after you put it out for a period of time.

The maximum entropy principle states that when you predict the probability distribution of a random event,

  • Don’t make any assumptions about the unknown.
  • If you don’t know anything about this distribution, guess the uniform distribution with the highest entropy.
  • If you know something about the distribution, then guess the distribution with the highest entropy satisfying those conditions (i.e. the prediction should satisfy all the known constraints).

From the perspective of risk prediction, the greater the entropy, the better, is to retain all the uncertainty, to minimize the risk. You can’t put all your eggs in one basket. When the probability distribution is the most uniform, the risk of prediction is the least, and the entropy of the probability distribution is the largest.

3. Discriminant model vs generative model

The classification problem is to determine the corresponding label y given a data x. There are two kinds of classification models: discriminant model and generation model.

3.1 Generation model (Generative)

A generation model is called a generation model because the idea behind it is that x is a feature, y is a tag, and whatever tag generates that feature. For example, if the tag is an elephant, then the possible features are big ears, long nose, etc. The model represents the generation relationship between a given input X and an output Y.

Generation model is to study the observation data of the x and y hidden category joint probability distribution P (x, y), and then according to the bayesian formula for conditional probability P (y | x), predict the largest y conditional probability. In other words, in the training phase, only P(X,Y) is modeled, and all information parameters of the joint probability distribution need to be determined. The specific process is:

  • Firstly, the distribution of all the data is explored from the training sample data, and then a distribution is finally determined as the distribution of all the input data. And it is a joint distribution P(X,Y) (note that X contains all the characteristics Xi, Y contains all the label Yi, and each Label Yi needs to be modeled).
  • And then to the new sample data of new samples to calculate P (Y | X), Y is derived, finally choose the optimal probability of label for the result, so there is no discriminant boundary.

Go through the generative model workflow again.

Learning stage, modeling:


P ( X . Y ) = P ( X Y ) P ( Y ) P(X,Y)=P(X|Y)P(Y)

Then, based on the new X, derive Y:


P ( Y X ) = P ( X . Y ) P ( X ) P(Y|X) = \frac{P(X,Y)}{P(X)}

The advantage of the generative model is that it contains so much information, which I call “God information,” that it can be used not only to enter labels but also to do other things. Generative models focus on how results are produced. However, the generative model needs a very sufficient amount of data to ensure that the original data is sampled, so the speed is slower. Or the generation model is simulating the real distribution of data.

Examples of generative models: HMM/Naive Bayes. Naive Bayes need to model input X and output Y at the same time to get joint distribution P(X,Y), so it is a generation model. Since X is a complex thing, it is very painful to model, so Naive Bayes have to make strong assumptions and wear the “Naive” hat for the rest of their lives.

3.2 Discriminant Model (Discriminative)

Discriminant model is directly on the conditional probability P (Y | X) modeling, that is, directly according to the characteristics of X on Y modeling training. Popularly interpreted as the probability of a predicted result after a given eigenvalue.

The discriminant model learns the Y (or label) of the data directly from the provided X (features). Training process is to determine the construction of P (Y | X) inside the model parameters of the complex mapping relationship.

The discriminant model constructs only one model for all samples to confirm the overall discriminant boundary. Observing the characteristics of the input, it predicts the most likely label, and finally draws a clear or relatively clear boundary (which can be achieved by complex function mapping, decision stacking mechanism, etc.).

In other words, the discriminant model is a discriminant model, because we got rid of the independence hypothesis, so we can’t give a joint probability distribution, we can only give a posterior probability, so it’s a discriminant model.

Examples of discriminant models are logistic regression/SVM/CRF.

The advantage of the discriminant model is that the data volume is not as strict as the generation formula, the speed will be fast, and the accuracy will be better under the small data volume.

3.3 contrast

From the perspective of probability graph, directed graph expresses A deductive relationship, that is, B appears under the premise of A. This model is also called generative model.

The undirected graph expresses A “it’s right” relationship, that is, it’s right to have both A and B, which is also called A discriminant model.

The generative model is usually calculated with joint probability (since we know the premise of A, we can calculate the joint probability).

Discriminant models are usually calculated using conditional probabilities (because we don’t know the premises, we can only “assume” the probability of B given A).

Generation model’s goal is to seek joint probability distribution P (X, Y), then the formula calculating the conditional probability distribution P (Y | X). The P (Y | X) = P (X, Y)/P (X). Because of learning joint distribution, the distribution of data can be expressed from the perspective of statistics, which can reflect the similarity of the same kind of data. But it doesn’t care what the boundaries of the categories are.

Discriminant model is directly by the training data to calculate the decision function Y = f (x) or conditional probability distribution P (Y | x). It doesn’t care about the generation relationship between X and Y, it cares about what the output Y should be given the input X, and it doesn’t need to learn the joint distribution P of X,Y. It cannot reflect the characteristics of the training data itself. But it looks for the optimal classification surface between different categories, reflecting the differences between heterogeneous data. Due to direct learning P (Y | X) or f (X), the data can be all kinds of degree of abstraction, definitions, features and use, so it can simplify the learning problems.

3.4 Use classic examples on the Internet to explain:

Example 1: Let’s say you want to determine whether a sheep is a goat or a sheep

The method of distinguishing the model is:

Learn a unified model from historical data, and then predict the probability that the sheep is a goat, the probability that the sheep is a sheep by extracting the characteristics of the sheep. I just need to learn one particular difference between a sheep and a goat.

Goats, for example, are better at rock climbing, and if they can walk on a slope of more than 45 degrees, they’re goats. Here’s a sheep.

And this sheep is walking 85 degrees on a flat hill, so it’s a goat.

Method of generating model:

According to the characteristics of goats, first learn a goat model (thin body, mitsubishi horn, sickle-shaped bending, general hair thick and short, mostly white, black, green, brown or mixed color, tail upward, bold, feed on shrubs shoots).

Learn a model of a sheep based on the characteristics of the sheep (full body, short head, large spiral horns of the ram, no horns or only thin and small horns of the ewes. Capillary honey, mostly white. Sheep have thin, flexible lips and are suitable for grazing very short pastures. They are gentle and timid, and eat mainly grass.

And then you take various features from that sheep, put them in the goat model and see what the probability is, put them in the sheep model and see what the probability is, whichever is bigger.

In this case, in the first model discriminant method you just remember the difference between a goat and a sheep. In the second idea of generating models, you actually learn what a sheep is and what a goat is.

Example 2: Identify which language a speech belongs to

For example, when a person comes across and says a word to you, you need to identify whether she is speaking Chinese, English or French. There are two ways you can do this:

  • Every language, you put a lot of effort into learning Chinese, English, French, etc. By learning, I mean you know what sounds correspond to what language. And then when someone comes up to you and talks to you, you can figure out what voice they’re talking about. That’s GM.
  • Instead of learning every language, you learn the differences between the language models and then categorize them. It means that I learned that there is a difference in pronunciation between Chinese and English and other languages, and I learned the difference. This is DM.

Example 3: Tracing algorithm

  • Model generation: Usually learning a model that represents the target, and then searching the image area through it, and then minimizing the reconstruction error. It’s similar to generating a model that describes a target, and then pattern matching, finding the area of the image that best matches the model, is the target.
  • Discriminant model: Treat the tracking problem as a dichotomous problem, and then find the decision boundary between target and background. It doesn’t care how the target is described, as long as it knows what the difference is between the target and the background, and then you give an image, and it sees which side of the boundary it falls into.

3.5 How to Select a Model?

In general, generative models require a sufficient amount of data to ensure that the data is sampled as it should be. The discriminant model is not as strict as the generation formula to the amount of data. But it’s also case by case.

According to the bayesian formula, for example, we know that P (X, Y) = P (X | Y), P (Y) = P (Y | X) P (X), so the generated model can also be expressed as: P (Y | X) = P (X | Y), P (Y)/P (X).

We now use the P (Y | X) = P (X | Y), P (Y)/P (X) this formula, the first solution is P (X | Y), so some places also said of P (X | Y) modeling for GM (of course also requires P (Y) and P (X)), the P (Y | X) direct modeling for DM; Meaning is the same that the GM in the first of the P (X | Y) modeling, can take the joint distribution P (X, Y) = P (X | Y), P (Y), still can be said to be the joint distribution for GM.

For example, X = people who smoke, Y = risk of lung cancer, the P (Y | X) modeling can be very complex. For P (X | Y) modeling is relatively easy, because to P (X | Y) modeling is like smoking in lung cancer of people see inside. We all know that hundreds of millions of people smoke, but the number of people who get lung cancer is in the minority. It’s easier to model from the number of people who get lung cancer. For example, 100,000 people who get lung cancer, you can sample 1,000 people.

4. Maximum entropy model concept

Maximum entropy model is a discriminant model in classifier, which directly calculates conditional entropy.

Suppose X is the input feature, Y is the class standard, with some empirical knowledge, given a data set. Discriminant model is to study a conditional probability distribution P (y | x). There are an infinite number of these distributions, and of these infinite number of probability distributions, which one is a good one? So that’s applying maximum entropy to get the model.

We model what we know, but we make assumptions about what we don’t know. In other words, given a set of facts (features+output), the model that conforms to all the facts and is nearly uniform in other aspects is selected.

  • The maximum entropy model based on the maximum entropy principle is the model that most conforms to the probability distribution in the natural state, so the model is the most likely to happen in reality.
  • The maximum entropy principle expresses equal probabilities (as uniformly distributed as possible) by maximizing entropy. “Equally likely” is not easy to manipulate, whereas entropy is a numerical indicator that can be optimized.
  • When you don’t have any constraints you just have to find the model with the highest entropy.
  • If the probability model needs to satisfy some constraints, then the maximum entropy principle is to make no subjective assumptions about unknown situations in the set of conditions that satisfy known constraints.
  • The best distribution is the one that has the highest entropy given this empirical knowledge. That is, the maximum entropy statistical model obtains the model with the largest information entropy among all the models satisfying the constraint conditions.
  • According to the assumption of “maximum entropy of ignorant information”, we carry out “equal likelihood processing” on the classified phenomena respectively, and further get the probability of all the phenomena, which can be used to predict the classification.
  • For an input X, we import X into the maximum entropy model and obtain a class Y, which is the category X should belong to under the model that most conforms to the natural law distribution (uniform distribution of all possibilities).

5. Build the maximum entropy model

5.1 purpose

Our task is to construct a statistical model that accurately represents the behavior of a random process. Task is to predict the model x in the context of the given case, the output probability of y: p (y | x).

The generation of a classification model is generally divided into two steps:

  • Extracting a set of decision process facts (rules) from a sample
  • Is to model the classification process based on these facts.

Empirical knowledge is the constraint, which is introduced by the eigen function. The eigen function f(x,y) describes some fact (empirical knowledge) between x and y.

5.2 Derivation Process

Optimize the process of derivation, under a given training set, namely under the given constraint (knowledge), can be in conformity with the conditional probability model of constraint set {P (y | X)}.

Here we would like to P (y | x) as a set of probability distribution, namely every x, corresponding to a certain probability distribution P (y | x), each probability distribution will have a entropy, at this point, the maximum entropy, all is to maximize the probability distribution of entropy and, since every x has a empirical probability P ~ (x), We also need to weigh all of these entropies to show which probability distribution is more important for maximizing its entropy.

Then, the form of characteristic function is used to integrate the training data into the conditional probability. Finally, based on the principle of maximum entropy, among all possible probability models (probability distributions), the model with the maximum conditional entropy is selected as the final classification model. Therefore, the learning of maximum entropy model is to solve constrained optimization problem.

The original training method of maximum entropy model is an iterative algorithm called generalized iterative Scaling (GIS). The principle of GIS is not complicated, that is, the NTH iteration model is used to estimate the distribution of each feature in the training data. It can be roughly summarized as the following steps:

  1. The initial model of the zeroth iteration is assumed to be uniformly distributed with equal probabilities.
  2. The NTH iteration model is used to estimate the distribution of each information feature in the training data. If the distribution exceeds the actual one, the corresponding model parameters are reduced. Otherwise, make them big.
  3. Repeat step 2 until convergence occurs. That is, when the characteristic distribution of the training sample is the same as that of the model, the optimal parameters are obtained.

5.3 applications

After we get the parameters of the model, we get the maximum entropy model.

Model, the result is the probability of P (y | x) size, or given a input x, get different y corresponding conditional probability, according to the size of the probability of x can be divided into different y. So you can classify it.

Note that the P (y | x) is not as simple bayesian prior probability from sample data and calculate, but by conditional entropy and biggest.

0x03 Application of the maximum entropy model in the classic

Let’s look at it from a risk aversion perspective.

1. Apply maximum entropy

In the Outlaws of the Marsh, Song Jiang went to Jiangzhou to summon his troops and passed by Liangshan. Wu Yong sent men to wait at various intersections.

Wu Xuejiao did not know which way Song Jiang would pass by, so he arranged his troops on all four roads. (He was afraid that he would go to the wrong end of the road, so he taught the leaders to go to the four roads to wait for his brother.) So wait for song Jiang.

The three men traveled for a whole day. At night, they stopped at an inn to have a rest. They lit a fire and cooked some rice. Song Jiang said to him, “To tell you the truth, we are passing by the Liangshan Park today. There are some brave men in the stronghold who smell my name, for fear that they will come down and take me away and surprise you. You and I will get up early tomorrow and just take the road. We prefer to walk a few more miles.” The two men said, “How can we know if you don’t tell us, captain? I must not bump into them as I think I am going by.” That night the plan was decided, and the next day, the fire started in the fifth watch. The two officials and Song Jiang left the inn. Just stay on the path. When he had gone about thirty miles, he saw a group of men coming out from behind the hillside in front of him. When Song Jiang saw it, he only cried bitterly. The king who came was another man, and the leading hero was Liu Tang, a red-haired demon. With thirty or fifty men in charge, he came to kill the two men. This piece of qian, Li Wan, do a pile of son kneeling on the ground. Song Jiang cried, “Brother! Who are you going to kill?” Liu Tang said, “Brother, if I don’t kill these two men and women, what are you waiting for?” Song Jiang said, “You don’t want to stain your hands. I’ll kill you with a knife.” Two people only cry bitter. Liu Tang handed the knife to Song Jiang. Song Jiang took it and asked Liu Tang, “What do you mean by killing your official?” Liu Tang said, “My brother in Fengshan will do him a favor. The special envoy would come to Yuncheng County and plunder the jail, knowing that his brother had never been in prison and had not suffered. I am afraid I have made a mistake on the way to Jiangzhou. I am afraid I have made a mistake on the way to Jiangzhou. What if we don’t kill these two citizens?”

2. Not be usedMaximum entropy

This is the example of Huarong Road in the Romance of The Three Kingdoms.

Because Mr Zhuge did not have enough soldiers to apply the maximum entropy model. So in order to avoid risks, he can only according to the characteristics of Cao Cao made a relevant plan and finally selected huarong Road.

Cao Cao was the top figure in Chinese history, while Zhuge Liang, a man of many wits and close to the devil, was the top one in Chinese history. One can count two steps more than the average person, and one can count three steps more than the average person.

But this kind of planning is not for ordinary people. We ordinary people can only honestly choose the maximum entropy model.

Mr. Kongming’s plan

When the cloud is long in the side, Kongming ignores completely. “Guan has been fighting with his brother for many years and has never fallen behind. Today is a great enemy, but the strategist does not use, this is what meaning?” Kong Ming laughed and said, “Don’t be surprised if the clouds are long! Some one would have been vexed to take one of the most important passes, but there was something against it, and he dared not teach it.” The cloud commander said, “What is the matter? I will see the decree.” Kong Ming said, “Cao Cao treated his foot very well in the past. Today’s defeat, will walk huarong Road; If you let him go, you will let him pass. So I dare not teach them.” Yun Chang said, “Counselor is so kind! On that day, Cao Cao paid a heavy attention to X. X had already killed Yan Liang and Wen Chou, and solved the siege of Baima. Today bump into, will let go!” Kongming said, “What if I let it go?” Yun chang said, “According to the military law!” Kong Ming said, “So, I have written a letter.” The cloud commander then issued a writ.” Yun Chang said, “What if Cao Cao doesn’t come by that way?” Kong Ming said, “I will also give you my orders. The cloud rejoiced. Kongming said, “The cloud can be long at the huarong path and high mountains, piling up firewood, set off a fire and smoke, and bring Cao Cao here.” Yun Chang said, “Cao Cao saw the smoke and knew there was an ambush. How did he come?” Kong Ming laughed and said, “Have you heard of the theory of the art of war? Fuck can use an army, but this is the only way to get past him. When he sees smoke, he will call his bluff and go that way. The general must rest.” After receiving a general, Yun led Guan Ping, Zhou Cang and 500 sabers to Huarong Road to ambush them. “My brother is very loyal,” Xuande said. “If Cao Cao goes to Huarong as expected, I’m afraid he’ll let him go.” Kongming said, “On the bright night, I watched the dry elephant. Leave this favor, teach cloud long to do, is also a beautiful thing.” “Sir,” xuande said, “it is rare in the world to have a divine calculation.”

Cao Cao’s calculation

When they were on their way, the sergeant said, “There are two roads ahead. May I ask the prime minister which way to go?” Fuck ask: “which way is short?” The sergeant said, “The road is a little flat, but it’s about fifty leagues away. Lane cast Huarong road, but nearly 50 miles; But the road is narrow and dangerous, and the pit is difficult.” Fuck people up the mountain to see, return: “path mountain side several smoke; There was no movement on the road.” Drill before the army will walk huarong road path. The generals said, “Where the beacon smoke rises, there must be armies and horses. Why do you turn this way?” Cao said, “Have you not heard that there is a saying in the book of war: it is true when it is empty, and it is true when it is empty? Zhuge Liang was resourcefully, so he ordered people to burn smoke in the mountains, so that our troops would not dare to go through the mountain road. But he was waiting in ambush on the road. I was sure, but I didn’t fall into his trap!” The generals all said, “The prime minister’s brilliant plan is impossible to reach.” He took huarong Road. The men were hungry and the horses tired. The person in a state of distress supports a policy and walks, in the arrow the person with a gun reluctantly and walks. The armour is wet, all incomplete; Military flags, have not whole: most are yi Ling road was driven panic, only ride bald horses, saddled clothes, all abandoned. In the depths of winter and cold, it’s hard to say.

0xEE Personal information

Thoughts on life and technology

Wechat public account: Rosie’s Thinking

If you want to get updates on your own articles, or check out your own technical recommendations, please pay attention.

0 XFF reference

Machine learning interview generation model vs. discriminant model

Discriminant Model, Generation Model and Naive Bayes Method

Entropy in Machine learning: From introduction to mastery

【 Machine learning 】 Logistic regression, SoftMax regression and maximum entropy model of log-linear model

Introduction to the maximum entropy model

95 lines of code to achieve the maximum entropy model training

Maximum entropy is used for text classification

Deep Learning — Probability Graph Model (1)

15, read an article to understand the probability graph model that won the Turing Prize and the Nobel Prize

Statistical learning method notes (4)- maximum entropy model principle and Python implementation

Generative and Discriminative models