This article originated from personal public account: TechFlow, original is not easy, for attention


Today, in the 14th article of our machine learning series, we will talk about the well-known EM algorithm.

The full name of EM algorithm is the Expectation maximization algorithm, that is, the maximum Expectation algorithm, or the Expectation maximization algorithm. The EM algorithm claims to be one of the top ten machine learning algorithms, and that’s pretty impressive. I’ve read a lot of blogs and literature, but very few of them tell me all the ins and outs of this algorithm and how it works, so TODAY I’m going around and trying to make it as clear as I can.

EM is essentially an advanced version of maximum likelihood estimation, remember maximum likelihood estimation, which we talked about in the bayesian model, just to review.


Maximum likelihood estimation

So let’s say we have a coin, and we want to know what the probability is that the coin flips heads, so we flip the coin 10 times and we do an experiment. We found that there were 5 heads and 5 tails. So we think the probability of the coin coming up heads is 50%.

On the face of it, this is a perfectly reasonable conclusion. However, after careful analysis, we find that this is problematic. The problem is that the experimental results we make are not strongly coupled with the experimental parameters. So if the coin was tampered with, there’s a 60% chance that it came up heads, and if we flip it 10 times, there’s a chance that we get 5 heads and 5 tails. Similarly, if the probability of getting heads is 70%, there’s also a certain probability that we’re going to get 5 heads and 5 tails. Now that we have this result, how can we say that it’s a 50% chance of coming up?

So what are we supposed to do? Keep doing the experiment?

Obviously, no matter how many experiments we do, we’re not going to be able to solve this problem fundamentally, and since parameters affect the probability of the outcome, we should go back to this Angle and start with the probability. So we know that a coin flip is a binomial event, so let’s say the probability of flipping a coin heads is p, so the probability of tails is 1 minus P. So we can plug in the formula for the binomial distribution, and figure out the probability that after 10 flips, 5 of them are heads given the current p parameter.

Thus, we can get a curve like this:

So when the probability of heads is 0.5, the probability of getting 5 heads out of 10 flips is the highest. We think of the probability of heads as an experimental parameter, and we think of likelihood as a probability. So the maximum likelihood estimation actually refers to the parameter that maximizes the probability of the current experimental result.

What that means is that we use the experimental results and the probabilities to figure out what the most likely cause or parameter is, and that’s called the maximum likelihood estimate.

Once the principle is understood, the solution is easy to follow.

First, we need to express the probability of the outcome of the experiment as a function. The technical name of this function is called likelihood function.

After we have the function, we need to simplify the function, for example, in some experiments conducted many times, we need to calculate the logarithm of the likelihood function, and transform the multiplicative calculation into the accumulation operation.

Finally, we take the derivative of the simplified likelihood function, set the derivative to 0, and find the value of the parameter at the extreme point, which is the best parameter we find through the maximum likelihood estimation method.


Introducing hidden variables

So that’s just the basic use of maximum likelihood estimation, but what happens if we change the problem a little bit and introduce one more variable?

Let’s look at a classic example, again, flip a coin, but if we change the conditions a little bit, the whole problem will be completely different.

This example comes from a classic paper on the EM algorithm: Do, C. B., & Batzoglou, S. (2008). What is the expectation of maximization algorithm? Nature Biotechnology, 26(8), In this example, we have two coins, A and B. The probability of A coming up heads is 0.5 and B heads is 0.4. We randomly select one of the two coins for the experiment.

We did each experiment five times, and we recorded the number of heads and tails. After 5 rounds of experiments, we got the following results:

Since we knew which coins were selected for each round, the process went very smoothly. If we remove the coin information, and suppose we don’t know what coin we chose for each round, how do we figure out the probability that A and B go up?

In the new experiment, we don’t know the coin choice, which means there’s a hidden variable that we can’t know about. This kind of variable is called hidden variable, and the existence of hidden variable interferes with the direct connection between parameters and experimental results. So for example in this problem, we want to know the probability of heads of each coin, and we want to calculate that probability by knowing which coin was used in each round. And if we want to figure out which coin was used in each experiment we have to know the probability of the coin coming up heads. That is, the two variables are intertwined and interdependent, and we know too little to untangle them directly. It’s like a chicken-and-egg situation, stuck in a loop.

EM algorithm is born to solve this problem.


The EM algorithm

As we said, the hidden variables and the parameters we want are entangled in a loop, but we don’t have enough information to untangle them. If we can’t solve it, we can’t solve it. We’ll just brute force it.

Yes, you read that right, the essence of the EM algorithm is very simple and crude: since we can’t solve for hidden variables, we don’t ask, we just assume an initial value and iterate after we have the result.

So let’s say p1 is the probability of heads on coin A, and p2 is the probability of heads on coin B. Originally, we hoped to solve P1 and P2 that made the results appear through maximum likelihood estimation. Now, we directly assume and iterate:

Let’s assume p1=0.7, p2=0.3, and we’re just going to assume this arbitrarily, but you can assume anything else. So let’s plug p1 and p2 into these results.

For example, in the first round, there are 3 heads and 2 tails, and if we have coin A, the probability of that is easy to calculate based on the binomial distribution:Similarly, we can calculate that the probability of coin B is 0.01323. We work out all the probabilities the same way:

Now that we have a probability, obviously we can make a prediction, based on this probability table, of which coin is going to be used in each round.

According to the law of maximum likelihood, we can conclude that the coins used for each round are:

The first round is coin A

The second round is coin B

The third round is coin B

The fourth round is coin A

The fifth round is coin B

What’s the use of guessing the distribution of the coin? Very simply, we can use our guesses to reestimate the values of P1 and P2.

For example, if coin A appears in the first and fourth rounds, there are 10 experiments conducted in these two rounds, in which 6 is positive and 4 is negative, then we can correct the value of P1 to 0.6. Coin B comes up in rounds 2, 3, and 5, and we did 15 experiments in those three rounds, 5 heads and 10 tails, so the probability of heads is 1/3. It can be found that after one iteration, our results are closer to the real value.

The results are ok, but it’s still crude, and we can do better.

Example is given to improve

Let’s improve the calculation of the above example. The main problem is that after we calculate the distribution based on the assumed probability, we directly guess which coin is flipped in the current turn by likelihood estimation. It’s certainly possible to do this, but it doesn’t feel rigorous, because our guesses are arbitrary and not necessarily accurate.

Is there a better way?

In fact, there is. Instead of directly guessing which coin was chosen in a certain round, we can substitute the probability of choosing the coin into the expectation, which will have a better effect. For example, according to the calculation results, we can calculate the probability of choosing the coin in each round:

This probability is substituted into the experimental results to calculate the expectation, and the expectation table of P1 can be obtained:


In the same way, we can calculate the new p2 expectation table:

Plug in, and we get that the new p2 is 0.377.

After changing the estimation result to using probability and substituting it into iteration, our estimation result is much more accurate, that is to say, our convergence speed is faster. We repeat the process until convergence occurs, and when convergence occurs, we get the maximum likelihood estimates for P1 and P2. This is the essence of the whole EM algorithm.

Let’s sort out the operation process of EM algorithm. First of all, we randomly substitute the value of a parameter into the experimental results to calculate the probability distribution or value of the hidden variable. Then we iterate our parameter value through the hidden variable, and repeat the iteration until convergence. We can further abstract it into two steps, namely step E and step M:

In step E, we calculate the expectation estimates of unknown variables based on the assumed parameter values and apply them to hidden variables. In step M, we calculate the maximum likelihood estimates of current parameters based on the estimated values of hidden variables

According to this theory, we can also improve the above process.

So that’s it, and I think you can all understand it, but we haven’t proved it mathematically, why does this work? Why does this method have to converge, that the value of our convergence is the optimal solution? So we still have to prove it mathematically.


Mathematical proof

So let’s say we have some sample set X that’s made up of m samples, and we can write that asFor each of these m samples, there is an implicit variable z that is unknown. And there’s one more parameter, which is the parameter we want to solve by maximum likelihood estimation. Because it contains the implicit variable Z, we cannot directly calculate the derivative and extreme value of the probability function.

Let’s first write the probability function with implicit variables:


We want to find parameters that are globally optimal, so we hope to find out what makesMaximum. If we take the log of this expression, we can get:


Let’s assume that the probability distribution of the implicit variable z is zero, so the above equation can be transformed into:


I’m stuck here, but I’m not. We wrote in a previous article that for convex functions we have Jensen’s inequality: E[f(x)] >= f(E[x]), where the expected value of the function is greater than or equal to the expected value of the function. Logarithmic functions are convex in a general sense, concave in a strict sense, and you can use Jensen’s inequality, but you have to change the direction of inequality.

In the above equationIs the probability distribution of the hidden variable, sois, then we can substitute Jensen’s inequality to get:


The inequality on the right-hand side is much easier to solve, and when we fix the Z variable, we can easily solve for the maximum likelihood parameter. Let’s just say we haveAnd then we can optimize z. This method of fixing one of two variables and alternately optimizing the other is calledCoordinate ascending methodIs a very common solution in machine learning.

As shown in the figure above, the contour line of the loss function goes round and round. When we use the coordinate ascending method, we fix one axis variable at a time, optimize the other variable, and then alternate, we can also get the global optimal solution.

In addition, we can also explain mathematically.

And since this is an inequality, we can’t directly solve for the maximum on the left-hand side, so let’s go throughKeep optimizing the right hand side to get to the left hand side. So let’s take the left hand side of the equationOn the right-hand side of the inequality sign is 1“, and then let’s look at an image that I found from the great God’s blog:

The red color at the top of the image isThe bottom picture is J. Every time we fix z, we can find a better one,It moves towards the high point and eventually reaches its maximum.

Intuitively this is OK, but we need to prove it mathematically.

According to Jensen’s inequality, only ifThe independent variable x is constantIs equal to, and our independent variable is, we make it equal to the constant c:


Due to theSo we can know, we can substitute into the above equation, and get:


So after this whole sequence of transformations, we have thetaIn fact, the calculation formula forA posteriori probability. So this is the E step that we just showed you, and then we confirmAnd then we’re going to take the derivative and we’re going to take the extremum and we’re going to take the maximumThat’s M steps.

So, the whole process of EM algorithm is to repeat this process until convergence.

So how do we make sure that the algorithm converges? In fact, it is not difficult. Since we followed the equality condition of Jensen’s inequality in the process of E to solve the z, we can ensure that we can get the equal sign, that is:


When we fixedTake the derivative to get the maximized parameterAnd then we get the right-hand side, which must be better thanBut we’re not sure about the newOur previous oneThe distribution of can also satisfy Jensen’s inequality condition, so:


So we show that the likelihood function is increasing, and when it converges, it’s the value of the maximum likelihood estimate, the parameterThat’s the parameter that we need from the maximum likelihood estimation method.


conclusion

At this point, the EM algorithm is introduced. The algorithm gives me the biggest feeling is that this is another algorithm based on mathematical derivation, its derivation process is very rigorous, the effect is very good, through it can solve a lot of intuitively unsolvable problems. And what’s even harder is that even if we throw away the mathematical rigor of proof and derivation, it doesn’t prevent us from intuitively understanding the idea of an algorithm. No wonder this algorithm can be included in the top ten machine learning algorithms, is indeed a classic.

Finally, I don’t know if you have a feeling when watching, is the idea of EM algorithm seems to have seen somewhere before? A sense of deja vu?

This feeling is right. If you think back to Kmeans before, you will find that we also made guesses at the beginning because we did not know the center of the cluster. And then you approach it bit by bit by iteration. If we think a little more, we can find that the calculation process of Kmeans can be verified with the process of EM algorithm. Through modeling, we can transform the problem of Kmeans into the model of EM algorithm. Interested students can study this problem, and of course, they can also look forward to our subsequent articles.

Finally, the content of THE EM algorithm is here, if you feel that there is a harvest, please feel free to point attention or forward it, your help is very important to me.