Even those who have studied machine learning may still have a rudimentary understanding of MLE(Maximum likelihood Estimation), MAP(maximum posterior estimation), and Bayesian estimation. A basic model can be modeled from these three perspectives, for example, for Logistic Regression:


MLE: Logistics Regression

MAP: Regularized Logistics Regression

Bayesian: Bayesian Logistic Regression


Combined with practical examples, this paper explains the essential differences between the three in an easy-to-understand way, hoping to help readers clear away the obstacles in understanding.


(4) Hypothesis Space

What is the hypothesis space? We can think of it this way. Machine learning includes many kinds of algorithms, such as linear regression, support vector machine, neural network, decision tree, GDBT and so on. When we model, the first step is to select a specific algorithm such as “support vector machine”. Once we choose an algorithm, we choose a hypothesis space. In a hypothetical space, we usually have an infinite number of different solutions (or models), and what an optimization algorithm (such as gradient descent) does is to select the best one or more solutions/models, depending on the sample data. For example, if we choose to use support vector machines, that corresponds to our alternative solutions/models being concentrated in the top half (blue dots).

A specific “toy” problem


“Zhang SAN has a math problem and wants to ask for help. After some thinking, I found that my friend was a teacher in the Computer department of Tsinghua University. So he decided to enlist the help of computer science students at Tsinghua. So what strategies does Joe use to get help?


Here, “Tsinghua Computer Department” is a hypothetical space. In this hypothetical space, each student can be seen as a model.

For John, there are three different strategies he can choose from.


The first strategy: MLE

The first strategy is to pick the best student from the department and ask him to solve the puzzle. For example, we could choose the students who have done best in the last three tests.

The general learning process is divided into “learning process” and “prediction process”. The scheme of the first strategy can be represented in the following figure. Here, the learning process is equivalent to picking the best students from all the departments. So, the “student past report card” here is what we know as training data D, and the process of selecting the best students (calculating the historical average and selecting the highest) is called the MLE. Once we find the best students, we can go into the prediction. In the prediction part, we can ask him to answer The puzzle X ‘in Joe’s hand, and then we can get his solution Y’.



The second strategy: MAP


The difference with the first strategy is that in the second strategy we take the advice of the teacher, who is Zhang SAN’s friend. The teacher gave her opinion:

“There may be some moisture in xiao Ming and Xiao Hua’s results.”

When we rank students according to their grades, assuming that the top two are Xiao Ming and Xiao Hua in turn, if we do not consider the teacher’s evaluation, we will definitely take Xiao Ming as the target. However, since the teacher has made some negative comments on Xiaoming and Xiaohua, then at this time, we are likely to end up choosing the third in the class, rather than Xiaoming or Xiaohua.


Let’s draw a diagram of the second strategy. The only difference from the diagram above is the addition of the teacher’s evaluation, which we call Prior. That is to say, we select a student who we think is the best according to the student’s previous performance and teacher’s evaluation (it can be regarded as a model). Then you can ask him to answer Mr. Zhang’s puzzle X ‘and get his answer Y’. The whole process is similar to MAP estimation and prediction.




At this point, some readers may be wondering:”

How do teachers’ Prior assessments correlate with students’ past achievements?”

. To answer this question, we had to introduce a very famous theorem, called Bayes’ theorem, as shown in the figure below. The term on the left is the part of the MAP that needs to be optimized, which can be decomposed by Bayes’ theorem into MLE (the first strategy) and Prior, the teacher’s evaluation. In this case, the denominator is Constant, so you don’t have to worry about it.


The third strategy — Bayesian

Finally, we introduce a third strategy. This strategy, which many of you can imagine, is to get everyone involved in answering Joe’s puzzle, but in the end we get the final answer through some weighted averaging. Let’s say there are three students, and we don’t know anything about those three students. By asking the question, the first student answered A, the second student answered A, but the third student answered B. In this case, we can pretty much take A as the standard answer. Then consider a slightly more complicated situation. Suppose we learn from their past performance that the third student has won gold MEDALS in national Olympic Games several times. What should we do then? Obviously, in this case, we must give the third student more voice than the other two students.


When we apply the above idea to Zhang SAN’s question, we actually ask all the computer science students to participate in answering this question, and then aggregate their answers and come up with the final answer. If we know the weighting of each student, the process of summarizing is deterministic. But how to get the voice (weight) of each student? That’s what Bayesian estimation does!


The following diagram illustrates the whole process of Bayesian estimation and prediction. Similar to MAP, we know each student’s previous three exam scores (D) and their teacher’s Prior. However, different from MAP, our goal here is no longer to “select the best students”, but to obtain the voice of each student (weight) through the observation data (D), and these weights should all add up to 1, which is equivalent to a valid distribution.


To sum up, under the third strategy, given past test scores (D) and teacher evaluations (Prior), our goal is to estimate the Distribution of student weights, also known as Posterior Distribution. So how do we estimate this distribution? This part is what Bayesian estimation does, and there are many ways to do it, such as MCMC, Variational Method, etc., but this is not the focus of this article, so it will not be explained further here. Interested readers can focus on Bayesian estimation later in the column. From the perspective of intuition, because we know the past performance of each student, it is easy for us to know their ability level and estimate the power of speech (weight) of each student.


Once we have this distribution (that is, the weight of each student), we can then make predictions in a way that is similar to a weighted average, and those students with higher weights will naturally have more say.

The above is a basic explanation of MLE, MAP and Bayesian estimation. Here we try to answer two common questions.


Q: As we observe more and more data, MAP estimation gradually approaches MLE.

We continue with the previous example of MAP (the second strategy). Here, we change the original problem a little bit. In the previous example we assumed that we could get each student’s scores from the last three tests. But here we go a step further and assume that every student’s score from the last 100 tests can be obtained.


So what kind of change will this change? If you think about it, it’s pretty easy to imagine. Let’s imagine two of these scenarios. Suppose we know that a student has done well in the past three tests, but the teacher tells us that the student’s ability is not very good, then we are likely to believe the teacher, after all, it is difficult to get a comprehensive understanding of a student from only three tests. On the other hand, suppose we learn that the student has been number one in the class on all the last 100 tests, but at the same time the teacher tells us that the student’s ability is not very good, how would we react? Two or three exams may be considered luck, but it’s hard to equate being number one 100 times in a row with luck, right? We may even doubt the teacher’s moral character, is it deliberately slander people?


In other words, as we observe more and more data, the greater the confidence in the information we obtain from the data, the less important the teacher’s feedback becomes. Ideally, when we have an infinite number of data samples, the MAP approximates the MLE estimate, which is the same thing.

Q: Why is Bayesian estimation more difficult than MLE and MAP?

To recap, both MLE and MAP are looking for a top student. Bayesian estimation estimates the weight of each student. In the first case, in order to find the best students, we only need to know the “relative” level of excellence among students. How do I understand this? For example, if there are three students A,B and C in A class, we know that student A is better than B and B is better than C, then we can infer that student A is the best. We do not need to clearly know the scores of A and B…..


But in the Bayesian estimation model, we have to know the absolute weight of each student, because the final answer we get is a weighted average of the answers given by all the students, and the weight of all the students must add up to 1(the integral sum of any distribution must be equal to 1). Suppose we know each student’s ability, a1, A2,…. Student: An, can this be a weight? Obviously not. One of the easiest ways to get the weights is to sum them up and then get the weights. For example, a1+… Plus an is equal to S, and use a1/S as the weight. That doesn’t look too hard, but it’s just an addition, right?


We can easily see that the time complexity of this addition operation is O(n), depending on the number of students in the population. If we had a hypothetical space of a few hundred students, this wouldn’t be a problem. But in practice, let’s say we’re using a support vector machine for our model, and every possible solution in our hypothesis space is a student, how many students are there in this hypothesis space? Countless!! That is, you have to do this kind of addition on an infinite number of numbers. Of course, this kind of addition can exist in the form of an integral, but the problem is that integrals usually don’t have a closed-form solution. You have to approximate it. That’s what MCMC or Variational methods do.


Here are some important take-aways:



  • Each model defines a hypothesis space, which generally contains an infinite number of possible solutions.

  • Prior is not considered in MLE, whereas prior is considered in MAP and Bayesian estimation.

  • MLE and MAP are the relatively best models (Point estimation), while Bayesian method estimates posterior distribution through observation data and makes group decisions through posterior distribution. Therefore, the latter does not aim to select a best model.

  • When the number of samples is infinite, MAP theoretically approximates MLE.

  • The complexity of Bayesian estimation is large, which is usually approximated by MCMC and other approximate algorithms.


Post a summary at the end: