Original post (my blog): The difference between Bayesian estimation, maximum likelihood estimation and maximum posteriori estimation

More deep learning resources for machine learning


The example analysis

Even those who have studied machine learning may still have a rudimentary understanding of MLE(Maximum likelihood Estimation), MAP(maximum posterior estimation), and Bayesian estimation. A basic model can be modeled from these three perspectives, for example, for Logistic Regression:

MLE: Logistics Regression

MAP: Regularized Logistics Regression

Bayesian: Bayesian Logistic Regression

Combined with practical examples, this paper explains the essential differences between the three in an easy-to-understand way, hoping to help readers clear away the obstacles in understanding.

(4) Hypothesis Space

What is the hypothesis space? We can think of it this way. Machine learning includes many kinds of algorithms, such as linear regression, support vector machine, neural network, decision tree, GDBT and so on. When we model, the first step is to select a specific algorithm such as “support vector machine”. ** Once we choose an algorithm, we choose a hypothesis space. ** In a hypothesis space, we usually have an infinite number of different solutions (or models), and what an optimization algorithm (such as gradient descent) does is to select the best one or more solutions/models, depending on the sample data. For example, if we choose to use support vector machines, that corresponds to our alternative solutions/models being concentrated in the top half (blue dots).

A specific “toy” problem

“Zhang SAN has a math problem and wants to ask for help. After some thinking, I found that my friend was a teacher in the Computer department of Tsinghua University. So he decided to enlist the help of computer science students at Tsinghua. So what strategies does Joe use to get help?

Here, “Tsinghua Computer Department” is a hypothetical space. In this hypothetical space, each student can be seen as a model.

For John, there are three different strategies he can choose from.

The first strategy: MLE

The first strategy is to pick the best student from the department and ask him to solve the puzzle. For example, we could choose the students who have done best in the last three tests.

** The general learning process is divided into “learning process” and “prediction process”. The scheme of the first strategy can be represented in the following figure. ** Here, the learning process is equivalent to selecting the best students from all the departments. So, the “student past report card” here is what we know as training data D, and the process of selecting the best students (calculating the historical average and selecting the highest) is called the MLE. Once we find the best students, we can go into the prediction. In the prediction part, we can ask him to answer The puzzle X ‘in Joe’s hand, and then we can get his solution Y’.

The second strategy: MAP

The difference with the first strategy is that in the second strategy we take the advice of the teacher, who is Zhang SAN’s friend. The teacher gave his own opinion: “There may be some moisture in xiao Ming’s and Xiao Hua’s grades.” _ When we rank students according to their grades, assuming that the top two are Xiao Ming and Xiao Hua in turn, if we do not consider the teacher’s evaluation, we will definitely target Xiao Ming. However, since the teacher has made some negative comments on Xiaoming and Xiaohua, then at this time, we are likely to end up choosing the third in the class, rather than Xiaoming or Xiaohua.

Let’s draw a diagram of the second strategy. The only difference from the diagram above is the addition of the teacher’s evaluation, which we call Prior. That is to say, we select a student who we think is the best according to the student’s previous performance and teacher’s evaluation (it can be regarded as a model). Then you can ask him to answer Mr. Zhang’s puzzle X ‘and get his answer Y’. The whole process is similar to MAP estimation and prediction.

Prior to this point, some readers may wonder: “How do teachers’ evaluations correlate with students’ past achievements?” . To answer this question, we have to introduce a very famous theorem called bayes’ theorem, which is shown below. The term on the left is the part of the MAP that needs to be optimized, which can be decomposed by Bayes’ theorem into MLE (the first strategy) and Prior, the teacher’s evaluation. In this case, the denominator is Constant, so you don’t have to worry about it.

The third strategy — Bayesian

Finally, we introduce a third strategy. As many of you can imagine, ** is basically asking everyone to participate in answering Joe’s puzzle, but in the end we use some weighted averaging to get the final answer. ** Let’s say there are three students and we don’t know anything about these three students. By asking the question, the first student answered A, the second student answered A, but the third student answered B. In this case, we can pretty much take A as the standard answer. Then consider a slightly more complicated situation. Suppose we learn from their past performance that the third student has won gold MEDALS in national Olympic Games several times. What should we do then? Obviously, in this case, we must give the third student more voice than the other two students.

When we apply the above idea to Zhang SAN’s question, we actually ask all the computer science students to participate in answering this question, and then aggregate their answers and come up with the final answer. If we know the weighting of each student, the process of summarizing is deterministic. But how to get the voice (weight) of each student? That’s what Bayesian estimation does!

The following diagram illustrates the whole process of Bayesian estimation and prediction. Similar to MAP, we know each student’s previous three exam scores (D) and their teacher’s Prior. However, different from MAP, ** our goal here is no longer “to select the best students”, but to obtain the voice of each student (weight) through the observation data (D). ** And all these weights should add up to 1, which is equivalent to a valid distribution.

To sum up, under the third strategy, given past test scores (D) and teacher evaluations (Prior), our goal is to estimate the Distribution of student weights, also known as Posterior Distribution. So how do we estimate this distribution? This part is what Bayesian estimation does, and there are many ways to do it, such as MCMC, Variational Method, etc., but this is not the focus of this article, so it will not be explained further here. Interested readers can focus on Bayesian estimation later in the column. From the perspective of intuition, because we know the past performance of each student, it is easy for us to know their ability level and estimate the power of speech (weight) of each student.

Once we have this distribution (that is, the weight of each student), we can then make predictions in a way that is similar to a weighted average, and those students with higher weights will naturally have more say.

The above is a basic explanation of MLE, MAP and Bayesian estimation. Here we try to answer two common questions.

Q: As we observe more and more data, MAP estimation gradually approaches MLE.

We continue with the previous example of MAP (the second strategy). Here, we change the original problem a little bit. In the previous example we assumed that we could get each student’s scores from the last three tests. But here we go a step further and assume that every student’s score from the last 100 tests can be obtained.

So what kind of change will this change? If you think about it, it’s pretty easy to imagine. Let’s imagine two of these scenarios. Suppose we know that a student has done well in the past three tests, but the teacher tells us that the student’s ability is not very good, then we are likely to believe the teacher, after all, it is difficult to get a comprehensive understanding of a student from only three tests. On the other hand, suppose we learn that the student has been number one in the class on all the last 100 tests, but at the same time the teacher tells us that the student’s ability is not very good, how would we react? Two or three exams may be considered luck, but it’s hard to equate being number one 100 times in a row with luck, right? We may even doubt the teacher’s moral character, is it deliberately slander people?

In other words, as we observe more and more data, the greater the confidence in the information we obtain from the data, the less important the teacher’s feedback becomes. Ideally, when we have an infinite number of data samples, the MAP approximates the MLE estimate, which is the same thing.

Q: Why is Bayesian estimation more difficult than MLE and MAP?

To recap, both MLE and MAP are looking for a top student. Bayesian estimation estimates the weight of each student. In the first case, in order to find the best students, we only need to know the “relative” level of excellence among students. How do I understand this? For example, if there are three students A,B and C in A class, we know that student A is better than B and B is better than C, then we can infer that student A is the best. We do not need to clearly know the scores of A and B…..

But in the Bayesian estimation model, we have to know the absolute weight of each student, because the final answer we get is a weighted average of the answers given by all the students, and the weight of all the students must add up to 1(the integral sum of any distribution must be equal to 1). Suppose we know each student’s ability, a1, A2,…. Student: An, can this be a weight? Obviously not. One of the easiest ways to get the weights is to sum them up and then get the weights. For example, a1+… Plus an is equal to S, and use a1/S as the weight. That doesn’t look too hard, but it’s just an addition, right?

We can easily see that the time complexity of this addition operation is O(n), depending on the number of students in the population. If we had a hypothetical space of a few hundred students, this wouldn’t be a problem. But in practice, let’s say we’re using a support vector machine for our model, and every possible solution in our hypothesis space is a student, how many students are there in this hypothesis space? Countless!! That is, you have to do this kind of addition on an infinite number of numbers. Of course, this kind of addition can exist in the form of an integral, but the problem is that integrals usually don’t have a closed-form solution. You have to approximate it. That’s what MCMC or Variational methods do.

Here are some important take-aways:

  • Each model defines a hypothesis space, which generally contains an infinite number of possible solutions.
  • Prior is not considered in MLE, whereas prior is considered in MAP and Bayesian estimation.
  • MLE and MAP are the relatively best models (Point estimation), while Bayesian method estimates posterior distribution through observation data and makes group decisions through posterior distribution. Therefore, the latter does not aim to select a best model.
  • When the number of samples is infinite, MAP theoretically approximates MLE.
  • The complexity of Bayesian estimation is large, which is usually approximated by MCMC and other approximate algorithms.

Post a summary at the end:


The theoretical analysis

one Machine learning

The core idea is to learn rules from past experience, so as to predict new things. For supervised learning, the more useful the sample, the more accurate the training.

The process of machine learning and its knowledge are shown as follows:

To put it simply:

  1. First, we define our Model assumptions, such as linear classification, linear regression, logistic regression, SVM, and deep learning networks.
  2. How do we measure the quality of our models? Define loss functions (objective functions), lost functions, such as Square Loss
  3. How to optimize the hypothetical model, andoptimizationProcess. To put it simply, it is to choose an algorithm (such as gradient descent, Newton method, etc.) to optimize the objective function and finally get the optimal solution.
    1. Different models use different algorithms. For example, logistic regression is usually solved by gradient descent, neural network is solved by reverse derivation, and Bayesian model is solved by MCMC.
    2. Machine learning = Model + Optimization (different algorithms)
  4. Another question, how do you measure the complexity of your model? Because complex models are prone to overfitting. The solution to overfitting is to add regularization.
  5. So how do we know that the solution is really good? Verify this with ** cross-validation **.

two ML vs MAP vs Bayesian

  1. ML (maximum likelihood estimation) : it is given a model parameters, and then try to maximize p (D | parameters). That’s the probability of seeing the sample set given the parameters. The goal is to find the parameter that maximizes the probability ahead.
    1. Logistic regression is based on ML;
    2. Disadvantages: Will not add our prior knowledge to the model.
  2. The MAP (maximum a posteriori estimation) : maximize p | D (parameters).
  3. Bayesian: Our prediction takes into account all possible parameters, that is, the parameter space (the distribution of parameters).
  • Both ML and MAP belong to the same category, called freqentist, and the ultimate goal is the same: find an optimal solution, and then use the optimal solution to make predictions.

3. ML

We need to maximize p (D | parameters), we usually can use this part of the optimization to get the derivative is set to 0. However, ML estimation does not take prior knowledge into account, and it is easy to cause over-fitting phenomenon.

For example, in the estimation of cancer, a doctor may see 100 patients a day, but only 5 patients will be diagnosed with cancer. Under the MODEL of ML estimation, we get a probability of getting cancer of 0.05.

This is obviously unrealistic, because we know from experience that the probability is much lower. However, ML estimation does not incorporate this knowledge into the model.

Four. MAP

Through the above deduction, we can find that the biggest difference between MAP and ML lies in the p(parameter) item, so it can be said that MAP can solve ML’s shortcoming of lacking prior knowledge, and optimize the loss function after adding prior knowledge.

The p(parameter) term actually acts as a regularization. For example, if p(parameter) is assumed to follow the Gaussian distribution, it is equivalent to adding a L2 norm; If p is assumed to obey the Laplace distribution, it is equivalent to adding an L1 norm.

Five. Bayesian

Again: ML and MAP will only give an optimal solution, whereas Bayesian models will give a distribution of parameters, such as the parameters of the model, assuming the parameter space has parameters 1, 2, 3… Parameter N, what the Bayesian model learns is the importance of these parameters (that is, the distribution), and then when we predict the new sample, we will ask all models to predict together, but each model will have its own weight (the weight is the learned distribution). The final decision is made by all the estimators according to their weights.

However, ensemble of the model has the great advantage that it can reduce variance. This investment is very similar. For example, when we invest in various types of stocks, the risk will always be lower than when we invest in a certain stock.

Six. Frequentist and Bayesian. What is the difference between frequentist and Bayesian?

Let’s conclude with a simple example. Let’s say you’re the monitor of your class, and you have a question you want to know the answer to, you can ask all the students in your class. One option is to ask a classmate who is a top student. The other option is to ask all the students, and then aggregate the answers, but when you aggregate the answers, you give each student a weight based on how good or bad they are. The idea of the first scheme is similar to ML and MAP, while the second scheme is similar to Bayesian model.

Seven. The difficulties of the Bayesian

So the core technology in the field of bayesian is to approximate calculation of p (\ theta | D), we call it a bayesian inference, to put it bluntly, the core of the problem here is to approximate the complicated integral (integral), a solution is to use monte carlo algorithm. For example, if I want to calculate the average height of all employees in a company, the simplest and most rude way is to ask the administrative staff to measure them one by one and then calculate the average. But if you want to calculate the average height of all Chinese people, how do you do it? (Obviously, individual measurements are impossible.)

The sampling. We measured the height of randomly selected people, and then estimated the number of reviewers across the country based on their height. Of course, the more accurate the number of samples, the more representative the data sampled, the more accurate. That’s the idea of the Steward of the Monte Carlo algorithm.

Example:

Suppose we don’t know PI, but want to calculate the area of a circle. It can also be approximated by sampling. Add some points randomly to the square shown below, and remember that the number of points falling into the red region is N1 and the number falling into the white region is N2. Then the area of the quarter circle is N1 /(N1 +n2). — Monte Carlo idea

So, how do we evaluate continuous functions? Sample n more data and approximate the final integral value.

Suppose we want to calculate the expected value of f(x), and we also have a p(x) distribution, then we can keep sampling from the p(x) distribution, such as x1,x2… Xn, and then you take those samples and you compute f of x, so you end up with f of x1 plus f of x2, plus f of xn divided by n

However, the samples mentioned in the above example are independent. That is, each sample is independent from other samples and does not affect the sampling between each other. However, in reality, sometimes we want to speed up the sampling of valid samples. This is about optimizing the sampling process, and it’s a big topic in machine learning.

Again, using the sampling method mentioned above we can approximate complex integrals, we can calculate the area of a circle, we can calculate the average height of the national population. However, this sampling method is independent. Sometimes, we hope to use fewer samples to approximate a certain target more accurately. Therefore, the research in the field of sampling is to study how to optimize the whole sampling process and make the process more efficient.

MCMC sampling method, also known as Markov Chain Monte Carlo sampling method, means that each sample is correlated with each other.

But MCMC algorithms need to be calculated on the entire data set. In other words, in order to get a sample, you need to iterate through all the data. So when N is large, it obviously doesn’t apply. And the main reason that limits the development of Bayesian method is the high computational complexity. Therefore, the most important question for people in the Field is: how to optimize the sampling so that it can learn bayesian models in the context of big data?

An example of reducing iteration complexity:

For logistic regression, there is batch gradient descent method (that is, the whole data set is used to update parameters) when using gradient descent method to update parameters. In order to reduce computational complexity, people use stochastic gradient descent method, that is, randomly select samples from the data set to update parameters.

So, can we apply this idea to MCMC sampling?

Yes! Langevin Dynamic (one of the MCMC algorithms) and stochastic Optimizaiton (such as stochastic gradient descent) can be used together. In this way, we can sample with a small number of samples, and the efficiency of sampling is not dependent on N, but on M, which is much less than N.


reference

[1] Greedy technology. Machine learning in the MLE, MAP and bayesian estimation [DB/OL]. https://zhuanlan.zhihu.com/p/37543542, 2018-06-20.

[2] Quack girl. Bayesian thoughts and the maximum likelihood estimation, the difference between the maximum a posteriori estimation [DB/OL]. http://www.cnblogs.com/little-YTMM/p/5399532.html, 2018-06-20.