Maximum likelihood estimation is one of the most common parameter estimation methods in machine learning. The whole modeling process needs a likelihood function to describe the probability of real data occurring under different model parameters. The likelihood function is a function about model parameters. Maximum likelihood estimation is to find the optimal parameter to maximize the probability of observed data and make statistical model most similar to real data.

Linear regression and least square method of mathematical derivation process

We use a gambling example to simulate the probabilistic reasoning process of machine learning. If you take part in a gambling game, you are given 10 flips of a coin, heads and tails. You are given only one chance to bet. On the next toss of the coin, you will win $100. What are your decisions at this point?

Probability and likelihood

Generally, a coin has two sides, and if both sides of the coin are even, the probability that the coin will be heads after each flip is 0.5. If you take this coin, you’re likely to get 5 out of 10 heads. But if someone did something to the coin, if they did something to the coin in advance, so it’s going to come up heads every time, and now you flip it 10 times, and 10 times it’s going to come up heads, then the next time you’re not going to guess it’s going to come up tails, because the first 10 times it’s going to come up heads, you’re not going to intuitively believe that this is a normal coin. Now if someone flips a coin 10 times and gets 6 heads and 4 tails, how do you estimate the probability that the next coin will be heads?

Since we didn’t make the coin, we don’t know if the coin is perfectly uniform, so we can only extrapolate the coin from our current observations. Suppose a coin has a parameter θ, which determines how fair the coin is: θ = 0.5 means it is fair, with a probability of 0.5 for each head flip, and θ = 1 means it has only heads, with a probability of 1 for each head flip. Then, the process of extrapolating the coin’s construction parameter θ from the observed positive and negative results is a parameter estimation process.

The probability of

10 flips of a coin can have different outcomes: “5 heads 5 tails”, “4 heads 6 tails”, “10 heads 0 tails”, etc. If we know how the coin is constructed, i.e., given the parameter θ of the coin, then the probability of “6 heads and 4 tails” is:


Formula 1 is a probability function that represents the probability of the fact “6 + 4 +” occurring given the parameter θ. Different values of θ mean different probabilities of things happening. The Probability function is generally expressed by P or Pr.

So in this process, you have to pick 6 out of 10 heads, using permutations. May arise because of the “six is 4” is precisely are ZhengZhengFanFan to reflect, is precisely are positive and negative sides to reflect, is is positive and negative ZhengZhengFanFan, totaling 210 combinations, to select 6 times in 10 times as positive. If the probability of heads is 0.6 each time, then the probability of tails is (1-0.6). Each flip is independent of each other, and the probability of “6 heads and 4 tails” is the product of the probabilities of each flip, multiplied by 210 combinations.

Probability reflects the probability that an outcome will happen, given the underlying causes.

likelihood

Unlike probability, likelihood reflects that the result is known and the cause is deduced backwards. Specifically, Likelihood function represents the data based on observation, and how likely the statistical model is to approach the real observation data when different parameter θ is taken. It’s very much like the bet in the beginning, where you’ve been given a series of heads and tails, but you don’t know the configuration of the coin, and the next time you bet, you’re going to have to work backwards on the configuration of the coin based on what you already know. For example, when observing the fact that the coin is “10 heads, 0 tails”, it is highly likely that the coin will be heads every time; When observing the fact that the coin is “6 heads and 4 tails”, it is possible to guess that the coin is not even, with a probability of 0.6 per head.

The likelihood function is calculated in a very similar way to the probability function above, but unlike the probability function, the likelihood function is a function of θ, i.e., θ is unknown. The likelihood function measures the likelihood of true observed data occurring under different parameters θ. The likelihood function is usually the joint probability of the probability of occurrence of multiple observations, that is, the probability of occurrence of multiple observations. The probability of occurrence of a single observation data is P(θ). If each observation is independent, the probability of occurrence of multiple observation data can be expressed as the product of the probability of occurrence of each sample. Here is a little explanation of the relationship between event independence and joint probability. If event A and event B are independent of each other, then the probability of event A and event B occurring simultaneously is the probability of event A * the probability of event B. For example, the event “rain” and the event “wet ground” are not independent of each other. “Rain” and “wet ground” are simultaneous and highly correlated, and the probability of both events can not be expressed by the product of a single event. The two flips do not affect each other, so the probability of a coin coming up heads can be expressed as the product of each probability.

Likelihood functions are usually denoted by L. When the coin parameter θ takes different values, the likelihood function can be expressed as:


The graph of Formula 2 is shown below. It can be seen from the figure that when the parameter θ is 0.6, the likelihood function is the largest, and when the parameters are other values, the occurrence probability of “6 positive and 4 negative” is relatively smaller. In this bet, I’m going to guess that the coin will be positive next time, because given what I’ve seen, it’s likely to be positive with a probability of 0.6.

The general form of the likelihood function can be expressed in formula 2, line 2, which is the product of the probabilities of each sample, as mentioned earlier.

Maximum likelihood estimation

Understanding the meaning of likelihood function, it is easy to understand the mechanism of maximum likelihood estimation. Likelihood function is a function about parameters of statistical model, which describes the probability of observed real data occurring under different parameters. Maximum likelihood estimation is to find the optimal parameter and maximize the likelihood function. In other words, the observed data is most likely to occur when the optimal parameters are used.

Maximum likelihood estimation of linear regression

As mentioned in the previous article, the error term of linear regression is the difference between the predicted value and the true value (line 1 of Formula 3), which may be some random noise or some other influencing factors not taken into account by the linear regression model. One of the major assumptions of linear regression (also mentioned in the previous article) is that the errors follow a normal distribution with a mean of 0, and multiple observational data do not affect each other and are independent of each other. The probability density formula of normal distribution is shown in the second line of Formula 3. In line 2, Pr(x; Mu, sigma); Semicolon stressed mu and sigma is the probability density function of the parameters, it and the conditional probability function used in the perpendicular line | symbol meaning is different.


The third line of formula 3 can be obtained by substituting ε and taking the mean value μ as 0. If you substitute the relationship between x and y in row 1 into row 3, you get row 4, and row 4 is the probability of the ith single sample.

As mentioned earlier, the likelihood function is the product of the observed probabilities for each sample. A group of samples has N observation data, in which the probability of occurrence of a single observation data is shown in line 4 of Formula 3, and the product of N observation data is shown in line 1 of Formula 4. Finally, the likelihood function can be expressed as the structure in line 2 of Formula 4. Where, x and y are the observed real data and are known, and ω is the model parameter to be solved.


Given the observed data X and y, how can the parameter ω be selected to optimize the model? The maximum likelihood estimation method tells us that we should choose a ω to maximize the likelihood function L. The product sign and the e to the n operations in L look very complicated, and it’s not very convenient to calculate it directly in L, so the statistician takes the log of the likelihood function. Some properties of log can greatly simplify the computational complexity, and adding log logarithms to the original likelihood function does not affect where the optimal value of parameter ω is obtained. Log likelihood functions are usually represented by a cursive L.


Since log logarithms can convert multiplication to addition (formula 5, line 2), the product term in the likelihood function becomes the sum term, as shown in formula 5, line 3. And since log logs cancel out powers, you end up with the function in line 4 of formula 5.

Since we only care about the value of the parameter ω, the likelihood function is maximum, and the standard deviation σ does not affect the maximum likelihood function when the value of ω is selected, we can ignore the terms with standard deviation σ. If you add a minus sign, a negative or a negative becomes a positive, the original maximization problem becomes a minimization problem, and the final result is formula 6.


Formula 6 is almost the same as the loss function optimized by the least square method in the previous paper, which is the sum of squares of the true value and the predicted value. It can be said that all roads lead to the same result.

Then to solve the parameters of Formula 6, the derivative method in the previous article can be used to obtain a matrix equation by setting the derivative to 0. The solution of the matrix equation is the optimal solution of the model. You can also use gradient descent to find the optimal solution of the model. Gradient descent will be introduced later in this column.

Least squares and maximum likelihood

It was found in the previous derivation that the formula for least squares and maximum likelihood is almost the same. Intuitively, the least square method is to find the parameters with the smallest error distance between the observed data and the regression hyperplane. Maximum likelihood estimation maximizes the probability of observed data occurring. When we assume that the errors are normally distributed, the closer all the error terms are to the mean 0, the higher the probability. The normal distribution is symmetric on both sides of the mean, and the process in which the error term approaches the mean is equivalent to the process in which the distance is minimized.

conclusion

Maximum likelihood estimation is one of the most common parameter estimation methods in machine learning. It is used in logistic regression, deep neural network and other models. We need a likelihood function to describe the probability of real data occurring under different model parameters. The likelihood function is a function of model parameters. Maximum likelihood estimation is to find the optimal parameter to maximize the probability of observed data and make statistical model most similar to real data.