Linear regression to understand¶

For the current understanding, use the hypothesis functionTo get what we predicted. To keep the predicted value close to our label value y, use the target (loss) function, to get our parameter values by optimizing the parameters to make the objective function as small as possible

Least square method¶

By selecting theTo minimize. Initialize theAnd then using gradient descent algorithm to update,. The process of calculating partial derivatives in the formula is as follows:

Therefore, the formula of gradient descent updating algorithm can be obtained

Repeat the calculation until convergence:(For every j)

The standard equation can be solved as follows:

Probability inference¶

The following is the focus of this article. Are there any other reasons why the least mean square is used as a loss measure? Here is a possible hypothesis:

Assume that the relationship between the objective function and the target value is as follows:

Consists of the objective function plus an error term.

It is assumed that the error distribution is independent and satisfies the Gaussian distribution (standard distribution), and, the mean is 0, and the variance isThe distribution of. The probability density function can be obtained as follows:

, can be converted to the following form:



Represents in a givenThe distribution of. Because x and y are deterministic, we can construct likelihood functions. What do we findThe above conditions can be met. The likelihood function is as follows:

Based on the aboveThe above likelihood function can be written as:



Therefore, like the possibility model above, how to get the best parameters. You can use maximum likelihood estimation and choose the right oneTo maximize the likelihood function. becauseThe monotonicity of, can be taken logarithm, its extremum points are the same, convenient derivation. The point whose partial derivative is 0 is the extreme point.

After calculation, to maximize the function, the latter term should be as small as possible,. That explains why we used the least mean square loss measure in the first place.


Logistic regression to understand¶

In fact, most of the logistic regression process is the same as above, including the independence hypothesis and its distribution, the calculation of maximum likelihood. The following is a brief description of the process.

Similar to linear regression, y is predicted by a given x, and y has only 0,1, so a logical function is used to transform:

The subsequent derivative is required, and the derivative is as follows:

Given the regression model, how do we know. Similar to the above linear regression model, the maximum likelihood function is obtained by making some possibility assumptions. Suppose the following:.

Together, it can be written as:

Assuming that m training instances are generated independently, the likelihood function is as follows:



It’s easy to figure out what the logarithm is:



And the next step is similar, how do you maximize the likelihood function. Using gradient lift: Update parameter values:



The gradient calculation process is as follows:



Therefore, update the parameters as follows:

(For every j)

If you compare the updated formulas for the two parameters above, you will find that they are the same, but note that they are not the same algorithm, as defined byIt’s not the same. The perceptron algorithm and push process are similar.