Logistic regression

emmmm….. Start with Album Soon

Linear regression

In statistics, a linear regression is a regression analysis that models the relationship between one or more independent and dependent variables using a least-squares function called a linear regression equation. This function is a linear combination of one or more model parameters called regression coefficients. If there is only one independent variable, it is called a simple regression. If there is more than one independent variable, it is called a multiple linear regression. Linear regression is the simplest model in machine learning.

Linear regression is defined as follows: for a sample, its output value is a linear combination of its eigenvalues (this is the assumption). Then the model obtained by modeling the data is as follows:

By training the data set obtained by sampling, the objective function is close to the real function (fitting equation), and the least square method is generally adopted. There are many methods that can be used to solve the parameters of the optimal solution… That’s another question…)

Logistic regression

Logistic regression is a nonlinear regression function, which, like linear regression, is the most commonly used algorithm in machine learning. Linear regression is mainly used for prediction (modeling prediction), and two logistic regression is mainly used for binary classification. Both are supervised machine learning. (Classification is also a special case of forecasting.)

Logistic regression gives the probability of belonging to a class (0-1), maps the input to [0,1] by a nonlinear function,In this way, classification can be made through this model. Generally adoptedThe function isFunction:. With this function, the output value is mapped to the probability, doneThe role of.

With the objective function, how to construct the loss function to optimize the model? Using THE MSE method, the loss function is as follows:

But the function is going to look like this, so it’s going to be “non-convex”, so that when you optimize it, you get into the locally optimal solution. Therefore, another loss function needs to be sought.

Considering that sigmod itself represents the probability of belonging to a class, we have:Therefore, it can be written in the following form:Only 0 and 1). In this way, we can use the thinking of mathematical statistics to solve the parameters. For training data, according to the principle of maximum likelihood, it is necessary to have

The data in batch (m data) is:

The corresponding function is convex and therefore does not fall into the local optimal solution.

From this we can see that the essence of cross entropy (logistic regression activation) is actually maximum likelihood.

(Confusion?)

Well, this y right here is going to beThe probability of… That way, everything else makes sense.

Maximum likelihood and maximum posterior probability

  

The probability ofProbability is the study of giving the probability of an event occurring once the model and parameters are known. Probability is a deterministic thing, it’s an ideal value, and as the number of experiments goes to infinity, frequency equals probability. The “frequency school” is the idea that the world is defined and the parameters are modeledIs a definite value, so their view is to directly model the time itself. The frequency school believes that the parameter in the model is a definite value, and the maximum likelihood estimation (MLE) is generally used to estimate this value.

statistical: Statistics is based on the given observation data, using these data to carry out modeling and parameter prediction. In popular terms, it is to obtain the corresponding model and the description parameters of the model according to the observed data (for example, it is speculated that it is a Gaussian model, and the specific parameters of the model are obtained, etc.).

Likelihood function and probability function: For functions, there are two cases as follows:

  1. ifIt stays the same,If phi is a variable, the function is called the probability function, which represents phiThe frequency of.
  2. ifTheta is a variable, theta is a variableIs a fixed value (given), then the function at this time is called the likelihood function, representing differentThe eventThe probability of occurrence. The function at this time is also denoted as:
  3. Pay attention to distinguish betweenIs conditional probability and likelihood function,andThe relationship is whenIn theIs fixed, right now; And when theWhen it’s a random variable,This is the conditional probability. Colloquially it would beIt doesn’t always represent conditional probabilities, except in this caseFixed, now there isAt the same time, forIt means that the probability is certain and the parameter value is certain (it is not a random variable, but it is not known yet and needs to be estimated).

Bayes’ formula:

This expression represents the confidence of event A when event B has already occurred. One of theRepresents the prior probability of A… That’s the confidence that event A is independent. Bayesian thinking is that the world is uncertain, so you assume an estimate (a prior probability), and then you adjust that estimate based on observations. In general terms, when you model an event, you don’t think about the parameters of the modelIs a definite value, but a parameterIs itself subject to some underlying distribution (hence the assumption and selection of prior probabilities is important!!). . The bayesian method of parameter estimation is the maximum posterior probability. The specific form is as follows:

When you maximize the posterior probability, becauseIs already known (this is a fixed value, observed), so the maximum posterior probability is actually

Now, the posterior probability is affected by two things,and, the former is similar to the likelihood function, while the latter is the prior distribution of parameters. When the prior distribution is assumed to be 1, the posterior probability and the likelihood function are equivalent.

The difference between maximum posteriori probability and maximum likelihood function:

The difference between the two is really about parametersThe idea of maximizing a posteriori probability is that the parameter itself obeies some potential distribution, which needs to be considered, while the likelihood function considers that the parameter is a fixed value, not a random variable. The essence of a posterior probability isAnd theCould be an accident?? Well, that’s just right?? It’s amazing… Anyway, the essence of maximizing a posteriori probability is toThink of it as a random variable, a likelihood function with a penalty term.

references

BGM

Why does logistic regression loss function use maximum likelihood estimation instead of least squares?

Maximum likelihood estimation and maximum posterior probability estimation

Maximum Likelihood Estimation (MLE) & Maximum posterior Probability Estimation (MAP)

Talk about MLE and MAP for machine learning: maximum likelihood estimation and maximum posterior estimation

Machine learning algorithm series –MLE&MAP

The maximum likelihood estimation (MLE), maximum posterior probability estimation (MAP) and bayesian formula are explained in detail

Maximum posterior probability

Bayesian estimate, maximum likelihood estimate, maximum posterior probability estimate