The author | WEIWEI finishing | NewBeeNLP

What do interviewers askThe purpose of ** “** series of articles is to sort out the ML/DL/NLP related knowledge points as completely and comprehensively as possible. We hope that we can gain more or less help through this series of articles, whether it is just a beginner, students preparing for the interview or review the old and know the new 

This will be updated continuously, hope you enjoy itAt present, the following have been sorted out:

  • What do interviewers ask \ about SVM

  • What do interviewers ask \ about ELMo

  • What do interviewers ask \ about Transformer

  • What do interviewers ask about BERT

One. One sentence summary logistic regression

Logistic regression assumes that the data obey Bernoulli distribution, and uses gradient descent to solve parameters through the method of maximization likelihood function, so as to achieve the purpose of data binary classification.

There are five things in this sentence, and here are some of them:

  • Logistic regression hypothesis
  • Loss function of logistic regression
  • The solution method of logistic regression
  • The purpose of logistic regression
  • How to classify logistic regression

2. Logistic regression hypothesis

Any model has its own hypothesis, under which the model is applicable.

Hypothesis #1

The first basic assumption of logistic regression is that the data follows a Bernoulli distribution.

Bernoulli distribution: is a discrete probability distribution. If successful, the random variable is 1. If it fails, the random variable is set to 0. Let’s call the probability of success p, and let’s call the probability of failure q = 1-p.

In logistic regression, since it is assumed that the data distribution follows Bernoulli distribution, there is a success and failure, and the corresponding binary problem is positive class and negative class, so there should be a probability that the sample is positive class and the probability that the sample is negative class. Specifically, we write it in the following form:

Hypothesis #2

The second assumption of logistic regression is that the probability of positive class is calculated by sigmoid’s function, namely:

The probability that the predicted sample is positive:

The probability that the predicted sample is negative class:

Written together, i.e., the category of the predicted sample:

To paraphrase this formula, it’s not using the sample label, it’s saying what is the probability that you want to get, which means you want the probability of getting a positive class, which means you want the probability of getting a negative class. And the other thing is, this is useful for finding parameters, which we’ll talk about in a second.

In addition on the value, is a probability, and not to the point where it can be predicted real labels, more specific process should be calculated respectively is class probability namely, and the probability of negative class, which is bigger, because the two add up to 1, so we usually probability of default is used only for class, as long as more than 0.5 can be classified as class, However, this 0.5 is artificially specified. If you like, you can set it as greater than 0.6 to be the positive category. In this case, even if the probability of positive category is 0.55, it cannot be predicted as a positive category, but as a negative category.

The loss function of logistic regression

It is said that the loss function of logistic regression is its maximum likelihood function, but why?

Here’s a quick summary of maximum likelihood estimation, just in case the interviewer asks:

Maximum likelihood estimation: Using the known sample result information, backward deduce the model parameter values that are most likely (maximum probability) to lead to the occurrence of these sample results (model determined, parameter unknown)

The model is the formula used to predict:

The parameters are inside, so what is the sample result information, which is our, which is our sample, which is the feature and the tag, and what we know is that in the case of the feature taking these values, it should be of class Y (positive or negative).

Pushes the most probability of (maximum) may cause the results of the sample parameters, for example, we know that a sample points, is positive, then we put it into this model, it predicts the results of the class must be ah, is class is right, is what we expected, we should as far as possible let it is the largest, to match our actual label. And the other way around, if you throw out negative classes, then this is the probability of negative classes, and again we want to maximize it, so we don’t have to distinguish between positive and negative classes.

It all makes sense, in a nutshell:

A sample, regardless of positive or negative categories, into the model, not much, is a word, let it be large

Only one sample has been mentioned, but for the whole training set, of course, we expect the probability of all samples to reach the maximum, that is, our objective function itself is a joint probability, but assuming that each sample is independent, the probability of all samples can be written as:

Personally, at this point, we can only call it the target function, because it’s our target, but what is the loss function? Generally, in other algorithms, the loss function is determined by the error between the true value and the predicted value, so it is easy to understand.

After checking the data for a long time, it seems that there is no official concept to introduce log loss function, so I can only understand it personally. Logistic regression has no loss function, this log loss function is forced to call it. Why is it called the log loss function?

Our goal is to maximize that top target function, so we’re going to go in that direction, to maximize, we’re going to take the derivative, we’re going to take the derivative, we’re going to simplify, otherwise it’s too complicated, so how do we simplify?

  • The first step is to take the logarithm, remove the continuous multiplication and change it into continuous addition, and directly give the simplified result:
  • Step 2, in order to satisfy the general need to minimize the loss function, so add a negative sign:
  • After simplification, we can call it the loss function:

Fourth, the solution method of logistic regression

Generally, gradient descent method is used to solve the problem. Gradient descent also includes random gradient descent, batch gradient descent and small Batch gradient descent.

  • To put it simply, batch gradient descent will obtain the global optimal solution. The disadvantage is that all the data need to be traversed when updating each parameter, resulting in a large amount of calculation and a lot of redundant calculation. As a result, the update of each parameter will be slow when the data volume is large.
  • Stochastic gradient descent is frequently updated with high variance. The advantage is that SGD will jump to a new and potentially better local optimal solution, while the disadvantage is that the process of convergence to the local optimal solution becomes more complex.
  • Small batch gradient descent combines the advantages of SGD and Batch GD, and n samples are used for each update. The number of parameter updates is reduced and more stable convergence results can be achieved. This method is generally adopted in deep learning.

Plus, see you do not understand such as Adam, momentum method and other optimization methods (I will not expand in this, later if there is time to write a special optimization method). There are two fatal problems with this approach:

  • The first is how to choose the appropriate learning rate for the model. It’s not appropriate to keep learning at the same rate all the time. Because at the beginning of parameter learning, the parameters at this time are far away from the optimal solution, so it is necessary to maintain a large learning rate to approach the optimal solution as soon as possible. But after learning, the parameters and the optimal solution have been relatively close, you still maintain the initial learning rate, easy to go over the optimal advantage, oscillate around the optimal advantage, in a common point, it is easy to learn too much, run off.
  • The second is how to choose the appropriate learning rate for parameters. In practice, it is not reasonable to maintain the same learning rate for each parameter. Some parameters are updated frequently, so the learning rate can be appropriately small. Some parameters update slowly, so the learning rate should be larger.

5. The purpose of logistic regression

Dichotomize data

How to classify logistic regression

As mentioned above, a threshold value should be set to determine whether the probability of positive class is greater than this threshold. Generally, the threshold value is 0.5, so we only need to determine whether the probability of positive class is greater than 0.5.

7. Why does logistic regression use maximum likelihood function as loss function

General peace square loss function (least squares) to compare, because linear regression is square loss function, reason is the squared loss function plus sigmoid function will be a non-convex function, not easy to solve, will get local solution, using the logarithm likelihood function get higher order continuous differentiable convex function, can get the optimal solution.

And secondly, because the logarithmic loss function updates very quickly, because it only depends on x and y, not the gradient of the sigmoid itself.

8. Logistic regression In the training process, if there are many features that are highly correlated, or if one feature is repeated 100 times, what will be the impact

First, the conclusion is that if the loss function converges eventually, the effect of the classifier will not be affected even if there are many features highly correlated.

But for the feature itself, let’s say there’s only one feature, and you now repeat it 100 times without taking any samples. After the training, the data is still the same, but the feature itself is repeated 100 times, essentially dividing the original feature into 100 parts, and each feature is one hundredth of the weight of the original feature.

In the case of random sampling, in fact, after the training convergence, it can be considered that the 100 features are still the same as the original feature, but the values of many features in the middle may cancel out.

Nine. Why do we still remove highly relevant features during training

Removing highly correlated features makes the model more interpretable

Can greatly improve the speed of training. If many features in the model are highly correlated, even if the loss function itself converges, in fact, the parameters do not converge, which will slow down the training speed. Secondly, more features will increase the time of training itself.

X. Summary of advantages and disadvantages of logistic regression

Advantages:

  • The form is simple and the model is very explicable. The influence of different features on the final result can be seen from the weight of features. If the weight value of a feature is higher, the feature will have a greater influence on the final result.
  • The model works well. It is acceptable in engineering (as the baseline). If the feature engineering is well done, the effect will not be too bad, and the feature engineering can be developed in parallel by everyone, greatly speeding up the development speed.
  • Training is faster. In classification, the amount of computation depends only on the number of features. Moreover, the development of logistic regression distributed optimization SGD is mature, and the training speed can be further improved by heap machine, so that we can iterate several versions of the model in a short time.
  • Small resource footprint, especially memory. Because you only need to store the eigenvalues of each dimension.
  • Convenient adjustment of output results. Logistic regression can be very convenient to get the final classification result, because the output is the probability score of each sample, we can easily cut off these probability scores, that is, divide the threshold (greater than a certain threshold is a class, less than a certain threshold is a class).

Disadvantages:

  • The accuracy is not very high. Because the form is very simple (very similar to linear models), it is difficult to fit the true distribution of data.
  • It’s hard to deal with data imbalances. For example, if we have a very imbalanced sample ratio of positive and negative samples like 10,000 to 1. If we predict all the samples to be positive, the loss function will also be smaller. But as a classifier, it can not distinguish positive and negative samples very well.
  • Processing nonlinear data is troublesome. Logistic regression, without introducing other methods, can only deal with linearly separable data, or further, with dichotomies.
  • Logistic regression by itself cannot filter features. Sometimes, we use GBDT to filter features and then logistic regression.

– END

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code