Red Stone’s personal website: Redstonewill.com

Machine learning is both a theoretical and practical technical discipline. When applying for machine learning-related jobs, we often encounter various machine learning questions and knowledge points. In order to help you sort out and understand these knowledge points, so that you can better deal with machine learning written tests including interviews. Red stone is ready to serialize some machine learning pen test series articles in the public, hoping to be helpful to everyone!

Q1. In the regression model, which of the following has the greatest influence on the trade-offs between under-fitting and over-fitting?

A. Polynomial order

B. When updating the weight W, do we use matrix inversion or gradient descent

C. Use constant terms

Answer: A,

It is very important to choose the right polynomial order. If the order is too large, the model will be more complex and easy to overfit. If the order is small, the model will be too simple and prone to under-fitting. If the concept of over-fitting and under-fitting is unclear, see the following figure:

Q2. Suppose you have the following data: both input and output have only one variable. A linear regression model (y=wx+ B) was used to fit the data. Then what is the mean square error obtained by leave-one Out cross-validation?

A. 10/27

B. 39/27

C. 49/27

D. 55/27

Answer: C

Analysis: The retention method, in simple terms, is to assume that there are N samples, each sample as a test sample, the other n-1 samples as training samples. So you get N classifiers, N test results. The average of these N results is used to measure the performance of the model.

For this problem, we first draw the coordinates of three sample points:

Two points are used for linear fitting, which can be divided into three cases, as shown in the figure below:

In the first case, the regression model is y = 2, and the error E1 = 1.

In the second case, the regression model is y = -x + 4, and the error E2 = 2.

In the third case, the regression model is y = -1/3x + 2, and the error E3 = 2/3.

Then the total mean square error is:


Q3. Which of the following statements is correct about Maximum Likelihood Estimate (MLE)?

A. MLE may not exist

B. MLE is always present

C. If MLE exists, then its solution may not be unique

D. If MLE exists, its solution must be unique

Answer: AC

Analysis: If the maximum likelihood function L(θ) is discontinuous at the maximum value and the first derivative does not exist, then MLE does not exist, as shown in the figure below:

The other case is that MLE is not unique and the maximum corresponds to two theta. As shown below:

Q4. If we say that the “linear regression” model perfectly fits the training sample (the training sample error is zero), which of the following statements is true?

A. The test sample error is always zero

B. The test sample error cannot be zero

C. None of the above answers are correct

Answer: C

Analysis: According to the training sample error is zero, it is impossible to infer whether the test sample error is zero. It is worth mentioning that if the test sample sample is large, it is likely to have been fitted, and the model does not have good generalization ability!

Q5. In a linear regression problem, we use R-squared to judge the degree of fit. At this point, if a feature is added and the model remains unchanged, what is the following true?

A. If R-squared increases, this feature is meaningful

B. If R-Squared decreases, this feature is meaningless

C. Only looking at r-Squared single variable, it is impossible to determine whether this feature is meaningful.

[D]. None of the above is true

Answer: C

Analysis: In the linear regression problem, R-squared is used to measure the similarity between regression equation and real sample output. The expression is as follows:

In the above formula, the numerator represents the sum of the squares of the true value and the predicted value, similar to the mean square error MSE; The denominator represents the sum of the squares of the true value and the mean, similar to variance Var. According to the value of R-squared, judge the quality of the model: if the result is 0, it indicates that the model fitting effect is very poor; If the result is 1, the model is error free. Generally speaking, the larger R-squared is, the better the model fitting effect is. R-squared reflects how accurate it is, because with the increase of the number of samples, R-Square will inevitably increase, and the accuracy cannot be truly quantified, but only quantified roughly.

In this case, r-squared alone cannot infer whether the added features are meaningful. Generally speaking, if a feature is added, R-Squared may become larger or remain the same, but the two may not be positively correlated.

If Adjusted R-Square is used:

Where, n is the number of samples and P is the number of features. The Adjusted R-Square can offset the effect of sample size on R-Square, which is true 0~1, and the larger the better.

Q6. Are the following statements about Residuals in linear regression analysis correct?

A. The mean of residuals is always zero

B. The mean of residuals is always less than zero

C. The mean of residuals is always greater than zero

[D]. None of the above is true

Answer: A,

Analysis: In linear regression analysis, the goal is to minimize residual. The sum of squares of residuals is a function of parameters. In order to find the minimum value of residuals, if the partial derivative of residuals with respect to parameters is zero, the sum of residuals is zero, that is, the mean value of residuals is zero.

Q7. What is true of the following statement about Heteroskedasticity?

A. Linear regression has different error terms

B. Linear regression has the same error terms

C. The error term of linear regression is zero

[D]. None of the above is true

Answer: A,

Analysis: Heteroscedasticity is relative to Homoskedasticity. The so-called homoscedasticity is to ensure that the estimation of regression parameters has good statistical properties. An important assumption of classical linear regression model is that the random error terms in the population regression function meet the homoscedasticity, that is, they all have the same variance. If this assumption is not satisfied, that is, the random error terms have different variances, the linear regression model is said to have heteroscedasticity.

Generally speaking, the occurrence of singular values will lead to an increase in heteroscedasticity.

Q8. Which of the following reflects the strong correlation between X and Y?

A. The correlation coefficient is 0.9

B. For invalid hypothesis β=0, the p value is 0.0001

C. For invalid hypothesis β=0, t value is 30

[D]. None of the above is true

Answer: A,

Analysis: We are familiar with the concept of correlation coefficient, it reflects the degree of linear correlation between different variables, generally expressed by R.


Where, Cov(X,Y) is the covariance of X and Y, Var[X] is the variance of X, and Var[Y] is the variance of Y. The value of r ranges from [-1,1]. A larger r indicates a higher correlation. A) r=0.9 indicates A strong correlation between X and Y.

The values of P and T have no statistical significance, but are only compared with a certain threshold value to get a binary conclusion. For example, there are two assumptions:

  • Null hypothesis H0: there is no “linear” correlation between the two parameters.

  • Alternative hypothesis H1: There is “linear” correlation between the two parameters.

If the threshold is 0.05 and the calculated P value is small, say 0.001, it can be said that “there is very significant evidence to reject the H0 hypothesis in favor of the H1 hypothesis. That is, there is a “linear” correlation between the two parameters. P value is only used for binarization judgment, so it cannot be said that P =0.06 is necessarily better than P =0.07.

Q9. Which of the following assumptions do we follow in deriving linear regression parameters (multiple choices)?

A. Linear relation between X and Y (polynomial relation)

B. Model errors are statistically independent

C. Errors generally follow a normal distribution of 0 mean and fixed standard deviation

D. X is nonrandom and measurement error free

Answer: the ABCD

Analysis: In the linear regression derivation and analysis, we have assumed that the above four conditions are true.

Q10. In order to observe and test the linear relationship between Y and X, where X is a continuous variable, which of the following figures is appropriate?

A. A scatter diagram

B. bar charts

C. histogram

[D]

Answer: A,

Analysis: The scatter chart reflects the relationship between two variables. It is the most intuitive to use the scatter chart when testing the linear relationship between Y and X.

Q11. In general, which of the following methods is commonly used to predict continuous independent variables?

A. Linear regression

[B]

C. Linear regression and logistic regression are both acceptable

[D]. None of the above is true

Answer: A,

Analysis: Linear regression is generally used for real number prediction, logistic regression is generally used for classification problems.

Q12. The correlation coefficient between individual health and age is -1.09. What conclusion can you tell the doctor based on this?

A. Age is A good predictor of health

[B] age is a predictor of poor health

[C] none of the above is true

Answer: C

-1.09 cannot exist because the correlation coefficient ranges from [-1,1].

Q13. Which of the following offsets do we use in the case of least square line fitting? The abscissa is the input X, and the ordinate is the output Y.

A. Vertical offsets

B. Vertical offsets

C. Both offsets are acceptable

[D]. None of the above is true

Answer: A,

Analysis: Linear regression model uses vertical offsets to calculate loss functions, such as mean square error loss functions. Threshold offsets are commonly used in principal component analysis (PCA).

Q14. Suppose we produce some data using a third-order polynomial in which Y is X (a third-order polynomial fits the data well). So which of the following statements is true (multiple choice)?

A. Simple linear regression is easy to cause high bias and low variance.

B. Simple linear regression is easy to cause low bias and high variance.

C. Third-order polynomial fitting will cause low bias and high variance.

D. Third-order polynomial fitting with low bias and low variance

Answer: the AD

Analysis: Bias and variance are two relative concepts, just like underfitting and overfitting. If the model is too simple, it will usually cause underfitting, accompanied by high deviation and low variance. If the model is too complex, it usually results in overfitting, with low bias and high variance.

Use a graph to graphically show the relationship between bias and variance:

Image source: https://www.zhihu.com/question/27068705

Bias can be regarded as the gap between model prediction and real sample. In order to obtain low bias, it is necessary to complicate the model, but it is easy to cause overfitting. Variance can be regarded as the performance of the model on the test set. To obtain low variance, the model must be simplified, but it is easy to cause under-fitting. In practice, bias and variance are trade-offs. If the model performs well in training samples and test sets, the deviation and variance will be relatively small, which is also the ideal situation of the model.

Q15. If you are training a linear regression model, there are two sentences:

1. If the amount of data is small, overfitting is likely to occur.

2. If the assumption space is small, overfitting is easy to occur.

Which of the following statements is true about these two statements?

A. 1 and 2 are both false

B. 1 True. 2 False

C. 1 false. 2 True

D. 1 and 2 are correct

Answer: B

If the amount of data is small, it is easy to find a model in the hypothesis space that has a good fitting degree to the training sample, which is easy to cause overfitting. This model does not have good generalization ability.

In the second sentence, if the hypothesis space is small, it contains fewer possible models, so it is unlikely to find a model that can fit the sample well, which is easy to cause high deviation and low variance, that is, underfitting.

References:

https://www.analyticsvidhya.com/blog/2016/12/45-questions-to-test-a-data-scientist-on-regression-skill-test-regression-s olution/