Add wechat: Julyedukefu14, reply [8] to get the latest upgraded version

26. What is a common loss function?

For a given input X, f(X) gives the corresponding output Y, the predicted value f(X) of which may or may not agree with the true value Y (remember that sometimes loss or error is inevitable), and a loss function measures the degree of prediction error. The loss function, called L(Y, f(X)), measures the degree to which the predicted value f(X) of your model differs from the true value Y.

27, why xGBoost to use Taylor expansion, where is the advantage?

Xgboost uses the first and second partial derivatives, and the second derivative is conducive to faster and more accurate gradient descent. By using Taylor expansion to obtain the second derivative form of function as independent variable, leaf splitting optimization can be carried out by only relying on the input data without selecting the specific form of loss function. In essence, the selection of loss function and model algorithm optimization/parameter selection are separated. This decoupling increases xGBoost’s applicability, allowing it to select loss functions on demand, which can be used for classification as well as regression.

28. What is the difference between covariance and correlation?

Correlation is a standardized form of covariance. Covariances themselves are hard to compare. For example, if we calculate the covariance of salary ($) and age (years), since these two variables have different measures, we will get different covariances that cannot be compared.

29. How does XGBoost find optimal features? Is it put back or not put back?

Xgboost gives the gain score of each feature during training, and the feature with the maximum gain will be selected as the basis for splitting, thus memorizing the importance of each feature during model training — the number of times involving a feature from root to leaf is ranked as the importance of the feature.

30. What about discriminant and generative models?

Discriminant method: directly by the data learning decision function Y = f (X), or by the conditional distribution probability P (Y | X) as a predictive model, namely the discriminant model.

Methods: the data by learning the joint probability density distribution function P (X, Y), and then to find the conditional probability distribution P (Y | X) as a prediction model, namely the generation model.

The discriminant model can be obtained from the generative model, but the generative model cannot be obtained from the discriminant model.

The common discriminant models include k-nearest Neighbor, SVM, decision tree, perceptron, Linear discriminant analysis (LDA), linear regression, traditional neural network, Logistic regression, Boosting, and conditional random field common generation models include: Naive Bayes, Hidden Markov model, Gaussian mixture model, Document Topic Generation model (LDA), Restricted Boltzmann machine

31, the difference between linear classifier and nonlinear classifier and its advantages and disadvantages

Linear and nonlinear are for model parameters and input characteristics; For example, if you input x, the model y is equal to ax+ax^2 then you have a nonlinear model, and if you input x and x^2 then you have a linear model. Linear classifier has good interpretability and low computational complexity, but its disadvantage is that the model fitting effect is relatively weak. The nonlinear classifier has strong effect fitting ability, but its disadvantages are that it is easy to overfit due to insufficient data, high computational complexity and poor interpretability. Common linear classifiers include LR, Bayesian classification, single-layer perceptron and linear regression. Common nonlinear classifiers include decision tree, RF, GBDT and multi-layer perceptron SVM (see linear kernel or Gaussian kernel).

The difference between L1 and L2

L1 norm refers to the sum of absolute values of each element in a vector, also known as Lasso regularization. Such as vector A = (1, 1, 3), then A L1 norm for | | 1 + + | | 3 | | – 1. The L1 norm is the sum of the absolute values of each element of x vector. L2 norm: is the sum of squares of each element of x vector to the 1/2 power. L2 norm is also known as Euclidean norm or Frobenius norm Lp norm: is the sum of absolute values of each element of X vector to the P power and 1/ P power

33, L1 and L2 regular prior respectively obey what distribution encountered in the interview, L1 and L2 regular prior respectively obey what distribution, L1 is the Laplacian distribution, L2 is the Gaussian distribution.

34. Brief introduction of Logistics return?

Logistic Regression is a classification model in machine learning. Due to its simplicity and efficiency, it is widely used in practice. In the practical work, for example, we may encounter the following problems: predict whether a user click on a specific commodity whether users gender prediction will buy a given category of judging a comment is positive or negative These can all be seen as a classification problem, a more accurate, can be as a binary classification problem. To solve these problems, some existing classification algorithms, such as logistic regression, or support vector machines, are often used. They all belong to supervised learning, so a batch of labeled data must be collected as a training set before using these algorithms. Some tags can be retrieved from the log (user click, purchase), some can be retrieved from the information the user fills in (gender), and some may need to be manually tagged (comment polarity).

Talk about Adaboost, weight update formula. When the weak classifier is Gm, the weight of each sample is W1, W2… Please write down the final decision formula.

Given a training dataset T={(x1,y1), (x2,y2)… (xN,yN)}

36. Those of you who do a lot of Internet searching know that when you accidentally enter a non-existent word, the search engine will prompt you to enter the correct word. For example, if you type “Julw” into Google, the system will guess your intention: “July”

Comb through common machine learning interview questions. How many do you know? Comb through common machine learning interview questions. How many do you know?

When a user types a word, it may be spelled correctly or incorrectly. If spelling is spelled as c for correct and W for wrong, the job of spell check is to try to infer c from w when w is present. In other words: given W, find the most likely C among several alternatives

37. Why is Naive Bayes so “naive”?

Because it assumes that all features are equally important and independent in the data set. As we all know, this assumption is far from true in the real world, so to say naive Bayes is really naive. Naive Bayesian models are Naive in the sense of assuming that sample characteristics are independent of each other. This assumption is largely nonexistent in reality, but there are plenty of cases where feature correlations are small, so the model still works well.

38. Please compare plSA and LDA roughly

Comb through common machine learning interview questions. How many do you know? Comb through common machine learning interview questions. How many do you know?

39. Please tell us more about EM algorithm

What exactly is an EM algorithm? A maximization algorithm is an algorithm that looks for maximum likelihood or a maximum posterior estimate of parameters in a probabilistic model that relies on hidden variables that cannot be observed.

40. How to select K in KNN?

About what KNN is, you can check this article: “From k-nearest Neighbor algorithm, distance measurement to KD tree, SIFT+BBF algorithm” (link: blog.csdn.net/v_july_v/ar… If a larger value of K is selected, it is equivalent to using training examples in a larger field to make predictions. Its advantage is that it can reduce the estimation error of learning, but its disadvantage is that the approximation error of learning will increase. At this time, training instances far from the input instance (dissimilar) will also act on the predictor, making the prediction error, and the increase of K value means that the overall model becomes simple. When K=N, it is completely inadequate, because no matter what the input instance is at this time, it is simply predicted that it is the most tired in the training instance. The model is too simple, and a large amount of useful information in the training instance is ignored. In practical application, K value is generally taken as a relatively small value. For example, the optimal K value is selected by cross-validation method (simply speaking, part of the sample is used as the training set and part of the test set).

41. Methods to prevent over-fitting

The reason of over-fitting is that the learning ability of the algorithm is too strong. Some assumptions (such as sample independence and uniform distribution) may not be true; The training sample is too small to estimate the distribution of the whole space. Treatment methods: 1. Early stop: stop the training if the model performance is not significantly improved after several iterations. 2. Original data increase, original data add random noise, resampling 3 regularization, regularization can limit the complexity of the model 4 cross validation 5 feature selection/feature dimension reduction 6 Creating a validation set is the most basic method to prevent overfitting. The goal of the model we end up training is to perform well on the validation set, not the training set.

42. In machine learning, why do data normalization often

Machine learning model is widely used by the Internet industry, such as sorting (see: ordering learning practice www.cnblogs.com/LBSer/p/443… In general, most of the time spent in machine learning applications is spent on feature processing, in which a key step is to normalize feature data. Why normalization? Many students do not understand the explanation given by Wikipedia: 1) Normalization accelerates the speed of finding the optimal solution of gradient descent; 2) Normalization may improve accuracy.

What is the least square method?

We often say orally: generally speaking, on average. If, on average, non-smokers have better health than smokers, the reason why I use the word “average” is because there are exceptions to everything, there is always a special person who smokes but because of regular exercise, his health is better than that of his non-smoking friends. One of the simplest examples of least squares is arithmetic average. Least square method (also known as least square method) is a mathematical optimization technique. It finds the best function match for the data by minimizing the sum of the squares of error. The least square method can be used to obtain the unknown data easily and minimize the sum of squares of errors between the obtained data and the actual data.

44. Does gradient descent find the fastest direction of descent?

The gradient descent method is not necessarily the fastest descending direction globally, it is only the fastest descending direction of the objective function on the tangent plane of the current point (of course, higher-dimensional problems cannot be called planes). In practical implementation, Newtonian direction (considering Hessian matrix) is generally considered to be the fastest descending direction, which can reach the convergence speed of SuperLinear. The convergence rate of gradient descent algorithms is generally linear or even Sublinear (for some problems with complex constraints). By Lin Xiaoxi (www.zhihu.com/question/30…

A brief talk about Bayes’ theorem

Comb through common machine learning interview questions. How many do you know? Comb through common machine learning interview questions. How many do you know?

46, How to understand the decision tree, XgBoost can handle missing values? Some SVM models are sensitive to missing values.

Subject analytical source: www.zhihu.com/question/58… First, explain your confusion in two ways: just because the toolkit handles missing data automatically does not mean that specific algorithms can handle missing items for missing data: models based on decision trees are better than models that rely on distance measures. Tree models such as Random Forest and how XGBoost handles missing values are also described in the answer. At the end of the paper, some suggestions for selecting models with missing values are summarized.

Please give some examples of standardization and normalization

In simple terms, standardization is to process data according to the columns of the feature matrix, and convert the characteristic values of the sample to the same dimension through the method of z-score. The general formula is :(x-mean)/ STD, where mean is the average value and STD is the variance. From the formula, we can see that standardization is subtracting the average value of data according to its attributes (by column) and then dividing by the variance. This process is geometrically understood as moving the zero axis of the coordinate axis to the mean, and then scaling, which involves both translation and scaling. The result is that for each attribute (per column), all the data is clustered around 0 and the variance is 1. Each attribute/column is evaluated separately.

48. How does random forest deal with missing values?

Yieshah: As we all know, there are many ways to deal with missing values in machine learning. However, the topic “How to deal with missing values in random Forest” shows that the key problem is how to deal with random forest, so let’s give a brief introduction to random forest. Random forest is composed of a number of decision tree, the first thing to establish the Bootstrap data set, that is, from the original data back to random, as a new data set, the new data set will duplicate data, for each data set and then constructing a decision tree, but not directly with all the characteristics to build a decision tree, but for each step, In this way, multiple decision trees are constructed to form a random forest. Data are input into each decision tree to see the judgment results of each decision tree, and the prediction results of all decision trees are counted. Bagging integrates the results to obtain the final output. So how do you deal with missing values in a random forest? According to the characteristics of random forest creation and training, the processing of missing values in random forest is quite special.

49. How does random forest assess the importance of features?

There are two ways to measure the importance of a variable, Decrease GINI and Decrease Accuracy:

  1. Decrease GINI:

For classification problems (classifying a sample into a certain category), namely discrete variable problems, CART uses Gini values as a criterion. Defined as Gini=1-∑(P(I)*P(I)),P(I) is the proportion of class I samples in the data set on the current node. For example: divided into 2 categories, there are 100 samples on the current node, 70 samples belonging to the first category and 30 samples belonging to the second category, then Gini=1-0.7×07-0.3×03=0.42. It can be seen that the more average the category distribution is, the larger the Gini value is, the more uneven the category distribution is, and the smaller the Gini value is. When looking for the best classification feature and threshold value, the evaluation criterion is argmax (gini-ginileft-giniright), that is, to find the best feature F and threshold th, so that the Gini value of the current node minus the Gini value of the left child node and the Gini value of the right child node can be the largest.

For regression problems, it is relatively simple, and argmax(var-varleft-varright) is directly used as the evaluation standard, that is, the variance of the current node training set Var minus minus the variance VarLeft of the left child node and the variance VarRight of the right child node have the maximum value.

  1. Decrease Accuracy:

For a tree Tb(x), we can get the test error 1 by using OOB sample. Then, the JTH column of OOB sample is randomly changed: the other columns remain unchanged, and the JTH column is randomly replaced up and down to obtain error 2. At this point, we can use error 1- error 2 to describe the importance of variable j. The basic idea is that if a variable j is important enough, changing it will greatly increase the test error; On the contrary, if the test error does not increase by changing it, it means that the variable is not that important.

50. What about the optimization of Kmeans?

K-means analysis: Under the condition of big data, it will consume a lot of time and memory. Suggestions for k-means optimization: 1. Reduce the number of clustering k. Because each sample has to compute the distance from the center of the class. 2. Reduce the characteristic dimension of the sample. For example, dimension reduction through PCA etc. 3. Investigate other clustering algorithms and test the performance of different clustering algorithms by selecting toy data. 4. Hadoop cluster, k-means algorithm is easy to carry out parallel computing.

Add wechat: Julyedukefu14, reply [8] to get the latest upgraded version