Arrange according to the course of vegetables, easy to remember and understand

The code location is as follows:

Embedded Embedded method

Embedding method is a method to let the algorithm decide which features to use, that is, feature selection and algorithm training are carried out simultaneously

When using the embedding method,

  • First, some machine learning algorithms and models are used for training to obtain weight coefficients of each feature, and features are selected from large to small according to the weight coefficients (these weight coefficients often represent some contribution or importance of features to the model).
  • Based on this assessment of contributions, we can identify the features that are most useful for modeling.
  • Compared with the filtering method, the results of the embedding method can be more accurate to the utility of the model itself, which has a better effect on improving the model effectiveness
  • Considering the contribution of features to the model, irrelevant features (features requiring correlation filtering) and undifferentiated features (features requiring variance filtering) will be deleted due to lack of contribution to the model, which can be described as an evolutionary version of filtering method

  • disadvantages
    • Filtering method used in the statistics statistical knowledge and common sense can be used to find the range (such as p values should be lower than the significance level of 0.05), and embedding method used in the scope of a weight coefficient has no such can find — we can say that the characteristics of the weight coefficient of 0 no effect of the model, but when a large number of features all contribute to the model and the contribution is not at the moment, It’s hard to define a valid threshold.
      • In this case, the model weight coefficient is our hyperparameter, and we may need a learning curve or determine the optimal value of this hyperparameter according to some properties of the model itself. Explore the embedding method of random forest and decision tree model.
    • The embedding method introduces algorithms to select features, so its computational speed will also be greatly dependent on the applied algorithm. If the calculation is very large and slow, the embedding method itself will be very time-consuming and labor-intensive. And, after selecting, we still need to evaluate the model ourselves.

feature_selection.SelectFromModel

class sklearn.feature_selection.SelectFromModel (estimator, threshold=None, prefit=False, norm_order=1,max_features=None)

  • SelectFromModel is a meta-converter that can be used with any estimator that has coef_, Feature_importances_ attributes or optional penalties in parameters after fitting (e.g., random forest and tree models have feature_importances_, Logistic regression has L1 and L2 penalties, and linear support vector machines also support L2 penalties.
    • For models with Feature_importances_, features that are less important than the provided threshold parameter are considered unimportant and removed. Feature_importances_ ranges from 0 to 1, and if the threshold is small, such as 0.001, you can remove features that do not contribute at all to tag prediction. If you set it very close to 1, maybe only one or two features will be left.
  • Embedding of models using penalty terms
    • For the model with penalty term, the larger the regularization penalty term is, the smaller the coefficient corresponding to the feature in the model will be. When the regularization penalty term is large to a certain degree, part of the characteristic coefficients will become 0. When the regularization penalty term continues to increase to a certain degree, all the characteristic coefficients will approach 0. But what we’re going to find is that it’s easier for some of the features to go to 0 first, and that’s what you can screen out. In other words, we choose features with large characteristic coefficients.
    • Support vector machines and logistic regression use parameter C to control the sparsity of the returned feature matrix. The smaller parameter C is, the fewer features are returned.
    • Lasso regression uses the alpha parameter to control the returned feature matrix. The larger the alpha value is, the fewer features are returned.
parameter instructions
estimator Use any model evaluator with feature_importances_ or COEF_ attributes, or with L1 and L2 penalties
threshold The importance threshold of a feature. Features lower than this threshold will be deleted
prefit The default is False, which determines whether to pass the instantiated model directly to the constructor. If True, it must be direct

Fit and transform cannot be called, fit_transform cannot be used, and SelectFromModel cannot be used

Cross_val_score, GridSearchCV, and similar utilities for the clone estimator are used together.
norm_order K can be entered as a non-zero integer, plus or minus infinity. The default value is 1

The order of the norm of the vector used to filter coefficients below the threshold in the case that the coEF_ attribute of the estimator is higher than one dimension.
max_features Maximum number of features to be selected under threshold setting. To disable thresholds and select only according to max_features, set threshold = -np.INF

It is the first two parameters that are important to consider. Here, we use random forest as an example, and the learning curve is needed to help us find the best eigenvalues.

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier as RFC
RFC_ = RFC(n_estimators =10,random_state=0)
x_embedded = SelectFromModel(RFC_,threshold=0.005).fit_transform(x,y)
x_embedded

""" array([[ 0, 0, 0, ..., 253, 0, 0], [254, 254, 254, ..., 254, 255, 254], [ 9, 254, 254, ..., 0, 254, 254], ..., [ 0, 0, 0, ..., 0, 255, 255], [ 0, 0, 27, ..., 242, 0, 0], [ 0, 0, 0, ..., 0, 0, 0]], dtype=int64) """

x_embedded.shape  # (42000, 47)
Copy the code
  • Use the learning curve to combine
Let's draw threshold's learning curve
import numpy as np
import matplotlib.pyplot as plt
RFC_.fit(X,y).feature_importances_
threshold = np.linspace(0,(RFC_.fit(X,y).feature_importances_).max(),20)
score = []
for i in threshold:
    X_embedded = SelectFromModel(RFC_,threshold=i).fit_transform(x,y)
    once = cross_val_score(RFC_,X_embedded,y,cv=5).mean()
    score.append(once)
plt.plot(threshold,score)
plt.show()
Copy the code

From the point of view of the image, as the threshold becomes higher and higher, the effect of the model becomes worse and worse, more and more features are deleted, and the information loss becomes larger and larger

  • Verify the effect of the model after feature selection
X_embedded = SelectFromModel(RFC_,threshold=0.00067).fit_transform(x,y)
X_embedded.shape
cross_val_score(RFC_,X_embedded,y,cv=5).mean()
# 0.9391190476190475
Copy the code

The number of features was instantly reduced to more than 324, which was smaller than 392 columns filtered by median in variance filtering, and the cross validation score of 0.9399 was higher than 0.9388 after variance filtering, because the embedding method was more specific to the performance of the model than variance filtering

  • We can refine the learning curve
score2 = []
for i in np.linspace(0.0.00134.20):
    X_embedded = SelectFromModel(RFC_,threshold=i).fit_transform(x,y)
    once = cross_val_score(RFC_,X_embedded,y,cv=5).mean()
    score2.append(once)
plt.figure(figsize=[20.5])
plt.plot(np.linspace(0.0.00134.20),score2)
plt.xticks(np.linspace(0.0.00134.20))
plt.show()
Copy the code

  • The threshold parameter for finding the best position
X_embedded = SelectFromModel(RFC_,threshold=0.000071).fit_transform(x,y)
X_embedded.shape  # (42000, 340)

cross_val_score(RFC_,X_embedded,y,cv=5).mean()  # 0.9392857142857144
Copy the code

With the embedding method, we can easily achieve the goal of feature selection: reduce computational load and improve model performance

Embedding may be a more efficient approach than filtering, which involves thinking about a lot of statistics. However, when the algorithm itself is very complex, the calculation of filtering method is much faster than that of embedding method, so we still give priority to filtering method in large data.

Wrapper packaging method

  • Similar to embedding method
    • It is a method of feature selection and algorithm training simultaneously
    • Rely on the selection of the algorithm itself, such as the COef_ attribute or feature_importances_ attribute to complete the feature selection.
  • Different from embedding method
    • We often use an objective function as a black box to help us select features, rather than entering thresholds for certain metrics or statistics ourselves.
    • Different from filtering and embedding methods, which solve all problems in one training, packaging method uses feature subset for multiple training, so it needs the highest computational cost.
  • The wrapper method trains the estimator on the initial feature set and obtains the importance of each feature either through the COEF_ attribute or through the Feature_importances_ attribute. Then, trim the least important features from the current set of features. The process is repeated recursively over the clipped set until the desired number of selected features is finally reached.

Note that the “algorithm” in this diagram does not refer to the classification or regression algorithm that we ultimately use to import the data (i.e., not the random forest), but rather to the professional data mining algorithm, our objective function. The core function of these data mining algorithms is to select the optimal feature subset. The most typical objective function is Recursive feature elimination (RFE). It is a greedy optimization algorithm designed to find the subset of features with the best performance. It creates the model repeatedly, keeping the best features or eliminating the worst in each iteration, and in the next iteration, it builds the next model using features that were not selected in the previous modeling until all the features are exhausted. It then ranks features according to the order in which it keeps them or removes them, eventually picking the best subset. The effect of packaging method is the most conducive to improving model performance among all feature selection methods. It can use few features to achieve excellent results. In addition, with the same number of features, the packaging method and the embedding method are comparable, but it is slower to calculate than the embedding method, so it is not suitable for too large data. In contrast, packaging method is the best feature selection method to ensure the effect of the model.

feature_selection.RFE

class sklearn.feature_selection.RFE (estimator, n_features_to_select=None, step=1, verbose=0)

  • Parameters:
    • Estimator: Is an instantiated estimator that needs to be filled in
    • ** n_featuRES_TO_SELECT: ** is the number of features you want to select
    • Step: indicates the number of features to be removed in each iteration
  • attribute
    • .support_ : Returns the last selected Boolean matrix for all features
    • .ranking_ : Returns a ranking of features by their combined importance over several iterations
  • The characteristics of
    • It is a greedy optimization algorithm designed to find the subset of features with the best performance. It creates the model repeatedly, keeping the best features or eliminating the worst in each iteration, and in the next iteration, it builds the next model using features that were not selected in the previous modeling until all the features are exhausted.
  • advantages
    • The effect of packaging method is the most conducive to improving model performance among all feature selection methods. It can use few features to achieve excellent results
    • Packaging method is the best feature selection method to ensure the effect of the model
  • disadvantages
    • With the same number of features, packaging and embedding are comparable, but it is slower to calculate than embedding, so it is not suitable for very large data

feature_selection.RFECV

RFE is executed in a cross-validation loop to find the optimal number of features, increasing the parameter CV, and is otherwise used exactly as RFE.

## recursive feature elimination method Feature_selection.RFE
from sklearn.feature_selection import RFE
RFC_ = RFC(n_estimators =10,random_state=0)
selector = RFE(RFC_, n_features_to_select=340, step=50).fit(x, y)
selector.support_

""" array([False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False]) """
Copy the code
  • Properties show
selector.support_.sum(a)# 340

selector.ranking_

""" array([10, 9, 8, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 6, 6, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 5, 4, 4, 5, 3, 4, 4, 4, 5, 4, 5, 7, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 6, 7, 4, 3, 1, 2, 3, 3, 1, 1, 1, 1]) "" "
Copy the code
  • Verify the results of feature selection
X_wrapper = selector.transform(x)
cross_val_score(RFC_,X_wrapper,y,cv=5).mean() # 0.9379761904761905

# Plot the learning curve
score = []
for i in range(1.751.50):
    X_wrapper = RFE(RFC_,n_features_to_select=i, step=50).fit_transform(x,y)
    once = cross_val_score(RFC_,X_wrapper,y,cv=5).mean()
    score.append(once)
plt.figure(figsize=[20.5])
plt.plot(range(1.751.50),score)
plt.xticks(range(1.751.50))
plt.show()
Copy the code

Applying 50 features below the wrapping method, the performance of the model has reached over 90%, which is much more efficient than embedding and filtering methods

Summary of feature selection

  • Filtration is faster, but coarser. Packaging method and embedding method are more accurate, more suitable for specific algorithm to adjust, but the calculation is relatively large, long running time.
  • When there is a large amount of data, variance filtering and mutual information method are preferred for adjustment, followed by other feature selection methods.
  • When using logistic regression, embedding method is preferred. When using support vector machine, packaging method is preferred