Previous related articles:

  1. Linear regression for machine learning
  2. Machine learning logistic regression and Python implementation
  3. Machine learning project actual trading data anomaly detection
  4. Decision Tree for Machine Learning
  5. Python implementation of Decision Tree for machine learning
  6. PCA for Machine Learning
  7. Feature engineering for machine learning

When it comes to feature engineering, there is a saying that is widely circulated in the industry: data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit. Thus, the importance of feature engineering can be seen.

I. Interpretation and significance of feature engineering

So what is feature engineering? First, let’s take a look at features. Features are the useful information extracted from the index data for result prediction, that is, the relevant attributes of the data. Feature engineering: The process of using professional background knowledge and skills to process data so that features can be used better in machine learning algorithms

Meaning: 1. Better features mean more flexibility

2. Better features mean better results can be achieved with simple models

3. Better features mean better results can be trained

Ii. Specific process of characteristic engineering

The whole process can be summarized in the following diagram

1. Feature usage scheme

Once we have defined our goals, the first thing we need to do is analyze what data we need to achieve our goals, based on the business scenario. That is, based on business understanding, to find out as many independent variables as possible that affect the dependent variable. For example, if I want to predict a user’s order status, or if I want to recommend a product to a user, what information do I need to collect? Can be divided into three directions to collect,

  1. Shopkeeper: shopkeeper’s category, shopkeeper’s score, shopkeeper’s express, etc
  2. Merchandise: category of merchandise, rating, number of buyers, color, etc
  3. Users: historical purchase information, spending power, shopping cart conversion ratio, product retention time, user age, location, etc

Then, we need to do a usability assessment for the data we need

  1. Difficulty of acquisition, can we collect the data? For example, for the user’s age, it is more difficult to obtain, and not everyone will fill in the age when registering
  2. Coverage, some data that is not available to every object, such as historical purchase information, is not available to new users
  3. Accuracy, such as user age, store rating, will also have accuracy problems, because the store may brush orders, users may not write real age
  4. Is it fast to get when computing online in real time?

2. Feature acquisition scheme

After determining the features we need, the next step is to consider the feature acquisition and storage, which is mainly divided into offline feature acquisition and online feature acquisition

1 Solution for obtaining Offline Features Massive data can be processed offline using distributed file storage platforms, such as HDFS, and processing tools, such as MapReduce and Spark.

2 Online feature acquisition Scheme Online feature pays more attention to the delay of data acquisition. Since it is an online service, the corresponding data needs to be obtained in a very short time, which has high requirements on search performance. Data can be stored in index and KV storage. However, the search performance will conflict with the amount of data, which requires compromise processing. We use the feature hierarchical acquisition scheme, as shown in the figure below.

3. Feature processing

And then the most important part of feature engineering, which is what we’re mostly doing, is feature processing

3.1 Feature Cleaning

In feature processing, the first thing to do is feature cleaning, mainly to do two things

3.1.1 Cleaning Dirty Data (Abnormal Data)

How do you detect abnormal data? There are mainly the following kinds

1. Statistical outlier detection algorithm (1). Simple statistical analysis: for example, a descriptive statistics of attribute values is carried out to check which values are unreasonable. For example, for age, we specify a range dimension [0,100], and the samples not in this range are considered as abnormal samples (2). When the data is normal distribution, according to the definition of normal distribution, the average distance three delta outside the probability P (x | – mu | > 3 delta) < = 0.003, which belongs to the small probability events, by default, we can concluded that average distance more than 3 samples of delta is not there. Therefore, when the mean distance between samples is greater than 3δ, the sample is considered an outlier (3). Abnormal data were detected by range and quartile spacing

2. Anomaly detection algorithm based on distance (thoughts) and K neighbor algorithm mainly through distance method to detect abnormal points, will be a data point with the most points is greater than a certain threshold distance between points as abnormal point, main use has absolute distance measurement distance (Manhattan distance), markov distance and Euclidean distance method

3. The density-based outlier detection algorithm inspects the density around the current point to find local outliers

3.1.2 Missing value processing

For a feature

  1. If there are too many missing values in all samples, they can be removed
  2. If there are not many missing values, consider filling them with the global mean or median
  3. This feature is taken as the target, and the missing value is predicted by using the relevant algorithm model according to the data not missing

3.1.3 Data sampling

Data sampling is mainly to deal with the problem of sample imbalance. For example, in some cases, there is a large gap between the number of positive and negative samples in the data obtained, while most models are sensitive to the ratio of positive and negative samples (such as logistic regression). Therefore, data sampling is needed to balance the positive and negative samples of data

In dealing with the problem of sample imbalance, there are two main cases

  1. The difference between positive and negative samples is very large, and the number of positive and negative samples itself is also very large. In this case, down-sampling method can be adopted. Downsampling: undersampling the classes (most classes) with a large number of samples in the training set, and discarding some samples to alleviate class imbalance.

  2. The difference between positive and negative samples is large, and the number of positive and negative samples itself is relatively small. In this case, up-sampling method can be adopted. Upsampling: Over-sampling the classes (minority classes) with a small number of samples in the training set, and synthesizing new samples to alleviate class imbalance. During this period, there is a very classic SMOTE algorithm which is explained and used in the previous article (MACHINE learning project SMOTE transaction data abnormal detection), and will not be repeated in this section.

3.2 Feature preprocessing

For different types of features, there are different processing methods, which are summarized below

3.2.1 Numerical type (refers to continuous data)

Numerical characteristics generally need to be treated in the following aspects

1.1 Statistics.

It is necessary to check the maximum value, minimum value, average value and variance of the corresponding features, so as to better analyze the data. The following takes the iris data set in Sklearn as an example to demonstrate it through the code

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris() Get the data set
# iris.data[:5] # display the first 5 data of the dataset
series = pd.Series(iris.data[:,0])
series.describe()  # Through the describe method, we can directly obtain the total number, mean, variance, each quantile, maximum and minimum values of data under the current characteristics
Copy the code
Count 150.000000 mean 5.843333 STD 0.828066 min 4.300000 25% 5.100000 50% 5.800000 75% 6.400000 Max 7.900000 dtype: float64Copy the code

1.2. Dimensionless. The commonly used dimensionless methods are normalization and interval scaling.

(1) Standardization: Standardization needs to calculate the mean and variance of the data under the corresponding feature, and then see how many variances of each value under the current feature are away from the mean. The specific formula is as follows:


(2). Interval scaling method: There are many ideas of interval scaling method, one of which is to use two maximum values for scaling. The specific formula is as follows:


(3) The difference between standardization and normalization: It can be distinguished that for an M samples and an M * N feature matrix with n features, one row represents a sample data with N features, and one column represents m sample data with a feature. The normalization mentioned above is to process data for the columns of the eigenmatrix and convert the eigenvalues of the sample to the same dimension, while normalization is to process data according to the rows of the eigenmatrix. If a row of data is regarded as the transformation of a vector, it is equivalent to the uniformization of a vector

The above two methods (normalization and interval scaling) have been encapsulated in SkLearn for us, with the following code

from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(iris.data)    # standardization

from sklearn.preprocessing import MinMaxScaler
MinMaxScaler().fit_transform(iris.data)
Copy the code
array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.30555556, 0.70833333, 0.08474576, 0.04166667],
       [0.13888889, 0.58333333, 0.10169492, 0.04166667],
       [0.13888889, 0.41666667, 0.06779661, 0.        ],
       [0.        , 0.41666667, 0.01694915, 0.        ],
       [0.41666667, 0.83333333, 0.03389831, 0.04166667],
       [0.38888889, 1.        , 0.08474576, 0.125     ],
       [0.30555556, 0.79166667, 0.05084746, 0.125     ],
       [0.22222222, 0.625     , 0.06779661, 0.08333333],
       [0.38888889, 0.75      , 0.11864407, 0.08333333],
       [0.22222222, 0.75      , 0.08474576, 0.08333333],
       [0.30555556, 0.58333333, 0.11864407, 0.04166667],
       [0.22222222, 0.70833333, 0.08474576, 0.125     ],
       [0.08333333, 0.66666667, 0.        , 0.04166667],
       [0.22222222, 0.54166667, 0.11864407, 0.16666667],
       [0.13888889, 0.58333333, 0.15254237, 0.04166667],
       [0.19444444, 0.41666667, 0.10169492, 0.04166667],
       [0.19444444, 0.58333333, 0.10169492, 0.125     ],
       [0.25      , 0.625     , 0.08474576, 0.04166667],
       [0.25      , 0.58333333, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.10169492, 0.04166667],
       [0.13888889, 0.45833333, 0.10169492, 0.04166667],
       [0.30555556, 0.58333333, 0.08474576, 0.125     ],
       [0.25      , 0.875     , 0.08474576, 0.        ],
       [0.33333333, 0.91666667, 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.19444444, 0.5       , 0.03389831, 0.04166667],
       [0.33333333, 0.625     , 0.05084746, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.02777778, 0.41666667, 0.05084746, 0.04166667],
       [0.22222222, 0.58333333, 0.08474576, 0.04166667],
       [0.19444444, 0.625     , 0.05084746, 0.08333333],
       [0.05555556, 0.125     , 0.05084746, 0.08333333],
       [0.02777778, 0.5       , 0.05084746, 0.04166667],
       [0.19444444, 0.625     , 0.10169492, 0.20833333],
       [0.22222222, 0.75      , 0.15254237, 0.125     ],
       [0.13888889, 0.41666667, 0.06779661, 0.08333333],
       [0.22222222, 0.75      , 0.10169492, 0.04166667],
       [0.08333333, 0.5       , 0.06779661, 0.04166667],
       [0.27777778, 0.70833333, 0.08474576, 0.04166667],
       [0.19444444, 0.54166667, 0.06779661, 0.04166667],
       [0.75      , 0.5       , 0.62711864, 0.54166667],
       [0.58333333, 0.5       , 0.59322034, 0.58333333],
       [0.72222222, 0.45833333, 0.66101695, 0.58333333],
       [0.33333333, 0.125     , 0.50847458, 0.5       ],
       [0.61111111, 0.33333333, 0.61016949, 0.58333333],
       [0.38888889, 0.33333333, 0.59322034, 0.5       ],
       [0.55555556, 0.54166667, 0.62711864, 0.625     ],
       [0.16666667, 0.16666667, 0.38983051, 0.375     ],
       [0.63888889, 0.375     , 0.61016949, 0.5       ],
       [0.25      , 0.29166667, 0.49152542, 0.54166667],
       [0.19444444, 0.        , 0.42372881, 0.375     ],
       [0.44444444, 0.41666667, 0.54237288, 0.58333333],
       [0.47222222, 0.08333333, 0.50847458, 0.375     ],
       [0.5       , 0.375     , 0.62711864, 0.54166667],
       [0.36111111, 0.375     , 0.44067797, 0.5       ],
       [0.66666667, 0.45833333, 0.57627119, 0.54166667],
       [0.36111111, 0.41666667, 0.59322034, 0.58333333],
       [0.41666667, 0.29166667, 0.52542373, 0.375     ],
       [0.52777778, 0.08333333, 0.59322034, 0.58333333],
       [0.36111111, 0.20833333, 0.49152542, 0.41666667],
       [0.44444444, 0.5       , 0.6440678 , 0.70833333],
       [0.5       , 0.33333333, 0.50847458, 0.5       ],
       [0.55555556, 0.20833333, 0.66101695, 0.58333333],
       [0.5       , 0.33333333, 0.62711864, 0.45833333],
       [0.58333333, 0.375     , 0.55932203, 0.5       ],
       [0.63888889, 0.41666667, 0.57627119, 0.54166667],
       [0.69444444, 0.33333333, 0.6440678 , 0.54166667],
       [0.66666667, 0.41666667, 0.6779661 , 0.66666667],
       [0.47222222, 0.375     , 0.59322034, 0.58333333],
       [0.38888889, 0.25      , 0.42372881, 0.375     ],
       [0.33333333, 0.16666667, 0.47457627, 0.41666667],
       [0.33333333, 0.16666667, 0.45762712, 0.375     ],
       [0.41666667, 0.29166667, 0.49152542, 0.45833333],
       [0.47222222, 0.29166667, 0.69491525, 0.625     ],
       [0.30555556, 0.41666667, 0.59322034, 0.58333333],
       [0.47222222, 0.58333333, 0.59322034, 0.625     ],
       [0.66666667, 0.45833333, 0.62711864, 0.58333333],
       [0.55555556, 0.125     , 0.57627119, 0.5       ],
       [0.36111111, 0.41666667, 0.52542373, 0.5       ],
       [0.33333333, 0.20833333, 0.50847458, 0.5       ],
       [0.33333333, 0.25      , 0.57627119, 0.45833333],
       [0.5       , 0.41666667, 0.61016949, 0.54166667],
       [0.41666667, 0.25      , 0.50847458, 0.45833333],
       [0.19444444, 0.125     , 0.38983051, 0.375     ],
       [0.36111111, 0.29166667, 0.54237288, 0.5       ],
       [0.38888889, 0.41666667, 0.54237288, 0.45833333],
       [0.38888889, 0.375     , 0.54237288, 0.5       ],
       [0.52777778, 0.375     , 0.55932203, 0.5       ],
       [0.22222222, 0.20833333, 0.33898305, 0.41666667],
       [0.38888889, 0.33333333, 0.52542373, 0.5       ],
       [0.55555556, 0.54166667, 0.84745763, 1.        ],
       [0.41666667, 0.29166667, 0.69491525, 0.75      ],
       [0.77777778, 0.41666667, 0.83050847, 0.83333333],
       [0.55555556, 0.375     , 0.77966102, 0.70833333],
       [0.61111111, 0.41666667, 0.81355932, 0.875     ],
       [0.91666667, 0.41666667, 0.94915254, 0.83333333],
       [0.16666667, 0.20833333, 0.59322034, 0.66666667],
       [0.83333333, 0.375     , 0.89830508, 0.70833333],
       [0.66666667, 0.20833333, 0.81355932, 0.70833333],
       [0.80555556, 0.66666667, 0.86440678, 1.        ],
       [0.61111111, 0.5       , 0.69491525, 0.79166667],
       [0.58333333, 0.29166667, 0.72881356, 0.75      ],
       [0.69444444, 0.41666667, 0.76271186, 0.83333333],
       [0.38888889, 0.20833333, 0.6779661 , 0.79166667],
       [0.41666667, 0.33333333, 0.69491525, 0.95833333],
       [0.58333333, 0.5       , 0.72881356, 0.91666667],
       [0.61111111, 0.41666667, 0.76271186, 0.70833333],
       [0.94444444, 0.75      , 0.96610169, 0.875     ],
       [0.94444444, 0.25      , 1.        , 0.91666667],
       [0.47222222, 0.08333333, 0.6779661 , 0.58333333],
       [0.72222222, 0.5       , 0.79661017, 0.91666667],
       [0.36111111, 0.33333333, 0.66101695, 0.79166667],
       [0.94444444, 0.33333333, 0.96610169, 0.79166667],
       [0.55555556, 0.29166667, 0.66101695, 0.70833333],
       [0.66666667, 0.54166667, 0.79661017, 0.83333333],
       [0.80555556, 0.5       , 0.84745763, 0.70833333],
       [0.52777778, 0.33333333, 0.6440678 , 0.70833333],
       [0.5       , 0.41666667, 0.66101695, 0.70833333],
       [0.58333333, 0.33333333, 0.77966102, 0.83333333],
       [0.80555556, 0.41666667, 0.81355932, 0.625     ],
       [0.86111111, 0.33333333, 0.86440678, 0.75      ],
       [1.        , 0.75      , 0.91525424, 0.79166667],
       [0.58333333, 0.33333333, 0.77966102, 0.875     ],
       [0.55555556, 0.33333333, 0.69491525, 0.58333333],
       [0.5       , 0.25      , 0.77966102, 0.54166667],
       [0.94444444, 0.41666667, 0.86440678, 0.91666667],
       [0.55555556, 0.58333333, 0.77966102, 0.95833333],
       [0.58333333, 0.45833333, 0.76271186, 0.70833333],
       [0.47222222, 0.41666667, 0.6440678 , 0.70833333],
       [0.72222222, 0.45833333, 0.74576271, 0.83333333],
       [0.66666667, 0.45833333, 0.77966102, 0.95833333],
       [0.72222222, 0.45833333, 0.69491525, 0.91666667],
       [0.41666667, 0.29166667, 0.69491525, 0.75      ],
       [0.69444444, 0.5       , 0.83050847, 0.91666667],
       [0.66666667, 0.54166667, 0.79661017, 1.        ],
       [0.66666667, 0.41666667, 0.71186441, 0.91666667],
       [0.55555556, 0.20833333, 0.6779661 , 0.75      ],
       [0.61111111, 0.41666667, 0.71186441, 0.79166667],
       [0.52777778, 0.58333333, 0.74576271, 0.91666667],
       [0.44444444, 0.41666667, 0.69491525, 0.70833333]])
Copy the code

1.3. The discretization

Discretization is a very important processing numerical characteristics, it is to get the numeric data into the category type continuous data value space could be infinite, to facilitate the representation and processing in the model, need a discretization processing characteristics of continuous value In industry, seldom directly to the continuous values as feed characteristics to logistic regression models, Instead, continuous features are discretized into a series of 0 and 1 features and given to the logistic regression model, which has the following advantages:

  1. Sparse vector inner product multiplication is fast, and the results are easy to store and scalable.

  2. Discretized features have strong robustness to abnormal data: for example, if a feature is age >30, it is 1, otherwise it is 0. If features are not discretized, an abnormal data “age of 300 years” will cause great disturbance to the model.

  3. Logistic regression is a generalized linear model with limited expression ability. After the single variable is discretized into N, each variable has its own weight, which is equivalent to introducing nonlinearity into the model, which can improve the expression ability of the model and increase the fitting.

  4. After discretization, feature crossing can be carried out, from M N variables to M*N variables, which further introduces nonlinearity and improves expression ability.

  5. After feature discretization, the model will be more stable. For example, if the age of users is discretized, 20-30 will not be a completely different person just because a user is one year older. Of course, the samples next to the interval will be the opposite, so how to divide the interval is a science.

Common discretization methods include equal partition and equal partition. (1) Equivalence partition is to evenly divide features according to range, and the values in each section are treated equally. For example, if the value range of a feature is [0,10], it can be divided into 10 segments, [0,1), [1,2),…, [9,10).

(2) Equal division refers to the equalization based on the total number of samples, and each section is divided into 1 section with equal number of samples. For example, the range of the distance feature is [0,3000000], which needs to be divided into 10 segments. If it is divided according to equal proportion, it can be found that most of the samples are in the first segment. The final possible segmentation is [0,100), [100,300), [300,500),.., [10000,3000000]. The former interval partition is relatively dense, and the latter is relatively sparse

ages = np.array([20.22.25.27.21.23.37.31.61.45.41.32]) # Some age data
The cut method in pandas is used to split the data
# factory = pd. Cut (ages,4) #arr
factory = pd.cut(ages,4,labels=['Youth'.'YoungAdult'.'MiddleAged'.'Senior']) #lable, you can name yourself for each category
# factory = pd. The cut (arr, bins =,25,35,60,100 [18], labels = [' a ', 'b', 'c', 'd']) # bins you specify partition boundaries
# factory. Dtype #CategoricalDtype. As you can see, the return from cut is a Categorical object
test = np.array(factory)  # Get the classified data
test
# factory.codes # array([0, 0, 0, 0, 0, 0, 1, 1, 3, 2, 2, 1], dtype=int8)
Copy the code
array(['Youth', 'Youth', 'Youth', 'Youth', 'Youth', 'Youth', 'YoungAdult',
       'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult'],
      dtype=object)
Copy the code
# Let's look at equal partition
The qcut method in pandas can be used to split the data
factory = pd.qcut(ages,4) 
# factory
factory.value_counts()   # You can see that by equal partitioning, the number of data in each category is the same
Copy the code
(19.999, 22.75]    3
(22.75, 29.0]      3
(29.0, 38.0]       3
(38.0, 61.0]       3
dtype: int64
Copy the code

3.2.2 Categorical data

One – hot coding

For categorical data, one of the most important processing, is one-hot encoding, see a concrete example

Create a simple raw data
testdata = pd.DataFrame({'age': [4.6.3.3].'pet': ['cat'.'dog'.'dog'.'fish']})
testdata
Copy the code
age pet
0 4 cat
1 6 dog
2 3 dog
3 3 fish
# The first method, through the get_dummies method provided in PANDAS
pd.get_dummies(testdata,columns=['pet']) The first parameter is the original data. Columns pass in the features that need encoding conversion, which can be multiple, and return new data
Copy the code
age pet_cat pet_dog pet_fish
0 4 1 0 0
1 6 0 1 0
2 3 0 1 0
3 3 0 0 1
testdata.pet.values.reshape(- 1.1)
Copy the code
array([['cat'],
       ['dog'],
       ['dog'],
       ['fish']], dtype=object)
Copy the code
The second method uses the OneHotEncoder method of SkLearn

from sklearn.preprocessing import OneHotEncoder
OneHotEncoder().fit_transform(testdata.age.values.reshape(- 1.1)).toarray()   
Copy the code
array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])
Copy the code
#OneHotEncoder does not work with String values. To work with String values, you need to convert them
# OneHotEncoder (.) fit_transform (testdata. Pet. Values. Reshape (1, 1)). The toarray () # complains
from sklearn.preprocessing import LabelEncoder
petvalue = LabelEncoder().fit_transform(testdata.pet)
print(petvalue)  # [0 1 1 2] converts the string category to an integer category
OneHotEncoder().fit_transform(petvalue.reshape(- 1.1)).toarray() # You can see that the result is the same as the result of the get_dummies transformation above
Copy the code
[0 1 1 2]





array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
Copy the code

3.2.3 Time-based data

For temporal data, it can be converted to continuous or discrete values.

Continuous values

Such as duration (length of page view), interval (time since last purchase/click)

Discrete values

What time of day (hour_0-23), what day of the week (week_Monday…) Week of the year weekdays/weekends quarter of the year etc

# Let's take an example. This data is a bicycle rental data made by the hour in 2 years
import pandas as pd
data = pd.read_csv('kaggle_bike_competition_train.csv', header = 0, error_bad_lines=False)
data.head()  Print the first 5 lines to see what the data looks like
Copy the code
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 The 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 The 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 The 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 The 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1
# datatime = datatime
# First, we can cut it into pieces
data = data.iloc[:,:1] Only look at the datatime attribute
temp = pd.DatetimeIndex(data['datetime'])
data['date'] = temp.date  # date
data['time'] = temp.time  # time
data['year'] = temp.year  # years
data['month'] = temp.month # month
data['day'] = temp.day #,
data['hour'] = temp.hour # hours
data['dayofweek'] = temp.dayofweek  # Specific day of the week
data['dateDays'] = (data.date - data.date[0]) ['0days','0days',... '1days',...]
data['dateDays'] = data['dateDays'].astype('timedelta64[D]')  # convert to float
data
Copy the code
datetime date time year month day hour dayofweek dateDays
0 2011-01-01 00:00:00 2011-01-01 00:00:00 2011 1 1 0 5 0.0
1 The 2011-01-01 01:00:00 2011-01-01 01:00:00 2011 1 1 1 5 0.0
2 The 2011-01-01 02:00:00 2011-01-01 02:00:00 2011 1 1 2 5 0.0
3 The 2011-01-01 03:00:00 2011-01-01 03:00:00 2011 1 1 3 5 0.0
4 The 2011-01-01 04:00:00 2011-01-01 04:00:00 2011 1 1 4 5 0.0
5 The 2011-01-01 05:00:00 2011-01-01 05:00:00 2011 1 1 5 5 0.0
6 The 2011-01-01 06:00:00 2011-01-01 06:00:00 2011 1 1 6 5 0.0
7 The 2011-01-01 07:00:00 2011-01-01 07:00:00 2011 1 1 7 5 0.0
8 The 2011-01-01 08:00:00 2011-01-01 08:00:00 2011 1 1 8 5 0.0
9 The 2011-01-01 09:00:00 2011-01-01 09:00:00 2011 1 1 9 5 0.0
10 The 2011-01-01 10:00:00 2011-01-01 10:00:00 2011 1 1 10 5 0.0
11 The 2011-01-01 11:00:00 2011-01-01 11:00:00 2011 1 1 11 5 0.0
12 The 2011-01-01 12:00:00 2011-01-01 12:00:00 2011 1 1 12 5 0.0
13 The 2011-01-01 13:00:00 2011-01-01 13:00:00 2011 1 1 13 5 0.0
14 2011-01-01 14:00:00 2011-01-01 14:00:00 2011 1 1 14 5 0.0
15 The 2011-01-01 15:00:00 2011-01-01 15:00:00 2011 1 1 15 5 0.0
16 The 2011-01-01 16:00:00 2011-01-01 16:00:00 2011 1 1 16 5 0.0
17 The 2011-01-01 17:00:00 2011-01-01 17:00:00 2011 1 1 17 5 0.0
18 The 2011-01-01 18:00:00 2011-01-01 18:00:00 2011 1 1 18 5 0.0
19 The 2011-01-01 19:00:00 2011-01-01 19:00:00 2011 1 1 19 5 0.0
20 The 2011-01-01 20:00:00 2011-01-01 20:00:00 2011 1 1 20 5 0.0
21 The 2011-01-01 21:00:00 2011-01-01 21:00:00 2011 1 1 21 5 0.0
22 The 2011-01-01 22:00:00 2011-01-01 22:00:00 2011 1 1 22 5 0.0
23 The 2011-01-01 23:00:00 2011-01-01 23:00:00 2011 1 1 23 5 0.0
24 2011-01-02 00:00:00 2011-01-02 00:00:00 2011 1 2 0 6 1.0
25 The 2011-01-02 01:00:00 2011-01-02 01:00:00 2011 1 2 1 6 1.0
26 The 2011-01-02 02:00:00 2011-01-02 02:00:00 2011 1 2 2 6 1.0
27 The 2011-01-02 03:00:00 2011-01-02 03:00:00 2011 1 2 3 6 1.0
28 The 2011-01-02 04:00:00 2011-01-02 04:00:00 2011 1 2 4 6 1.0
29 The 2011-01-02 06:00:00 2011-01-02 06:00:00 2011 1 2 6 6 1.0
. . . . . . . . . .
10856 The 2012-12-18 18:00:00 2012-12-18 18:00:00 2012 12 18 18 1 717.0
10857 The 2012-12-18 19:00:00 2012-12-18 19:00:00 2012 12 18 19 1 717.0
10858 The 2012-12-18 20:00:00 2012-12-18 20:00:00 2012 12 18 20 1 717.0
10859 The 2012-12-18 21:00:00 2012-12-18 21:00:00 2012 12 18 21 1 717.0
10860 The 2012-12-18 22:00:00 2012-12-18 22:00:00 2012 12 18 22 1 717.0
10861 The 2012-12-18 23:00:00 2012-12-18 23:00:00 2012 12 18 23 1 717.0
10862 2012-12-19 00:00:00 2012-12-19 00:00:00 2012 12 19 0 2 718.0
10863 The 2012-12-19 01:00:00 2012-12-19 01:00:00 2012 12 19 1 2 718.0
10864 The 2012-12-19 02:00:00 2012-12-19 02:00:00 2012 12 19 2 2 718.0
10865 The 2012-12-19 03:00:00 2012-12-19 03:00:00 2012 12 19 3 2 718.0
10866 The 2012-12-19 04:00:00 2012-12-19 04:00:00 2012 12 19 4 2 718.0
10867 The 2012-12-19 05:00:00 2012-12-19 05:00:00 2012 12 19 5 2 718.0
10868 The 2012-12-19 06:00:00 2012-12-19 06:00:00 2012 12 19 6 2 718.0
10869 The 2012-12-19 07:00:00 2012-12-19 07:00:00 2012 12 19 7 2 718.0
10870 The 2012-12-19 08:00:00 2012-12-19 08:00:00 2012 12 19 8 2 718.0
10871 The 2012-12-19 09:00:00 2012-12-19 09:00:00 2012 12 19 9 2 718.0
10872 The 2012-12-19 10:00:00 2012-12-19 10:00:00 2012 12 19 10 2 718.0
10873 The 2012-12-19 11:00:00 2012-12-19 11:00:00 2012 12 19 11 2 718.0
10874 The 2012-12-19 12:00:00 2012-12-19 12:00:00 2012 12 19 12 2 718.0
10875 The 2012-12-19 13:00:00 2012-12-19 13:00:00 2012 12 19 13 2 718.0
10876 2012-12-19 14:00:00 2012-12-19 14:00:00 2012 12 19 14 2 718.0
10877 The 2012-12-19 15:00:00 2012-12-19 15:00:00 2012 12 19 15 2 718.0
10878 The 2012-12-19 16:00:00 2012-12-19 16:00:00 2012 12 19 16 2 718.0
10879 The 2012-12-19 17:00:00 2012-12-19 17:00:00 2012 12 19 17 2 718.0
10880 The 2012-12-19 18:00:00 2012-12-19 18:00:00 2012 12 19 18 2 718.0
10881 The 2012-12-19 19:00:00 2012-12-19 19:00:00 2012 12 19 19 2 718.0
10882 The 2012-12-19 20:00:00 2012-12-19 20:00:00 2012 12 19 20 2 718.0
10883 The 2012-12-19 21:00:00 2012-12-19 21:00:00 2012 12 19 21 2 718.0
10884 The 2012-12-19 22:00:00 2012-12-19 22:00:00 2012 12 19 22 2 718.0
10885 The 2012-12-19 23:00:00 2012-12-19 23:00:00 2012 12 19 23 2 718.0

10886 rows × 9 columns

temp
Copy the code
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00', '2011-01-01 02:00:00', '2011-01-01 03:00:00', 04:00:00 '2011-01-01', 'the 2011-01-01 05:00:00', '2011-01-01 06:00:00', '2011-01-01 07:00:00', '2011-01-01 08:00:00', '2011-01-01 09:00:00'... '2012-12-19 14:00:00', '2012-12-19 15:00:00', '2012-12-19 16:00:00', '2012-12-19 17:00:00', 18:00:00 '2012-12-19', 'the 2012-12-19 19:00:00', '2012-12-19 20:00:00', '2012-12-19 21:00:00', '2012-12-19 22:00:00', '2012-12-19 23:00:00'], dtype='datetime64[ns]', name='datetime', length=10886, freq=None)Copy the code

3.3 Feature dimension reduction

In actual projects, the higher the dimension, the better. Why do we need to reduce the dimension

  1. The higher the feature dimension is, the easier the model is to overfit, and the more complex model is not easy to use.
  2. The higher the independent feature dimensions are, the larger the number of training samples required to achieve the same effect on the test set without changing the model.
  3. Increasing the number of features increases the overhead of training, testing, and storage.
  4. In some models, such as KMeans and KNN models based on distance calculation, too high dimension will affect accuracy and performance in distance calculation. 5. Need for visual analysis. In lower dimensions, like two, three, we can plot the data, we can visualize the data. As the dimensions go up, it becomes harder to plot. In machine learning, there is a very classic concept of dimensional disaster.

It is because of the various problems of high-dimensional features as described above that we need to carry out the work of feature reduction and feature selection. PCA, LDA and other commonly used algorithms for feature dimension reduction. The goal of feature dimension reduction is to map data sets in high-dimensional space to data in low-dimensional space with as little information loss as possible, or the data points after dimension reduction can be distinguished easily as possible. PCA algorithm analysis was introduced in previous articles

3.4 Feature Selection

After data preprocessing is completed, we need to select meaningful features and input them into machine learning algorithms and models for training. The goal of feature selection is to find the optimal feature subset. Feature selection can eliminate irrelevant or redundant features so as to reduce the number of features, improve model accuracy and reduce running time. On the other hand, a simplified model with truly relevant features is selected to help understand the process of data generation.

According to the form of feature selection, feature selection methods can be divided into three types

  1. Filter: the Filter method scores all features according to divergence or correlation, sets thresholds or the number of thresholds to be selected, and sorts the Top related features.
  2. Wrapper: A method of selecting or excluding several features at a time based on an objective function (usually a predictive performance score).
  3. Embedded: firstly, some machine learning algorithms and models are used for training, to get the weight coefficients of each feature, and select features from large to small according to the coefficients. Similar to Filter method, but through training to determine the advantages and disadvantages of features.

3.4.1 Filter

Method of variance selection

Using variance selection method, first calculate the variance of each feature, and then select the feature whose variance is greater than the threshold value according to the threshold value.

from sklearn.feature_selection import VarianceThreshold
# iris. Data [:, 0] var () # 0.6811222222222223
# iris. Data [: 1] var () # 0.18675066666666668
# iris. Data [:, 2]. Var () # 3.092424888888889
# iris. Data [:, 3]. The var () # 0.5785315555555555

#threshold: the variance threshold of comparison, and return the data corresponding to features whose variance is greater than the threshold. For the iris dataset above, only the third column is returned
# VarianceThreshold(threshold=3).fit_transform(iris.data)
Copy the code

Correlation coefficient method

To use the correlation coefficient method, first calculate the correlation coefficient of each feature to the target value and the P value of the correlation coefficient. The code for selecting features using the SelectKBest class of the Feature_Selection library in conjunction with correlation coefficients is shown below

# First look at the calculation of Pearson correlation coefficient
from scipy.stats import pearsonr  # used to calculate the correlation coefficient
np.random.seed(0)  # Set the same seed and generate the same random number each time
size = 300
test = np.random.normal(0.1,size)
# print(" add low noise: ",pearsonr(test,test+np.random. Normal (0,1,size)) # print(" add low noise: ",pearsonr(test,test+np.random
# print(" add noise: ",pearsonr(test,test+np.random. Normal (0,10,size))
# The higher the correlation system, the smaller the P value

from sklearn.feature_selection import SelectKBest


# Select K best features and return the data after selecting features
The first parameter is a function that evaluates whether a feature is good or not. This function inputs the feature matrix and the object vector, and outputs an array of binary groups (score, p-value). The i-th item of the array is the score and p-value of the i-th feature. It is defined here as calculating the correlation coefficient
# parameter k is the number of selected features
SelectKBest(lambda X, Y: list(np.array(list(map(lambda x:pearsonr(x, Y), X.T))).T), k=2).fit_transform(iris.data, iris.target)
Copy the code
Array ([[1.4, 0.2], [1.4, 0.2], [1.3, 0.2], [1.5, 0.2], [1.4, 0.2], [1.7, 0.4], [1.4, 0.3], [1.5, 0.2], [1.4, 0.2]. [1.5, 0.1], [1.5, 0.2], [1.6, 0.2], [1.4, 0.1], [1.1, 0.1], [1.2, 0.2], [1.5, 0.4], [1.3, 0.4], [1.4, 0.3], [1.7, 0.3]. [1.5, 0.3], [1.7, 0.2], [1.5, 0.4], [1., 0.2], [1.7, 0.5], [1.9, 0.2], [1.6, 0.2], [1.6, 0.4], [1.5, 0.2], [1.4, 0.2]. [1.6, 0.2], [1.6, 0.2], [1.5, 0.4], [1.5, 0.1], [1.4, 0.2], [1.5, 0.1], [1.2, 0.2], [1.3, 0.2], [1.5, 0.1], [1.3, 0.2]. [1.5, 0.2], [1.3, 0.3], [1.3, 0.3], [1.3, 0.2], [1.6, 0.6], [1.9, 0.4], [1.4, 0.3], [1.6, 0.2], [1.4, 0.2], [1.5, 0.2]. [1.4, 0.2], [4.7, 1.4], [4.5, 1.5], [4.9, 1.5], [4, 1.3], [4.6, 1.5], [4.5, 1.3], [4.7, 1.6], [3.3, 1], [4.6, 1.3]. [3.9, 1.4], [3.5, 1], [4.2, 1.5], [4, 1], [4.7, 1.4], [3.6, 1.3], [4.4, 1.4], [4.5, 1.5], [4.1, 1], [4.5, 1.5]. [3.9, 1.1], [4.8, 1.8], [4, 1.3], [4.9, 1.5], [4.7, 1.2], [4.3, 1.3], [4.4, 1.4], [4.8, 1.4], [5., 1.7], [4.5, 1.5]. [3.5, 1], [3.8, 1.1], [3.7, 1], [3.9, 1.2], [5.1, 1.6], [4.5, 1.5], [4.5, 1.6], [4.7, 1.5], [4.4, 1.3], [4.1, 1.3]. [4, 1.3], [4.4, 1.2], [4.6, 1.4], [4, 1.2], [3.3, 1], [4.2, 1.3], [4.2, 1.2], [4.2, 1.3], [4.3, 1.3], [3., 1.1]. [4.1, 1.3], [6., 2.5], [5.1, 1.9], [5.9, 2.1], [5.6, 1.8], [5.8, 2.2], [6.6, 2.1], [4.5, 1.7], [6.3, 1.8], [5.8, 1.8]. [6.1, 2.5], [5.1, 2], [5.3, 1.9], [5.5, 2.1], [5., 2], [5.1, 2.4], [5.3, 2.3], [5.5, 1.8], [6.7, 2.2], [6.9, 2.3]. [5., 1.5], [5.7, 2.3], [4.9, 2], [6.7, 2], [4.9, 1.8], [5.7, 2.1], [6., 1.8], [4.8, 1.8], [4.9, 1.8], [5.6, 2.1]. [5.8, 1.6], [6.1, 1.9], [6.4, 2], [5.6, 2.2], [5.1, 1.5], [5.6, 1.4], [6.1, 2.3], [5.6, 2.4], [5.5, 1.8], [4.8, 1.8]. [5.4, 2.1], [5.6, 2.4], [5.1, 2.3], [5.1, 1.9], [5.9, 2.3], [5.7, 2.5], [5.2, 2.3], [5., 1.9], [5.2, 2], [5.4, 2.3]. [5.1, 1.8]])Copy the code

chi-square

The classical Chi-square test is to test the correlation between qualitative independent variables and qualitative dependent variables. Assuming that the independent variable has N values and the dependent variable has M values, considering the difference between the observed value and the expected value of the sample frequency when the independent variable is equal to I and the dependent variable is equal to j, construct the statistic:


The code for selecting features using the Feature_Selection library’s SelectKBest class combined with the Chi-square test is as follows:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Select K best features and return the data after selecting features
SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)
Copy the code
Array ([[1.4, 0.2], [1.4, 0.2], [1.3, 0.2], [1.5, 0.2], [1.4, 0.2], [1.7, 0.4], [1.4, 0.3], [1.5, 0.2], [1.4, 0.2]. [1.5, 0.1], [1.5, 0.2], [1.6, 0.2], [1.4, 0.1], [1.1, 0.1], [1.2, 0.2], [1.5, 0.4], [1.3, 0.4], [1.4, 0.3], [1.7, 0.3]. [1.5, 0.3], [1.7, 0.2], [1.5, 0.4], [1., 0.2], [1.7, 0.5], [1.9, 0.2], [1.6, 0.2], [1.6, 0.4], [1.5, 0.2], [1.4, 0.2]. [1.6, 0.2], [1.6, 0.2], [1.5, 0.4], [1.5, 0.1], [1.4, 0.2], [1.5, 0.1], [1.2, 0.2], [1.3, 0.2], [1.5, 0.1], [1.3, 0.2]. [1.5, 0.2], [1.3, 0.3], [1.3, 0.3], [1.3, 0.2], [1.6, 0.6], [1.9, 0.4], [1.4, 0.3], [1.6, 0.2], [1.4, 0.2], [1.5, 0.2]. [1.4, 0.2], [4.7, 1.4], [4.5, 1.5], [4.9, 1.5], [4, 1.3], [4.6, 1.5], [4.5, 1.3], [4.7, 1.6], [3.3, 1], [4.6, 1.3]. [3.9, 1.4], [3.5, 1], [4.2, 1.5], [4, 1], [4.7, 1.4], [3.6, 1.3], [4.4, 1.4], [4.5, 1.5], [4.1, 1], [4.5, 1.5]. [3.9, 1.1], [4.8, 1.8], [4, 1.3], [4.9, 1.5], [4.7, 1.2], [4.3, 1.3], [4.4, 1.4], [4.8, 1.4], [5., 1.7], [4.5, 1.5]. [3.5, 1], [3.8, 1.1], [3.7, 1], [3.9, 1.2], [5.1, 1.6], [4.5, 1.5], [4.5, 1.6], [4.7, 1.5], [4.4, 1.3], [4.1, 1.3]. [4, 1.3], [4.4, 1.2], [4.6, 1.4], [4, 1.2], [3.3, 1], [4.2, 1.3], [4.2, 1.2], [4.2, 1.3], [4.3, 1.3], [3., 1.1]. [4.1, 1.3], [6., 2.5], [5.1, 1.9], [5.9, 2.1], [5.6, 1.8], [5.8, 2.2], [6.6, 2.1], [4.5, 1.7], [6.3, 1.8], [5.8, 1.8]. [6.1, 2.5], [5.1, 2], [5.3, 1.9], [5.5, 2.1], [5., 2], [5.1, 2.4], [5.3, 2.3], [5.5, 1.8], [6.7, 2.2], [6.9, 2.3]. [5., 1.5], [5.7, 2.3], [4.9, 2], [6.7, 2], [4.9, 1.8], [5.7, 2.1], [6., 1.8], [4.8, 1.8], [4.9, 1.8], [5.6, 2.1]. [5.8, 1.6], [6.1, 1.9], [6.4, 2], [5.6, 2.2], [5.1, 1.5], [5.6, 1.4], [6.1, 2.3], [5.6, 2.4], [5.5, 1.8], [4.8, 1.8]. [5.4, 2.1], [5.6, 2.4], [5.1, 2.3], [5.1, 1.9], [5.9, 2.3], [5.7, 2.5], [5.2, 2.3], [5., 1.9], [5.2, 2], [5.4, 2.3]. [5.1, 1.8]])Copy the code

3.4.2 Wrapper

Feature selection is regarded as a feature subset search problem, and various feature subsets are screened and the effect is evaluated by model. The typical wrapping algorithm is “recursive feature deletion algorithm”

Recursive feature elimination method

Recursive feature elimination method uses a base model to carry out multiple rounds of training. After each round of training, the features of several weight coefficients are eliminated, and then the next round of training is carried out based on the new feature set. For example, with logistic regression, how do you do that? ①  run a model with full features ②  delete 5-10% of the weak features according to the coefficient of the linear model (which reflects the correlation) and observe the change in accuracy/AUC ③  proceed gradually until the accuracy/AUC drops significantly and stops

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
 
# Recursive feature elimination method, return the data after feature selection
# Parameter Estimator is the base model
# parameter n_features_to_select indicates the number of selected features
RFE(estimator=LogisticRegression(), n_features_to_select=2).fit_transform(iris.data, iris.target)
Copy the code
Array ([[3.5, 0.2], [3., 0.2], [3.2, 0.2], [3.1, 0.2], [3.6, 0.2], [3.9, 0.4], [3.4, 0.3], [3.4, 0.2], [2.9, 0.2]. [3.1, 0.1], [3.7, 0.2], [3.4, 0.2], [3., 0.1], [3., 0.1], [4, 0.2], [4.4, 0.4], [3.9, 0.4], [3.5, 0.3], [3.8, 0.3]. [3.8, 0.3], [3.4, 0.2], [3.7, 0.4], [3.6, 0.2], [3.3, 0.5], [3.4, 0.2], [3., 0.2], [3.4, 0.4], [3.5, 0.2], [3.4, 0.2]. [3.2, 0.2], [3.1, 0.2], [3.4, 0.4], [4.1, 0.1], [4.2, 0.2], [3.1, 0.1], [3.2, 0.2], [3.5, 0.2], [3.1, 0.1], [3., 0.2]. [3.4, 0.2], [3.5, 0.3], [2.3, 0.3], [3.2, 0.2], [3.5, 0.6], [3.8, 0.4], [3., 0.3], [3.8, 0.2], [3.2, 0.2], [3.7, 0.2]. [3.3, 0.2], [3.2, 1.4], [3.2, 1.5], [3.1, 1.5], [2.3, 1.3], [2.8, 1.5], [2.8, 1.3], [3.3, 1.6], [2.4, 1], [2.9, 1.3]. [2.7, 1.4], [2, 1], [3., 1.5], [2.2, 1], [2.9, 1.4], [2.9, 1.3], [3.1, 1.4], [3., 1.5], [2.7, 1], [2.2, 1.5]. [2.5, 1.1], [3.2, 1.8], [2.8, 1.3], [2.5, 1.5], [2.8, 1.2], [2.9, 1.3], [3., 1.4], [2.8, 1.4], [3., 1.7], [2.9, 1.5]. [2.6, 1], [2.4, 1.1], [2.4, 1], [2.7, 1.2], [2.7, 1.6], [3., 1.5], [3.4, 1.6], [3.1, 1.5], [2.3, 1.3], [3., 1.3]. [2.5, 1.3], [2.6, 1.2], [3., 1.4], [2.6, 1.2], [2.3, 1], [2.7, 1.3], [3., 1.2], [2.9, 1.3], [2.9, 1.3], [2.5, 1.1]. [2.8, 1.3], [3.3, 2.5], [2.7, 1.9], [3., 2.1], [2.9, 1.8], [3., 2.2], [3., 2.1], [2.5, 1.7], [2.9, 1.8], [2.5, 1.8]. [3.6, 2.5], [3.2, 2], [2.7, 1.9], [3., 2.1], [2.5, 2], [2.8, 2.4], [3.2, 2.3], [3., 1.8], [3.8, 2.2], [2.6, 2.3]. [2.2, 1.5], [3.2, 2.3], [2.8, 2], [2.8, 2], [2.7, 1.8], [3.3, 2.1], [3.2, 1.8], [2.8, 1.8], [3., 1.8], [2.8, 2.1]. [3., 1.6], [2.8, 1.9], [3.8, 2], [2.8, 2.2], [2.8, 1.5], [2.6, 1.4], [3., 2.3], [3.4, 2.4], [3.1, 1.8], [3., 1.8]. [3.1, 2.1], [3.1, 2.4], [3.1, 2.3], [2.7, 1.9], [3.2, 2.3], [3.3, 2.5], [3., 2.3], [2.5, 1.9], [(3), (2)], [3.4, 2.3]. [3., 1.8]])Copy the code

3.4.3 Embedded

Feature selection method based on penalty term

In addition to feature screening, dimension reduction is also carried out by using the basis model with penalty term. Using the Feature_Selection library’s SelectFromModel class combined with a logistic regression model with L1 penalties, the code for selecting features looks like this:

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

# Feature selection of a base model by logistic regression with L1 penalty
SelectFromModel(LogisticRegression(penalty="l1", C=0.1)).fit_transform(iris.data, iris.target)
Copy the code
Array ([[5.1, 3.5, 1.4], [4.9, 3., 1.4], [4.7, 3.2, 1.3], [4.6, 3.1, 1.5], [5., 3.6, 1.4], [5.4, 3.9, 1.7], [4.6, 3.4, 1.4], [5., 3.4, 1.5], [4.4, 2.9, 1.4], [4.9, 3.1, 1.5], [5.4, 3.7, 1.5], [4.8, 3.4, 1.6], [4.8, 3., 1.4], [4.3, 3. 1.1], [5.8, 4, 1.2], [5.7, 4.4, 1.5], [5.4, 3.9, 1.3], [5.1, 3.5, 1.4], [5.7, 3.8, 1.7], [5.1, 3.8, 1.5], [5.4, 3.4, 1.7], [5.1, 3.7, 1.5], [4.6, 3.6, 1], [5.1, 3.3, 1.7], [4.8, 3.4, 1.9], [5., 3., 1.6], [5., 3.4, 1.6], [5.2, 3.5, 1.5], [5.2, 3.4, 1.4], [4.7, 3.2, 1.6], [4.8, 3.1, 1.6], [5.4, 3.4, 1.5], [5.2, 4.1, 1.5], [5.5, 4.2, 1.4], [4.9, 3.1, 1.5], [5., 3.2, 1.2], [5.5, 3.5, 1.3], [4.9, 3.1, 1.5], [4.4, 3., 1.3], [5.1, 3.4, 1.5], [5., 3.5, 1.3], [4.5, 2.3, 1.3], [4.4, 3.2, 1.3], [5., 3.5, 1.6], [5.1, 3.8, 1.9], [4.8, 3., 1.4], [5.1, 3.8, 1.6], [4.6, 3.2, 1.4], [5.3, 3.7, 1.5], [5., 3.3, 1.4], [7, 3.2, 4.7], [6.4, 3.2, 4.5], [6.9, 3.1, 4.9], [5.5, 2.3, 4], [6.5, 2.8, 4.6], [5.7, 2.8, 4.5], [6.3, 3.3, 4.7], [4.9, 2.4, 3.3], [6.6, 2.9, 4.6], [5.2, 2.7, 3.9], [5., 2., 3.5], [5.9, 3., 4.2], [6., 2.2, 4.], [6.1, 2.9, 4.7], [5.6, 2.9, 3.6], [6.7, 3.1, 4.4], [5.6, 3., 4.5], [5.8, 2.7, 4.1], [6.2, 2.2, 4.5], [5.6, 2.5, 3.9], [5.9, 3.2, 4.8], [6.1, 2.8, 4], [6.3, 2.5, 4.9], [6.1, 2.8, 4.7], [6.4, 2.9, 4.3], [6.6, 3., 4.4], [6.8, 2.8, 4.8], [6.7, 3, 5.], [6., 2.9, 4.5], [5.7, 2.6, 3.5], [5.5, 2.4, 3.8], [5.5, 2.4, 3.7], [5.8, 2.7, 3.9], [6., 2.7, 5.1], [5.4, 3., 4.5], [6., 3.4, 4.5], [6.7, 3.1, 4.7], [6.3, 2.3, 4.4], [5.6, 3., 4.1], [5.5, 2.5, 4], [5.5, 2.6, 4.4], [6.1, 3., 4.6], [5.8, 2.6, 4], [5., 2.3, 3.3], [5.6, 2.7, 4.2], [5.7, 3., 4.2], [5.7, 2.9, 4.2], [6.2, 2.9, 4.3], [5.1, 2.5, 3], [5.7, 2.8, 4.1], [6.3, 3.3, 6], [5.8, 2.7, 5.1], [7.1, 3., 5.9], [6.3, 2.9, 5.6], [6.5, 3. 5.8], [7.6, 3., 6.6], [4.9, 2.5, 4.5], [7.3, 2.9, 6.3], [6.7, 2.5, 5.8], [7.2, 3.6, 6.1], [6.5, 3.2, 5.1], [6.4, 2.7, 5.3], [6.8, 3., 5.5], [5.7, 2.5, 5.], [5.8, 2.8, 5.1], [6.4, 3.2, 5.3], [6.5, 3., 5.5], [7.7, 3.8, 6.7], [7.7, 2.6, 6.9], [6., 2.2, 5.], [6.9, 3.2, 5.7], [5.6, 2.8, 4.9], [7.7, 2.8, 6.7], [6.3, 2.7, 4.9], [6.7, 3.3, 5.7], [7.2, 3.2, 6.], [6.2, 2.8, 4.8], [6.1, 3., 4.9], [6.4, 2.8, 5.6], [7.2, 3., 5.8], [7.4, 2.8, 6.1], [7.9, 3.8, 6.4], [6.4, 2.8, 5.6], [6.3, 2.8, 5.1], [6.1, 2.6, 5.6], [7.7, 3., 6.1], [6.3, 3.4, 5.6], [6.4, 3.1, 5.5], [6., 3., 4.8], [6.9, 3.1, 5.4], [6.7, 3.1, 5.6], [6.9, 3.1, 5.1], [5.8, 2.7, 5.1], [6.8, 3.2, 5.9], [6.7, 3.3, 5.7], [6.7, 3., 5.2], [6.3, 2.5, 5.], [6.5, 3., 5.2], [6.2, 3.4, 5.4], [5.9, 3., 5.1]])Copy the code

Feature selection method based on tree model

In the tree model, GBDT can also be used as the base model for feature selection. The SelectFromModel class of feature_Selection is combined with GBDT model to select features

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
 
#GBDT as feature selection of base model
SelectFromModel(GradientBoostingClassifier()).fit_transform(iris.data, iris.target)
Copy the code
Array ([[1.4, 0.2], [1.4, 0.2], [1.3, 0.2], [1.5, 0.2], [1.4, 0.2], [1.7, 0.4], [1.4, 0.3], [1.5, 0.2], [1.4, 0.2]. [1.5, 0.1], [1.5, 0.2], [1.6, 0.2], [1.4, 0.1], [1.1, 0.1], [1.2, 0.2], [1.5, 0.4], [1.3, 0.4], [1.4, 0.3], [1.7, 0.3]. [1.5, 0.3], [1.7, 0.2], [1.5, 0.4], [1., 0.2], [1.7, 0.5], [1.9, 0.2], [1.6, 0.2], [1.6, 0.4], [1.5, 0.2], [1.4, 0.2]. [1.6, 0.2], [1.6, 0.2], [1.5, 0.4], [1.5, 0.1], [1.4, 0.2], [1.5, 0.1], [1.2, 0.2], [1.3, 0.2], [1.5, 0.1], [1.3, 0.2]. [1.5, 0.2], [1.3, 0.3], [1.3, 0.3], [1.3, 0.2], [1.6, 0.6], [1.9, 0.4], [1.4, 0.3], [1.6, 0.2], [1.4, 0.2], [1.5, 0.2]. [1.4, 0.2], [4.7, 1.4], [4.5, 1.5], [4.9, 1.5], [4, 1.3], [4.6, 1.5], [4.5, 1.3], [4.7, 1.6], [3.3, 1], [4.6, 1.3]. [3.9, 1.4], [3.5, 1], [4.2, 1.5], [4, 1], [4.7, 1.4], [3.6, 1.3], [4.4, 1.4], [4.5, 1.5], [4.1, 1], [4.5, 1.5]. [3.9, 1.1], [4.8, 1.8], [4, 1.3], [4.9, 1.5], [4.7, 1.2], [4.3, 1.3], [4.4, 1.4], [4.8, 1.4], [5., 1.7], [4.5, 1.5]. [3.5, 1], [3.8, 1.1], [3.7, 1], [3.9, 1.2], [5.1, 1.6], [4.5, 1.5], [4.5, 1.6], [4.7, 1.5], [4.4, 1.3], [4.1, 1.3]. [4, 1.3], [4.4, 1.2], [4.6, 1.4], [4, 1.2], [3.3, 1], [4.2, 1.3], [4.2, 1.2], [4.2, 1.3], [4.3, 1.3], [3., 1.1]. [4.1, 1.3], [6., 2.5], [5.1, 1.9], [5.9, 2.1], [5.6, 1.8], [5.8, 2.2], [6.6, 2.1], [4.5, 1.7], [6.3, 1.8], [5.8, 1.8]. [6.1, 2.5], [5.1, 2], [5.3, 1.9], [5.5, 2.1], [5., 2], [5.1, 2.4], [5.3, 2.3], [5.5, 1.8], [6.7, 2.2], [6.9, 2.3]. [5., 1.5], [5.7, 2.3], [4.9, 2], [6.7, 2], [4.9, 1.8], [5.7, 2.1], [6., 1.8], [4.8, 1.8], [4.9, 1.8], [5.6, 2.1]. [5.8, 1.6], [6.1, 1.9], [6.4, 2], [5.6, 2.2], [5.1, 1.5], [5.6, 1.4], [6.1, 2.3], [5.6, 2.4], [5.5, 1.8], [4.8, 1.8]. [5.4, 2.1], [5.6, 2.4], [5.1, 2.3], [5.1, 1.9], [5.9, 2.3], [5.7, 2.5], [5.2, 2.3], [5., 1.9], [5.2, 2], [5.4, 2.3]. [5.1, 1.8]])Copy the code

The above is the relevant description of feature engineering. As for the content related to feature monitoring, I have not understood it yet, and I will supplement it after learning it later. After feature engineering is completed, the next step is model selection and tuning. The next article will mainly study and sort out this knowledge

Refer to the article: 1. blog.csdn.net/rosenor1/ar… 2. www.cnblogs.com/jasonfreak/…

Welcome to pay attention to my personal public account AI computer vision workshop, this public account irregularly push machine learning, deep learning, computer vision and other related articles, welcome to learn with me, communication.