Feature engineering: Using domain knowledge and existing data to create new features for machine learning algorithms; It can be manual or automated. The automatic feature engineering of neural network is often not suitable for other complex tasks in reality. Therefore, this paper mainly focuses on data mining and traditional machine learning, and will not involve deep learning fields such as image recognition and natural language processing.

As the saying goes, the data and feature engineering determine the upper limit of the model, and the improved algorithm is only approximating the upper limit.

In douban book channel, search for the keyword “feature engineering”, the search results are only two books with a score of more than 7 points, which are proficient in feature engineering, introduction to feature engineering and practice

When the data dimension is limited, “feature engineering” is very important, but there are few articles published on this aspect, let alone global and systematic explanation of feature engineering theory and actual practice, so I want to summarize a set of general data science-feature engineering methodology based on my work experience, relevant books and excellent articles.

From this\newcommand\colorful{\color{red}} \colorful{feature understanding, feature cleaning, feature construction, feature transformation} and other dimensions are developed and explained step by stepThe theory ofandCode implementationEtc. As for the code implementation part, relevant data of the company cannot be disclosed, so Titanic data is chosen to be disclosed.

import pandas as pd
import numpy as np
import seaborn as sns

df_titanic = sns.load_dataset('titanic')
Copy the code

The data fields are described as follows:

First, characteristic understanding

1.1 Distinguishing structured data from unstructured data

For example, some of the data stored in the form of tables, are structured data; Unstructured data is a bunch of data, like text, messages, logs, etc.

1.2 Distinguish between quantitative and qualitative data

  • Quantitative data: Numerical numbers used to measure the quantity of something;
  • Qualitative data: Refers to the categories used to describe the nature of something.

Two, feature cleaning

The goal is to improve data quality and reduce the risk of algorithmic modeling errors.

In the real business modeling process, there are often various problems with the data, such as incomplete, noisy, inconsistent data and so on. And the data with the wrong information can adversely affect the model.

Data cleaning process includes data alignment, missing value processing, outlier processing, data transformation and other data processing.

2.1 Data Alignment

The main ones are alignment of time, fields, and related dimensions.

1) time:

  • Date formats are inconsistent [‘ 2019-07-20 ‘, ‘20190720’, ‘2019/07/20′, ’20/07/2019’]
  • The time stamp units are inconsistent. Some are expressed in seconds and some in milliseconds.
  • Invalid time is used, the timestamp is 0, and the end timestamp is FFFF.

2) the field:

  • Write the gender of the name, id number write mobile phone number, etc

3) dimensions:

  • Unified numeric types [e.g. 1, 2.0, 3.21E3, 4]
  • Unity of units [such as 180cm, 1.80m]

2.2 Missing processing

In the case of a small amount of missing data, the missing data should not be processed or deleted, or the mean, median, mode and similar mean should be used for filling.

When the missing value has a great influence on the model and there is more data without missing, model prediction or interpolation can be adopted. When there are too many missing values, the missing values can be coded.

The ratio of missing values is calculated for each field, and then strategies are formulated according to the ratio of missing values and the importance of fields, as shown in the following figure:

Null value summary distribution

df_titanic.isnull().sum() survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0  deck 688 embark_town 2 alive 0 alone 0Copy the code

1) Delete the tuple

Delete the objects (tuples, records) with missing information attribute values, so as to get a complete information table.

Advantages:

It is simple and easy to operate. It is effective when the object has multiple missing values and the deleted object with missing values is very small compared with the data of the initial data set.

Inadequate:

When the missing data accounts for a large proportion, especially when the missing data is not distributed randomly, this method may lead to data deviation and lead to incorrect conclusions.

Code implementation

The embark_town field has two null values and can be considered for deletion under missing processing

df_titanic[df_titanic["embark_town"].isnull()]


df_titanic.dropna(axis=0,how='any',subset=['embark_town'],inplace=True)
Copy the code

2) Data filling

Fill in null values with certain values to complete the information table. Usually based on statistical principles, a missing value is populated based on the distribution of the rest of the object values in the initial data set.

(a) Manually filling

Manual padding based on business knowledge.

(b) Filling (Treating Missing Attribute values as Special values)

Treat null values as a special property value, unlike any other property value. For example, all null values are filled with “unknown”. Usually used as a temporary fill or intermediate process.

Code implementation

df_titanic['embark_town'].fillna('unknown', inplace=True)
Copy the code

(c) Statistical padding

If the miss rate is low (less than 95%) and the importance is low, the data will be filled according to the data distribution.

Common fill statistics:

Mean value: For data that conform to uniform distribution, the mean value of the variable is used to fill in missing values.

Median: In the case of skewed data distribution, the median is used to fill in missing values.

Mode: Discrete features can use mode to fill in missing values.

  • Median filling

Fare: If there are many missing values, fill them with the median.

df_titanic['fare'].fillna(df_titanic['fare'].median(), inplace=True) 
Copy the code
  • The number of filling

Embarked: Only two missing values are embarked in mode

Df_titanic ['embarked'].isnull().sum() 2 df_titanic['embarked'].fillna(df_titanic['embarked'].mode(), inplace=True) df_titanic['embarked'].value_counts()  S 64Copy the code
  • The imputer is used to fill the missing value

The imputer class provides a basic strategy for processing missing values, such as replacing the missing values with the mean, median and mode of the row or column where the missing values are located. This class is also compatible with different missing value encodings.

Fill the missing value: sklearn. Preprocessing. Imputer (missing_values = ‘NaN’, the strategy = ‘mean’, axis = 0, verbose = 0, copy = True)

Main parameters:

Missing_values: missing values, either an integer or a NaN(missing value numpy. NaN is represented by the string ‘NaN’), default is NaN strategy: Replacement strategy, string, the default are replaced with average ‘mean’ (1) if the scheme, replace with mean characteristic columns (2) if the median, replace with median characteristics listed in (3) if most_frequent, replace with characteristic mode of axis: Specify the number of axes, default axis=0 for columns, axis=1 for rows copy: If this parameter is set to True, it will not be modified on the original data set. If this parameter is set to False, it will be modified in place. ① X is not an array of floating point values ② X is sparse and missing_values=0 ③ axis=0 and X is a CRS matrix ④ axis=1 and X is a CSC matrix STATIStics_ attribute: When axis is set to 0, the array of fill values for each attribute is filled. When Axis is set to 1, the error of not having this attribute is reported

  • Homogeneous mean filling

Age: group by sex, Pclass and WHO. If you fall into the same group, fill with the mean or median of the group.

Df_titanic. Groupby ([' sex ', 'pclass', 'the who']) [r]. 'age' mean () results: Sex pclass who female 1 child 10.333333 woman 35.500000 2 child 6.600000 woman 32.179688 3 child 7.100000 woman 27.854167 male 1 child 5.306667 man 42.382653 2 child 2.258889 man 33.588889 3 Child 6.515000 man 28.995556 Name: age, dtype: Float64 age_group_mean = df_titanic. Groupby (['sex', 'pclass', 'who'])['age'].mean().reset_index() age_group_mean result: Sex pclass who age 0 female 1 child 10.333333 1 female 1 woman 35.500000 2 female 2 child 6.600000 3 female 2 woman 32.179688 4 female 3 child 7.100000 5 Female 3 woman 27.854167 6 male 1 child 5.306667 7 male 1 man 42.382653 8 male 2 Def select_group_age_median(row):  condition = ((row['sex'] == age_group_mean['sex']) & (row['pclass'] == age_group_mean['pclass']) & (row['who'] == age_group_mean['who'])) return age_group_mean[condition]['age'].values[0] df_titanic['age'] =df_titanic.apply( lambda x: Select_group_age_median (x) if np.isnan(x['age']) else x['age'],axis=1) 0 22.000000 1 38.000000 2 26.000000 3 35.000000 4 35.000000... 886 27.000000 887 19.000000 888 27.854167 889 26.000000 890 32.000000 SNS. Distplot (dF_titaniCopy the code

(d) Model predictive filling

The fields to be filled are used as labels, and the data without missing is used as training data. The classification/regression model is established, and the missing fields are predicted and filled.

Nearest distance adjacent method (KNN)

According to Euclidean distance or correlation analysis, the K samples with the closest distance to the missing data samples are determined, and the weighted average/vote of these K values is used to estimate the missing data of this sample.

Regression

Based on the complete data set, the regression equation was established. For objects containing null values, the known attribute values are substituted into the equation to estimate the unknown attribute values, and the estimated values are filled in. When the variables are not linearly correlated, it leads to biased estimates. Linear regression is commonly used.

Code implementation

Age: the missing value is large. The six features of sex, Pclass, WHO, FARE, Parch and SIBSP are used to construct a random forest model to fill in the missing value of age.

df_titanic_age = df_titanic[['age', 'pclass', 'sex', 'who','fare', 'parch', 'sibsp']] df_titanic_age = pd.get_dummies(df_titanic_age) dF_titanic_age.head ( Sex_female sex_male who_child who_man who_woman 0 22.0 3 7.2500 0 1 0 1 0 1 0 1 38.0 1 71.2833 0 1 1 00 0 1 2 26.0 3 7.9250 00 1000 1 3 35.0 1 53.1000 0 1 1000 1 4 35.0 3 8.0500 000 0 1000 1000 0 1 3 35.0 Df_titanic_age [df_titanic_age.age.notnull()] unknown_age = dF_titanic_age [dF_titanic_age.age.isnull ()] # y X_train_for_age = known_age. Drop (['age'], axis=1) X_test_for_age = unknown_age.drop(['age'], axis=1) from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1) rfr.fit(X_train_for_age, Y_pred_age = rfr.predict(X_test_for_age) # Fill in the missing data with the predicted results df_titanic.loc[df_titanic.age.isnull(), 'age'] = y_pred_age sns.distplot(df_titanic.age)Copy the code

(e) Interpolation filling

Including random interpolation, multiple interpolation, hot platform interpolation, Lagrange interpolation, Newton interpolation and so on.

  • Linear interpolation

An estimate of the missing value can be calculated using interpolation. The so-called interpolation method is to estimate the value of the intermediate point through two points (x0, y0) and (x1, y1). Assuming that y=f(x) is a line, calculate the function f(x) through the known two points, and then figure out y as long as x is known, so as to estimate the missing value.

Interpolate (method = ‘linear’, axis) the.interpolate(method = ‘linear’, axis) method replaces the NaN value with a value along the given axis via linear interpolation. The difference is the interpolated value between front and back or between top and bottom

df_titanic['fare'].interpolate(method = 'linear', axis = 0)
Copy the code

Also, row values can be inserted

df_titanic['fare'].interpolate(method = 'linear', axis = 1)
Copy the code

Code implementation

df_titanic['fare'].interpolate()
Copy the code
  • Multiple Imputation

The idea of multi-valued interpolation comes from Bayesian estimation, which holds that the value to be interpolated is random and its value comes from the observed value. In practice, the value to be interpolated is usually estimated, and then different noises are added to form multiple groups of optional interpolation values. Select the most appropriate interpolation value according to some selection basis.

The multiple interpolation method is divided into three steps:

Step1: Generate a set of possible interpolation values for each null value, which reflect the uncertainty of the unresponsive model;

Each value can be used to interpolate missing values in the data set to produce several complete data sets.

Step2: Each interpolation data set is statistically analyzed with the statistical method for the complete data set;

Step3: Select the results from each interpolation data set according to the scoring function to generate the final interpolation value.

(f) Dummy variable filling

If the variable is of discrete type and has fewer different values, it can be converted to dummy variables. For example, if the SEX variable has three different values, male, fameal, NA, it can be converted to IS_SEX_MALE, IS_SEX_FEMALE, and IS_SEX_NA. If a variable has more than a dozen different values, according to the frequency of each value, the less frequent values can be grouped into a category of ‘other’ to reduce the dimension. This approach maximizes the retention of information about variables.

Code implementation

sex_list = ['MALE', 'FEMALE', np.NaN, 'FEMALE', 'FEMALE', np.NaN, 'MALE'] df = pd.DataFrame({'SEX': sex_list}) display(df) df.fillna('NA', Inplace =True) df = pd.get_dummies(df['SEX'],prefix='IS_SEX') display(df) # FEMALE 5 NaN 6 MALE # After filling IS_SEX_FEMALE IS_SEX_MALE IS_SEX_NA 0 0 1 0 1 1 0 0 2 0 0 1 3 1 0 0 4 1 0 0 5 0 0 1 6 0 1Copy the code

(g) When the characteristic value is missing more than 80%, it is recommended to delete [or change into “whether variable”], which is easy to affect the model effect

df_titanic.drop(["deck"],axis=1)
Copy the code

2.3 Exception Handling:

1) Outlier identification

  • Box line method
sns.catplot(y="fare",x="survived", kind="box", data=df_titanic,palette="Set2");
Copy the code

  • Normal distribution
sns.distplot(df_titanic.age)
Copy the code

  • Outlier detection method

(a) Based on statistical analysis

Usually users use a statistical distribution of log points to model, and then assume the model, according to the distribution of points to determine whether abnormal.

For example, by analyzing the divergence of statistical data, that is, the data variation index, the distribution of data can be understood, and then the outlier data in the data can be found through the data variation index.

Commonly used data variation indicators include range, interval between quartiles, mean difference, standard deviation, coefficient of variation and so on. For example, large value of variation indicators means large variation and wide spread. A small value means that the deviation is small and the density is higher.

For example, the maximum and minimum values can be used to determine whether the value of this variable exceeds a reasonable range. For example, the age of the customer is -20 years old or 200 years old, which is an outlier.

(b) 3 sigma principle

If the data are normally distributed, under the three sigma principle, an outlier is a set of measurements that are more than three standard deviations from the mean. If the data is normal distribution, the average distance 3 sigma value outside of the probability of P (x | – mu | > 3 sigma) < = 0.003, belong to a small probability of rare events. If the data is not normally distributed, it can also be described by the number of standard deviations from the mean.

(c) Boxplot analysis

The boxplot provides a standard for identifying outliers: if a value is less than Q1-1.5IQR or greater than Q3+1.5IQR, it is called an outlier.

Q1 is the lower quartile, which means that a quarter of the observations are smaller than this;

Q4 is the upper quartile, meaning that a quarter of the observed values are larger than it;

IQR is the interval between quartiles, the difference between the upper quartile Q1 and the lower quartile Q3, and contains half of all observations.

The box plot method for judging outliers is based on the quartile and the quartile distance. The quartile is robust: 25% of the data can be arbitrarily far away without interfering with the quartile, so outliers cannot influence this standard. Therefore, the box diagram is more objective in identifying outliers and has certain advantages in identifying outliers.

(d) Model-based detection

First, a data model is established. Exceptions are those objects that do not fit perfectly with the model. If the model is a collection of clusters, an exception is an object that does not significantly belong to any cluster; When using a regression model, an exception is an object that is relatively far from the predicted value.

Advantages:

Have a solid statistical theoretical foundation and these tests can be very effective when sufficient data and knowledge of the types of tests used are available.

Disadvantages:

For multivariate data, fewer options are available, and for higher-dimensional data, these detection possibilities are poor.

(e) based on distance

The distance-based approach assumes that a data object is an exception if it is far away from most points. By defining the proximity measurement between objects, it can judge whether the abnormal object is far away from other objects according to the distance. The main distance measurement methods are absolute distance (Manhattan distance), Euclidean distance and Mahalanobis distance.

Advantages:

The distance-based approach is much simpler than the statistics-based approach;

Because it is much easier to define a distance measure for a data set than to determine the distribution of the data set.

Disadvantages:

The approach based on proximity requires O(m2) time, which is not applicable to large data sets.

The method is also sensitive to the choice of parameters.

Data sets with different density regions cannot be processed because it uses global thresholds and cannot take into account such variations in density.

(f) Density based

By investigating the density around the current point, we can find local outliers, and the local density of outliers is significantly lower than most of their neighbors, which is suitable for non-uniform data sets.

Advantages:

The quantitative measure that objects are outliers is given, and the data can be handled well even if it has different regions.

Disadvantages:

As with the distance-based methods, these methods must have a time complexity of O(m2).

O(mlogM) can be achieved for low dimensional data using specific data structures;

Parameter selection is difficult.

Although the algorithm deals with this problem by observing different values of K and obtaining the maximum outlier score, it is still necessary to choose the upper and lower bounds of these values.

(g) Based on clustering

Whether an object is considered an outlier may depend on the number of clusters (such as noise clusters when K is large). There is no easy answer. One strategy is to repeat the analysis for a different number of clusters. Another way is to find lots of small clusters, and the idea is:

Smaller clusters tend to be more cohesive;

If an object is an outlier with a large number of small clusters, it is likely to be a true outlier.

On the downside, a group of outliers may form small clusters to evade detection.

Advantages:

Clustering techniques based on linear and near-linear complexity (K-means) may be highly effective in finding outliers.

Clusters are usually defined as the complement of outliers, so it is possible to find both clusters and outliers.

Disadvantages:

The resulting set of outliers and their scores may depend heavily on the number of clusters used and the existence of outliers in the data;

The quality of clusters produced by clustering algorithm has a great influence on the quality of outliers produced by the algorithm.

(h) Outlier detection based on proximity

An object is abnormal if it is far away from most points. This approach is more general and easier to use than statistical methods because it is easier to determine a meaningful proximity measure for a data set than to determine its statistical distribution. An object’s outlier score is given by the distance to its K-nearest neighbor. The outlier score is highly sensitive to the value of K. If K is too small (e.g. 1), a small number of adjacent outliers may result in a lower outlier score than the outliers; If K is too large, then all objects in the cluster with fewer points than K may become outliers. In order to make the scheme more robust for the selection of K, the average distance of k nearest neighbors can be used.

Advantages:

simple

Disadvantages:

The approach based on proximity requires O(m2) time, which is not applicable to large data sets.

The method is also sensitive to the choice of parameters.

Data sets with different density regions cannot be processed because it uses global thresholds and cannot take into account such variations in density.

Conclusion:

In the data processing stage, outliers are considered as outliers that affect the data quality, rather than as the target points of anomaly detection commonly said. Generally, a simple and intuitive method is adopted to judge outliers of variables by combining boxplot and MAD statistical methods.

 sns.scatterplot(x="fare", y="age", hue="survived",data=df_titanic,palette="Set1")
Copy the code

2) Handling methods

The handling of outliers needs to be analyzed on a case-by-case basis. There are four commonly used methods for handling outliers:

  • Delete records containing outliers;
  • Whether some filtered abnormal samples are really unnecessary abnormal feature samples, it is better to find the business to reconfirm, to prevent us from filtering out the normal samples.
  • Treat outliers as missing values and hand them over to missing value handling methods;
  • Use mean/median/mode to correct;
  • Do not handle.

Three, characteristic structure

3.1Characteristics of the structure

The goal is to enhance data representation and add prior knowledge.

If the results are still not very good after we have processed the variables, we need to do feature building, which is to generate new variables.

3.3.1Statistical construction:

  1. Build new features based on business rules, prior knowledge, etc

  2. Quartile, median, mean, standard deviation, deviation, skewness, skew, discrete system

  3. Construct long and short term statistics (e.g., week, month)

  4. Time decay (the closer the observation, the higher the weight value)

  • Age: Child, young, midlife, old
def age_bin(x): if x <= 18: return 'child' elif x <= 30: return 'young' elif x <= 55: return 'midlife' else: Return 'old' df_titanic['age_bin'] = dF_titanic ['age'].map(age_bin) dF_titanic ['age_bin'].unique()  array(['young', 'midlife', 'child', 'old'], dtype=object)Copy the code
  • Extracting the Title feature
df_titanic['title'] = df_titanic['name'].map( lambda x: X.s plit (', ') [1]. The split ('. ') [0]. The strip ()) df_titanic [' title '] value_counts () results:  Mr 757 Miss 260 Mrs 197 Master 61 Rev 8 Dr 8 Col 4 Ms 2 Major 2 Mlle 2 Dona 1 Sir 1 Capt 1 Don 1 Lady 1 Mme 1 the Countess 1 Jonkheer 1 #, Countess 1 Jonkheer 1 #, Countess 1 Jonkheer 1 #  array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms', 'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess', 'Jonkheer', 'Dona'], dtype=object) title_dictionary = { "Mr": "Mr", "Mrs": "Mrs", "Miss": "Miss", "Master": "Master", "Don": "Royalty", "Rev": "Officer", "Dr": "Officer", "Mme": "Mrs", "Ms": "Mrs", "Major": "Officer", "Lady": "Royalty", "Sir": "Royalty", "Mlle": "Miss", "Col": "Officer", "Capt": "Officer", "the Countess": "Royalty", "Jonkheer": "Royalty", "Dona": 'Mrs'} dF_titanic ['title'] = dF_titanic ['title'].map(title_dictionary) dF_titanic ['title'].value_counts()  Mr 757 Miss 262 Mrs 201 Master 61 Officer 23 Royalty 5Copy the code
  • Sample family size
Df_titanic ['family_size'] = dF_titanic ['sibsp'] + dF_titanic ['parch'] + 1 dF_titanic ['family_size'].head() 0 2 1 2 2 1 3 2 4 1Copy the code

3.3.2 rainfall distribution on 10-12 cycle value:

  1. The cycle values of the first N cycles/days/months/years, such as the last 5 days, average, etc

  2. Year-on-year/sequential

3.3.3 Data Bucket Distribution:

1) Barrel division with equal frequency and equal distance

(a) Custom branching

It refers to the interval that is set by itself according to business experience or common sense, and then the original data is classified into each interval.

(b) Isometric subdivision

Divide the data into equal portions of the same width.

In this way, if A and B are the minimum and maximum values, then the length of each interval is W=(B−A)/N, then the boundary values of the interval are A+W,A+2W,… A + (N – 1) W. Only the boundary is considered here, and the number of instances in each equal portion may vary.

The disadvantage is that it is more influenced by outliers

(c) Equal frequency subdivision box

Divide the data into equal pieces, and the number in each equal piece is the same.

The boundary values of the intervals are selected so that each interval contains roughly the same number of instances. For example, if N=10, each interval should contain about 10% of instances.

  • Numerical variable box
Bins = pd.qcut(dF_titanic ['fare_bin'], 5, Retbins =True) dF_titanic ['fare_bin'].value_Counts () (7.854, 10.5] 184 (21.679, 39.688] 180 (-0.001, 7.854] 179 (39.688, Bins # Array ([0., 7.8542, 10.5, 21.6792, 39.6875, 512.3292]) Def fare_cut(age): If age <= 7.8958: return 0 if age <= 10.5: return 1 if age <= 21.6792: return 2 if age <= 39.6875: Return 3 return 4 dF_titanic ['fare_bin'] = dF_titanic ['fare'].map(fare_cut) # Cut Bins = [0, 12, 18, 65, 100] pd.cut(df_titanic['age'], bins).value_countsCopy the code

2) Best – KS barrels

1. Order the eigenvalues from smallest to largest.

2. Calculate the maximum value of KS, that is, the tangent point, denoted as D. Then cut the data into two parts.

3. Repeat Step 2 for recursion, and further cut the data around D. Until the number of KS boxes reaches our preset threshold.

4. Continuous variable: The KS value after the box distribution is less than the KS value before the box distribution

5. In the process of packing, the KS value after packing is determined to be a certain pointcut, rather than the joint action of multiple pointcuts. The location of this tangent point is where the original KS value is the largest.

Note: The code implementation is available online

3) Chi square bucket

A bottom-up (i.e., merger-based) data discretization method. It relies on the chi-square test: adjacent intervals with minimum chi-square values are combined until a determined stop criterion is met.

The basic idea

For precise discretization, the relative class frequencies should be exactly the same over an interval. Therefore, if two adjacent intervals have very similar class distributions, the two intervals can be combined; Otherwise, they should remain separate. Low chi-square values indicate that they have similar class distributions.

Implementation steps

Step 1: Define a chi-square threshold in advance;

Step 2: Initialize. The instances are sorted according to the attributes to be discrete, and each instance belongs to an interval;

Step 3: Merge intervals;

Calculate the chi-square value of each pair of adjacent intervals;

Combine a pair of intervals with minimum chi-square values;

Aij: number of instances of class JTH in the i-th interval; Eij: the expected frequency of Aij (=(Ni*Cj)/N), N is the total number of samples, Ni is the number of samples of group I, Cj is the proportion of samples of class J in the whole;

Meaning of the threshold

When categories and attributes are independent, there is a 90% probability that the calculated chi-square value will be less than 4.6. A chi-square value greater than the threshold of 4.6 indicates that the attribute and class are not independent and cannot be merged. If the threshold value is large, interval merging will be carried out many times, and the number of interval after discrete is small and the interval is large.

Pay attention to

The ChiMerge algorithm recommends confidence levels of 0.90, 0.95, and 0.99, with a maximum range of 10 to 15. The chi-square threshold can also be ignored, and the minimum interval number or the maximum interval number can be considered. Specifies the upper and lower limits of the number of intervals, at most a few intervals, at least a few intervals; For class-type variables, you need to sort them in some way when you need to box them.

Code implementation

Github.com/tatsumiw/Ch…

3) Minimum entropy method

It is necessary to minimize the total entropy, that is, to maximize the compartmentalization of dependent variables.

Entropy is a measure of data disorder in information theory. The basic purpose of information entropy is to find out the relationship between the amount of information and the redundancy of a certain symbol system, so as to realize the highest efficiency of data storage, management and transmission with the minimum cost and consumption.

The lower the entropy of the data set, the smaller the difference between the data. The minimum entropy division is to make the data in each box have the best similarity. Given the number of boxes, if all possible boxes are considered, the boxes obtained by the minimum entropy method should be the boxes with the minimum entropy.

3.3.4 Feature combination

Note: Finite consideration of strong feature dimensions

1) Discrete + discrete: Cartesian product

2) Discrete + continuous: Cartesian product or group by based on category feature is carried out after continuous feature is divided into buckets, which is similar to cluster feature construction

3) continuous + continuous: addition, subtraction, multiplication and division, second-order difference, etc

  • Polynomials generate new features [for continuous values]
Df_titanic_numerical = dF_titanic [['age','sibsp','parch','fare','family_size']] df_titanic_numerical. Age SIBSP PARch fare Family_size 0 22.0 10 7.2500 2 1 38.0 10 71.2833 22 26.0 0 7.9250 1 3 35.0 10 53.1000 2 4 Preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2) and PolynomialFeatures from Sklearn. Preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2) and PolynomialFeatures include_bias=False, interaction_only=False) df_titanic_numerical_poly = poly.fit_transform(df_titanic_numerical) pd.DataFrame(df_titanic_numerical_poly, columns=poly.get_feature_names()).heaCopy the code

Check the correlation after the new variable is derived. The darker the color, the greater the correlation:

sns.heatmap(pd.DataFrame(df_titanic_numerical_poly, columns=poly.get_feature_names()).corr())
Copy the code

3.4 Feature Selection

The goal is to reduce noise, smooth predictive power and computational complexity, and enhance the predictive performance of the model

After the data preprocessing is completed, we need to select meaningful features to input into the machine learning algorithm and model for training.

In general, features are selected in two ways:

  • Whether the feature diverges: If a feature does not diverge, for example, the variance is close to 0, that is, the sample does not differ in the feature, and the feature is not useful for sample differentiation.
  • Correlation between features and targets: This is obvious, and features with high correlation with targets should be selected optimally. In addition to the variance method, the other methods introduced in this paper are all based on correlation.

According to the form of feature selection, feature selection methods can be divided into three types:

  • Filter: The Filter method is used to score each feature according to divergence or correlation, and set the threshold or the number of threshold to be selected to select the feature.
  • Wrapper: A Wrapper method in which several features are selected or excluded at a time according to an objective function (usually a predictive performance score).
  • Embedded: Embedded method. Firstly, some machine learning algorithms and models are used for training, and the weight coefficients of each feature are obtained. Then the features are selected from large to small according to the coefficients. Similar to the Filter method, but through training to determine the advantages and disadvantages of features. We use the Feature_Selection library in SkLearn for feature selection.

3.4.1 track filter type

1) Variance filtering

This is a class that filters features by their variance. For example, if the variance of a feature itself is very small, it means that there is basically no difference between the samples on this feature. Perhaps most of the values in the feature are the same, or even the value of the whole feature is the same, then this feature has no effect on sample differentiation. So whatever the next feature project is going to be, it’s going to be a priority to eliminate features with zero variance. VarianceThreshold Indicates the threshold of variance. It indicates that all features whose variance is smaller than threshold are discarded. The default value is 0, that is, all features with the same records are deleted.

From sklear.feature_selection import VarianceThreshold VarianceThreshold = VarianceThreshold() Df_titanic_numerical = dF_titanic [['age','sibsp','parch','fare','family_size']] X_var = Variancethreshold. Fit_transform (df_titanic_numerical) # get deleted unqualified features the new matrix variancethreshold. Variances_ array ([ 79.58, 1.21467827, 0.64899903, 512.3292, 2.60032675]) del_list = Df_titanic_numerical. Columns [variancethreshold.get_support()==0].to_list() #Copy the code

However, if we know how many features we need, variance can also help us make feature selection in one step.

For example, if we want to keep half of the features, we can set a variance threshold that halves the total number of features. Simply find the median of the feature variance and enter this median as the value of the parameter threshold:

df_titanic_numerical_fsvar = VarianceThreshold(np.median(df_titanic_numerical.var().values)).fit_transform(df_titanic_numerical)
Copy the code

When the feature is a dichotomous feature, the value of the feature is Bernoulli random variation. Suppose P =0.8, that is, the feature is deleted when a certain category accounts for more than 80% of the dichotomous feature

X_bvar = VarianceThreshold(.8 * (1 -.8)).fit_transform(dF_titanic_numerical) x_bvar.shapeCopy the code

2) Chi-square filtering

The Chi-square test, which is dedicated to classification algorithms, captures the feature where correlation is pursued for p less than the significance level

Chi – square filtering is a correlation filtering specifically for discrete labels (i.e. classification problems).

The chi-square test feature_selection. Chi2 calculates the Chi-square statistics between each non-negative feature and the label, and ranks the features according to the chi-square statistics from high to low

df_titanic_categorical = df_titanic[['sex', 'class', 'embarked', 'who', 'age_bin','adult_male','alone','fare_bin']] df_titanic_numerical = df_titanic[['age','sibsp','parch','fare','family_size','pclass']] df_titanic_categorical_one_hot = pd.get_dummies( df_titanic_categorical, columns=['sex', 'class', 'embarked', 'who', 'age_bin','adult_male','alone','fare_bin'], drop_first=True) df_titanic_combined = pd.concat([df_titanic_numerical,df_titanic_categorical_one_hot],axis=1) y = df_titanic['survived'] X = df_titanic_combined.iloc[:,1:] from sklearn.feature_selection import chi2 from Feature_selection import SelectKBest chi_value, p_value = chi2(X,y) K = chi_value.shape[0] - (p_value > 0.05).sum() # X_chi = SelectKBest(chi2, k=14).fit_transform(X, y) X_chi. Shape (89Copy the code

3) the F test

Only linear correlation can be captured, which requires the data to follow normal distribution, and the P value is less than the significance level.

F test, also known as ANOVA, is a filtering method used to capture the linear relationship between each feature and the label. It can do both regression and classification, so it includes feature_selection.f_classif (f-test classification) and feature_selection.f_regression (F-test regression). Where, f-test classification is used for data labeled with discrete variables, while F-test regression is used for data labeled with continuous variables.

The essence of F test is to find the linear relationship between two sets of data, and its null hypothesis is that “there is no significant linear relationship between the data”.

From sklearn. Feature_selection import f_classif f_value, p_value = f_classif X_classif = SelectKBest(f_classif, k=14).fit_transform(X, y)Copy the code

4) Mutual information

Can capture any correlation can not be used for sparse matrix, the pursuit of mutual information greater than 0 features

Mutual information method is a filtering method to capture any relationship (linear and nonlinear) between each feature and the label. Like the F-test, it can do both regression and classification, and contains two classes:

Feature_selection. Mutual_info_regression (mutual information regression) feature_selection. Mutual_info_regression

Both classes have exactly the same usage and parameters as the F-test, except that the mutual information method is more powerful than the F-test, because the F-test can only find linear relationships, whereas the mutual information method can find arbitrary relationships. The mutual information method does not return statistics with similar P or F values. It returns “an estimate of the mutual information between each feature and the target”, which is valued between [0,1], where 0 means that the two variables are independent, and 1 means that the two variables are completely related.

From sklearn. Feature_selection import mutual_info_classif as MIC # mutual information method mic_result = MIC(X,y) # mutual information estimation k = mic_result.shape[0] - sum(mic_result <= 0) #16 X_mic = SelectKBest(MIC, k=16).fit_transform(X, y) X_mic.shape (891, 16)Copy the code

3.4.2 package type

1) Recursive feature elimination

Recursive feature elimination method uses a base model to conduct several rounds of training. After each round of training, the features of several weight coefficients are eliminated, and then the next round of training is carried out based on the new feature set. The code for using the FEATure_Selection library’s RFE class is as follows:

From sklear.feature_Selection Import RFE from sklear.linear_model Import LogisticRegression X_ref = RFE(estimator=LogisticRegression()); n_features_to_select=10).fit_transform(X, y)Copy the code

2) Importance assessment

from sklearn.ensemble import ExtraTreesClassifier # feature extraction model = ExtraTreesClassifier() model.fit(X, y) print(model.feature_importances_) feature=list(zip(X.columns,model.feature_importances_)) feature=pd.DataFrame(feature,columns=['feature','importances']) Sort_values (by='importances', Ascending =False). Head (20) Feature importances 2 fare 0.227659 15 adult_male_True 0.130000 10 who_man 0.108939 5 sex_male 0.078065 11 who_woman 0.059090 7 class_Third 0.055755 4 pclass 0.048733 3 Family_size 0.038347 0 sibsp 0.035489 9 embarked_S 0.029512 1 parch 0.023778 20 512.329] 0.022985 14 age_bin_young 0.021404 12 age_bin_midlife 0.019379 6 class_Second 0.019301 17 fare_bin_(7.854, 19 Fare_bin_ (21.679, 39.688] 0.016006 21.679] 0.014871 16 alone_True 0.013093 13 age_bin_old 0.0112Copy the code

3) Rank importance assessment

** Advantages: ** fast calculation; Easy to use and understand; The attribute of a feature importance measure; Pursuit of characteristic stability

** Principles: ** Calculate the importance of substitutions after training the machine learning model. This method is to propose a hypothesis to the model, if the target and all other columns are retained while randomly disrupting a column of verification set feature data, it will predict the accuracy of the machine learning model. For a highly important feature, random-reshuffle can cause even greater damage to the accuracy of machine learning model predictions.

** The first number in each line indicates how much the model’s performance (accuracy in this example) has declined, and the numbers following ± represent the standard deviation of multiple scrambles.

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import  RandomForestClassifier import eli5 from eli5.sklearn import PermutationImportance my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y) perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y) eli5.show_weights(perm, feature_names = val_X.columns.tolist())Copy the code

Rule 3.4.3 embedded

1) Feature selection method based on penalty terms

Using the base model with penalty terms, we can not only screen out features, but also reduce dimensions.

Using the feature_Selection library SelectFromModel class combined with the logistic regression model with L1 penalty, the code for selecting features is as follows:

From sklear.feature_Selection Import SelectFromModel from sklear.linear_Model Import LogisticRegression Lr = LogisticRegression(solver='liblinear',penalty= 'L1 ',penalty=' L1 ', X_sfm = SelectFromModel(lr).fit_transform(X, y) X_sfm. Shape (891, 7Copy the code

Using the feature_Selection library SelectFromModel class in combination with the SVM model, the feature selection code is as follows:

from sklearn.feature_selection import SelectFromModel from sklearn.svm import LinearSVC lsvc = LinearSVC (C = 0.01, penalty = 'l1', dual = False). The fit (X, y) model = SelectFromModel(lsvc,prefit=True) X_sfm_svm = model.transform(X) X_sfm_svm.shape (891, 7Copy the code

2) Based on tree model

GBDT can also be used as the base model for feature selection. Using the Feature_Selection library SelectFromModel class and GBDT model, the code for feature selection is as follows:

from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import GradientBoostingClassifier #GBDT As the base model of feature selection GBDT = GradientBoostingClassifier () X_sfm_gbdt = SelectFromModel (GBDT). Fit_transform (X, y) X_sfm_gbdt.shape (891, 5)Copy the code

To sum up, there are several methods of feature selection experience:

(1) If the feature is a category variable, then start with SelectKBest and use chi-square or tree-based selectors to select the variable;

(2) If the feature is a quantitative variable, the linear model and correlation-based selector can be used directly to select the variable;

(3) If it is a dichotomous problem, consider using SelectFromModel and SVC;

(4) Before the feature selection, we still need to do EDA.

Four, characteristic transformation

1) Standardization

Convert to z-Score so that the arithmetic mean of the numerical characteristic column is 0 and the variance (and standard deviation) is 1. Not immune to outlier.

Note: If there are numerically large or small outliers (found through EDA) in the numerical characteristic column, more robust statistics should be used: median instead of arithmetic mean, quantile instead of variance. This standardization method has an important parameter (lower quantile limit, upper quantile limit), which is best determined through EDA data visualization. Immune outlier.

Stan_scaler = StandardScaler() stan_scaler.fit (x) x_zscore = Stan_scaler.transform(x) x_test_zscore = stan_scaler.transform (x_test) joblib.dump(Stan_scaler,'zscore.m'Copy the code

(2) Normalization

Normalize each row so that it has the unit norm (l1, L2 or Max). Not immune to outlier.

, including\iota represents the norm function.

3) Interval scaling

You take the value of one column and divide it by the maximum absolute value of that column.

MinMaxScaler: maps linearly to [0,1] and is not immune to outlier.

MaxAbsScaler: Maps linearly to [-1, 1] and is not immune to outlier.

from sklearn import preprocessing min_max_scaler = preprocessing.MinMaxScaler() min_max_scaler.fit_transform(x) x_minmax  = min_max_scaler.transform(x) x_test_minmax = min_max_scaler.transform(x_test) Joblib. dump(min_max_scaler,'min_max_scaler.m'Copy the code

Note: If there are numerically large or small outliers (found through EDA) in the numerical characteristic column, more robust statistics should be used: median instead of arithmetic mean, quantile instead of variance. This standardization method has an important parameter (lower quantile limit, upper quantile limit), which is best determined through EDA data visualization. Immune outlier.

The difference between normalization and standardization

(a) the purpose is different. Normalization is to eliminate the dimensionality and compress it into the interval [0,1]; Standardization simply adjusts the overall distribution of features. (b) Normalization is related to maximum and minimum values; The normalization is related to the mean and the standard deviation. (c) the normalized output is between [0,1]; Standardization is unlimited.

Normalized and standardized application scenarios

(a) In classification and clustering algorithms, z-score standardization performs better when distance is used to measure similarity (such as SVM and KNN) or when PCA technology is used to reduce dimension. (b) The first method or other normalization methods can be used when distance measures, covariance calculations, and data do not conform to the normal distribution are not involved. For example, in image processing, the RGB image is converted into a grayscale image and its value is limited to the range of [0 255]. (c) The tree-based approach does not require feature normalization. Such as random forest, Bagging and Boosting. If it’s a parameter-based model or a distance-based model, because you need to calculate the parameters or the distance, you need to normalize.

3.5.2 Nonlinear transformation [Statistical transformation]

Use statistical or mathematical transformations to mitigate the effects of skewed data distribution. The values of the originally dense interval should be dispersed as much as possible, and the values of the originally dispersed interval should be aggregated as much as possible.

These transformation functions belong to the power transformation function cluster, usually used to create monotonous data transformation. The main thing they do is they help stabilize the variance, keep the distribution close to normal and make the data independent of the mean of the distribution.

1) log transformation

Log transformations are often used to create monotonous data transformations. Its main function is to help stabilize the variance, to keep the distribution close to normal and to make the data independent of the mean of the distribution. Because log transformations tend to stretch the range of the values of the arguments that fall in the lower range, they tend to compress or reduce the range of the values of the arguments that fall in the higher range. So that the skew distribution is as close to the normal distribution as possible. Therefore, for the unstable variance of some numerical continuous features, we need to adopt logization to adjust the variance of the entire data distribution for the heavy-tailed distribution of eigenvalues, which belongs to the variance-stable data transformation.

The log transformation belongs to the power transformation cluster. The function is expressed mathematically as

The natural logarithm uses b=e, e=2.71828, which is usually called Euler’s constant. You can use the base b=10 normally used in the decimal system.

Code implementation

sns.distplot(df_titanic.fare,kde=False)
Copy the code

df_titanic['fare_log'] = np.log((1+df_titanic['fare']))
sns.distplot(df_titanic.fare_log,kde=False)
Copy the code

2) box – cox transformation

The Box-Cox transformation is another function in a popular family of power transformation functions. The prerequisite for this function is that the numeric value must first be converted to a positive number (as is required for the log transformation). In case the value is negative, it is helpful to offset the value with a constant.

Box-cox transformation is a generalized power transformation method proposed by Box and Cox in 1964. It is a data transformation commonly used in statistical modeling. It is used in the case that the continuous response variables do not meet the normal distribution. After the box-Cox transformation, the unobobservable error and the correlation between the predicted variables can be reduced to a certain extent. The main characteristic of box-Cox transform is to introduce a parameter, which can be estimated by the data itself and then determine the data transformation form. Box-cox transform can obviously improve the normality, symmetry and variance equality of the data, and is effective for many actual data.

Box-cox transformation function:

The output y of the resulting transformation is a function of the input x and the parameters of the transformation; When lambda is equal to 0, the transformation is the natural log transformation, which we’ve already mentioned. The best value for λ is usually determined by maximum likelihood or maximum logarithmic likelihood.

Code implementation

Fare_positive_value = dF_titanic [(~df_titanic['fare'].isnull()) & (df_titanic['fare']>0)]['fare'] import fare_positive_value = dF_titanic [(~df_titanic['fare'] Opt_lambda = spstats. Boxcox (fare_positive_value) print('Optimal lambda value:', Opt_lambda) # -0.5239075895755266 # Perform the box-cox transform fare_boxcox_lambda_opt = spstats.boxcox(df_titanic[df_titanic['fare']>0]['fare'],lmbda=opt_lambda) sns.distplot(fare_boxcox_lambda_opt,kde=FalCopy the code

3.5.3 Processing discrete variables

1) Label Encoder

LabelEncoder is a label that numbers discontinuous numbers or text. The code value is between 0 and N_classes-1.

For example, if we have [dog,cat,dog,mouse,cat], we convert it to [1,2,1,3,2]. Here comes a curious phenomenon: the average of dog and mouse is cat.

Advantages: Compared with OneHot coding, LabelEncoder coding occupies less memory space, and supports text feature coding.

The downside: It implicitly assumes that there is a sequential relationship between different categories. In the specific code implementation, LabelEncoder will sort all the unique data in the qualitative feature column once, so as to get the mapping from the original input to the integer. So far, there is no widespread use of tag coding. Generally, it can be used in tree models.

Code implementation

From sklear.preprocessing import LabelEncoder le = LabelEncoder() le.fit([" 1 ", "1 "," 2 ", "3 "]) print(' {} '. The format (the list (le. Classes_))) # output characteristics: [' first-line ', 'three lines',' second-tier ', 'super line] print (' the conversion tag value: {} '. The format (le. Transform ([above a "line", "line", "second line"]))) outputs the conversion tag # : array ([3 0 2]...). Format (list(le.inverse_transform([2, 2, 1])))) # print(' inverse_transform ', 'inverse_transform ',' three-line ')Copy the code

2) One hot encoder

OneHotEncoder is used to extend the dimension of data representing classifications. The simplest understanding is to use an N-bit status register to encode N states, each of which has an independent register bit, and only one of these register bits is valid and can have only one state.

Why use single-heat coding?

Single hot coding is because most algorithms are based on the measurement in the vector space to calculate, in order to make the variable values of non-partial order relation have no partial order, and to the dot is equidistant. One-hot coding is used to extend the value of discrete features to Euclidean space, and a value of discrete features corresponds to a point in Euclidean space. Using one-HOT coding for discrete features will make the calculation of the distance between features more reasonable.

Why do eigenvectors map to Euclidean space?

The reason for mapping discrete features to Euclidean space through one-hot coding is that in machine learning algorithms such as regression, classification and clustering, the calculation of distance or similarity between features is very important, and the calculation of distance or similarity commonly used by us is the calculation of similarity in Euclidean space.

For example – suppose there are three color characteristics: red, yellow, and blue.

Vectoquantization or digitization is usually required when using machine learning algorithms. So you might want to assume that red =1, yellow =2, and blue =3, so this implements tag coding, which labels different categories. However, this means that the machine might learn “red < yellow < blue”, but this is not the intention of our machine learning, just to let the machine distinguish between them, not the size comparison.

Therefore, the label coding is not enough and further conversion is required. Because there are three color states, there are three bits. Red: 1 0 0, yellow: 0 1 0, blue: 0 0 1. In this way, the distance between each of the two vectors is square root of 2, and the distance in the vector space is the same, so there is no bias, and basically it does not affect the effect of the vector space based measurement algorithm.

Advantages: Single hot coding solves the problem that the classifier is not good at processing attribute data, and also plays the role of extending features to a certain extent. Its values are only 0 and 1, and the different types are stored in vertical space.

Disadvantages: it can only be used for binarization of numeric variables, but cannot be used directly for encoding of string type class variables. When the number of categories is large, the feature space becomes very large. In such cases, PCA can generally be used to reduce the dimension. The combination of One Hot Encoding +PCA is also useful in practice.

Code implementation

  • Implemented using Pandas:
sex_list = ['MALE', 'FEMALE', np.NaN, 'FEMALE', 'FEMALE', np.NaN, 'MALE'] df = pd.DataFrame({'SEX': sex_list}) display(df) df.fillna('NA', Inplace =True) df = pd.get_dummies(df['SEX'],prefix='IS_SEX') display(df) # FEMALE 5 NaN 6 MALE # After filling IS_SEX_FEMALE IS_SEX_MALE IS_SEX_NA 0 0 1 0 1 1 0 0 2 0 0 1 3 1 0 0 4 1 0 0 5 0 0 1 pd.get_dummies( df_titanic, columns=[ 'sex', 'class', 'pclass', 'embarked', 'who', 'family_size', 'age_bin' ],drop_first=True)Copy the code

  • Use skLearn to implement:
Note: If the character is a string, you need to convert it to a continuous numeric variable using LabelEncoder(). OneHotEncoder() in Sklearn. Preprocessing is used to binarize OneHotEncoder() to encode the index of each component in shape=(None,1) column vector into one hot row vector. Import Numpy as NP from sklear. preprocessing import OneHotEncoder # Ne (Len (labels), -1) # Ne (Len (labels), -1) Enc = OneHotEncoder() enc.fit(labels) targets = enc.transform(labels).toarray() # The index bonus in the form of, also can through the parameter specifies the sparse = False to achieve the same effect Coding results: array ([[. 1, 0, 0.], [0. 1, 0.], [. 1, 0, 0.], [0., 0. 1.]])Copy the code

3) theSign binarizer

The function is the same as OneHotEncoder, but OneHotEncoder can only binarize numeric variables and cannot directly encode string type class variables, while LabelBinarizer can directly binarize character variables.

3.5.4 dimension reduction

Read data & show data

from sklearn import datasets iris_data = datasets.load_iris() X = iris_data.data y = iris_data.target def draw_result(X, X: data after dimension reduction iris: """ plt.figure() # Extract iris-setosa setosa = X[y == 0] # Parameter 1, the x vector, Scatter (setosa[:, 0], setosa[:, 1], color="red", label="Iris-setosa") # Iris-versicolor versicolor = X[y == 1] plt.scatter(versicolor[:, 0], versicolor[:, 1], color="orange", label="Iris-versicolor") # Iris-virginica virginica = X[y == 2] plt.scatter(virginica[:, 0], virginica[:, 1], color="blue", label="Iris-virginica") plt.legend() plt.show() draw_result(X, yCopy the code

1) PPrincipal Component Analysis (CA)

Function: dimension reduction and compression

Steps:

  • oThe mean X
  • willX minus the mean to compute the covariance matrixC = \frac{1}{m}XX^T
  • The covariance matrixC eigenvalue decomposition
  • Arrange them from largest to smallestThe eigenvalue of C comes firstThe matrix of the eigenvectors corresponding to k eigenvalues is the transformation matrixP_{k\times n}

(a) Manually implement PCA

class PCA: def __init__(self, dimension, train_x): Self. train_x = train_x @property def result(self): 'return the matrix with reduced dimension' # 1. Phoebe = self.train_x -nP. mean(self.train_x, axis=0) # 2. Select * from: cov_matrix = NP. cov(data_Tick, rowvar=False) # 3. Eigenvalue decomposition EIGEN_VAL, EIGEN_VEc = NP.linalg.eig (coV_matrix) # 4. Generate data dimension reduction after p = eigen_vec [: 0: self. Dimension] # take eigenvector matrix k d before return np. Dot (data_centering, p) method is called:  pca = PCA(2,X) iris_2d = pca.result draw_result(iris_2d, yCopy the code

(b) sklearn PCA

import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
newX = pca.fit_transform(X)


draw_result(newX, y)
Copy the code

2)SVD (Singular Value Decomposition)

Functions: eigendecomposition, dimension reduction

Steps:

(a) Implement SVD manually

class SVD: def __init__(self, dimension, train_x): self.dimension = dimension self.train_x = train_x @property def result(self): Phoebe = self.train_x -nP. mean(self.train_x, axis=0) # SVD U, Sigma, Select np.dot(data_Phoebe, np.transpose(VT)[:, :self.dimension]) select nP.dot (data_Phoebe, np.transpose(VT)[:, :self.dimension])  svd = SVD(2,X) iris_svd = svd.result draw_result(iris_svd,y)Copy the code

(b) sklearn SVD

TruncatedSVD, truncated singular value decomposition (this method is used when the data volume is so large that SVD cannot run out).

from sklearn.decomposition import TruncatedSVD
iris_2d = TruncatedSVD(2).fit_transform(X)
draw_result(iris_2d, y)
Copy the code

3)PCA and SVD

4) Fisher Linear discriminant analysis(Linear Discriminant Analysis,LDA)

Is a supervised dimensionality reduction to obtain the optimal feature subset by minimizing intra class dispersion and maximizing interclass dispersion.

LD1 can well separate the two normally distributed classes through linear determination. The linear decision of LD2 keeps the large variance of the data set, but LD2 cannot provide information about the category, so LD2 is not a good linear decision.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=2)
iris_2d = lda.fit_transform(X, y)
draw_result(iris_2d, y)
Copy the code

LDA is similar to PCA:

PCA tries to find the orthogonal principal component axis with the largest variance, and LDA finds the feature subspace that can be optimized for classification. Both LDA and PCA are linear transformation techniques that can be used to reduce the dimension of data sets. PCA is an unsupervised algorithm

5) T-SNE

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
iris_2d = tsne.fit_transform(X)
draw_result(iris_2d, y)
Copy the code

Five summarizes

Don’t: Throw all the features into the model at the start, and you’ll be misled by useless features.

1) EDA

Plot, plot, plot, say the important thing three times

2) Feature pretreatment

Time series: the feature of yesterday is added to the feature of today, or the change in the value of the feature compared to yesterday is added to the feature of today.

Continuous feature discretization (decision tree type model doesn’t make sense) : an interesting variation that limits the accuracy of floating-point features so that the abnormal data is more robust and the model is more stable.

Clipping: Clip (low, upper) is used in pandas Dataframe to limit the value of the characteristic value to a certain range

3) Data cleaning

It should be reasonable, not blindly filling in missing values and deleting outliers, but based on statistical science.

4) Feature transformation

Do not use PCA or LDA to reduce dimensions unless absolutely necessary. Instead, reduce the original features directly.

  • Feature transformation to remember:

Any monotone transformation (such as logarithm) for a single feature column: does not apply to decision tree algorithms.

For the decision tree,There is no difference between X, X^3, and X^5,| | X, X ^ 2, there is no difference between X ^ 4, unless a rounding error.

Linear combination: Only applicable to decision tree and decision tree-based ensemble (such as Gradient Boosting, random Forest), because the common Axis-aligned split function is not good at capturing the correlation between different features; Not suitable for SVM, linear regression, neural network, etc.

  • Combination of category features and numerical features:

N1 and N2 are used to represent numerical features, and C1 and C2 are used to represent category features. The groupby operation of pandas can create the following meaningful new features :(C2 can also be a discrete N1)

Median (N1) _by (C1) \ \ median scheme (N1) _by (C1) \ \ arithmetic average mode (N1) _by (C1) \ \ modal min (N1) _by (C1) \ \ the minimum Max (N1) _by (C1) \ \ maximum STD (N1) _by (C1) \ \ standard deviation, var (N1) _by (C1) \ \ variance freq (C2) _by \ \ frequency (C1)Copy the code

Simply combining the categories and numerical features you already have with the above effective combination can greatly increase the number of good features available.

By combining this method with basic feature engineering methods such as linear combination (only for decision trees), more meaningful features can be obtained, such as:

N1 - median(N1)_by(C1)
N1 - mean(N1)_by(C1)
Copy the code
  • Using genetic programming to create new traits

Symbolic Regression based on genetic Programming [GPLEARN is the first gene programming library in Python environment].

Two uses of genetic programming:

Transformation: To combine and transform existing features in a user-defined way (unary, binary, and multivariate operators) or by using the library’s own functions (such as addition, subtraction, multiplication, and division, min, Max, trigonometry, exponent, and logarithm). The goal of the combination is to create new features that are most “relevant” to the target Y value.

Spearman is mostly used for decision tree (immune monotony transform), and Pearson is mostly used for other algorithms such as linear regression.

Regression (regression) : Same principle as above, but is directly used for regression.

  • Create new features with decision trees:

In the algorithm of decision tree series (single decision tree, GBDT, random forest), each sample is mapped to a leaf of the decision tree. Therefore, we can add the index (natural number) or one-hot-vector (sparse vector obtained by dummy coding) of the sample mapped through each decision tree to the model as a new feature.

Implementation: the apply() and decision_path() methods are used in both Scikit-Learn and XgBoost.

5) model

  • Tree model:

It is not sensitive to the amplitude of the characteristic value, so no dimensionless and statistical transformation can be carried out. Since the number model relies on sample distance for learning, it can not be coded for category features (but character features cannot be directly input, so at least label coding is required). Both LightGBM and XGBoost can learn NaN as part of the data, so there is no need to deal with missing values. In other cases, we need to use.

  • Models that rely on sample distance to learn (such as linear regression, SVM, deep learning, etc.) :

For numerical features, dimensionless processing is required. For some long-tailed distribution data features, statistical transformation can be done to make the model better optimized. For linear models, feature sorting can improve the expression ability of the model.

Note: combine work content, study summary above content, if have mistake, please point out, sincere consult

The resources

Machine learning features engineering practical skills

Feature engineering series: data cleaning – big coffee – blog park www.cnblogs.com

Author: Yue-hao Wang, future machine learning algorithm expert