Feature scaling & feature coding for feature engineering

Introduction to Machine Learning — How to Build a Complete Machine Learning Project, Part 5!

The first four articles in this series:

How to Build a Complete Machine Learning Project
Machine learning data set acquisition and test set construction method
Data Preprocessing for feature Engineering (PART 1)
Data Preprocessing for feature Engineering (Part 2)

This article will continue to introduce the content of feature engineering. This time, feature scaling and feature coding will be introduced. The former is mainly normalized and regularized to eliminate the influence of dimensional relations, while the latter includes serial number coding, unique heat coding, etc., mainly dealing with categorical, textual and continuous features.

3.2 Feature scaling

Feature scaling can be divided into two methods: normalization and regularization.

3.2.1 normalization

Normalization, also known as Normalization, does more than just normalize features. It actually normalizes raw data as well. It scales features (or data) to a specified range of values that are roughly the same.
Two reasons for normalization:

Some algorithms require that the values of sample data or features have zero mean and unit variance.
To eliminate differences between sample data or featuresDimensional impact, that is, the elimination of an order of magnitude of impact. The contour line of the objective function with two attributes is shown below
- A difference in the order of magnitude will result in the dominance of larger attributes. As you can see from the left of the figure below, an attribute of greater magnitude will compress the contour line of the ellipse into a straight line, making the objective function depend only on this attribute.
- The difference of order of magnitude will slow down the convergence rate of iteration. When the original features perform gradient descent, the gradient direction of each step will deviate from the direction of the minimum value (contour line center point), with a large number of iterations, and the learning rate must be very small, otherwise it is easy to cause wide oscillation. However, after standardization, the gradient direction of each step almost points to the direction of the minimum value (contour center point), with fewer iterations.
- All algorithms that rely on sample distance are very sensitive to the order of magnitude of data. For example, KNN algorithm needs to calculate the k samples closest to the current sample. When the magnitude of attributes is different, the selected k samples will be different.

There are two commonly used normalization methods:

Min-max Scaling of linear Functions. It takes a linear transformation of the original data so that the result maps to[0, 1]To achieve equal scale of the original data, the formula is as follows:

Where X is the raw data,Indicates the maximum value and minimum value of data respectively.

Zero-mean Normalization (Z-Score). It maps the original data to a distribution with a mean of 0 and a standard deviation of 1. Let’s say the mean of the original features is zeroAnd variance is, the formula is as follows:

If the data set is divided into training set, validation set and test set, then the three data sets are all use the same normalized parameter, the value is calculated through the training set, which of these two methods, respectively, need the data of maximum, minimum value, variance and mean value is calculated through the training set (this is similar to the deep learning of normalization, the realization of the BN).
Normalization is not a panacea. In practical application, the models solved by gradient descent method need normalization, including linear regression, logistic regression, support vector machine, neural network and other models. However, the decision tree model does not need it. Take C4.5 algorithm as an example, the decision tree splits nodes mainly according to the information gain ratio of data set D to feature X, and the information gain ratio is irrelevant to whether the feature has been normalized. Normalization will not change the information gain of the sample on feature X.

3.2.2 regularization

Regularization is scaling a norm (L1, L2 norm) of a sample or feature to unit 1.

Suppose the data set is:

The Lp norm is firstly calculated for the sample, and then:

The regularized result is: the value of each attribute divided by its Lp norm

The process of regularization is for a single sample, scaling it to the unit norm for each sample.

Normalization is for a single attribute and requires the values of all samples on that attribute.
This method is usually useful when using quadratic forms (such as the dot product) or other kernel methods to calculate the similarity between two samples.

3.3 Feature Coding

3.3.1 Ordinal Encoding

Definition: Ordinal coding is generally used to deal with data with size relationships between categories.

For example, grades can be divided into high, medium and low grades, and there is a size relationship of “high > medium > low”, so the serial number coding can be coded for these three grades as follows: high is expressed as 3, medium as 2, and low as 1, so that the size relationship is still retained after conversion.

3.3.2 One-hot Encoding

Definition: Unique thermal coding is usually used to deal with features that do not have a size relationship between categories.

Unique thermal coding uses N state bits to encode N possible values. For example, blood type, there are four values (type A, B, AB and O), then the unique thermal coding will convert blood type into A 4-dimensional sparse vector, representing the above four blood types respectively:

Type A,0,0,0: (1)
B型：(0,1,0,0)
AB型：(0,0,1,0)
O型：(0,0,0,1)

The advantages of independent thermal coding are as follows:

Ability to handle non-numeric attributes. Such as blood type, gender and so on
The features are expanded to a certain extent.
The encoded vector is a sparse vector, with only one digit being 1 and all others being 0. The sparsity of the vector can be used to save storage space.
Ability to handle missing values. When all bits are 0, a miss has occurred. At this point, the high-dimensional mapping method mentioned in dealing with the missing value can be adopted, and the missing value can be represented by the N+1 bit.

Of course, there are also some disadvantages of thermal coding alone:

1. High-dimensional features will bring the following problems:

In KNN algorithm, the distance between two points in high-dimensional space is difficult to be measured effectively.
In logistic regression model, the number of parameters will increase with the increase of dimension, which leads to the complexity of model and over-fitting problem.
Usually, only part of the dimensions are helpful for classification and prediction, so it is necessary to reduce the dimensions by feature selection.

2. Decision tree model does not recommend independent thermal coding of discrete features for the following two main reasons:

The sample segmentation imbalance occurs, and the segmentation gain will be very small.

For example, if the blood type is individually coded, A small number of samples will be 1’s and A large number of samples will be 0’s for each characteristic, whether A, B, AB or O.

The gain of this partition is very small, because after splitting:
- The smaller split sample set, it’s too small a percentage of the total sample. Whatever the gain is, it is almost negligible when multiplied by this ratio.
- The larger split sample set, which is almost the original sample set, has almost zero gain.
Influence the learning of decision trees.

Decision trees rely on statistics of data. Unique thermal coding splits the data into discrete Spaces. In these scattered small Spaces, statistical information is inaccurate, and the learning effect becomes worse.

The essence is that the expression ability of features after independent thermal coding is poor. The predictive power of this feature is artificially divided into multiple parts, each of which fails to compete with the other features for the optimal partition point. In the end, the feature is less important than it actually is.

3.3.3 Binary Encoding

Binary encoding is divided into two main steps:

First, each category is assigned a category ID by serial number coding.
The binary encoding corresponding to the category ID is then taken as the result.

Continuing with blood type as an example, see the following table:

Blood type	Category ID	Binary representation	Hot coding alone
A	1	0 0 1	1 0 0 0
B	2	1 0 0	1 0 0 0
AB	3	0 1 1	1 0 0 0
O	4	1 0 0	0 0 0 to 1

As can be seen from the above table, binary encoding is essentially using binary to hash the category ID and finally get 0/1 feature vector, and the feature dimension is smaller than the single thermal encoding, which saves more storage space.

We normalize

Definition: Feature dualism is the conversion of numeric attributes to Boolean attributes. It is usually used when the distribution of attribute values is assumed to be Bernoulli.

The algorithm of feature dualism is relatively simple. Specify a threshold m for attribute j.

If the sample is in the propertyjIs greater than or equal tom, is 1 after dualism;
If the sample is in the propertyjIs less thanm, then the duality becomes 0

According to the above definition, M is a key hyperparameter, and its value needs to be selected based on the model and specific tasks.

3.3.5 discretization

Definition: As the name implies, discretization is the conversion of continuous numerical attributes to discrete numerical attributes.

So when do you need to use feature discretization?

Behind this is the need to adopt “a large number of discrete features + simple model”, or “a small number of continuous features + complex model”.

For linear models, “massive discrete features + simple model” is usually used.
- Advantages: Simple model
- Disadvantages: Feature engineering is difficult, but once successful experience can be promoted, and many people can study in parallel.
For nonlinear models (such as deep learning), a “few continuous features + complex model” is commonly used.
- Advantages: No complex feature engineering is required
- Disadvantages: Complex model

Points barrels

1. Buckets are commonly used for decentralization:

Place all samples in consecutive numerical attributesjThe values of are in ascending order 。
Then select bucket boundaries from small to large. Among them:
- MIs the number of buckets, which is a hyperparameter and needs to be specified manually.
- The size of each bucketIs also a hyperparameter that needs to be specified manually.
Given propertyjThe value of, it is divided into buckets:
- if, the bucket id is 0. The value of the attribute after bucket splitting is 0.
- if, the bucket number isk. The value of the bucket attribute isk;
- if, the bucket number isM. The value of the bucket attribute isM.

2. The number and boundary of buckets must be manually specified. There are generally two methods:

The designation is based on experience in the business domain. For example, when dividing the annual income by barrels, 5 barrels can be selected according to the per capita disposable income of Chinese residents in 2017, which is about 26,000 yuan. Among them:
- If the income is less than 13,000 yuan (0.5 times of per capita), it is 0 per barrel.
- Annual income of 13,000 to 52,000 yuan (0.5 to 2 times of per capita) is 1 per barrel.
- Annual income of 53,000 to 260,000 yuan (2 to 10 times of the average person) is 2 per barrel.
- Annual income of 260,000 ~ 2.6 million yuan (10 ~ 100 times of per capita), is divided by 3 barrels.
- If the annual income exceeds 2.6 million yuan, it will be divided into 4 barrels.
Specified according to the model. The data set after bucket splitting is trained according to the specific task, and the optimal bucket number and bucket boundary are determined by hyperparameter search.

3. There are some experiences to guide the selection of bucket size:

The bucket size must be small enough to make the change of attribute values in the bucket have a small impact on sample markers.

That is to say, such a situation cannot occur: within a single bucket, the output of sample markers varies greatly.
The sub-bucket size must be large enough to contain enough samples in each bucket.

If there are too few samples in the bucket, the randomness is too large to be statistically persuasive.
Try to distribute samples evenly in each bucket.

features

1. In the industry, continuous values are rarely directly used as the feature input of the logistic regression model. Instead, continuous features are discretized into a series of 0/1 discrete features.

Its advantages are:

The inner product multiplication of the sparse vector obtained after discretization is faster and the calculation result is convenient to store.
The discretized feature has strong robustness to abnormal data.

For example, when sales volume is between [30,100), it is 1; otherwise, it is 0. If it is not discretized, an outlier value of 10000 will cause great interference to the model, and due to its large value, it has great influence on weight learning.
Logistic regression is a generalized linear model with limited expression ability and can only describe linear relations. After feature discretization, it is equivalent to introducing nonlinearity, improving the expression ability of the model and enhancing the fitting ability.

Suppose some continuous featurej, it discretizes intoMA 0/1 characteristics. Is:

. Among themIs the new feature after discretization, and their value space is {0, 1}.

On the right side of the above equation is a piecewise linear mapping, which is more expressive.
After discretization, feature crossing can be carried out. Suppose there is a continuous feature J, discretized into N 0/1 features; Continuous feature K is discretized into M 0/1 features, then N+M features are introduced after discretization respectively.

Assuming that feature J and K are jointly discretized instead of discretized independently during discretization, N*M combined features can be obtained. This further introduces nonlinearity and improves model expression.
After discretization, the model will be more stable.

If the sales volume is discretized, [30,100] is taken as a range. When the sales volume floats around 40, the value of its discretized features will not be affected.

But the values at the join of the interval need to be handled carefully, as does how the interval is divided.

2. Feature discretization simplifies the logistic regression model and reduces the risk of model overfitting.

The reason why it can resist over-fitting: after feature discretization, the model no longer fits the specific value of feature, but fits a certain concept of feature. Therefore, it can resist data disturbance and is more robust.

In addition, it greatly reduces the value of model fitting and the complexity of the model.

summary

Feature scaling is a very common method, especially for normalized feature data processing. For algorithms that use gradient descent to train learning model parameters, it helps to improve the speed of training convergence. Feature coding, especially unique thermal coding, is also used for data preprocessing of structured data.

Reference:

“Hundred side machine learning”. Chapter 1 feature Engineering
Blog.csdn.net/dream_angel…
www.cnblogs.com/sherial/arc…
Gofisher. Making. IO / 2018/06/22 /…
Gofisher. Making. IO / 2018/06/20 /…
Juejin. Cn/post / 684490…
www.zhihu.com/question/47…
www.huaxiaozhuan.com/ Statistical Learning/Chapte…

Welcome to follow my wechat official account – Machine Learning and Computer Vision, or scan the qr code below, we can communicate, learn and progress together!

Past wonderful recommendation

Machine learning series

Introduction to Machine Learning series 1 – An Overview of Machine learning
How to Build a Complete Machine Learning Project
Machine learning data set acquisition and test set construction method
Data Preprocessing for feature Engineering (PART 1)
Data Preprocessing for feature Engineering (Part 2)

Math study notes

Math Notes for programmers 1- Base conversion
Programmer’s Math Note 2– remainder
Mathematical Notes for Programmers 3– Iterative methods

Github projects & Resource tutorials recommended

[Github Project recommends] a better site for reading and finding papers
TensorFlow is now available in Chinese
Must-read AI and Deep learning blog
An easy-to-understand TensorFlow tutorial
Recommend some Python books and tutorials, both beginner and advanced!