001 — Data Preprocessing techniques (Mean Removal, range scaling, Normalization, binarization, independent thermal coding)

Python 3.5, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2

Necessity of data preprocessing: In the real world, it is often necessary to process a large amount of raw data, which is beyond the comprehension of machine learning algorithms. Therefore, in order for machine learning algorithms to understand the raw data, data preprocessing is required.

The most common data preprocessing techniques:


1. Mean removal

Removing the average value of each feature to ensure that the average value of each feature is zero (i.e., normalization) eliminates deviations between features.

# # # # # # # # # # # to Normalization# data set # # # # # # # # # # # # # # # # # # # # # # # #
import numpy as np
from sklearn import preprocessing

data=np.array([[3.1.5.2.5.4],
               [0.4.0.3.2.1],
               [1.3.3.1.9.4.3]]) Shape =(3,4)

data_standardized=preprocessing.scale(data)

print(data_standardized.shape)
print('Mean={}'.format(data_standardized.mean(axis=0)))
print('Mean2={}'.format(np.mean(data_standardized,axis=0)))
print('standardized: ')
print(data_standardized)
print('STD={}'.format(np.std(data_standardized,axis=0)))
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

(3, 4) Mean=[5.55111512E-17-1.11022302E-16-7.40148683E-17-7.40148683E-17] Mean2=[5.55111512E-17-1.11022302E-16 7.40148683 e-17 7.40148683 e-17] standardized: [[1.33630621-1.40451644 1.29110641-0.86687558] [-1.06904497 0.84543708-0.14577008 1.40111286] [-0.26726124 0.55907366-1.14533633-0.53423728]] STD=[1.1.1 1.1.]

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1, the mean of each column of the matrix after the value is removed is about 0, while STD is 1. The purpose of this is to ensure that the values of each feature column are in the range of similar data, so as to prevent the natural value of a feature column from being too large.

2. You can directly call the mature scale method in the Preprocessing module to perform mean removal on a NUMPY matrix.

3. There are at least two ways to find the average (or STD, min, Max, etc.) of a NUMPY matrix, as shown in lines 9 and 10 of the code.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


2. Scaling

Necessity: The range of values for each feature column in a data point can vary greatly; therefore, it is sometimes necessary to scale the range of values for the feature column to a reasonable size.

########### scale the data set #########################
import numpy as np
from sklearn import preprocessing

data=np.array([[3.1.5.2.5.4],
               [0.4.0.3.2.1],
               [1.3.3.1.9.4.3]]) Shape =(3,4)

data_scaler=preprocessing.MinMaxScaler(feature_range=(0.1)) # scale between (0,1)
data_scaled=data_scaler.fit_transform(data)

print('scaled matrix: *********************************')
print(data_scaled)
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

scaled matrix: * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * [[1. 0. 1. 0.] [0. 1. 1. 0.41025641] [0.33333333 0.87272727 0. 0.14666667]]

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. The mean value of each column of the removed matrix is about 0, while STD is 1. The purpose of this is to ensure that the values of each feature column are in the range of similar data, so as to prevent the natural value of a feature column from being too large.

2. You can directly call the mature scale method in the Preprocessing module to remove the mean value of a NUMPY matrix.

3. There are at least two ways to find the average (or STD, min, Max, etc.) of a NUMPY matrix, as shown in lines 9 and 10 of the code

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


3. Normalization

It is used when the values of the feature vectors need to be adjusted to ensure that the values of each feature vector are scaled to the same value range. The most commonly used form of normalization in machine learning is to adjust the eigenvectors to L1 norm so that the sum of the values of the eigenvectors is 1.

# # # # # # # # # # # to Normalization# data set # # # # # # # # # # # # # # # # # # # # # # # #
import numpy as np
from sklearn import preprocessing

data=np.array([[3.1.5.2.5.4],
               [0.4.0.3.2.1],
               [1.3.3.1.9.4.3]]) Shape =(3,4)

data_L1_normalized=preprocessing.normalize(data,norm='l1')
print('L1 normalized matrix: *********************************')
print(data_L1_normalized)
print('sum of matrix: {}'.format(np.sum(data_L1_normalized)))

data_L2_normalized=preprocessing.normalize(data) # default: l2
print('L2 normalized matrix: *********************************')
print(data_L2_normalized)
print('sum of matrix: {}'.format(np.sum(data_L2_normalized)))
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

L1 normalized matrix: * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * [[0.25210084 0.12605042 0.16806723 0.45378151] [0. 0.625 0.046875 0.328125] [ 0.0952381 0.31428571-0.18095238-0.40952381]] sum of matrix: 0.5656337535014005 L2 normalized matrix: * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * [[0.45017448 0.22508724 0.30011632 0.81031406] [0. 0.88345221 to 0.06625892 [0.46381241] [0.17152381 0.56602858-0.32589524-0.73755239]] sum of matrix: 0.6699999596689536

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. After Normaliztion, the values of all feature vectors are scaled to the same value range, which can ensure that there is no big difference between data points due to the basic properties of features, that is, ensure that all data points are in the same data amount and improve the comparability of different feature data.

2. Note the difference with mean removal. While mean removal scales every feature column to a similar value range with a mean of 0, Normalization scales all values globally to the same value range.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


4. Binarization

Binarization is used to convert numeric eigenvectors to Boolean type vectors.

# # # # # # # # # # # to Binarization# data set # # # # # # # # # # # # # # # # # # # # # # # #
import numpy as np
from sklearn import preprocessing

data=np.array([[3.1.5.2.5.4],
               [0.4.0.3.2.1],
               [1.3.3.1.9.4.3]]) Shape =(3,4)

data_binarized=preprocessing.Binarizer(threshold=1.4).transform(data)
print('binarized matrix: *********************************')
print(data_binarized)
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

binarized matrix: ********************************* [[1. 0. 1. 0.] [0. 1. 0. 1.] [0. 1. 0. 0.]]

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1, the data points after binarization are all 0 or 1, so it is called binarization.

2. The calculation method is to change all data greater than threshold to 1, and all data less than or equal to threshold to 0.

3 is often used when a feature is present (e.g., set to 1) or not (e.g., set to 0).

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


5. One-hot Encoding

Usually, the values that need to be processed are sparse and scattered in space, but we don’t need to store these large values. In this case, we need to use unique thermal coding, which is actually a tool to tighten the feature vector.

########### independently heat codes the data set #########################
import numpy as np
from sklearn import preprocessing

data=np.array([[0.2.1.12],
               [1.3.5.3],
               [2.3.2.12],
               [1.2.4.3]]) Shape =(4,4)

encoder=preprocessing.OneHotEncoder()
encoder.fit(data)
encoded_vector=encoder.transform([[2.3.5.3]]).toarray()
print('one-hot encoded matrix: *********************************')
print(encoded_vector.shape)
print(encoded_vector)
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

one-hot encoded matrix: ********************************* (1, 11) [[0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. Unique thermal coding can reduce the dimension of feature vectors and shrink sparse and scattered data sets (such as data, shape=(4,4) in code blocks) into 11-dimensional dense matrices (such as output results, shape=(1,11)).

2. Encoding method: Encoder encoder is constructed according to original data set data, and encoder is used to encode new data. For example, column 0 has three different values (0,1,2), so there are three dimensions: 0= 100,1 = 010,2 =001; Similarly, column 1 has two different values (2,3), so there are only two dimensions, i.e. 2= 10,3 =01; Similarly, column 2 has four different values (1,5,2,4), so there are four dimensions, i.e. 1= 1000,2 =0100,4=0010,5=0001. Similarly, column 3 has two different values (3,12), so there are only two dimensions, i.e. 3= 10,12 =01. Therefore, in the face of new data [[2,3,5,3]], 2 in the 0 column corresponds to 001,3 in the second column corresponds to 01,5 in the third column corresponds to 0001, and 3 in the fourth column corresponds to 10. The output (1,11) matrix is the dense matrix after reading the code.

3, if the new data facing does not exist in the encoder above, for example, when [[2,3,5,4]],4 does not exist in the third column (only two discrete values 3 and 12), then the output is 00, and [[0.0.1 0.1. Notice that the second to last number becomes 0

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


Note: This part of the code has been uploaded to (my Github), welcome to download.

References:

1, Classic Examples of Python machine learning, by Prateek Joshi, translated by Tao Junjie and Chen Xiaoli