Most of the time, I find that many people are just like me, and I have a vague understanding of the basic concepts of machine learning. For example, I often hear about normalization and standardization, but I can’t distinguish them clearly. Recently, I read an article that clearly states the definition, application scenarios, physical significance and application significance of normalization and standardization. After the authorization of the original author, I would like to forward and add my understanding, so that more people can learn and progress together.

There are two terms that are often heard in machine learning and data mining: Normalization and Standardization. What exactly are they? What are the benefits? How does it work? This paper discusses these problems in detail.

What is it

1. The normalization

The commonly used method is to map the original data to between [0,1] through linear transformation, and the transformation function is as follows:


Minmin is the minimum value in the sample, and maxmax is the maximum value in the sample. Note that the maximum and minimum values vary in data flow scenarios. In addition, the maximum and minimum values are easily affected by outliers, so this method has poor robustness and is only suitable for traditional precise small data scenarios.

Standardization of 2.

The commonly used method is z-Score standardization, where the mean value and standard deviation of the processed data are 0 and 1 respectively. The processing method is as follows:


Where μμ is the mean of the sample and σσ is the standard deviation of the sample, which can be estimated from existing samples. In the case of enough samples, it is stable and suitable for modern noisy big data scenes.

Two, bring what

The basis of normalization is very simple. Different variables often have different dimensions. Normalization can eliminate the influence of dimensions on the final result and make different variables comparable. For example, the weight difference between two people is 10KG, and the height difference is 0.02m. When measuring the difference between two people, the weight difference will completely cover the height difference, and there will be no such problem after normalization.

The principle of standardization is complicated. It represents the number of standard deviations between the original value and the mean value. It is a relative value, so it also has the effect of removing dimensions. It also has two additional benefits: a mean of 0 and a standard deviation of 1.

What’s good about averaging 0? It can distribute data left and right around the center of 0, which is a great convenience. For example, SVD decomposition on decentralized data is equivalent to PCA on original data. In machine learning, many functions such as Sigmoid, Tanh, Softmax and so on are distributed around the center of 0 (not necessarily symmetric).

What’s the advantage of having a standard deviation of 1? This is a little bit more complicated. For the distance between xixi and xi ‘xi’, it is often expressed as


Where DJ (xij,xi ‘j) DJ (xij,xi’ j) is the distance between two points of attribute jj, and WJWJ is the weight of the distance between that attribute in the total distance. Note that wj=1, JWJ =1,∀j does not make every attribute contribute the same to the final result. For a given data set, the average distance between all point pairs is a constant value, i.e


Is a constant, where


¯ DJWJ ~ 1/¯ DJWJ ~ 1/d¯j can make all attributes contribute the same to the mean distance of the whole data set. Now let DJDJ be the square of the Euclidean distance (or binary norm), one of the most commonly used distance measures, then


Where varjvarj is the sample estimation of Var(Xj), that is to say, the importance of each variable is proportional to the variance of this variable in this data set. If we make the standard deviation of each dimension equal to 1, each dimension is equally important in calculating distances.

Three, how to use

When it comes to calculating the distance between points, using normalization or normalization can improve or even make a qualitative difference in the final result. So what is the choice between normalization and standardization? According to the previous section, if variables of all dimensions are treated equally and play the same role in the final distance calculation, normalization should be selected; if the potential weight relationship reflected by standard deviation in the original data should be retained, normalization should be selected. In addition, standardization is more suitable for modern noisy big data scenarios.