Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

In this article, we will show how to use the SK_learn machine learning toolkit for PCA dimension reduction.

PCA: Principal component analysis

1. Introduction

Basic train of thought: the greater the variance, show that the vector data distribution is more spread out, so we usually choose the variance biggest in raw data vector (features) as a new features a collection of the first axis, and on this basis, the second should coordinate with the first axis orthogonal (linear), and maximum variance, and so on;

  • The algorithm steps of PCA are summarized as follows:

    Let’s say m pieces of n-dimensional data.

    1. The original data is composed of n rows and m columns matrix X according to columns;
    2. I’m going to zero mean each row of X, minus the mean of this row;
    3. Find the covariance matrix
    4. The eigenvalues and corresponding eigenvectors of covariance matrix are obtained. The eigenvectors are arranged into a matrix according to the size of corresponding eigenvalues in rows from top to bottom, and the first k rows are taken to form a matrix P.
    5. That is, the data after dimension reduction to dimension K.

Specific see zhuanlan.zhihu.com/p/77151308 derived “must see!!!!!!!!!!!!!!!”

2. Code implementation

import sklearn.decomposition as sk_decomposition
​
pca = sk_decomposition.PCA(n_components=3,whiten=False,svd_solver='auto')
pca.fit(fr)
result = pca.transform(fr)
print(result)
print ('Ratio of the variance of each principal component to the total variance after dimensionality reduction',pca.explained_variance_ratio_)
print ('Variance of principal components after dimensionality reduction',pca.explained_variance_)
print ('Feature number after dimensionality reduction',pca.n_components_)
Copy the code

3. PCA explanation of SkLearn

3.1 parameter

  • N_components: This parameter helps specify the number of feature dimensions that we want PCA to reduce. The most common approach is to specify the number of dimensions to be reduced directly, where n_components is an integer greater than or equal to 1. Of course, we can also specify the variance of principal components and the minimum proportion threshold, and let PCA class decide the number of dimensions to be reduced according to the sample characteristic variance. In this case, n_components is a number between (0,1). Of course, we can also set the parameter to “MLE”. In this case, PCA class will use MLE algorithm to select a certain number of principal component features according to the variance distribution of features to reduce dimension. We can also use the default value, that is, no n_components, where n_components=min(sample number, feature number);
  • Whiten: Determines whether to whiten. The so-called whitening is to normalize each feature of the data after dimensionality reduction so that the variance is 1. For PCA dimensionality reduction itself, albinism is generally not required. If you have subsequent data processing actions after PCA dimensionality reduction, consider whitening. The default value is False, that is, no whitening;
  • Svd_solver: the method to specify singular value decomposition SVD. As eigendecomposition is a special case of singular value decomposition SVD, PCA libraries are generally implemented based on SVD. There are 4 selectable values: {‘ auto ‘, ‘full’, ‘arpack’, ‘randomized’}. Randomized was generally applied to dimensionality reduction of PCA with large data volume, multiple data dimensions and low principal component number ratio. Randomized algorithms were used to accelerate SVD. Full is SVD in the traditional sense, using the corresponding implementation of the SCIPY library. The scenarios for ARpack were similar to those for Randomized, except that Randomized used SciKit-Learn’s own SVD implementation, while ARpack used scipy’s sparse SVD implementation. The default is Auto, that is, PCA class will weigh among the three algorithms mentioned above and choose an appropriate SVD algorithm to reduce dimension. In general, the default values are sufficient;

3.2 attributes

print ('Ratio of the variance of each principal component to the total variance after dimensionality reduction',pca.explained_variance_ratio_)
print ('Variance of principal components after dimensionality reduction',pca.explained_variance_)
print ('Feature number after dimensionality reduction',pca.n_components_)
Copy the code