The introduction of

Principal Component Analysis (PCA) is one of the most important dimensionality reduction methods. It is widely used in data compression, redundancy elimination and data noise elimination. In general, PCA is the easiest algorithm for dimensionality reduction. PCA algorithm is mainly used for (1) exploration and visualization of high-dimensional data sets. 2) Data compression. 3) Data preprocessing. 4) Image, voice and communication analysis and processing. 5) Dimension reduction (most important), data redundancy and noise removal.

1. The idea of PCA

PCA, as the name implies, is to find the most important aspects of the data and replace the original data with the most important aspects of the data. Specifically, let’s say our data set is n-dimensional, with m numbers. We want to reduce the dimension of the m data from the N dimension down to the N ‘dimension, and we want the m n’ dimension data set to be as representative of the original data set as possible. We know there’s definitely a loss when you go from an N dimension to an N ‘dimension, but we want the loss to be as small as possible. So how do you make this n’ -dimensional data as close to the original data as possible?

In essence, PCA takes the direction with the largest variance as the main feature and “de-correlates” the data in each orthogonal direction, that is, makes them irrelevant in different orthogonal directions. The data is converted from the original coordinate system to the new coordinate system, and the choice of the new coordinate system is determined by the data itself. The first new coordinate axis selects the direction with the greatest variance in the original data. The second new coordinate axis selects the direction with the greatest variance that is orthogonal to the first coordinate axis. The process repeats for the number of features in the original data. We’ll see that most of the variance is contained in the first few of these new axes. Therefore, we can ignore the rest of the axes, that is, reduce the dimension of the data.

2. Algorithm flow of PCA

  1. De-averaging, that is, subtracting the average value of each feature;
  2. Calculate covariance matrix;
  3. The eigenvalues and eigenvectors of covariance matrix are calculated.
  4. Sort the eigenvalues from large to small;
  5. Retain the largest eigenvectors;
  6. Transform the data into a new space constructed by a feature vector.

3. General process of PCA algorithm implementation:

  1. Normalized data processing;
  2. Calculate the covariance matrix of the normalized data set;
  3. The eigenvalues and eigenvectors of covariance matrix are calculated.
  4. Retain the most important k features (k is usually less than n);
  5. Find the eigenvectors corresponding to k eigenvalues
  6. The m * N data set is multiplied by the eigenvectors (n * k) of k n-dimensional eigenvectors to obtain the final dimensionality reduction data.

PCA dimension reduction criterion:

  1. Nearest reconstruction: the sum of errors between the reconstructed point and the original point is the smallest for all points in the sample set.
  2. Maximum separability: The projection of samples in low dimensional space is as separate as possible.

Advantages of PCA algorithm:

  1. Making data sets easier to use;
  2. Reduce the computational overhead of the algorithm;
  3. Noise removal;
  4. Make the results easy to understand;
  5. There is no parameter limitation.

Disadvantages of PCA algorithm:

  1. If the user has a certain prior knowledge of the observed object and has mastered some characteristics of the data, but cannot intervene in the processing process through parameterization and other methods, the expected effect may not be achieved and the efficiency is not high.
  2. Eigenvalue decomposition has some limitations. For example, the transformation matrix must be square matrix.
  3. In the case of non-Gaussian distribution, the principal element obtained by PCA method may not be optimal.

Algorithm is the core

# Define PCA algorithm
def PCA(data, r) :
    data = np.float32(np.mat(data))
    rows, cols = np.shape(data)
    data_mean = np.mean(data, 0)  Average the columns
    A = data - np.tile(data_mean, (rows, 1))  Subtract the mean from all the samples to get A
    C = A * A.T  Get the covariance matrix
    D, V = np.linalg.eig(C)  Find the eigenvalues and eigenvectors of the covariance matrix
    V_r = V[:, 0:r]  Take the first r eigenvectors by column
    V_r = A.T * V_r  # Transition from small matrix eigenvectors to large matrix eigenvectors
    for i in range(r):
        V_r[:, i] = V_r[:, i] / np.linalg.norm(V_r[:, i])  # Normalization of eigenvectors

    final_data = A * V_r
    return final_data, data_mean, V_r
Copy the code

reference

zhuanlan.zhihu.com/p/33191810