The introduction of

Principal Component Analysis (PCA) is one of the most important dimensionality reduction methods. It is widely used in data compression, redundancy elimination and data noise elimination. In general, PCA is the easiest algorithm for dimensionality reduction. PCA algorithm is mainly used for (1) exploration and visualization of high-dimensional data sets. 2) Data compression. 3) Data preprocessing. 4) Image, voice and communication analysis and processing. 5) Dimension reduction (most important), data redundancy and noise removal.

1. The idea of PCA

PCA, as the name implies, is to find the most important aspects of the data and replace the original data with the most important aspects of the data. Specifically, let’s say our data set is n-dimensional, with m numbers. We want to reduce the dimension of the m data from the N dimension down to the N ‘dimension, and we want the m n’ dimension data set to be as representative of the original data set as possible. We know there’s definitely a loss when you go from an N dimension to an N ‘dimension, but we want the loss to be as small as possible. So how do you make this n’ -dimensional data as close to the original data as possible?

In essence, PCA takes the direction with the largest variance as the main feature and “de-correlates” the data in each orthogonal direction, that is, makes them irrelevant in different orthogonal directions. The data is converted from the original coordinate system to the new coordinate system, and the choice of the new coordinate system is determined by the data itself. The first new coordinate axis selects the direction with the greatest variance in the original data. The second new coordinate axis selects the direction with the greatest variance that is orthogonal to the first coordinate axis. The process repeats for the number of features in the original data. We’ll see that most of the variance is contained in the first few of these new axes. Therefore, we can ignore the rest of the axes, that is, reduce the dimension of the data.

2. Algorithm flow of PCA

De-averaging, that is, subtracting the average value of each feature;
Calculate covariance matrix;
The eigenvalues and eigenvectors of covariance matrix are calculated.
Sort the eigenvalues from large to small;
Retain the largest eigenvectors;
Transform the data into a new space constructed by a feature vector.

3. General process of PCA algorithm implementation:

Normalized data processing;
Calculate the covariance matrix of the normalized data set;
The eigenvalues and eigenvectors of covariance matrix are calculated.
Retain the most important k features (k is usually less than n);
Find the eigenvectors corresponding to k eigenvalues
The m * N data set is multiplied by the eigenvectors (n * k) of k n-dimensional eigenvectors to obtain the final dimensionality reduction data.

PCA dimension reduction criterion:

Nearest reconstruction: the sum of errors between the reconstructed point and the original point is the smallest for all points in the sample set.
Maximum separability: The projection of samples in low dimensional space is as separate as possible.

Advantages of PCA algorithm:

Making data sets easier to use;
Reduce the computational overhead of the algorithm;
Noise removal;
Make the results easy to understand;
There is no parameter limitation.

Disadvantages of PCA algorithm:

If the user has a certain prior knowledge of the observed object and has mastered some characteristics of the data, but cannot intervene in the processing process through parameterization and other methods, the expected effect may not be achieved and the efficiency is not high.
Eigenvalue decomposition has some limitations. For example, the transformation matrix must be square matrix.
In the case of non-Gaussian distribution, the principal element obtained by PCA method may not be optimal.

Algorithm is the core

# Define PCA algorithm
def PCA(data, r) :
    data = np.float32(np.mat(data))
    rows, cols = np.shape(data)
    data_mean = np.mean(data, 0)  Average the columns
    A = data - np.tile(data_mean, (rows, 1))  Subtract the mean from all the samples to get A
    C = A * A.T  Get the covariance matrix
    D, V = np.linalg.eig(C)  Find the eigenvalues and eigenvectors of the covariance matrix
    V_r = V[:, 0:r]  Take the first r eigenvectors by column
    V_r = A.T * V_r  # Transition from small matrix eigenvectors to large matrix eigenvectors
    for i in range(r):
        V_r[:, i] = V_r[:, i] / np.linalg.norm(V_r[:, i])  # Normalization of eigenvectors

    final_data = A * V_r
    return final_data, data_mean, V_r
Copy the code

reference

zhuanlan.zhihu.com/p/33191810

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

PCA algorithm learning

The introduction of

1. The idea of PCA

2. Algorithm flow of PCA

3. General process of PCA algorithm implementation:

PCA dimension reduction criterion:

Advantages of PCA algorithm:

Disadvantages of PCA algorithm:

Algorithm is the core

reference

PCA algorithm learning

The introduction of

1. The idea of PCA

2. Algorithm flow of PCA

3. General process of PCA algorithm implementation:

PCA dimension reduction criterion:

Advantages of PCA algorithm:

Disadvantages of PCA algorithm:

Algorithm is the core

reference

Related Posts

Ant Financial presented data mining summit KDD 2018, these you can’t miss!

【TSP】 A hybrid particle swarm optimization algorithm based on MATLAB to solve the traveling salesman Problem

Loop test no results? Get a chance to be a lead AI architect