Principal component analysis (PCA) method steps and code details

preface

In the last section, we learned that in the construction of neural network model, in addition to mastering how to build neural network architecture, understand the specific meaning of parameters, avoid risks and other methods. The first step is a detailed understanding of the adopted data set, without touching any neural network code, starting with a thorough examination of the data. This step is very critical. Often, a certain step in data processing will affect the experimental results to a certain extent. This section describes a common method of dimensionality reduction, PCA, to reduce the number of variables in a data set while retaining as much information as possible.

1. What is principal component analysis?

Principal Component Analysis (PCA) is a common data Analysis method, which is often used for dimensionality reduction of high-dimensional data and can be used to extract the main feature components of data. PCA is commonly used to reduce the dimension of large data sets by reducing the number of indicators in the data set and retaining most of the information about the indicators in the original data set. The bottom line: Reduce the number of metrics and retain as much information as possible.

1.1 Application scope of PCA
  • There are dimensionality reduction techniques on both labeled and unlabeled data
  • Focus on dimensionality reduction techniques for unlabeled data and apply the techniques to labeled data as well.
1.2 the advantages and disadvantages

The advantage of PCA lies in the data dimension reduction, which makes it easy to extract the main features of the data, make the data easier to use, reduce the calculation overhead, remove noise and so on. The disadvantage is that it is not necessary, and useful information may be lost. Only retaining the main information for the training set may lead to over-fitting. Applies to structured data. PCA can not only compress the data, but also make the data features independent of each other after dimensionality reduction.

2. Methods and steps of PCA

As a traditional machine learning algorithm, PCA can be derived by basic line generation knowledge (covariance matrix calculation, eigenvector calculation, eigenvalue calculation, orthogonal…). . The main mathematical methods involved are not described too much in this section, but interested readers can refer to the linear algebra section of the flower book for derivation. The steps of PCA can be divided into five steps.

2.1 Standardize the range of continuous initial variables (from unstructured to structured)

The purpose of this step is to standardize the scope of the structural indicators, because PCA is very sensitive to the variance of the initial variables. If there is a large difference between the ranges of the initial variables, it will cause a large variation. Standardization can be used to transform the data into comparable scales. The most common methods are the following two

  • The linear function is normalized. The original data is linearly transformed to make the result map to the range of [0,1], so as to achieve equal scale of the original data. The normalization formula is as follows:

X norm  = X X min X max X min X_{\text {norm }}=\frac{X-X_{\min }}{X_{\max }-X_{\min }}
  • Zero mean normalization. It maps the original data to a distribution with a mean of 0 and a standard deviation of 1. In particular, assume the mean of the original featuresmu, the standard deviation issigma, then the normalization formula is defined as:

    z = x mu sigma . z=\frac{x-\mu}{\sigma} .

This method is limited to structured data. For categorical features, male, female, blood type and other features are only valued in limited options. The raw input is usually in the form of a string and can be preprocessed using serial number encoding, unique heat encoding, binary encoding, etc.

  • Ordinal encoding: Ordinal encoding is typically used to deal with data that has a size relationship between categories. For example, grades can be divided into low, middle and three grades, and there is a “high > middle > low” ordering relationship. The serial number coding will assign a numerical ID to the category features according to the size relationship, for example, the high representation is 3, the middle representation is 2, and the low representation is 1, and the size relationship remains after the transformation.

  • Unique thermal coding: Unique thermal coding is usually used to deal with features that do not have a size relationship between categories. For example, there are four values of blood type (A, B, AB and O). The unique thermal coding will turn blood type into A 4-dimensional sparse vector, which is represented by (1, 0, 0, 0) for blood type A, (0, 1, 0, 0) for blood type B, and (0, 0, 1, 0) for blood type AB. Blood type O is (0, 0, 0, 1). (In the case of a large number of category values, sparse vectors should be used to save space, and dimensionality can be reduced with feature selection.)

  • Binary encoding: Binary encoding is mainly divided into two steps. First, each category is assigned a category ID with ordinal encoding, and then the binary encoding corresponding to the category ID is taken as the result. Taking blood types A, B, AB and O as examples, Table 1.1 shows the process of binary coding. The ID of blood type A is 1, which is 001 in binary. The ID of blood type B is 2, which is 010 in binary. The analogy can be used to get binary representations of blood types AB and O. It can be seen that binary coding essentially uses binary hash mapping of ID to obtain 0/1 feature vector, and its dimension is less than that of independent hot coding, saving storage space.

  • Categorical Embedder: Neural network coding for Categorical variables, interesting friends can refer to this article (this may be listed in a separate chapter later, can not take up too much space…)

    For unstructured data of text type, Bag of Words Model, TF-IDF, Topic Model and Word Embedding Model are mainly used. This can be described briefly without too much narration. For specializing in NLP** friends is before guan Gong play broadsword…

2.2 Calculate covariance matrix to identify correlation

The purpose of this step is to observe whether data labels are correlated with each other and whether indicators contain redundant information. The use covariance matrix is a P * P symmetric matrix (where P is the dimension) that has covariances as entries associated with all possible pairs of initial variables. Assume three variables x, y and Z 3d data sets, and the covariance matrix is 3 * 3 matrix as shown in the figure below:


[ Cov ( x . x ) Cov ( x . y ) Cov ( x . z ) Cov ( y . x ) Cov ( y . y ) Cov ( y . z ) Cov ( z . x ) Cov ( z . y ) Cov ( z . z ) ] \left[\begin{array}{lll} \operatorname{Cov}(x, x) & \operatorname{Cov}(x, y) & \operatorname{Cov}(x, z) \\ \operatorname{Cov}(y, x) & \operatorname{Cov}(y, y) & \operatorname{Cov}(y, z) \\ \operatorname{Cov}(z, x) & \operatorname{Cov}(z, y) & \operatorname{Cov}(z, z) \end{array}\right]

Self covariance is self variance, ** (Cov(a,b)=Cov(b,a)) ** is commutative, meaning that the upper triangle part is equal to the lower triangle part. If the covariance is positive, the two variables are positively correlated, and if the covariance is negative, the two variables are negatively correlated.

2.3 Calculate the eigenvectors and eigenvalues of the covariance matrix to identify the principal components

The principal components of the data are determined by calculating the eigenvectors and eigenvalues of the covariance matrix. First, the definition of principal components: a principal component is a new variable made up of linear combinations or mixtures of initial variables. The new variables are unrelated, and most of the information in the original variable is squeezed or compressed into the first component. In layman’s terms, ten dimensions of data are divided into ten principal components. PCA tries to place the maximum possible information in the first component, then the maximum remaining information in the second component, and so on until the following figure appears.

By organizing the information in the principal component in this way, you can reduce the dimensions to generate new metrics without losing too much information. At this time, the new indicators are irrelevant and inexplicable. These are linear combinations of the initial variables. The principal component represents the direction of the data interpreting the maximum variance. The relation between variance and information is that the greater the variance carried by a line, the higher the dispersion of data points along the line, and the greater the dispersion of data points along the line, the more information it contains. Calculate the eigenvalues of the covariance matrix is calculated maximum variance, calculate the corresponding eigenvectors is the best projection direction, and calculate the covariance matrix characteristic value to its diagonalization, in order to meet the changes of covariance between 0 and a set of index variance as large as possible, so the maximization problem is solved, can be represented as:


{ max { Omega. T Σ Omega. }  s.t.  Omega. T Omega. = 1 \left\{\begin{array}{l} \max \left\{\omega^{\mathrm{T}} \Sigma \omega\right\} \\ \text { s.t. } \quad \omega^{\mathrm{T}} \omega=1 \end{array}\right.

At this point, the Lagrange multiplier method is used to transform the problem into an optimization problem, and the derivative of ω is equal to 0, which can be deduced sigma ω=λ ω, at this point:


D ( x ) = Omega. T Σ Omega. = Lambda. Omega. T Omega. = Lambda. D(\boldsymbol{x})=\boldsymbol{\omega}^{\mathrm{T}} \Sigma \boldsymbol{\omega}=\lambda \boldsymbol{\omega}^{\mathrm{T}} \boldsymbol{\omega}=\lambda

The order of the calculated eigenvalues is sorted to obtain the principal components in order of importance.

2.4 Create feature vectors to determine which principal components to retain

Computing eigenvectors and ordering them in descending order of eigenvalue allows us to find the principal components in order of importance. In this step, we choose to keep all the eigenvalues or discard those with low importance. And form a vector matrix with the remaining eigenvalues as eigenvectors. An eigenvector is just a matrix listed as the eigenvector that we decide to keep. This step depends on our requirements. Usually, the eigenvector quantity ω1,ω2… ,ω D, maps n-dimensional samples to D-dimensional samples by the following mapping;


x i = [ Omega. 1 T x i Omega. 2 T x i Omega. d T x i ] \boldsymbol{x}_{i}^{\prime}=\left[\begin{array}{c} \boldsymbol{\omega}_{1}^{\mathrm{T}} \boldsymbol{x}_{i} \\ \boldsymbol{\omega}_{2}^{\mathrm{T}} \boldsymbol{x}_{i} \\ \vdots \\ \boldsymbol{\omega}_{d}{ }^{\mathrm{T}} \boldsymbol{x}_{i} \end{array}\right]

The d dimension of the new Xi ‘is the projection of XI on the d principal component ω D direction. By selecting the eigenvectors corresponding to the largest D eigenvalues, we discarded the features with small variance (noise), so that each N-dimensional column vector Xi is mapped to the D-dimensional column vector Xi’, and the proportion of information after dimensionality reduction is defined as:


eta = i = 1 d Lambda. i 2 i = 1 n Lambda. i 2 \eta=\sqrt{\frac{\sum_{i=1}^{d} \lambda_{i}{ }^{2}}{\sum_{i=1}^{n} \lambda_{i}{ }^{2}}}
2.5 Recast data along the principal component axis

In this step, the eigenvalues formed by the eigenvectors of the covariance matrix are used to reorient the data from the original axis to the principal component axis. This can be done by multiplying the transpose of the original data set by the transpose of the eigenvector.

2.6 summarize

In addition to using the objective function to solve the maximum variance, other ideas can be considered, such as minimum regression error to get a new objective function. In fact, the corresponding principle and solution method are equivalent to this method. PCA is a kind of linear dimension reduction method, has certain limitation, can consider to PCA mechanical energy through nuclear mapping extension is kernel principal component analysis (KPCA), by mapping the manifold dimensionality reduction methods, such as the isometric mapping, locally linear embedding, Laplace feature mapping, etc., for some complex data sets of PCA, the result is bad nonlinear dimension reduction operations.

The following will illustrate the use of PCA dimensionality reduction

3 into PCA

So far, we have studied the theoretical basis of PCA, which is fundamentally a dimensionality reduction algorithm, but it can also be used as a tool for visualization, noise filtering, feature extraction and engineering. Scikit-learn library to further show how to apply

3.1 PCAApply to the visualization section
3.1.1 Code Implementation — Theory part
  • Import related libraries
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(a)Copy the code
  • Principal component analysis (PCA) is a fast and flexible method for unsupervised data dimension reduction. It is relatively simple to visualize. I’m going to randomly generate 200 points
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2.2), rng.randn(2.200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
Copy the code

The results are shown below:

  • It is clear that there is a nearly linear relationship between the x and y variables. Unlike linear regression problems, PCA is an attempt to understand the relationship between x and Y variables. In PCA, the PCA estimator of the SciKit-Learn library can quantify this relationship by calculating a list of spindles in the data and using these axes to represent the data set.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
print(pca.components_)
print(pca.explained_variance_)
Copy the code

PCA interpretation variance;

Learn something from the data by fitting, most importantly components and interpretive variance. For these numeric concepts, let’s visualize them as vectors on the input data, using components to define the direction of the vector, and using interpretive variance to define the square length of the vector.

def draw_vector(v0, v1, ax=None) :
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='- >',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate(' ', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');
Copy the code

These vectors represent the main axis of the data. The length of the vector indicates the importance of that axis in describing the distribution of the data. More precisely, it is a measure of the variance of the data at the time of the projection to that axis. The projection of each data point onto the principal axis is the principal component of the data. This transformation from the data axis to the main axis is called an affine transformation, and basically consists of translation, rotation, and uniform scaling (in some sense).

3.1.2 visualizationPCACase study: Handwritten numbers

The use of dimensionality reduction is not obvious in two-dimensional representations, but becomes clearer when viewing higher-dimensional data. The application of PCA to data is demonstrated through handwritten digital data sets

First load data:

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
digits = load_digits()
digits.data.shape
Copy the code

The dimension of the whole data is made up of 8*8 pixels, a total of 64 dimensions. We can use PCA to project them to a number of dimensions that are easier to use:

Project data to 2 dimensions
pca = PCA(2)  # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
Copy the code

Visualize the first two principal components of each point to understand the data:

plt.scatter(projected[:, 0], projected[:, 1],
            c=digits.target, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('Accent'.10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
Copy the code

The complete data is a 64-dimensional cloud of points, which are the projections of each data point along the direction with the greatest variance. Essentially, we have found the optimal stretch and rotation in 64 dimensions, allowing us to see the digital layout in 2 dimensions (no reference to labels).

3.1.3 Definition of Component

This can be understood in terms of combinations of basis vectors. For example, each image in the training set is defined by a set of 64 pixel values, which we call vector X:


x = [ X 1 . X 2 . X 3 X 64 ] x=\left[X_{1}, X_{2}, X_{3} \cdots X_{64}\right]

To build the image, we multiply each element of the vector by the pixel it describes, and then add the results together to build the image:


image ( x ) = x 1 (  pixel  1 ) + x 2 (  pixel  2 ) + x 3 (  pixel  3 ) x 64 (  pixel  64 ) \begin{array}{c} \operatorname{image}(x)=x_{1} \cdot(\text { pixel } 1)+x_{2} \cdot(\text { pixel } 2)+x_{3} \cdot(\text { pixel } 3) \cdots x_{64} \\ \cdot(\text { pixel } 64) \end{array}

Now one way to imagine reducing the dimensionality of this data is to zero out some of these basis vectors. For example, if we just use the first two pixels, we’ll get a two-dimensional projection, but it doesn’t reflect the whole image very well, and we’ve thrown away most of the pixels.

def plot_pca_components(x, coefficients=None, mean=0, components=None,
                        imshape=(8.8), n_components=2, fontsize=12,
                        show_mean=True) :
    if coefficients is None:
        coefficients = x
        
    if components is None:
        components = np.eye(len(coefficients), len(x))
        
    mean = np.zeros_like(x) + mean
        

    fig = plt.figure(figsize=(1.2 * (5 + n_components), 1.2 * 2))
    g = plt.GridSpec(2.4 + bool(show_mean) + n_components, hspace=0.3)

    def show(i, j, x, title=None) :
        ax = fig.add_subplot(g[i, j], xticks=[], yticks=[])
        ax.imshow(x.reshape(imshape), interpolation='nearest')
        if title:
            ax.set_title(title, fontsize=fontsize)

    show(slice(2), slice(2), x, "True")
    
    approx = mean.copy()
    
    counter = 2
    if show_mean:
        show(0.2, np.zeros_like(x) + mean, r'$\mu$')
        show(1.2, approx, r'$1 \cdot \mu$')
        counter += 1

    for i in range(n_components):
        approx = approx + coefficients[i] * components[i]
        show(0, i + counter, components[i], r'$c_{0}$'.format(i + 1))
        show(1, i + counter, approx,
             r"${0:.2f} \cdot c_{1}$".format(coefficients[i], i + 1))
        if show_mean or i > 0:
            plt.gca().text(0.1.05.'+ $$', ha='right', va='bottom',
                           transform=plt.gca().transAxes, fontsize=fontsize)

    show(slice(2), slice(-2.None), approx, "Approx")
    return fig
    
from sklearn.datasets import load_digits

digits = load_digits()
sns.set_style('white')

fig = plot_pca_components(digits.data[2],
                          show_mean=False)
Copy the code

The top row of visual images represents individual pixels and the bottom row shows the cumulative contribution of those pixels to the image construction. Using just two pixel-based components is obviously not enough to build only a tiny fraction of a 64 pixel image. Pixel representation is not the only base choice, there are other base functions that contain some predefined contribution for each pixel.


 image  ( x ) = mean + x 1 (  basis  1 ) + x 2 (  basis  2 ) + x 3 (  basis  3 ) \text { image }(x)=\operatorname{mean}+x_{1} \cdot(\text { basis } 1)+x_{2} \cdot(\text { basis } 2)+x_{3} \cdot(\text { basis } 3) \cdots

PCA can be identified as the process of selecting the best basis function so that simply adding the first few is enough to reconstruct most elements of the dataset. The principal component that is the low-dimensional representation in the data is just the coefficient that multiplies each element of the series. The figure below shows the corresponding description of the reconstructed number using the mean plus the first two PCA basis functions:

pca = PCA(n_components=2)
Xproj = pca.fit_transform(digits.data)
sns.set_style('white')
fig = plot_pca_components(digits.data[10], Xproj[10],
                          pca.mean_, pca.components_)
Copy the code

The visualization is as follows:

Unlike the pixel base, PCA allows you to use the mean plus two components to recover salient features of the input image. The number of pixels per component is a corollary to the direction of the vectors in our two-dimensional presentation. PCA forms a set of functions that are simpler and more efficient than raw data. So how do you choose the number of components to make the overall function optimal

3.1.4 How to determine the number of components

How to choose the number of components to describe the data can be determined by looking at the cumulative interpretation variance rate as a function of the number of components.

pca = PCA().fit(digits.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
Copy the code

The visualization results are as follows:

This curve quantifies how much of the 64-dimensional variance is contained in the NTH component. For example, we can see that the first ten components contain approximately **75% of the variance, and 50 components are required to describe 99%** of the variance. Visualization helps us understand the level of redundancy.

3.2 PCAUsed for noise filtering
3.2.1 Code Implementation — Theoretical part

PCA can be applied with noise filtering. The theory is that any component whose variance is larger than that of noise is relatively unaffected by noise. In human speech, only the largest subset of principal components is used to reconstruct data, with priority reserved for signal and noise elimination.

First look at the relationship between digital data.

def plot_digits(data) :
    fig, axes = plt.subplots(4.10, figsize=(10.4),
                             subplot_kw={'xticks': [].'yticks':[]},
                             gridspec_kw=dict(hspace=0.1, wspace=0.1))
    for i, ax in enumerate(axes.flat):
        ax.imshow(data[i].reshape(8.8),
                  cmap='binary', interpolation='nearest',
                  clim=(0.16))
plot_digits(digits.data)
Copy the code

Add some random noise, create a dataset, and redraw.

np.random.seed(42)
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)
Copy the code

Now the data set is noisy handwriting recognition, contains false pixels, we can hardly recognize the numbers with the naked eye. Let’s train on this data setPCA, the projection is required to retain 50% variance.

pca = PCA(0.50).fit(noisy)
pca.n_components_
Copy the code

The operation result is 12, 50% erase-proof equivalent to 12 principal components. Now we compute these components and use transformations to reconstruct the filtered numbers:

components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)
Copy the code

It can be clearly seen from the visualization images that PCA has the feature of noise filtering and the data through PCA dimensionality reduction has the feature selection function.

3.2.2PCAApplied to dimensionality reduction — field marking of human faces (cfscikit-learnOfficial case)
  1. Description of project

Labeled Faces in the Wild is a common benchmark for face verification. The sample contained 1850 features. PCA was used to observe whether dimensionality reduction could be performed on these features.

  1. The development process

    1. Download data: Observe the labels and samples in the dataset

    2. Observe the sample label content:

3. Prepare data, observe data characteristics, and calculate the appropriate number of components:

```python
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA as RandomizedPCA
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)
pca = RandomizedPCA(150)
pca.fit(faces.data)
```
Copy the code

The total sample is 1288, the sample characteristics are 1850 (ignoring the data cleaning part of the first step) and the fit component is 150

  1. Draw the reevaluated image
```python fig, axes = plt.subplots(3, 8, figsize=(9, 4), subplot_kw={'xticks':[], 'yticks':[]}, Gridspec_kw =dict(hspace=0.1, wspace=0.1)) for I, ax in enumerate(axes. Flat): ax.imshow(pca.components_[i].reshape(62, 47), cmap='bone') ```Copy the code

Draw component graph:

``` import numpy as np plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel('number of components') plt.ylabel('cumulative explained variance'); ` ` `Copy the code

The computing component reconstructs the image and visualizes the result:

In the case of high-dimensional data, visualize the image associated with the first few principal components and observe how it changes (officially named eigenFaces for some strange reason). We can see that the 150 component accounts for more than 95% of the variance. We can use 150 components to reconstruct most of the features of the data, and compare the input image with the reconstructed image as shown in the figure below.

``` # Compute the components and projected faces pca = RandomizedPCA(150).fit(faces.data) components = pca.transform(faces.data) projected = pca.inverse_transform(components) # Plot the results fig, ax = plt.subplots(2, 10, figsize = (10, 2.5), subplot_kw = {' xticks: [], 'yticks: []}, gridspec_kw = dict (img tags like hspace = 0.1, Wspace =0.1)) for I in range(10): ax[0, i].imshow(faces.data[i].reshape(62, 47), cmap='binary_r') ax[1, i].imshow(projected[i].reshape(62, 47), cmap='binary_r') ```Copy the code

The first row shows the input image, and the second row shows the reconstructed image from 150 components out of 3,000 initial features. Although the dimension of the data is reduced by nearly 20 times, we can still judge the reconstructed image information with the naked eye. It means that our classification algorithm is efficient enough to facilitate classification.

4 develop

In this section, from theoretical derivation to code analysis to case analysis, we fully discussed the knowledge related to PCA, how to carry out dimension reduction, high-latitude visualization, noise filtering and feature selection. I believe that everyone will be as fruitful as I am. For dimension reduction processing methods, PCA also has some limitations, such as high-order correlation, which requires some functions to transform nonlinearity into linear correlation for dimensionality reduction. In unsupervised dimensionality reduction, there are also methods such as random projection and feature aggregation to reduce the dimensionality of data. You can refer to sciKit-learn for more information.

reference

  1. 100 Sides machine Learning
  2. aces recognition example using eigenfaces and SVMs
  3. A Step-by-Step Explanation of Principal Component Analysis (PCA)
  4. usingPCATo simplify the data
  5. In Depth: Principal Component Analysis

Recommended reading

  • Differential operator method
  • PyTorch is used to build a neural network model for handwriting recognition
  • PyTorch was used to build neural network models and back-propagation calculations
  • How to optimize model parameters and integrate models
  • TORCHVISION Target detection fine tuning tutorial