PCA for Machine Learning

Previous related articles:

Linear regression for machine learning
Machine learning logistic regression and Python implementation
Machine learning project actual trading data anomaly detection
Decision Tree for Machine Learning
Python implementation of Decision Tree for machine learning
PCA for Machine Learning
Feature engineering for machine learning

Here is a dimension reduction algorithm, PCA(Principal component analysis).

In machine learning, there is a problem called dimension disaster, in the actual machine learning project, we have to deal with the dimension of sample data may be hundreds of thousands, even hundreds of thousands or more, in this case, directly to the original sample data for training model will cost a lot of time, the corresponding resource consumption is unacceptable, this time, We need to reduce the dimensionality of the data, which of course means the loss of information, but given the relevance of the actual data itself, we can find ways to reduce the dimensionality while minimizing the loss of information.

PCA is a dimension reduction method with strict mathematical basis and has been widely used

Before giving the PCA algorithm in detail, let’s review the relevant knowledge in linear algebra. First of all, it should be emphasized that the vectors mentioned below, without special explanation, all refer to column vectors

1. Inner product and projection of vectors

The inner product of two vectors of the same dimension is defined as:

Now, let’s look at the geometric meaning of the inner product, and let’s say that A and B are two n-dimensional vectors, and let’s just assume for simplicity that A and B are two dimensional vectors, so. In the two-dimensional plane, A and B can be represented by two directed line segments originating from the origin, as shown in the figure below:

Now we draw A perpendicular from point A to the line B. We know that the intersection of perpendicular and B is called the projection of A onto B, and let the Angle between A and B be A, then the length of the vector of the projection be, includingIt’s the magnitude of vector A, which is the scalar length of A line segment

Basis and basis transformation

We know that a two dimensional vector can correspond to a directed line segment starting from the origin in a two dimensional Cartesian coordinate system. For example, the following vector:

That vector in the picture, we can represent as (3, 2), where 3 is the projection onto the X-axis of 3, 2, and 2 is the projection onto the Y-axis of 2.

In other words, we have implicitly introduced a definition of x and y vectors of length 1 in the positive direction. So a vector (3,2) is essentially saying that the projection onto the X-axis is 3 and the projection onto the Y-axis is 2. Note that the projection is a vector, so it can be negative.

So, the vector (x,y) actually represents a linear combination:? X (1, 0) ^ \ mathsf {T} + y (0, 1) ^ \ mathsf {T}? And here,(1,0),(0,1) is called a basis in two dimensions. It’s also an orthonormal basis.

By default, we choose (1,0) and (0,1) as the basis, which is of course convenient, because they are the unit vectors in the positive direction of the x and y axes respectively, so that point coordinates and vectors on the two-dimensional plane correspond one by one, which is very convenient. But in fact any two linearly independent two-dimensional vectors can be a basis. The so-called linearly independent vectors in a two-dimensional plane can be intuitively regarded as two vectors that are not on a straight line.

For example, (1,1) and (-1,1) can also be a basis. In general, we want the magnitude of the basis to be 1, because from the meaning of the inner product, if the magnitude of the basis is 1, then it is convenient to dot the vector with the basis and directly obtain its coordinates on the new basis! In fact, for any vector we can always find a vector with magnitude 1 in the same direction, just divide the two components by the magnitude. For example, the basis above can be changed to? (\ frac {1} {\ SQRT {2}}, \ frac {1} {\ SQRT {2}}) and (- \ frac {1} {\ SQRT {2}}, \ frac {1} {\ SQRT {2}}). ?

Now, if we want to get the coordinates of (3,2) on our new basis, that is, the values of the projected vectors in both directions, then according to the geometric meaning of the inner product, we just need to compute the inner product of (3,2) and the inner product of the two bases separately, and it’s not hard to get the new coordinate. The following diagram shows the new basis and the coordinate values of (3,2) on the new basis:

Matrix representation of the basis transformation

So, from the previous description, we know that to transform PI (3,2) into coordinates on our new basis, we take the inner product of PI (3,2) with each of our new bases. We can write this transformation succinctly as a matrix multiplication:

To generalize to multiple vectors, let’s say you have m vectors, you just take the two dimensional vectors and arrange them in columns in a matrix with two rows and m columns, and you multiply that matrix by your basis matrix, and you get the values of all of those vectors in your new basis. For example, (1,1), (2,2), (3,3) can be expressed as follows:

So we can see that the basis change can be represented as matrix multiplication

In general, if we have A M A N d vector, to transform it as R A N d vectors of new space, so first will be R A base according to the row of matrix A, then the vector according to the column of matrix B, then the product of two matrices AB is transformation as A result, the AB in the first M as A result of the first M column transformation. The mathematical expression is:

If R is less than N, this is a transformation of an n-dimensional data into a lower dimensional space, so this matrix multiplication representation can also represent dimensional reduction.

The idea of multiplying two matrices is to transform every column of the right matrix to the space where every row of the left matrix is a basis

The optimization goal

We know from the above that the same set of data can be given different representations by selecting different sets of bases, and if the number of bases in a set is less than the dimension of the vector itself, the effect of dimension reduction can be achieved.

So, our next question is how to find an optimal basis, that is, if we have a set of n-dimensional vectors and now want to reduce it to k-dimension (K is less than N), how to select K bases so as to retain the information of the original N-dimensional vector to the greatest extent.

Let’s expand with a specific example. Assume that our data has 5 records, as follows:

Each column has a data, and each behavior has a field, which is a feature.

First of all, for the convenience of subsequent processing, we zeroed out the mean value of the data, that is, subtracting the mean value of all the values in each field (that is, each row). After processing, the mean value of each field in the new data is 0. The transformed data is as follows:

The distribution in the coordinate system is as follows:

Now, we want to reduce this data to one dimension, but keep as much of the original information as possible. How to deal with it?

As we know from the above discussion, this problem is actually to choose a direction in the two-dimensional plane, whether to project all the data onto the line where this direction is located. Then, how to choose this direction (or base) to retain more original information? An intuitive way to think about it is: we want to make the projected values as scattered as possible

As can be seen from the above example, if it is projected to the X axis, the two points on the left will overlap, and so will the two points in the middle, so there are only two different values left after the projection of the four different two-dimensional points, which is a serious information loss. Similarly, If you project the top two points on the y axis and the top two points on the x axis will also overlap. So neither x nor y seems to be the best choice for projection. Visually we can see that the five points are still distinguishable after the projection if they are projected onto an oblique line passing through the first and third quadrants.

Now, let’s express this problem mathematically

The variance

So we’ve argued that we want the projected values to be as scattered as possible, and that this dispersion can be expressed mathematically in terms of variance. The variance of field A is expressed as:

As we have zeroed the mean of each field above, so:

covariance

For the problem of reducing two dimensions to one, find the direction that maximizes the variance. But for higher dimensions, there’s another problem to solve. Let’s think about three dimensions falling to two dimensions. As before, first we want to find a direction that maximizes the post-projection variance, which completes the first direction, and then we choose the second direction.

If we simply choose the direction with the largest variance, it is obvious that this direction and the first direction should be “almost identical”, obviously such a dimension is not useful, therefore, there should be other constraints. Intuitively, we want two fields to represent as much of the original information as possible. We don’t want them to be (linear) correlated, because correlation means that the two fields are not completely independent and must have duplicate representations of information.

The correlation can be expressed mathematically by the covariance of two fields:

When the covariance is 0, the two fields are completely independent. In order for the covariance to be zero, we can only choose the second basis in directions that are orthogonal to the first basis (orthogonalization, where the inner product is zero, corresponds to covariance zero). So the two directions that you end up choosing must be orthogonal.

At this point, we get the dimension reduction problem of the optimization goal: will be reduced to a set of N d vector K d (K is greater than zero, less than N), the goal is to choose K units (mode 1) orthogonal basis, so that when the raw data transformation to the group based on, in various fields between the two covariance is 0, the field of variance was as large as possible (under the constraint of orthogonal, take the biggest K variance).

Covariance matrix

We see that the ultimate goal is closely related to the intra-field variance and inter-field covariance. So we want to be able to express them in a uniform way, and when we look at them, we see that both of them can be expressed as inner products, and inner products are closely related to matrix multiplication. So we came up with the idea:

Suppose we only have two fields, A and B, then we form them into a matrix X in rows:

Then we multiply X times X transpose, and multiply by the coefficient 1/m:

This conclusion can be easily generalized to the general case according to the algorithm of matrix multiplication: suppose we have m n-dimensional data records arranged in columns in an n-by-m matrix X, and suppose, C is a symmetric matrix whose diagonals are the variances of each field, and the elements of column J of row I and column I of row J are the same, indicating the covariance of the two fields I and J

Diagonalization of covariance matrix

In accordance with the above deduction, we found that to optimize the current, equivalent to the covariance matrix diagonalization, which in addition to the diagonal of the other elements to 0, and the diagonal elements will be sorted by size from top to bottom, to reach optimization purposes, so we say that may not be very clear, we further look at the original matrix and basis transformation matrix of covariance matrix after relationship:

Let the covariance matrix corresponding to the original data matrix X be C, P is a matrix composed of a set of bases (each basis is a column vector), and let, then Y is the data after X performs the basis transformation on the basis group contained by P. Let the covariance matrix of Y be D, then:

It’s quite clear now! The P we’re looking for is nothing but the P that diagonalizes the original covariance matrix. In other words, the optimization goal becomes to find a matrix P that satisfiesIs a diagonal matrix, and the diagonal elements are arranged in order from large to small, so the K columns before P are the bases to be sought. The matrix composed of the K columns before P is transposed by X, which reduces X from N dimension to K dimension and satisfies the above optimization conditions.

As we know above, covariance matrix C is a real symmetric matrix. In linear algebra, a real symmetric matrix has a series of very good properties:

1) The eigenvalues of a real symmetric matrix are all real numbers

2) All eigenvectors of a real symmetric matrix are orthogonal.

3) Set eigenvaluesIf the multiplicity is r, there must be r linearly independent eigenvectors corresponding to, so the r eigenvector units can be orthogonalized.

It can be seen from the above two that a real symmetric matrix with n rows and n columns must have n unit orthogonal eigenvectors. Let the n eigenvectors be, we form it into a matrix according to columns:

Then the covariance matrix C can be concluded as follows:

Among themIs the diagonal matrix, and its diagonal elements are the corresponding eigenvalues of each eigenvector (there may be duplication).

At this point, we’ve found our matrix P is equal to E

P is the matrix in which the eigenvectors of the covariance matrix are normalized and arranged in columns, where each column is an eigenvector of C. If I set P to thetaWhen the feature vectors are arranged from left to right, the device of the matrix composed of the first K rows of P is multiplied by the original data matrix X, and the data matrix Y after dimensionality reduction is obtained.

Above, is the whole PCA mathematical principle discussion

PCA algorithm

The algorithm steps of PCA are summarized as follows:

Let’s say m pieces of n-dimensional data.

1) The original data is composed of n rows and m columns matrix X according to columns

2) Zero-mean each row of X (representing an attribute field), i.e., subtract the mean of this row

3) Find the covariance matrix C= frac{1}{m}XX^ mathsf{T}

4) Work out the eigenvalues and corresponding eigenvectors of the covariance matrix

5) Arrange the eigenvectors into a matrix in columns from left to right according to the size of the corresponding eigenvalues, and take the first k columns to form the matrix P

6) Y=P^TX is the data after dimension reduction to dimension K

The instance

This is what I mentioned above

For example, PCA method was used to reduce the two-dimensional data to one dimension.

Since each row of this matrix is already zero-mean, we can directly solve the covariance matrix here:

Then calculate its eigenvalue and eigenvector, the specific solution method is not detailed, you can refer to relevant materials. The eigenvalues after solution are:

The corresponding feature vectors are:

The corresponding eigenvectors are a general solution, and c_1 and C_2 can be arbitrary real numbers. Then the normalized feature vector is:

So our matrix P is:

The diagonalization of covariance matrix C can be verified:

Finally, we multiply the first row of P by the data matrix to get the dimensionally reduced representation:

Code implementation

import numpy as np
import pandas as pd
Copy the code

TopNfeat: The number of dimensions to be reduced
def PCA(X, topNfeat=9999999):
    # 1. The default raw data is one sample data per behavior, one feature per column,
    So take the transpose so that each column represents a sample
    X = X.T
    # 2. Zero mean each row of the data (representing an attribute field)
    meanValues = np.mean(X, axis=1) # Calculate the mean of each row
    meanValues = np.mat(meanValues).T # Convert a vector to an n*1 matrix
    meanRemoved = X - meanValues   # The mean goes to zero
    
    #3. Find the covariance matrix
    covMat = np.cov(meanRemoved) The covariance is divided by the number of samples minus 1, which is the degree of freedom
# covMat = meanRemoved @ meanRemoved.T / (meanRemoved.shape[1])
    print("Covariance matrix: \n",covMat)
    
    Figure out the eigenvalues and corresponding eigenvectors of the covariance matrix
    eigVals, eigVects = np.linalg.eig(covMat)  #eigVals eigVects
    print("Eigenvalue \n",eigVals)
    print("Eigenvector \n",eigVects)
    
    #5 Arrange the eigenvectors into a matrix in columns from left to right according to the size of the corresponding eigenvalues, and take the first k columns to form the matrix
    The # argsort function returns the index value of the array from smallest to largest with a - sign
    eigValInd = np.argsort(-eigVals)
    eigValInd = eigValInd[0:topNfeat]  TopNfeat = topNfeat = topNfeat
    redEigVects = eigVects[:,eigValInd]  #redEigVects is the required transformation matrix, i.e. P
    print("Transformation matrix: \n",redEigVects)
    
    Y=P^TX is the data after dimension reduction to k
    X_PCA = redEigVects.T @ X
    return X_PCA
    
    
Copy the code

PCA(X, topNfeat = 1)
Copy the code

[1.1.5]] eigenvalue [2.5 0.5] Eigenvector [[0.70710678-0.70710678] [0.70710678 0.70710678]] Transformation matrix: [[0.70710678]] Array ([[-2.12132034, -0.70710678, 0., 2.12132034, 0.70710678]])Copy the code

sklearn

Sklearn has encapsulated the corresponding PCA interface for us. Now we use PCA to reduce the dimension of a handwritten digital data set in SKLearn.

from sklearn import datasets
Copy the code

digits = datasets.load_digits()
X = digits.data
y = digits.target
X.shape,y.shape
Copy the code

((1797, 64), (1797))Copy the code

Firstly, the data set is divided into training set and test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=Awesome!)
X_train.shape
Copy the code

(1347, 64)
Copy the code

Firstly, KNN algorithm is used to build and forecast the model without pca dimensionality reduction

%%time

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test,y_test)  # Accuracy is
Copy the code

Wall time: 69.8 ms
Copy the code

knn_clf.score(X_test,y_test)  # Accuracy is
Copy the code

0.9866666666666667
Copy the code

Next, PCA dimension reduction

# decomposition (decomposition)
from sklearn.decomposition import PCA
Copy the code

# n_components the number of dimensions to drop down to. The first time we went straight down to 2 dimensions
pca = PCA(n_components =2)
pca.fit(X_train)
X_train_reducation = pca.transform(X_train)
X_test_reducation = pca.transform(X_test) Note that the test set also needs to be reduced in dimension
X_train_reducation.shape
Copy the code

(1347, 2)
Copy the code

pca.explained_variance_ratio_ 
# represents the proportion of variance value of each principal component to the total variance value after dimensionality reduction. The larger the proportion is, the more important the principal component is.
Copy the code

Array ([0.14566817, 0.13735469])Copy the code

%%time 

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_reducation, y_train) After dimensionality reduction, the elapsed time is significantly reduced
Copy the code

Wall time: 2 ms
Copy the code

knn_clf.score(X_test_reducation,y_test) 
Since we directly reduced the data from 64 dimensions to two dimensions, the information loss was serious and the accuracy rate was only 0.6
Copy the code

0.6066666666666667
Copy the code

# If we pass a small value directly, it represents the proportion of the original information to be retained
# For example below, keep 95% of the main information
pca = PCA(0.95)
pca.fit(X_train)
X_train_reducation = pca.transform(X_train)
X_test_reducation = pca.transform(X_test)
Copy the code

pca.n_components_  # n_COMPONents_ You can view the specific dimension that pca descends to
Copy the code

28
Copy the code

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_reducation, y_train)
knn_clf.score(X_test_reducation,y_test)   
Copy the code

0.98
Copy the code

As you can see, the accuracy is basically the same, but the dimensions have been reduced from 64 to 28, saving a lot of resources. This is the principle and code implementation of PCA principal component analysis

Refer to the article: www.cnblogs.com/mikewolf200…

Welcome to pay attention to my personal public account AI computer vision workshop, this public account irregularly push machine learning, deep learning, computer vision and other related articles, welcome to learn with me, communication.