1. Write at the front

If you want to engage in data mining or machine learning, it is necessary to master common machine learning algorithms. Here is a brief overview of common machine learning algorithms:

  • Supervised learning algorithms: logistic regression, linear regression, decision tree, Naive Bayes, K-nearest Neighbor, support vector machine, integration algorithm Adaboost, etc
  • Unsupervised algorithms: clustering, dimensionality reduction, association rules, PageRank, etc

Published:

Algorithm theory + practical K nearest neighbor algorithm

Algorithmic theory + practical decision tree

Naive Bayes for algorithm theory and practice

【 Vernacular Machine Learning 】 Algorithm theory + Practice Support Vector Machine (SVM)

Algorithmic theory + Practical AdaBoost algorithm

Algorithm theory + practical K-means clustering algorithm

In order to understand these principles in detail, I have read watermelon books, statistical learning methods, machine learning actual combat and other books, as well as listened to some machine learning courses, but I always feel that the discourse is abstruse and impatient to read, and there are theories everywhere, while actual combat is the most important.

In my opinion, it is more important to understand the idea behind the algorithm and its use than to understand its mathematical derivation. Idea will let you have an intuitive feeling, so as to understand the rationality of the algorithm, the mathematical deduction is to express this kind of rationality in a more rigorous language, for example, a pear is sweet and can be expressed in mathematical language to sugar content is 90%, but only the bite personally, you can truly feel the pear how sweet, And really understand the math of what 90 percent sugar looks like. If these machine learning algorithms are pears, the primary purpose of this article is to take a bite out of them. There are also several other purposes:

  • Test your understanding of the algorithm and make a summary of the algorithm theory
  • Can be happy to learn the core ideas of these algorithms, find the interest in learning these algorithms, for in-depth learning these algorithms lay a foundation.
  • The theory of each class will be put to a practical case, can really learn to apply, not only can exercise programming ability, but also can deepen the grasp of algorithm theory.
  • Also want to put all the previous notes and reference in a piece, convenient for the convenience of checking later.

The process of learning algorithms should not only gain the theory of algorithms, but also have fun and the ability to solve practical problems!

Principal Component Analysis (PCA) dimension reduction is an unsupervised method commonly used in dimension reduction of data, that is, data preprocessing. It is not used as the final algorithm model, but as an auxiliary means to help algorithms make better and faster decisions. As we know, if the feature dimension of the data given to us is too high, first of all, it is very troublesome to calculate; secondly, there may be correlation between features, which increases the complexity of the problem and makes it inconvenient to analyze. At this point, we wonder if we can get rid of some features? However, this feature is not removed by their own will, because blindly reducing the features of the data will lose the key information contained in the data, which is easy to produce wrong conclusions, which is detrimental to the analysis.

Therefore, we want to find a reasonable way to reduce the indexes we need to analyze and keep as much information of the original data as possible. PCA is one of the reasonable ways. So today we will learn PCA dimensionality reduction to see what kind of task PCA can accomplish and how it can accomplish this task.

First of all, we will experience the figure of dimensionality reduction from the life scene, then we will see what the dimensionality reduction is doing from the example of dimensionality reduction of two-dimensional data, introduce what tasks PCA is completing from the macro level, and then I will explain how PCA completes the dimensionality reduction task from the micro level of mathematics. Finally, we will use pure Python to implement PCA algorithm to complete the classification of iris data sets, and then call PCA tool of SKLearn to do a dimension reduction analysis of face recognition to see what kind of existence PCA is in actual combat tasks. With this tool, your model will be much better at analyzing the data. Finally, I would like to remind you that although this is also a colloquial series, there will be some mathematical things in it this time. After all, there is no mathematics, PCA is also an out-of-body existence no matter how it is described, but it will not be too difficult, and with my colloquial words in it, you can easily feel the power of mathematics. Haha, let’s get started! By the way, PCA is also called principal component analysis, they are the same thing!

The outline is as follows:

  • Data dimension reduction? It’s all around us!
  • What exactly is PCA doing? (Macro perspective grasp)
  • How does PCA reduce dimension? (Understand the mathematical principle of PCA from a microscopic perspective)
  • PCA programming practice (Dimensional reduction analysis of Iris data set + Face recognition)

OK, let’s go!

2. Data dimension reduction? It’s all around us!

We often say that algorithms come from life, but we walk too fast in life and don’t pay attention to the scenery along the road. That’s ok, as long as you like to hear other people’s stories, I keep an eye out for some for you, haha. Here’s a taste of dimension reduction from a story scene:

u

We walk in a city, will always find each road in the city there are some strange name, what Beijing road, with jingha what various road, I took my city, road name is standard, because we are the northeast is the bohai sea, the south is the Yellow River, so the horizontal road collectively known as the Yellow River in the city, and then from the south, north, in turn, is the Yellow River all the way, Yellow River Road 2, Yellow River Road 3…. , longitudinal road is collectively called the Bohai Sea, from the east to the west in turn is the Bohai sea all the way, the Bohai sea 2 road, the Bohai sea 3… Is very similar to the X and Y axes of our coordinate system. When you first arrive in a city, you might wonder, why do you name a road? Some even have fancy names they’ve never heard of. This is because, for the street name, location information can be through the streets of the city only identified five (such as the Yellow River and the bohai sea three road intersection), so that when you want to find a place, is more easy to find, after all, named road, is generally accepted standards, when everyone says five of the Yellow River and the bohai sea three road intersection, To make sure we’re all in the same position, these paths are kind of like the coordinate system we’re talking about. And every place in the city is like a point in the coordinate system, uniquely identified by the coordinate system. Then there is a problem. I want you to convert the location information marked with two numbers to the location description marked with only one number at the intersection of Yellow River No. 5 road and Bohai Road No. 3 Road. What do you do? You may not think of it immediately. But if a railroad passes through the city, and all the important buildings in the city are on the railroad side, then the position of each point can be defined in terms of how far it is from the starting point of the railroad. Would that solve the problem? For example, the intersection of Yellow River No.5 Road no.3 Bohai Road is quite close to the railway, beside the starting point. In fact, this is a kind of dimensionality reduction, the original need for two dimensions of the information, now with a railway can be done

Of course, this positioning is not as accurate as using two dimensions, and some locations are far from the railway, but how far is not shown in the new representation. This indicates that data dimension reduction is not lossless, which may cause partial information loss.

So, what’s the use of dimensionality reduction? Just to cause partial loss of information? Going back to the above example, the first use of data dimensionality reduction is data compression. If you can only write down your home address to a new friend on a little note, but it can’t be Yellow River ×× Road and Bohai ×× Road, you can write down railway no.5. Data dimensionality reduction can also be used for data visualization or feature extraction. For example, if you want to open a store in a city, you can first see where the crowd is more dense. By data dimensionality reduction, you can figure out where there are more points around the place, so that the crowd is more dense. The third use of data dimension reduction is outlier detection and clustering, for example through your data dimension reduction, found that most of the people live in two city near the train station, but there is a two home but not here, so you will find the city the maverick, and then you find two railway station accessories, a last name is zhang, a surname li, in this way, You’ve reduced the dimensions of the city’s households into two categories.

Now we have an intuitive understanding of dimensionality reduction. What is the relationship between this and our protagonist PCA? Principal component analysis method is a way of data dimension reduction, just as we mentioned above, as long as the data compression, is bound to lose some information, and PCA is as much as possible to find some of the major key characteristics to separate the data, to get rid of those features of some little to distinguish between data, in this way, can achieve dimension reduction, You can also retain as much information as possible from the original data. In the example above, we found the railway line as our principal component, and we ranked all the houses in the city in the order in which the railway arrived, so as to get a numerical representation of the distance. As for the loss of location information, it must be some loss, but at least it can be represented.

Let’s take a closer look at what PCA is doing.

3. What exactly is PCA doing? (Macro perspective grasp)

Above, we have been through life scenes felt the dimension reduction, and then know that the principal component analysis method is a method of dimension reduction, it can retain data information as much as possible and finish to reduce data dimension, I said above, the railway is that we find out the principal component, but what is main component in the data? Let’s take a look at some examples: The following table is the statistics of some students’ Chinese, math, physics and chemistry scores:First, it is assumed that the scores in these subjects are irrelevant, that is, the scores in one subject have no bearing on the others. Now I want you to use less as far as possible subject to distinguish the three grades students learning level, I’m sure you won’t choose language result as distinguish standard (because of the language points are all the same, no difference), you can see at a glance mathematics, physics, chemistry can be used as the main component of this group of data (obviously, mathematics as the first principal component, For the most part, we can tell the three students apart just by their math scores. This is the same as our usual exams, why parents like to say good math, because math can really open the score to decide who took the first place. This is not, only a mathematics achievement can complete classification task, need not consider other three branches, this is not reduced dimension.

So how do you know that math is the principal component? You said, because mathematics opened the score ah, you opened the score, in fact, it means that the gap between students and students may be relatively large, this is the probability theory we often talk about variance. The larger the variance, the more information will be obtained. When PCA is looking for principal components, it is actually looking for K directions to separate samples as far as possible, that is, the direction with maximum variance as the principal component, so that the dimension of data can be reduced to K-dimension (originally it must be larger than K-dimension) while retaining as much data information as possible.

Ok, so the picture above is relatively simple, you can see the principal component immediately, let’s look at another set of data scores:Can you pick out which subject can be used as the main ingredient? And you might say, well, that’s not easy, that’s not easy, right? Didn’t you say you were looking for the most likely variance? I’m going to calculate the variance of each subject, and THEN I’m going to see which ones have the largest variance, okay? Finding the principal component is not necessarily to find the subject with the largest variance in the original subjects, but more like a balance of these subjects. What does this mean? For example, each of our subject represents a direction, so every student in the space according to the different became one by one, each score of principal components, we are looking for is not necessarily the original a direction or certain direction, we find the direction of the projection is that every students can leave as far as possible, after namely variance as large as possible, It helps us to distinguish.

u

Let’s take this picture from Ng:If we look at these data points in two dimensions, if we take that orange line and divide these samples, you’ll see that the projection of each point on the orange line is very far apart, and it’s still easy to divide the data using this line. But if you look for the magenta line to divide the data, you’ll find that the projections on each point are too close together to tell them apart. So we tend to choose the orange line as the main ingredient.

So, for those fractions with the higher dimensions, we prefer to split the data on an axis like this (think of it another way) :Well, that’s what PCA does at a macro level – look for principal components and try to separate the data as much as possible without losing too much of the original information.

To be specific, when PCA reduces dimension, it regards data as points in space, tries to find several directions (orange above and PC1PC2 below), projects these points, and makes them as far away as possible after projection. So these directions are the principal components, and the sample points in space can be described in terms of these new directions. However, there are also requirements for these directions, that is, they do not interfere with each other and have no linear relationship, just like the X and y axes, so as to better describe these data. There is no redundancy between the principal components.

Therefore, there are two standard conditions for principal components: one is that they are unrelated to each other, and the other is that when they are used to describe data, the variance should be as large as possible.

So how does PCA work? So this is going to have to be understood mathematically.

4. How does PCA reduce dimension? (Understand the mathematical principle of PCA from a microscopic perspective)

When it comes to mathematics, we need to be more rigorous. As we know above, PCA is to find the principal components, and the standard two conditions of the principal components are: one is unrelated to each other (note that unrelated does not mean independent, and there is only no linear relationship); The second is that when you’re describing the data, the variance is as big as possible, that is, the data is projected as far away as possible.

The next step is to figure out how PCA measures these two conditions. For the first condition, which is unrelated, you can find a set of principal components so that the covariance between them is zero (which will be mentioned later). Then for the second condition, the data is projected as far away as possible? This is a little bit tricky, so before we go to the second one, let’s look at what a projection is, and how do we measure this?

This has to start with vector representation and basis transformation:

4.1 Vector representation and basis transformation

So if we want to see a projection, let’s see what happens with the inner product of vectors, right?

Inner product:

So if the magnitude of vector B is 1, thenThe inner product of A and B is equal to the length of the vector projecting A onto the line of B. Take a look at the picture below:So when you take the inner product of two vectors, you’re doing a projection.

So, let’s look at how vectors are represented in space:

u

So if we pick a basis in space, then all of the vectors in space can be represented as linear combinations of bases, what does that mean? For example, if we specify a coordinate system, then the red vector is (3,2), because the projection of this coordinate on the x axis is 3, and the projection of this coordinate on the y axis is 2. In fact, this vector can be expressed in linear combinations:And the (1,0) and (0,1) up here are called a basis in two dimensions. And I want the basis to be a unit vector and it has to be perpendicular and independent, and if you don’t know what to do with the basis, remember the first condition for the principal component? It’s unrelated. Is it just a coincidence? NO, it’s just that there’s a little bit of foreshadowing here, a little bit of wanting to come out.

So what are we doing with this basis? Now, don’t panic. Once we have a basis, we can do the basis change. What do you mean? In other words, in the figure above we know that the red vector’s coordinate is 3 comma 2, but that’s not true, it’s true that we have the coordinate system, in my x and y coordinate system, my red vector’s coordinate is 3 comma 2. So, what if I change coordinates? If I look at the picture below, I ask you, in blue coordinates, what are the coordinates of the red vector? Obviously, it’s not (3,2), so when you think about things in the future, you have to think about the whole picture, because everything is true under certain conditions. Ha ha, I didn’t expect to learn life lessons from the basis transformation.So what does our original (3,2) look like when we change this coordinate system? We still need to find the basis in our new coordinate system, and if we look at the blue arrow, we can see that this basis could be alpha and beta, and then what do we do with that basis?

u

Transformation: Data is inner product with the first basis, resulting in the first new coordinate component, and then with the second basis, resulting in the second new coordinate component. Data (3,2) is mapped to coordinates in the new base:

u

So, in the blue coordinate system, the red vector coordinates are zero

So, having understood the above transformation, we can get a more general expression for the basis transformation:

The idea of multiplying two matrices is to transform every column vector in the right-hand matrix to the space represented by every row vector in the left-hand matrix. I have my column vectors, my row vectors, a basis. That means that the matrix on the left is a basis, and the matrix on the right is a sample set. The multiplication of these two matrices is the representation of the sample set in a space defined by a new basis.

What does this basis have to do with our PCA? Now it can be said:

u

What PCA does is to find a set of bases (principal components), which are unrelated to each other, and make all data transform to coordinate representation on this set of bases, so that the variance value is as large as possible.

We know that a basis is unrelated to each other, but how should PCA be selected specifically? PCA certainly doesn’t pick all the directions at once, but what PCA does is it first picks a directional basis to maximize the variance of the data projected onto that basis, and then picks a second directional basis that is orthogonal to that orientation basis to maximize the variance of the data projected onto that basis. And then I’m going to pick a third basis that’s orthogonal to both of these bases, and then I’m going to go down that way, and I’m going to make sure that the bases are orthogonal to each other, and I’m going to make sure that the variance is maximum. So when we talk about orthogonal directional bases, we can’t avoid another mathematical concept: covariance. Let’s start with this concept:

4.2 covariance

If variance represents the degree of fluctuation between data of a variable, then covariance represents the degree of correlation between two variables. The covariance of two variables X and Y can be expressed as:

So if we take the mean of X and Y and turn it to 0, and we normalize the data first, then the covariance becomes 0

So if you want two bases to have no correlation, you have to make sure that their covariance is zero.

4.3 Optimization objectives of PCA

Having talked about basis transformation and covariance, it is easy to express the two conditions of PCA in mathematical terms:

u

Reduce the dimension of a set of N as vectors in dimension K (0

Let’s assume that our data set X is a matrix of 2*m, where 2 represents two dimensional features and M represents m samples, so LET’s do the x-inner product and see what WE get:

Do you see that? The resulting matrix is the covariance matrix of the feature, and the two elements on the main diagonal are the variance of the feature itself, and the two elements on the side diagonal are the covariance between the two features. So all of a sudden, the variance of the feature itself and the covariance between the features are immediately complete. And this covariance matrix is a symmetric matrix!

Then the following task is clear, PCA is what to do, the two criteria, one is that the covariance between features is 0, how to do? The other is to make the variance as large as possible, which is to arrange the diagonal elements from the largest to the smallest after diagonalizing the covariance matrix.

u

Diagonalization of covariance matrix: that is, the covariance matrix is transformed so that all elements except the diagonal are 0, and the elements on the diagonal are arranged in order from largest to smallest

So how do we diagonalize this covariance similarity? So that’s where line generation comes in, and we have this theorem:

u

Real symmetric matrix: A real symmetric matrix with n rows and m columns must find n units of orthogonal eigenvectors such that

And are n eigenvalues of the covariance matrix C.

In this way, as long as we arrange the eigenvalues from large to small and the corresponding eigenvectors from top to bottom, then multiply the matrix composed of the first K rows by the original matrix X, we can get the data matrix Y after dimensionality reduction. It will also be the way after PCA, and the first K eigenvectors form a basis in the new space, which is also the K principal components found by PCA. You might be a little confused at this point, but why? Why is the eigenvector of the covariance matrix similar diagonalization of the covariance matrix, and the arrangement of the eigenvalues from large to small, the corresponding eigenvector is the basis? It’s not clear here, is it?

So, let’s take the next step with this question, which is assuming that the corresponding eigenvector is really that basis, then we multiply this eigenvector times the original matrix X to get the data matrix Y after dimensionality reduction. So let’s see, what is this dimension reduced data matrix Y taking the inner product by itself and multiplying by 1/m, which is

So what do you get? It turns out that the inner product of the data matrix Y after dimensionality reduction and itself is the covariance matrix under the new dimension, and this covariance matrix is exactly the diagonal matrix composed of the first K eigenvalues after the previous X covariance matrix C is similar diagonalized.

This shows that the principal components found by PCA are not correlated with each other, and the variance of the principal component itself is the first K largest value selected from the N dimension. That’s exactly what we were looking for. So the transformation of X to Y by this eigenvector as a new basis is completely reasonable.

This is how PCA mathematically finds the K principal components in detail.

Let’s go through the process, and then we’ll feel the process in terms of an example. When we come across a new data set X (N dimensions), we want to go down to K dimensions. PCA does this

  • So first of all, we should normalize each of the dimensions of X to get the mean to 0
  • Then, compute the covariance matrix C, i.e
  • Then, compute the eigenvalues and eigenvectors of C, arrange the eigenvalues from large to small, and arrange the corresponding eigenvectors from top to bottom
  • Then, the covariance matrix is similarly diagonalized
  • And then, if you take the first k of the eigenvectors, you get k bases
  • And then finally, when you multiply X by these K bases, you get a reduced k-dimensional matrix Y

Here’s an example to get a feel for the process:

u

  1. Input data X: 2 features, 5 training samples, and de-mean
  2. Compute the covariance matrix C, i.e

u

  1. Calculate eigenvalues and eigenvectors:

    Feature vector:

Feature vector unitalization:

u

  1. Similar diagonalization

u

  1. It goes down to one dimension, so I take it as a basis, and I get the final result

Here, the mathematical principle of PCA is finished, have you digested? If not digested, the following code can still help you understand this process. Next, we will use the iris data set to reduce its dimension by handwriting PCA.

5. PCA programming practice

As for the actual practice of PCA, we already know that PCA is generally used for data preprocessing and dimensionality reduction, which plays a strong auxiliary role in the better work of the following model. Now let’s take a look at how to write the above mathematical process into Python code from a simple data set iris, and see the final effect of reducing the dimension of iris data set.

5.1 Dimension reduction analysis of iris data set

First of all, I want to use iris data set (because it is simpler and easier to understand) to implement it concretely according to the above PCA process and see what effect dimension reduction has on the end. So let’s get started:

u

Recall the above process:

  1. Get the data, normalize the data
  2. And then you get your covariance matrix C
  3. Find the eigenvalue and eigenvector of covariance matrix, and sort the eigenvalue
  4. Take the first K eigenvectors as the basis, and finally multiply by X to get the data matrix Y after dimensionality reduction

Here is a simulation

  1. Pre-package (Iris dataset is in Sklearn. datasets)

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    from sklearn.datasets import load_iris
    from sklearn.preprocessing import StandardScaler
    Copy the code
  2. Import the data and normalize it

    iris = load_iris()
    X = iris.data   # Xshape(150.4X_norm = StandardScaler().fit_transform(X) x_norm.mean (axis=)0) # Then the mean value of each dimension is0theCopy the code
  3. The following is the process of dimensionality reduction using PCA

    Eigen value and eigen vector ew, ev = Np.linalg.eig (NP.cov (x_norm.t)) # Np.cov directly find covariance matrix, each row represents a feature, Each round represents the order of eigenvalues of sample # eigenvectors ew_oreder = np.argsort(ew)[::- 1] ew_sort = ew[ew_oreder] ev_sort = ev[:, ew_oreder] # ev_sort.shape # (4.4) # we specify drop to2And then the first two columns of the ordered eigenvectors are the basis K equals theta2
    V = ev_sort[:, :2]  # 4*2X_new = x_norm.dot (V) # shape (150.2)
    Copy the code
  4. Let’s visualize X_new to see what it looks like after dimensionality reduction:

    colors = ['red'.'black'.'orange']
    
    plt.figure()
    for i in [0.1.2]:
        plt.scatter(X_new[iris.target==i, 0],
                    X_new[iris.target==i, 1],
                    alpha=7.,
                    c=colors[i],
                    label=iris.target_names[i]
                   )
    
    plt.legend()
    plt.title('PCa of IRIS dataset')
    plt.xlabel('PC_0')
    plt.ylabel('PC_1')
    plt.show()
    Copy the code

    The results are as follows:From the results, we can see that after PCA dimension reduction, the first is characteristics into 2 columns, become to visualize it, and found that each class of the iris data is relatively easy to separate, so the back with some basic machine learning algorithm such as decision tree, KNN, etc which can get better effect, not only make the calculation simple, however, It also provides a basis for us to choose the algorithm.

This process is written in order to better digest the mathematical formula, in fact, sklearn has helped us complete this tool, we can directly use, and complete dimensionality reduction can be done in a sentence, is not much simpler. We can also try the following:

from sklearn.decomposition importPCA # then use PCA = PCA(n_compoents=2)
X_new = pca.fit_transform(X_norm)

"""View some properties of PCA"""
print(pca.explained_variance_) # attribute can view the amount of information carried on each feature vector after dimensionality reduction (the amount of explanatory variance)print(pca.explained_variance_ratio_) # Check the percentage of information content of each new feature after dimensionality reduction to the total information content of the original dataprint(pca.explained_variance_ratio_.sum())4.22824171 0.24267075] # It can be found that the variance of features after dimensionality reduction [0.92461872 0.05306648] # The proportion of information content of original data of feature band after dimensionality reduction0.977685206318795# Information retention after dimensionality reduction (lost3%, remove half of the features, not bad)Copy the code

In this way, the dimensionality reduction operation of X_norm is implemented. For details about PCA in Sklearn, please refer to the reference documentation. Here are a few small details about PCA:

u

  • We can check the amount of information carried by each feature vector after dimensionality reduction by using the pCA. explained_variance_ attribute. Pca. Explained_variance_ratio_ Check the percentage of information content of each new feature after dimensionality reduction to the total information content of the original data; Pca. Explained_variance_ratio_. Sum (), information retention after dimension reduction
  • How to select the n_coments parameter? This parameter directly specifies that we want to drop down to several dimensions (an integer, greater than 0 and less than the total dimension of the data).N_coments = ‘mle’, this is an optional hyperparameter, which means that the computer will choose a float number between n_coments = [0,1] based on some calculations, keeping as much information as possible. Svd_solver = ‘full’. This is how much information we want PCA to retain directly.This is a better way to specify how much information to keep, without drawing a map to explore, you can first ensure 80% of the information, see how many dimensional features will be retained, and then slowly add.

5.2 Actual combat of face data set

Fetch_lfw_people (skLearn); skLearn (skLearn); skLearn (skLearn); skLearn (skLearn);

# import package from sklearn.datasetsimport fetch_lfw_people
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import numpy as np
Copy the code
# Import the data and explore faces = fetch_lFW_people (min_faces_per_person=60)
faces.images.shape   # (1348.64.47)  1348One picture, each64*47
faces.data.shape    # (1348.2914) This is a combination of the last two dimensions, total2914Four features (pixels)Copy the code
# Let's visualize these plots first and see what they look like.3.8, figsize=(8.4), subplot_kw={"xticks": []."yticks": []})for i, ax in enumerate(axes.flat):
    ax.imshow(faces.images[i, :, :], cmap='gray')
Copy the code

Next, we use PCA for dimensionality reduction

pca = PCA(150). The fit (faces. Data) # below150Dimension V = pca.components_ # This is the basis.150.2914Each row is a basis, and if we multiply that times our sample X, we get the reduced matrixCopy the code

The above components_ attribute is extracted after 150 dimensions of that group of bases, that is, the feature vectors in the mathematical formula. We can visualize this and feel for ourselves what the basis looks like (i.e. what principal component features PCA extracts from the raw data).

Axes = PLT. Subplots (3.8, figsize=(8.4), subplot_kw={"xticks": []."yticks": []})for i, ax in enumerate(axes.flat):
    ax.imshow(V[i,:].reshape(62.47), cmap='gray')
Copy the code

The results are as follows:Why does this look like a ghost? Don’t be afraid, in fact this is the main component feature of PCA to extract, although could not see, but you will find that PCA to extract the features, will be more focus on the characteristics of facial features, eyes, mouth, nose, etc is obvious, it is quite in line with our current face recognition principle, we recognize faces, is mainly to see the difference between the facial features? Therefore, THE features extracted by PCA in this aspect are reasonable.

Then you may wonder, can the dimension reduction process of PCA go back? Inverse_transform: inverse_transform pca.inverse_transform: Inverse_transform pCA. inverse_transform: Inverse_transform pCA. inverse_transform: Inverse_transform pCA. inverse_transform: Inverse_transform pCA. inverse_transform: Inverse_transform We can do an experiment:

X_dr = pca.transform(faces.data) X_dr = pca.transform(faces.data1358.150X_inverse = pca.inverse_transform(X_dr) x_inverse.shape # (1348.2914) Look at the shape coming backCopy the code

Look at this shape, it does go back, but does it go back?

Ax = PLT. Subplots2.10, figsize=(10.2.5), subplot_kw={"xticks": []."yticks": []})for i in range(10):
    ax[0,i].imshow(faces.images[i,:, :], cmap='binary_r')
    ax[1,i].imshow(X_inverse[i].reshape(62.47), cmap="binary_r") # Dimension reduction is not completely reversibleCopy the code

We can see that the inverted image is blurry, although it is almost recognisable.

u

Inverse_transform Inverse_transform transforms X_dr data into the same space as the original data, instead of restoring all data. But we also see that the data after dimensionality reduction to 150 does retain most of the original data, so that the image looks very similar to the original data, only slightly blurred.

So dimension reduction is impossible to completely reversed, throw away those information, it is hard to find, but so that we better understand the common application of PCA, the original our realistic human face recognition system, what bus station, train station, we put on id, why can so quick to judge whether their ah, actually is to use the PCA technology, This is because you can use 150 features to determine whether you are the person, rather than the original 2,914 features.

6. Summary

First of all, from the life scene, feel what is called dimensionality reduction, and then an example to explain what PCA is actually doing, to sum up, PCA is looking for K directions to distinguish samples as far as possible when looking for principal components, and the K directions are not related to each other. This allows you to reduce the dimension of your data to K dimensions while retaining as much information as possible. Then it is on this basis to talk about the mathematical principle of PCA, understand how PCA is to find K unrelated directions. Finally, the mathematical calculation process of PCA is realized with the example of iris, and then PCA in SKLearn is understood, and then the data set of face recognition is reduced and reversed in this way, and compared.

In short, I hope it will be helpful after learning it. PCA is a very useful technical means, which is generally used in data feature processing and feature engineering. Then finally, let’s talk about a detail: what is the difference between PCA and ordinary feature selection? Feature selection is to select the features that carry the most information from the existing features, and the selected features are still interpretable, while PCA compresses the existing features, and after dimensionality reduction is completed, it is not any of the original features, that is, the features after dimensionality reduction of PCA, we do not know what the meaning is.

Reference:

  • www.sohu.com/a/206848524…
  • www.cnblogs.com/onemorepoin…
  • Blog.csdn.net/wuzhongqian…
  • Blog.csdn.net/Marvelous_M…

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code