I. Introduction to PCA

1. Relevant background

In many fields of research and application, it is often necessary to make a large number of observations of multiple variables reflecting things and collect a large number of data for analysis and finding rules. Multivariate large sample will undoubtedly provide rich information for research and application, but it also increases the workload of data collection to some extent. More importantly, in most cases, there may be correlation between many variables, which increases the complexity of problem analysis and brings inconvenience to analysis. If each indicator is analyzed separately, the analysis tends to be in isolation rather than synthesis. Blindly reducing the index will lose a lot of information, easy to produce false conclusions.

Therefore, it is necessary to find a reasonable method to reduce the indexes that need to be analyzed and minimize the loss of the information contained in the original indexes, so as to achieve the purpose of comprehensive analysis of the collected data. As there is a certain correlation between variables, it is possible to use fewer comprehensive indicators to synthesize all kinds of information in each variable. Principal component analysis and factor analysis are such dimensionality reduction methods.

Table 1 shows the statistics of Chinese, mathematics, physics and chemistry scores of some students:

 

 

First, it is assumed that the scores in these subjects are irrelevant, that is, the scores in one subject have no bearing on the others. It can be seen at a glance that the scores of math, physics and chemistry constitute the principal component of the data (obviously, math is the first principal component, because math scores are the most advanced). Why can you tell at a glance? Because the axes are right! See table 2 for the results of math, physics, chemistry, Chinese, history and English of a group of students.

 

 

There is so much data that it looks a bit messy! In other words, you can’t see the principal components of this set of data directly, because the set of data is very scattered in the coordinate system. The reason for this is that there is no way to get rid of the fog that covers the naked eye. If you represent the data in the corresponding space, maybe you can find the principal components from a different perspective. As shown in Figure 1 below:

 

 

But for higher-dimensional data, can you imagine the distribution? Even if you could describe the distribution, how do you find exactly the axes of these principal components? How do you measure how much of the data you extract is the principal component? So, we’re going to use principal component analysis.

3. Data Dimension reduction To illustrate what is the principal component of data, let’s start with data dimension reduction. How does data dimension reduction work? Suppose you have a series of points in three dimensions, and the points lie on an inclined plane passing through the origin, and if you represent the data in natural coordinates x,y, and Z, you have to use three dimensions, but in fact the points are only distributed on a two-dimensional plane, so what’s the problem? And if you think about it a little bit more, could you rotate the x,y,z coordinate system so that the data plane coincides with the x,y plane? That’s right! If the rotated coordinate system is denoted as X ‘,y’,z’, then the representation of this set of data can be expressed only in x’ and Y ‘dimensions! And of course, if you want to restore the original representation, you have to save the transformation matrix between these two coordinates. This will bring the data dimension down! But, to see the nature of the process, if you arrange the data in rows or columns in a matrix, the rank of the matrix will be 2! There is a correlation between these data, and the largest linearly independent set of vectors that go through the origin of these data consists of 2 vectors, which is why the plane was assumed to go through the origin in the first place! What if the plane does not pass through the origin? That’s what data center is all about! Shift the origin of coordinates to the data center so that previously unrelated data is relevant in this new coordinate system! Interestingly, three points must be coplanar, that is to say, any three points in three dimensional space are linearly correlated after centralization. Generally speaking, n points in n-dimensional space must be analyzed in an N-1-dimensional subspace!

In the previous paragraph, it was argued that nothing was discarded when the data was reduced in dimension, because the components of the data in the third dimension outside the plane were all zero. Now, assuming that these data in the z axis have a small jitter, so we still use the two-dimensional representation of these data, the reason is that we can assume that the two axes of information is the main component data, and these are enough information for our analysis, z ‘jitter on the shaft is likely to be noise, that is to say, was a correlation between this set of data, The introduction of noise leads to incomplete correlation of data, but the Angle between the distribution of these data on the Z ‘axis and the origin is very small, that is to say, there is a great correlation on the Z’ axis. Taking these considerations into consideration, it can be considered that the projection of data on the X ‘and Y’ axes constitutes the principal component of data!

The problem of feature selection mentioned by the teacher in class is that the features to be eliminated are mainly those unrelated to class tags. Many of the features here are related to class tags, but there is noise or redundancy. In this case, a feature dimension reduction method is needed to reduce the number of features, reduce noise and redundancy, and reduce the possibility of over-fitting.

The idea of PCA is to map n-dimensional features onto k-dimensions (k<n), which are new orthogonal features. These K-dimensional features are called principal components and are reconstructed k-dimensional features, rather than simply removing the rest of the N-k-dimensional features from the N-dimensional features.

Ii. The PCA instance now assumes a set of data as follows:

 

 

The rows represent the samples, the columns represent the features, and there are 10 samples with two features per sample. It can be considered that there are 10 documents, x is the TF-IDF that “learn” appears in 10 documents, and Y is the TF-IDF that “study” appears in 10 documents.

The first step is to take the average of x and y, and then subtract the corresponding mean for all of the examples. Here, the mean value of x is 1.81, and the mean value of y is 1.91, so a sample subtracting the mean value is (0.69,0.49), which is obtained

 

 

Step two, find the characteristic covariance matrix, if the data is 3 dimensions, then the covariance matrix is

 

 

I just have x and y, and I solve for that

 

 

The diagonal lines are the variances of x and y, and the non-diagonal lines are the covariances. Covariance is a measure of how much two variables change at the same time. Covariance greater than 0 means that if one of x and y increases, the other one increases; Less than 0 means an increase and a decrease. If x and y are statistically independent, then the covariance between them is 0; But if the covariance is 0, it doesn’t mean that x and y are independent. The greater the absolute value of covariance, the greater the influence of the two on each other, and vice versa. Covariances are ununitless quantities, so if the dimensions taken by the same two variables change, their covariances will also produce changes in the branches.

The third step is to find the eigenvalues and eigenvectors of the covariance

 

 

The two eigenvalues are above and the corresponding eigenvectors are below. The eigenvector corresponding to the eigenvalue 0.0490833989 is. The eigenvectors here are normalized into unit vectors.

The fourth step is to sort the eigenvalues in descending order, select the largest k of them, and then use the corresponding K eigenvectors as column vectors to form the eigenvector matrix.

Here there are only two eigenvalues, we choose the largest one, which is 1.28402771, and the corresponding eigenvector is (-0.677873399, -0.735178656)T.

The fifth step is to project the sample points onto the selected feature vectors. Assuming that the sample number is M and the eigennumber is N, the sample matrix after subtracting the mean is DataAdjust(m*n) and the covariance matrix is N *n. The matrix of the k EigenVectors selected is EigenVectors(n*k). So the projected data FinalData is

FinalData(10*1) = DataAdjust(10*2 matrix) X feature vector (-0.677873399, -0.735178656)T

So what I get is this

 

 

In this way, the n-dimensional features of the original sample are transformed into k-dimensions, and the K-dimensions are the projection of the original features onto the K-dimensions.

The above data can be considered as the fusion of learn and study features into a new feature called LS feature, which basically represents these two features. The above process is described in Figure 2 below:

 

 

The positive sign represents the sample points after pretreatment, and the two oblique lines are orthogonal feature vectors respectively (since the covariance matrix is symmetric, its feature vectors are orthogonal). The matrix multiplication in the last step is to project the original sample points to the corresponding axes of the feature vectors respectively.

The whole PCA process seems to be extremely simple, that is, to find the eigenvalues and eigenvectors of covariance, and then do data conversion. But is it any wonder that finding eigen vectors for covariance is the ideal k vector? What is the hidden meaning behind it? What is the significance of the whole PCA?

3. PCA Derivation First look at the following picture:

 

 

In the first part, we took an example of a student’s score, in which the data points are six dimensions, i.e. each observation is a point in a six-dimensional space. We want to represent 6 dimensions in lower dimensions.

Suppose there are only two dimensions, that is, only two variables, represented by the x-coordinate and the y-coordinate; So each observation has two coordinates corresponding to these two axes; If the data forms an ellipse-shaped lattice, then the ellipse has a major axis and a minor axis. In the short axis direction, the data change little; In extreme cases, if the minor axis degenerates into a single point, only the direction of the major axis can explain the change of points; In this way, dimensionality reduction from two to one dimensions is done naturally.

In the figure above, u1 is the principal component direction, and then in two dimensions you take the direction that’s orthogonal to u1, which is u2. Then n data have the largest dispersion degree (maximum variance) on the U1 axis, and the projection of data on U1 represents most of the information of the original data. Even if U2 is not taken into account, there is not much information loss. Moreover, u1 and U2 are irrelevant. When you just look at U1, two dimensions go down to one.

The greater the difference between the long and short axes of an ellipse, the more it makes sense to reduce dimension.

1. In signal processing, maximum variance theory holds that signal has large variance and noise has small variance. Signal-to-noise ratio is the variance ratio between signal and noise, and the larger the better. As shown in the previous figure, the projection variance of the sample on U1 is large and that on U2 is small, so it can be considered that the projection on U2 is caused by noise.

Therefore, we believe that the best K-dimensional feature is that after converting n-dimensional sample points into K-dimensions, the sample variance in each dimension is large.

For example, if we project the five points shown in the figure below onto a dimension, represented here by a straight line through the origin (the data has been centralized) :

 

 

Suppose we choose two different lines for projection, which one is better? According to our previous variance maximization theory, the one on the left is good because the variance between the projected sample points is the largest (or the sum of the absolute values of the projected points is the largest).

The method for calculating the projection is shown in Figure 5 below:

 

 

 

In this diagram, the red dot represents the sample, and the blue dot represents the projection onto u, which is the slope of the line and the direction vector of the line, and it’s a unit vector. The blue dots are the projection points on u that are less than x,u> away from the origin (that is, xTu or uTx).

2. The least square method we use the least square method to determine the orientation of each principal axis (principal component).

For a given set of data (in the following elaboration, vectors are generally referred to as column vectors) :

       

 

Its data centers are located at:

       

 

   

Data centralization (moving the origin of coordinates to the center of the sample point) :

        

 

In other words, the sum of the absolute values of the projection in the direction of U1 is the largest (or the variance is the largest). The calculation method of the projection has been described above, that is, the inner product of x and U1. Since only the direction of U1 is required, let u1 also be the unit vector.

In this case, maximize the following formula:

       

 

According to the knowledge of matrix algebra, it is convenient to square the sign item of absolute value. Therefore, the following formula is maximized:

        

 

The inner product of two vectors can be transformed into matrix multiplication:

       

 

Therefore, the objective function can be expressed as:

       

 

In parentheses, matrix multiplication represents vector inner product. Since column vector transpose is row vector, and row vector is multiplied by column vector to obtain a number, the transpose of a number is still itself, the objective function can be changed into:

        

 

Parentheses:

        

 

And since u1 has nothing to do with I, we can take it outside the summation, and the above formula can be simplified as:

       

 

Those of you who have taken matrix algebra may have noticed that the sum in the parentheses above is equivalent to a large matrix multiplied by its transpose, which has the following form:

       

 

The ith column of the X matrix is xi

So are:

       

 

So the objective function is finally reduced to:

       

 

One of them is a quadratic form,

We assume that an eigenvalue is λ, and the corresponding eigenvector is ξ

       

 

So,Is a semidefinite symmetric matrix, i.eIs the quadratic form of a positive semidefinite matrix, obtained by matrix algebra knowledge, there is a maximum objective function!

So let’s solve for the maximum and the direction of u1 when we get the maximum.

To solve the first problem, the square of the binary norm of vector x is:

       

 

Similarly, the objective function can also be expressed as the 2-norm square of the mapped vector:

       

 

To convert the quadratic form into a norm, since u1 takes the unit vector, the basic problem of maximizing the objective function is also translated into: for a matrix, it transforms a vector, how can the modulus stretch scale of the vector before and after the transformation be maximized? According to the theorem in matrix algebra, the maximum value of the ratio of the vector length before and after the matrix mapping is the maximum singular value of the matrix, namely:

       

 

Where, is the maximum singular value of matrix A (also the binary norm of matrix A), which is equal to the square root of the maximum eigenvalue of (or).

For this problem, is a semi-positive definite symmetric matrix, which means that its eigenvalues are greater than or equal to 0, and the eigenvectors corresponding to different eigenvalues are orthogonal, constituting a set of unit orthogonal basis of the space in which they are located.

Then solve the second problem. For the general case, let the n eigenvalues of the symmetric matrix be:

      

 

The corresponding unit feature vector is:

     

 

Take any vector x, and the basis in the space composed of feature vectors can be expressed as:

      

 

Is:

     

 

So:

      

 

For the second problem, we take the direction of the eigenvector corresponding to the maximum eigenvalue of the objective function in the above equation, that is, the direction of the first principal component U1! (The direction of the second principal component is the direction of the eigenvector corresponding to the second eigenvalue of, and so on).

That’s it.

The percentage of the principal component in the whole information can be calculated by the following formula:

 

 

In the formula, the denominator is the sum of squares of all singular values, and the numerator is the sum of squares of the former k large singular values selected.

Some research shows that the total length of the selected spindles accounts for about 85% of the total length of all spindles. In fact, this is just a general statement, and the specific number of spindles depends on the actual situation.

function varargout = MainForm(varargin)
% MAINFORM MATLAB code for MainForm.fig
%      MAINFORM, by itself, creates a new MAINFORM or raises the existing
%      singleton*.
%
%      H = MAINFORM returns the handle to a new MAINFORM or the handle to
%      the existing singleton*.
%
%      MAINFORM('CALLBACK',hObject,eventData,handles,...) calls the local
%      function named CALLBACK in MAINFORM.M with the given input arguments.
%
%      MAINFORM('Property','Value',...) creates a new MAINFORM or raises the
%      existing singleton*.  Starting from the left, property value pairs are
%      applied to the GUI before MainForm_OpeningFcn gets called.  An
%      unrecognized property name or invalid value makes property application
%      stop.  All inputs are passed to MainForm_OpeningFcn via varargin.
%
%      *See GUI Options on GUIDE's Tools menu.  Choose "GUI allows only one
%      instance to run (singleton)".
%
% See also: GUIDE, GUIDATA, GUIHANDLES

% Edit the above text to modify the response to help MainForm

% Last Modified by GUIDE v2.5 17-Mar-2014 21:27:08

% Begin initialization code - DO NOT EDIT
gui_Singleton = 1;
gui_State = struct('gui_Name',       mfilename, ...
    'gui_Singleton',  gui_Singleton, ...
    'gui_OpeningFcn', @MainForm_OpeningFcn, ...
    'gui_OutputFcn',  @MainForm_OutputFcn, ...
    'gui_LayoutFcn',  [] , ...
    'gui_Callback',   []);
if nargin && ischar(varargin{1})
    gui_State.gui_Callback = str2func(varargin{1});
end

if nargout
    [varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:});
else
    gui_mainfcn(gui_State, varargin{:});
end
% End initialization code - DO NOT EDIT


% --- Executes just before MainForm is made visible.
function MainForm_OpeningFcn(hObject, eventdata, handles, varargin)
% This function has no output args, see OutputFcn.
% hObject    handle to figure
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)
% varargin   command line arguments to MainForm (see VARARGIN)

% Choose default command line output for MainForm
handles.output = hObject;
clc;
set(handles.axes1, 'XTick', [], 'YTick', [], ...
    'XTickLabel', '', 'YTickLabel', '', 'Color', [0.7020 0.7804 1.0000], 'Box', 'On',...
    'xlim', [-1 1], 'ylim', [-1 1]);
set(handles.axes2, 'XTick', [], 'YTick', [], ...
    'XTickLabel', '', 'YTickLabel', '', 'Color', [0.7020 0.7804 1.0000], 'Box', 'On',...
    'xlim', [-1 1], 'ylim', [-1 1]);
set(handles.axes3, 'XTick', [], 'YTick', [], ...
    'XTickLabel', '', 'YTickLabel', '', 'Color', [0.7020 0.7804 1.0000], 'Box', 'On',...
    'xlim', [-1 1], 'ylim', [-1 1]);
handles.Ims = 0;
handles.c = 0;
handles.Im = 0;
handles.f = 0;
handles.Img = 0;
% Update handles structure
guidata(hObject, handles);

% UIWAIT makes MainForm wait for user response (see UIRESUME)
% uiwait(handles.figure1);


% --- Outputs from this function are returned to the command line.
function varargout = MainForm_OutputFcn(hObject, eventdata, handles)
% varargout  cell array for returning output args (see VARARGOUT);
% hObject    handle to figure
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)

% Get default command line output from handles structure
varargout{1} = handles.output;



function edit1_Callback(hObject, eventdata, handles)
% hObject    handle to edit1 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)

% Hints: get(hObject,'String') returns contents of edit1 as text
%        str2double(get(hObject,'String')) returns contents of edit1 as a double


% --- Executes during object creation, after setting all properties.
function edit1_CreateFcn(hObject, eventdata, handles)
% hObject    handle to edit1 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    empty - handles not created until after all CreateFcns called

% Hint: edit controls usually have a white background on Windows.
%       See ISPC and COMPUTER.
if ispc && isequal(get(hObject,'BackgroundColor'), get(0,'defaultUicontrolBackgroundColor'))
    set(hObject,'BackgroundColor','white');
end


% --- Executes on button press in pushbutton1.
function pushbutton1_Callback(hObject, eventdata, handles)
% hObject    handle to pushbutton1 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)
filePath = OpenImageFile();
if filePath == 0
    return;
end
Img = imread(filePath);
if ndims(Img) == 3
    Img = rgb2gray(Img);
end
sz = size(Img);
sz0 = [112 92];
if ~isequal(sz, sz0);
    Img = imresize(Img, sz0, 'bilinear');
end
% wh = 600;
% if sz(1) > wh
%     rate = wh/sz(1);
%     Img = imresize(Img, rate, 'bilinear');
% end
% 显示
imshow(Img, [], 'Parent', handles.axes1);
handles.Img = Img;
handles.sz = size(Img);
guidata(hObject, handles);


% --- Executes on button press in pushbutton2.
function pushbutton2_Callback(hObject, eventdata, handles)
% hObject    handle to pushbutton2 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)
if isequal(handles.Img, 0)
    return;
end
f = GetFaceVector(handles.Img);
f = f(1:round(length(f)*0.9));
handles.f = f;
guidata(hObject, handles);
msgbox('降维成功!', '提示信息');


% --- Executes on button press in pushbutton3.
function pushbutton3_Callback(hObject, eventdata, handles)
% hObject    handle to pushbutton3 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)
if isequal(handles.f, 0)
    return;
end
Im = QrGen(handles.f);
% 显示
imshow(Im, [], 'Parent', handles.axes2);
handles.Im = Im;
guidata(hObject, handles);


% --- Executes on button press in pushbutton4.
function pushbutton4_Callback(hObject, eventdata, handles)
% hObject    handle to pushbutton4 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)
if isequal(handles.Im, 0)
    return;
end
c = QrDen(handles.Im);
set(handles.edit1, 'String', c);
handles.c = c;
guidata(hObject, handles);


% --- Executes on button press in pushbutton5.
function pushbutton5_Callback(hObject, eventdata, handles)
% hObject    handle to pushbutton5 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)
if isequal(handles.c, 0)
    return;
end
Ims = FaceRec(handles.c, handles.sz);
% 显示
imshow(Ims, [], 'Parent', handles.axes3);
handles.Ims = Ims;
guidata(hObject, handles);


% --- Executes on button press in pushbutton6.
function pushbutton6_Callback(hObject, eventdata, handles)
% hObject    handle to pushbutton6 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)
% 清理工作空间
axes(handles.axes1); cla reset;
axes(handles.axes2); cla reset;
axes(handles.axes3); cla reset;
set(handles.axes1, 'XTick', [], 'YTick', [], ...
    'XTickLabel', '', 'YTickLabel', '', 'Color', [0.7020 0.7804 1.0000], 'Box', 'On');
set(handles.axes2, 'XTick', [], 'YTick', [], ...
    'XTickLabel', '', 'YTickLabel', '', 'Color', [0.7020 0.7804 1.0000], 'Box', 'On');
set(handles.axes3, 'XTick', [], 'YTick', [], ...
    'XTickLabel', '', 'YTickLabel', '', 'Color', [0.7020 0.7804 1.0000], 'Box', 'On');
set(handles.edit1, 'String', '');


% --- Executes on button press in pushbutton7.
function pushbutton7_Callback(hObject, eventdata, handles)
% hObject    handle to pushbutton7 (see GCBO)
% eventdata  reserved - to be defined in a future version of MATLAB
% handles    structure with handles and user data (see GUIDATA)
choice = questdlg('确定要退出系统?', ...
    '退出', ...
    '确定','取消','取消');
switch choice
    case '确定'
        close;
    case '取消'
        return;
end
Copy the code

References and codes for private message bloggers