Original link:tecdat.cn/?p=8491

 

Principal component analysis (PCA) is a data dimension reduction technique, which can transform a large number of related variables into a group of few unrelated variables, which are called principal components. For example, PCA can be used to transform 30 related (and probably redundant) environmental variables into five unrelated component variables while preserving as much information as possible from the original data set.

 

Principal component analysis model, variables (X1 to X5) map to principal components (PC1,PC2)

The general steps of PCA analysis are as follows:

  1. Data preprocessing. PCA deduces results based on correlations between variables. Users can input the raw data matrix or correlation coefficient matrix into principal() and FA () functions for calculation. Make sure there are no missing values in the data before calculation.
  2. Determine the number of principal components to select (no factor analysis is involved here).
  3. Select the principal component (rotation is not involved here).
  4. Explain the results.
  5. Calculate the principal component score.

The goal of PCA is to replace a large number of related variables with a small group of unrelated variables, while preserving the information of the initial variables as much as possible. These derived variables are called principal components, and they are linear combinations of observed variables. If the first principal component is:

It is a weighted combination of K observed variables and has the greatest explanatory effect on the variance of the initial set of variables. The second principal component is also a linear combination of the initial variables, with the second most explanatory difference and orthogonal to the first principal component. Each of the following principal components maximizes the degree of explanation of its difference, and is orthogonal to all of the preceding principal components. We all want to be able to explain all the variables with fewer principal components.

The data set USJudgeRatings contains lawyers’ ratings of U.S. Supreme Court judges. The data box contains 43 samples and 12 variables:

The question then arises: is it possible to summarize the 12-variable assessment information with fewer variables? If so, how many? How do you define them?

Firstly, the number of principal components is judged. Here, Cattell gravel test is used to show the relationship between eigenvalues and principal components. The general principle is that the eigenvalue of the number of principal components to be retained should be greater than 1 and greater than the eigenvalue of parallel analysis. Let’s go straight to the diagram:

Evaluate the number of principal components to be retained in the rating of JUDGES in the United States. The lithograph (line and x symbol), the criterion of eigenvalues greater than 1 (horizontal line), and the parallel analysis of 100 simulations (dotted line) all show that retaining a principal component is sufficient

It can be seen that only the eigenvalue of the left-handed Component Number 1 is greater than 1 and the eigenvalue of the parallel analysis. So selecting a principal component preserves most of the data set. The next step is to pick out the corresponding principal component using the principal() function.

It can be seen that the first principal component (PC1) is highly correlated with almost every variable (except CONT), that is, it is a dimension that can be used for general evaluation. The h2 column refers to the variance of the component common factor — the degree to which the principal component explains the variance of each variable. Column U2 refers to component uniqueness — the proportion of variance that cannot be explained by principal components (1-h2). The SS Loadings line contains the eigenvalues associated with the principal component, referring to the normalized value of the variance associated with a particular principal component (in this case, the first principal component has a value of 10). Finally, the Proportion Var line indicates the degree to which each principal component explains the entire data set. Here you can see that the first principal component explains 84% of the 12 variables.

PC1$scores

Principal component score

Since the correlation between variable CONT and PC1 is too low, that is, PC1 cannot represent CONT, we add a principal component PC2 to represent CONT. Combined with the push of the previous period, the diagram is as follows:

It can be seen that PC1(84.4%) and PC2(9.2%) can explain the degree of 93.6 of the 12 variables. All the other 11 variables except CONT have a good correlation with PC1, so PC1 is basically skew with these 11 variables, while CONT cannot be represented by PC1, so it is basically orthogonal and vertical with PC1. PC2 is basically parallel to CONT, indicating that it can basically represent CONT.