Case report of principal component Pca and T-SNE algorithm for dimension reduction and visual analysis of R language high dimensional data

Original link:tecdat.cn/?p=6592

Original source:Tuo End number according to the tribe public number

There are two main use cases for dimensionality reduction: data exploration and machine learning. It is useful for data exploration because reducing the dimension to a few dimensions (for example, 2 or 3 dimensions) allows visualization of samples. This visualization can then be used to gain insights from the data (for example, detect clustering and identify outliers). For machine learning, dimensionality reduction is useful because the model generally generalizes better when fewer features are used in the fitting process.

In this article, we will investigate dimensionality reduction techniques:

Principal component analysis (PCA) : the most popular method for dimensionality reduction
Nuclear PCA: A variant of PCA that allows nonlinearity
T-sne t-distributed random neighborhood embedding: a nonlinear dimensionality reduction technique

The key difference between these methods is that PCA outputs the rotation matrix, which can be applied to any other matrix to transform data.

Load data set

We can load the dataset in the following way:

 
df <- read.csv(textConnection(f), header=T)
# select variable
features <- c("Body"."Sweetness"."Smoky"."Medicinal"."Tobacco"."Honey"."Spicy"."Winey"."Nutty"."Malty"."Fruity"."Floral")
feat.df <- df[, c("Distillery", features)]
Copy the code

Assumptions about the outcome

Before we start reducing the dimensions of data, we should consider the data.

Since whiskeys from neighbouring distilleries use similar distilling techniques and resources, their whiskies have similarities. To test this hypothesis, we will test whether the average expression of whisky characteristics differs between distilleries from different regions. To do this, we will perform the MANOVA test:

## id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id id Signif. Codes: 0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '. '0.1 "' 1Copy the code

The test statistic is significant at the 5% level, so we can reject the null hypothesis (regions have no effect on features).

Location of the winery

Since regions play an important role in whisky, we will explore the location of the wineries in the dataset by mapping their latitude and longitude. The following Scotch whisky areas exist:

PCA

Using PCA to visualize whisky datasets:

In the second figure, we will plot the labels of the wineries so that we can explain the categories in more detail.

Overall, the principal components seem to reflect the following characteristics:

PC1 indicates the intensity of flavor: smoky, medicinal (Laphroaig or Lagavulin) and mild (Auchentoshan or Aberlour)
PC2 represents flavor complexity: that is, flavor characteristics (such as Glenfiddich or Auchentoshan) versus more distinctive flavor characteristics (such as Glendronach or Macallan)

##   Cluster Campbeltown Highlands Islands Islay Lowlands Speyside
## 1       1           2        17       2     2        0       19
## 2       2           0         8       2     1        3       22
## 3       3           0         2       2     4        0        0
Copy the code

Reasonable explanations for categories are as follows:

Cluster 1: Compound whisky, mainly from Highlands/Speyside
Cluster 2: Balanced whisky, mainly from Speesside and the Highlands
Cluster 3: Smoked whisky, mainly from islay

Visualization has two interesting observations:

Oban and Clynelish are the only Highland distilleries with flavors similar to islay Brewery.
Highland and Speyside whiskey differ in one major respect. At one extreme are smooth, balanced whiskies such as Glenfiddich. At the other end of the spectrum are more distinctive flavours such as macallan.

This includes our visual study of PCA. We will investigate the use of PCA for prediction at the end of this article.

Kernel PCA

Kernel PCA (KPCA) is an extension of PCA that utilizes kernel functions that are well known on support vector machines. By mapping the data to the reproducible kernel Hilbert space, it is possible to separate the data even if they are not linearly separable.

Use KPCA in R

To perform KPCA, we use the KPCA function in the package, kernlab.

Using this kernel, you can reduce the dimensions as follows:

Having retrieved the new dimension, we can now visualize the data in the transformed space:

In terms of visualization, the results are slightly more crude than what we get using conventional PCR. Nevertheless, whisky from Islay is well separated and we can see clusters of Specset whiskies, while Highland whiskies are more widely distributed.

T-SNE

T-sne has become a very popular data visualization method.

Visualize data using T-SNE

Here, we reduce the dimension of the whisky data set to two dimensions:

Compared with PCA, the separation of clusters is clearer, especially for clusters 1 and 2.

For T-SNE, we must interpret:

V1 is flavor complexity. The outliers here are smoky Allais on the right (Lagavulin, for example) and complex Highland whiskies on the left (Macallan, for example).
V2 represents smoky/medicinal flavor.

Supervised learning using PCA

It is crucial that PCA is done independently. Therefore, the following approach needs to be followed:

PCA is performed on the test data set and the model is trained on the transformed data.
The learning PCA transform of training data is applied to test data sets and the performance of the model is evaluated on the transformed data.

For this we will use k nearest neighbor models. In addition, because all variables are in the feature space [0,4]. We had to optimize k, so we also reserved a validation set to determine this parameter.

PCA transformation

First, we write some functions to verify the predicted performance.

get.accuracy <- <strong>function</strong>(preds, labels) {
    correct.idx <- which(preds == labels)
    accuracy <- length(correct.idx) / length(labels)
    return (accuracy)
}
 
Copy the code

In the following code, we will perform PCA on the training data and study the interpreted variances to select the appropriate dimensions

##         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## N_dim      1    2    3    4    5    6    7    8    9    10    11    12
## Cum_Var   22   41   52   63   72   79   85   90   94    97    99   100
Copy the code

Since there is a sufficient percentage of variance to be explained in 3 dimensions, we will use this value to set up the training, test, and validation datasets.

Now that we have converted the training, validation, and test sets into PCA space, we can use the K nearest neighbor.

## [1] "PCA+KNN accuracy for k = 9 is: 0.571"
Copy the code

Let’s examine whether a model using PCA is better than a model based on raw data:

## [1] "KNN accuracy for k = 7 is: 0.524"
Copy the code

 # Variance of whiskey features
print(diag(var(data))) 
Copy the code

## Body Smoky Medicinal Tobacco Honey Spicy ## 0.8656635 0.5145007 0.7458276 0.9801642 0.1039672 0.7279070 Gynofloral ## gynofloral # gynofloral #Copy the code

Right now we can only identify six areas of Scotch based on their taste, but the question is whether we can still get better performance. We know that it is difficult to predict areas of Scotland that are under-represented in the data set. So what happens if we limit ourselves to fewer places?

Island whisky is combined with Islay whisky
Lowland/Campbeltown whisky is combined with Highland Whisky

In this way, the problem was reduced to three areas: Island/Islay, Highland/Lowland/Campbeltown and Speyside. Analysis again:

## [1] "PCA+KNN accuracy for k = 13 is: 0.619"
Copy the code

With an accuracy of 61.9%, we were able to conclude that grouping the whisky areas in our smaller sample was indeed worthwhile.

KPCA is used for supervised learning

Forecasting with KPCA is not as simple as PCA. In PCA, the eigenvectors are computed in the input space, but in KPCA, the eigenvectors come from the kernel Hilbert space. Therefore, it is not possible to simply transform new data points when we do not know the explicit mapping function ϕ is being used.

# Note: This overestimates the actual effect
accuracy <- get.accuracy(preds.kpca, df$Region[samp.test])
Copy the code

Abstract

We saw how PCA, KPCA, and T-SNE can be used to reduce the dimension of the dataset. PCA is a method suitable for visual and supervised learning. KPCA is a nonlinear dimension reduction technique. T-sne is one of the latest nonlinear methods that excel at visualizing data, but lack the interpretability and robustness of PCA.

This could indicate one of two things:

There’s still a lot of potential to try new whiskies.
There are a lot of flavor combinations that are possible and work well together.

I prefer the second option. Why is that? In the PCA diagram, the lower right corner is the largest region without samples. Looking at the whiskies near this area, we see that those are Macallan on the y axis and Lagavulin on the X axis. Macallan is known for its complex flavors and Lagavulin for its smoky flavor.

The whisky, located to the lower right of the two-dimensional PCA space, will be both: it is complex and smoky. I guess this two-character whiskey is too good for taste.

Thank you very much for reading this article, please leave a comment below if you have any questions!

reference

1. Matlab Partial least squares regression (PLSR) and principal component regression (PCR)

2. Principal component Pca and T-SNE algorithm for dimension reduction and visual analysis of R language high-dimensional data

3. Principal component Analysis (PCA) basic principle and analysis examples

4. LASSO regression analysis based on R language

5. Use LASSO regression to predict stock return data analysis

6. Lasso regression, Ridge Ridge regression and Elastice-net model in R language

7. Partial least squares regression PLS-DA data analysis in R language

8. Partial least squares PLS regression algorithm in R language

Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA) and Regular Discriminant Analysis (RDA)