Original link:tecdat.cn/?p=9913

Original source:Tuo End number according to the tribe public number

 


 

Overview and Definition

In this paper, we will consider some alternative fitting methods for linear models, in addition to the usual ordinary least square method. These alternative methods can sometimes provide better predictive accuracy and model interpretability.

  • Prediction accuracy: Linear, ordinary least squares estimators will have low bias. OLS also performed well,n  >>  p. However, ifnNo better thanpMuch larger, the fit may have a lot of variability, leading to overfitting and/or poor prediction. ifp  >  n, there is no longer a unique least squares estimate, and the method cannot be used at all.

This problem is another aspect of the curse of dimension. As P starts to get larger, the observed value X starts to get closer to the boundary between the classes than nearby observations, causing major problems for prediction. In addition, for many PS, the training samples are often sparse, making it difficult to identify trends and make predictions.

By limiting and narrowing the estimated coefficients, we can usually greatly reduce the variance, since the increase in bias is negligible, which usually results in a significant improvement in accuracy.

  • Model interpretability: Unrelated variables lead to unnecessary complexity in the resulting model. By removing them (setting coefficient = 0), we get a model that is easier to interpret. However, using OLS makes it extremely unlikely that the coefficients will be zero.

    • Subset selection: We use the least square fitting model of subset features.

Although we have discussed the application of these techniques to linear models, they are also applicable to other methods, such as classification.

Detailed methods

Subset selection

Optimal subset selection

Here, we fit individual OLS regressions for each possible combination of P predictive variables and then look at the resulting model fit. The problem with this approach is that the best model is hidden within 2 ^ P possibilities. The algorithm is divided into two phases. (1) Fit all models containing K predictive variables, where K is the maximum length of the model. (2) Select a model using the prediction error of cross-validation. More specific prediction error methods, such as AIC and BIC, are discussed below.

This applies to other types of model choices, such as logistic regression, but the score we choose varies depending on the choice. For logistic regression, we will use bias instead of RSS and R ^ 2.

 

Choose the best model

Each of the three algorithms mentioned above requires us to manually determine which model works best. As mentioned earlier, the model with the most predicted values usually has the smallest RSS and the largest R ^ 2 when using training errors. In order to select the model with the maximum test error, we need to estimate the test error. There are two ways to calculate the test error.

  1. By making and adjusting training errorsindirectEstimate test errors to address overfitting deviations.
  2. Use validation sets or cross validation methodsdirectlyEstimated test error.

 

Validation and cross validation

Typically, cross-validation techniques are more direct estimates of tests and make fewer assumptions about the underlying model. In addition, it can be used in a wider selection of model types.

 

Ridge regression

Ridge regression is similar to least square except that the coefficients are estimated by minimizing slightly different quantities. Like OLS, Ridge regressions seek to reduce the coefficient estimates for RSS, but they also produce shrinkage losses when the coefficients approach zero. The effect of this loss is to reduce the coefficient estimate to zero. The parameter λ controls the effect of shrinkage. λ= 0 behaves exactly the same as OLS regression. Of course, choosing a good λ value is crucial and should be done using cross validation. Ridge regression requires that the center of predictive variable X be set as mean = 0, so the data must be standardized in advance.

 

Why is ridge regression better than least square?

The advantage is evident in the bias variance. With the increase of λ, the flexibility of ridge regression fitting decreases. This results in a smaller variance and a smaller increase in bias. Fixed OLS regression has high variance but no bias. However, the lowest test MSE tends to occur at the intersection between variance and bias. Thus, by appropriately adjusting λ to obtain less variance, we can find a lower potential MSE.

Ridge regression is most effective when the least square estimation has high variance. Ridge regression is more computationally efficient than any subset method because all λ values can be evaluated simultaneously.

The Lasso Lasso

Ridge regression has at least one disadvantage. It includes all p predictive variables in the final model. The penalty term will bring many of them close to zero, but will never be exactly zero. This is usually not a problem for prediction accuracy, but can make it harder for the model to interpret the results. Lasso overcomes this shortcoming and is able to make S small enough to force some coefficients to be zero. Since S = 1 leads to conventional OLS regression, when S approaches 0, the coefficient will shrink to zero. Therefore, lasso regression also performs variable selection.

 

Dimension reduction method

So far, the methods we have discussed have controlled for variance by using subsets of the original variables or reducing their coefficients to zero. Now, we explore a class of models that can transform the prediction variables and then use the transformed variables to fit the least-squares model. Dimensionality reduction reduces the problem of estimating p +1 coefficients to a simple problem of M +1 coefficients, where M < P. The two approaches to this task are principal component regression and partial least squares.

Principal component regression (PCA)

PCA can be described as a method for deriving low-witt solicitations from a large number of variables.

In regression, we construct M principal components and then use these components as predictive variables in linear regression using least squares. In general, it is possible to fit a better model than ordinary least squares because we can reduce the effects of overfitting.

 

Partial least squares

The PCR method we described above involves identifying linear combinations of X that best represent the predictive variables.

PLS achieves this by giving higher weights to the variables most closely related to the dependent variable.

In fact, THE performance of PLS is not better than ridge regression or PCR. That’s because even though PLS can reduce bias, it’s also likely to increase variance, so the overall payoff doesn’t really make a difference.

 

Explain the higher dimensional results

We must always be careful about how we report the results of our model, especially in high-dimensional Settings. Multicollinearity is a serious problem in this case, because any variable in the model can be written as a linear combination of all the other variables in the model.

 

example

Subset selection method

Optimal subset selection

We hope to predict the Salary situation of baseball players based on various statistics from the previous year.

library(ISLR)
attach(Hitters)
names(Hitters)
Copy the code
##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"
Copy the code
dim(Hitters)
Copy the code
# # 322 20 [1]Copy the code
str(Hitters)
Copy the code
## 'data.frame': 322 obs. of 20 variables: ## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ... ## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ... ## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ... ## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ... ## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ... ## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ... ## $ Years : int 1 14 3 11 2 11 2 3 2 13 ... ## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ... ## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ... ## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ... ## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ... ## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ... ## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ... ## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ... ## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ... ## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ... ## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ... ## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ... ## $Salary: num NA 475 480 500 91.5 750 70 100 75 1100... ## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...Copy the code
Sum (is. Na (Hitters$Salary))/length(Hitters[,1])*100Copy the code
# # 18.32 [1]Copy the code

As it turned out, about 18 percent of the data was lost. We will omit the missing data.

Hitters <- na.omit(Hitters)
dim(Hitters)
Copy the code
# # 263 20 [1]Copy the code

Perform the best subset selection and use RSS for quantification.

library(leaps)

regfit <- regsubsets(Salary ~ ., Hitters)
summary(regfit)
Copy the code
## Subset selection object ## Call: regsubsets.formula(Salary ~ ., Hitters) ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE  FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 8 ## Selection Algorithm: exhaustive ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 ( 1 ) " " " " " " " " " " " " " " " " "" "" "" # # 2 (1)" "" "" "" "" "" "" "" "" "" "" "# # 3 (1)" "" "" "" "" "" "" "" "" "" "" "# # 4 (1)" * "" "" "" "" "" "" "" "" "" "" "# # 5 (1)" * * "" "" "" "" "" "" "" "" "" "" "# # 6 ( 1) "* *" "" "" "" "" "" "" "" "" "" "" "# # 7 (1)" "" "" "" "" * "" *" "" "* "" * *" "" "" # # 8 (1) "*" "*" " " " " " " "*" " " " " " " "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN ## 1 ( 1 ) "*" "" "" "" "" "" "" "" # # 2 (1)" * "" "" "" "" "" "" "" "# # 3 (1) "*" "" "" "" "" "" "" "*" # # # 4 (1) "*" "" "" "* *" "" "" "" "" # # 5 (1) "*" "" "" "* *" "" "" "" "" # # 6 (1) "*" "" "" "* *" "" "" " "" # # 7 (1)" "" "" "" "" "* *" "" "" "" # # 8 (1)" "" "" "" "" * * * "" "" "" "Copy the code

The asterisk indicates that the variable is included in the corresponding model.

##  [1] 0.3215 0.4252 0.4514 0.4754 0.4908 0.5087 0.5141 0.5286 0.5346 0.5405
## [11] 0.5426 0.5436 0.5445 0.5452 0.5455 0.5458 0.5460 0.5461 0.5461
Copy the code

In this 19 variable model, R ^ 2 increases monotonically.

We can draw RSS, ADJ R ^ 2, C P, AIC and BIC using the built-in drawing function.

Note: The degree of fit shown above is an estimate of all test errors except R ^ 2.

Step by step forward and backward

 

## Subset selection object ## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward") ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE  ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 19 ## Selection Algorithm: forward ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 ( 1 ) " " " " " " " " " " " " " " " " " "" "" "# # 2 (1)" * "" "" "" "" "" "" "" "" "" "" "# # 3 (1)" * "" "" "" "" "" "" "" "" "" "" " # # 4 (1) "" "" "" "" "" "" "" "" "" "" "" # # 5 (1)" * * "" "" "" "" "" "" "" "" "" "" "" # # 6 (1) "*" "*" "" "" "" "" "" "" "" "" "" # # 7 (1)" * * "" "" "" "" "" "" "" "" "" "" "" # # 8 (1)" * "" *" " * "" "" "" "" "" "" "" "" * "# # 9 (1)" * "" * *" "" "" "" "" "" "" "" "" "* *" # # 10 (1) "* *" "" "" "" " "" *" "" "" "" "" "* *" # # 11 (1) "*" "* *" "" "" "" "" "" "" "" "" "* *" # # 12 (1) "*" "*" "" "" "" "* * "" "" "" "" "" * * # # 13 (1) "" "" "" * *" "* *" "" "" "" "" "" "" "* *" # # 14 (1) "*" "* *" "*" "" "" "" "" "* *" " "" "" * "# # 15 (1)" * "" *" "*" "* *" "" "" "" "" "" "" "* * *" # # 16 (1) "*" "*" "*" "*" "" "" "" "* * * * "" "" " "*" # # 17 (1) "*" "*" "*" "*" "* *" "" "" "* *" "" "" "*" # # 18 (1) "*" "*" "*" "*" "*" "*" "*" "* *" "" "" "*" # # 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN # # 1 (1) "*" "" "" "" "" "" "" "" # # 2 (1)" * "" "" "" "" "" "" "" "# # 3 (1) "*" "" "" "" "" "*" "" "" # # 4 (1) "*" "" "" "* *" "" "" "" "" # # 5 (1) "*" "" "" "* *" "" "" "" "" # # 6 (1) "*" "" "" "*" "*" "" "" "" # # 7 (1)" * * "" "" "" * *" "" "" "" "" # # 8 (1) "* *" "" "" "* *" "" "" "" "" # # 9 (1) "*" "*" "* *" "" "" "" "" "" # # 10 (1)" * * "" "" "" *" "* *" "" "" "" # # 11 (1)" * "" *" "*" "*" "* *" "" "" "" # # 12 (1) "*" "*" "*" "*" "* *" "" "" "" # # 13 (1)" * "" *" "*" "*" "*" "* *" "" "" # # 14 (1) "*" "*" "*" "*" "*" "*" "*", "" # # 15 (1)" * "" *" "*" "*" "*" "* *" "" "" # # 16 (1) "*" "*" "*" "*" "*" "* *" "" "" # # 17 (1) "*" "*" "*" "*" "*" "*" "*" "*" # # 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" # # 19 (1) "*" "*" "*" "*" "*" "*" "*" "*"Copy the code
## Subset selection object ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ##  LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 19 ## Selection Algorithm: backward ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 ( 1 ) " " " " " " " " " " " " " " " " " "" "" * "# # 2 (1)" * "" "" "" "" "" "" "" "" "" "" * "# # 3 (1)" * "" "" "" "" "" "" "" "" "" "" *" # # 4 (1) "* *" "" "" "" "" "" "" "" "" "" "" * "# # 5 (1)" * * "" "" "" "" "" "" "" "" "" "" "*" # # 6 (1) ) "* *" "" "" "" "" "" "" "" "" "" "" * "# # 7 (1)" * * "" "" "" "" "" "" "" "" "" "" "*" # # 8 (1) "*" "*" "" "" "" "" "" "" "" "" "*" # # 9 (1) "* *" "" "" "" "" "" *" "" "" "" "* "" *" # # 10 (1) "* *" "" "" "" " "" *" "" "" "" "" "* *" # # 11 (1) "*" "* *" "" "" "" "" "" "" "" "" "* *" # # 12 (1) "*" "*" "" "" "" "* * "" "" "" "" "" * * # # 13 (1) "" "" "" * *" "* *" "" "" "" "" "" "" "* *" # # 14 (1) "*" "* *" "*" "" "" "" "" "* *" " "" "" * "# # 15 (1)" * "" *" "*" "* *" "" "" "" "" "" "" "* * *" # # 16 (1) "*" "*" "*" "*" "" "" "" "* * * * "" "" " "*" # # 17 (1) "*" "*" "*" "*" "* *" "" "" "* *" "" "" "*" # # 18 (1) "*" "*" "*" "*" "*" "*" "*" "* *" "" "" "*" # # 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN # # 1 (1) "" "" "" "" "" "" "" "" # # 2 (1) "" "" "" "" "" "" "" "" # # 3 (1) "" "" "" "" "" "*" "" "" # # 4 (1) "" "" "" "" "" "" "" "* "# # 5 (1)" "" "" "" "" "" "" "" *" # # 6 (1) "" "" "" "*" "*" "" "" "" # # 7 (1)" "" *" "* *" "" "" "" "" "" # # 8 (1)" * * "" "" "" * *" "" "" "" "" # # 9 (1) "*" "*" "* *" "" "" "" "" "" # # 10 (1)" * * "" "" "" *" "* *" "" "" "" # # 11 (1)" * "" *" "*" "*" "* *" "" "" "" # # 12 (1) "*" "*" "*" "*" "* *" "" "" "" # # 13 (1)" * "" *" "*" "*" "*" "* *" "" "" # # 14 (1) "*" "*" "*" "*" "*" "*" "*", "" # # 15 (1)" * "" *" "*" "*" "*" "* *" "" "" # # 16 (1) "*" "*" "*" "*" "*" "* *" "" "" # # 17 (1) "*" "*" "*" "*" "*" "*" "*" "*" # # 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" # # 19 (1) "*" "*" "*" "*" "*" "*" "*" "*"Copy the code

We can see here that the 1-6 variable models are the same for the optimal subset and selection.

Ridge regression and lasso

Start the cross-validation method

We will also apply the cross-validation method to the regularization method.

Validation set

R ^ 2 C P and BIC estimate test error rate, we can use cross validation method. We must use training observations only to perform all aspects of model fitting and variable selection. Test errors are then calculated by applying the training model to the test or validation data.

## Ridge Regression ## ## 133 samples ## 19 predictors ## ## Pre-processing: scaled, centered ## Resampling: Bootstrapped (25 reps) ## ## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ... ## ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared RMSE SD Rsquared SD ## 0 400 0.4 40 0.09 ## 1E-04 400 0.4 40 0.09 ## 0.1 300 0.5 40 0.09 ## ## RMSE is used to select the best model using minimum values. The final value of ## for the model is lambda = 0.1.Copy the code

mean(ridge.pred - test$Salary)^2
Copy the code
# # 30.1 [1]Copy the code

kCross validation

Use k-cross validation to select the best lambda.

For cross-validation, we split the data into test and training data.

## Ridge Regression ## ## 133 samples ## 19 predictors ## ## Pre-processing: centered, scaled ## Resampling: Cross-Validated (10 fold) ## ## Summary of sample sizes: 120, 120, 119, 120, 120, 119, ... ## ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared RMSE SD Rsquared SD ## 0 300 0.6 70 0.1 ## 1E-04 300 0.6 70 0.1 ## 0.1 300 0.6 70 0.1 ## ## RMSE is used to select the best model using minimum values. The final value of ## used for the model is lambda = 1E-04.Copy the code
# predict(Ridge $finalModel, type='coef', mode='norm')$coefficients[19,]Copy the code
## AtBat Hits HmRun Runs RBI Walks ## -157.221 313.860-18.996 0.000-70.392 171.242 ## Years CAtBat CHits CHmRun CRuns CRBI ## -27.543 0.000 0.000 51.811 202.537 187.933 ## CWalks LeagueN DivisionW PutOuts Assists Errors ## -224.951 12.839 -38.595-9.128 13.288-18.620 ## NewLeagueN ## 22.326Copy the code
sqrt(mean(ridge.pred - test$Salary)^2)
Copy the code
# # 17.53 [1]Copy the code

So the average error in salary is about 33,000. The regression coefficient doesn’t really seem to go to zero, but that’s because we normalized the data in the first place.

Now, we should check if this is better than the regular LM () model.

## Linear Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled, centered 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 120, 120, 121, 119, 119, 119, ... 
## 
## Resampling results
## 
##   RMSE  Rsquared  RMSE SD  Rsquared SD
##   300   0.5       70       0.2        
## 
## 
Copy the code
coef(lmfit$finalModel)
Copy the code
## (Intercept) AtBat Hits HmRun Runs RBI ## 535.958-327.835 591.667 73.964-169.699-162.024 ## Walks Years CAtBat CHits CHmRun CRuns ## 234.093-60.557 125.017-529.709-45.888 680.654 ## CRBI CWalks LeagueN DivisionW PutOuts Assists ## 393.276-399.506 19.118-46.679-4.898 41.271 ## NewLeagueN ##Copy the code
sqrt(mean(lmfit.pred - test$Salary)^2)
Copy the code
# # 17.62 [1]Copy the code

As we see, this ridge regression fit certainly has a lower RMSE and a higher R ^ 2.

The lasso

## The lasso ## ## 133 samples ## 19 predictors ## ## Pre-processing: scaled, centered ## Resampling: Cross-Validated (10 fold) ## ## Summary of sample sizes: 120, 121, 120, 120, 120, 119, ... ## ## Resampling results across tuning parameters: ## ## fraction RMSE Rsquared RMSE SD Rsquared SD ## 0.1 300 0.6 70 0.2 ## 0.5 300 0.6 60 0.2 ## 0.9 300 0.6 70 0.2 ## ## RMSE is used to select the best model using minimum values. The final value of ## for the model is = 0.5.Copy the code
# # # # $s [1] # # # # $0.5 fraction 0 # # # # # # # # # # $0.5 mode [1] "fraction" # # # # # # $coefficients AtBat Hits HmRun Runs RBI Walks ## -227.113 406.285 0.000-48.612-93.740 197.472 ## Years CAtBat CHits CHmRun CRuns CRBI ## -47.952 0.000 0.000 82.291 274.745 166.617 ## CWalks LeagueN DivisionW PutOuts Assists Errors ## -287.549 18.059-41.697-7.001 30.768 -26.407 ## NewLeagueN ## 19.190Copy the code
sqrt(mean(lasso.pred - test$Salary)^2)
Copy the code
# # 14.35 [1]Copy the code

In lasso, we see that many coefficients have been forced to zero. Even if RMSE is a little higher than ridge regression, it has an advantage over linear regression models.

PCR and PLS

Principal component regression

 

## Data: X dimension: 133 19 ## Y dimension: 133 1 ## Fit method: svdpc ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 451.5 336.9 323.9 328.5 328.4 329.9 337.1 ## adjCV 451.5 336.3 323.6 327.8 327.5 328.8 335.7 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ## CV 335.2 333.7 338.5 334.3 337.8 340.4 346.7 ## adjCV 332.5 331.7 336.4 332.0 335.5 337.6 343.4 ## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps ## CV 345.1 345.7 329.4 337.3 343.5 338.7 ## AdjCV 341.2 341.6 325.7 332.7 338.4 333.9 ## ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps ## 92.74 ## Salary 92.74 50.01 51.19 51.98 53.23 55.63 ## 4 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 Comps ## X 95.37 96.49 97.45 98.09 98.73 99.21 99.52 ## Salary 56.48 56.73 59.92 59.34 59.44 62.01 ## Comps 17 comps 18 comps 19 comps ## X 99.77 99.90 99.97 99.99 100.00 ## Salary 62.65 65.29 66.48 66.77 67.37Copy the code

The algorithm reports CV as RMSE and training data as R ^ 2. As you can see by plotting the MSE, we achieved the lowest MSE. This represents a big improvement over the least square method, as we were able to account for most of the variance using only 3 components instead of 19.

Execute on test data set.

sqrt(mean((pcr.pred - test$Salary)^2))
Copy the code
# # 374.8 [1]Copy the code

Lower than lasso/linear regression RMSE.

 

## Principal Component Analysis ## ## 133 samples ## 19 predictors ## ## Pre-processing: centered, scaled ## Resampling: Cross-Validated (10 fold) ## ## Summary of sample sizes: 121, 120, 118, 119, 120, 120... ## ## Resampling results across tuning parameters: ## ## ncomp RMSE Rsquared RMSE SD Rsquared SD ## 1 300 0.5 100 0.2 ## 2 300 0.5 100 0.2 ## 3 300 0.6 100 0.2 ## ## RMSE is used to select the best model using minimum values. ## The final value for the model is ncomp = 3.Copy the code

Choose the best model of 2 components

sqrt(mean(pcr.pred - test$Salary)^2)
Copy the code
# # 21.86 [1]Copy the code

However, the PCR results are not easily interpreted.

Partial least squares

## Data: X dimension: 133 19 ## Y dimension: 133 1 ## Fit method: kernelpls ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 451.5 328.9 328.4 332.6 329.2 325.4 323.4 ## adjCV 451.5 328.2 327.4 330.6 326.9 323.0 320.9 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ## CV 318.7 318.7 316.3 317.6 316.5 317.0 319.2 ## adjCV 316.2 315.5 313.5 314.9 313.6 313.9 315.9 ## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps ## CV 323.0 323.8 325.4 324.5 323.6 321.4 ## AdjCV 319.3 320.1 321.4 320.5 319.9 317.8 ## ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps ## 89.17 ## Salary 51.56 54.90 57.72 59.78 61.50 62.94 63.96 ## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 Comps ## X 90.55 93.49 95.82 97.05 97.67 98.45 98.67 ## Salary 65.34 65.75 66.03 66.44 66.69 66.77 66.94 ## 15 comps 16 Comps 17 comps 18 comps 19 comps ## X 99.02 99.26 99.42 99.98 100.00 ## SalaryCopy the code

The best M is 2. Evaluate corresponding test errors.

sqrt(mean(pls.pred - test$Salary)^2)
Copy the code
# # 14.34 [1]Copy the code

Here we can see an improvement in RMSE compared to PCR.


Most welcome insight

1. Matlab Partial least squares regression (PLSR) and principal component regression (PCR)

2. Principal component Pca and T-SNE algorithm for dimension reduction and visual analysis of R language high-dimensional data

3. Principal component Analysis (PCA) basic principle and analysis examples

4. LASSO regression analysis based on R language

5. Use LASSO regression to predict stock return data analysis

6. Lasso regression, Ridge Ridge regression and Elastice-net model in R language

7. Partial least squares regression PLS-DA data analysis in R language

8. Partial least squares PLS regression algorithm in R language

Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA) and Regular Discriminant Analysis (RDA)