Original link:tecdat.cn/?p=8652

Original source:Tuo End number according to the tribe public number

 

Partial least squares regression is a form of regression. When using PLS, the new linear combination helps explain the independent and dependent variables in the model.

In this article, we will use PLS to predict “revenue”.

library(Ecdat)
Copy the code
## 'data.frame': 753 obs. of 18 variables: ## $ work : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ... ## $ hoursw : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ... ## $ child6 : int 1 0 1 0 1 0 0 0 0 0 ... ## $ child618 : int 0 2 3 3 2 0 2 0 2 2 ... ## $ agew : int 32 30 35 34 31 54 37 54 48 39 ... ## $ educw : int 12 12 12 12 14 12 16 12 12 12 ... ## $hearnw: num 3.35 1.39 4.55 1.1 4.59... ## $wagew: num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15... ## $ hoursh : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ... ## $ ageh : int 34 30 40 53 32 57 37 53 52 43 ... ## $ educh : int 12 9 12 10 12 11 12 8 4 12 ... $wageh: num 4.03 8.44 3.58 3.54 10... ## $ income : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ... ## $ educwm : int 12 7 12 7 12 14 14 3 7 7 ... ## $ educwf : int 7 7 7 7 14 7 7 3 7 7 ... ## $unemprate: num 5 11 5 5 9.5 7.5 5 5 3 5... ## $ city : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ... ## $ experience: int 14 5 15 6 7 33 11 35 24 21 ...Copy the code

First, we prepared the data by dividing it into training and test sets.

set.seed(777)
train<-sample(c(T.F),nrow(Mroz),rep=T) #50/50 training/testing split
Copy the code

In the code above, we set the “set.seed function” to ensure repeatability. We then create the “train” object.

Now we use the “PLSR” function to create the model, and then use the “summary” function to examine the results. We use cross validation. Here’s the code.

## Data: X dimension: 392 17 ## Y dimension: 392 1 ## Fit method: kernelpls ## Number of components considered: 17 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 11218 8121 6701 6127 5952 5886 5857 ## adjCV 11218 8114 6683 6108 5941 5872 5842 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ## CV 5853 5849 5854 5853 5853 5852 5852 ## adjCV 5837 5833 5837 5836 5836 5835 5835 ## 14 comps  15 comps 16 comps 17 comps ## CV 5852 5852 5852 5852 ## adjCV 5835 5835 5835 5835 ## ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps ## 69.13 ## income 49.26 66.63 72.75 74.16 74.87 75.25 75.44 ## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 Comps ## X 72.82 76.06 78.59 81.79 85.52 89.55 92.14 ## income 75.49 75.51 75.51 75.52 75.52 ## 15 COMPs 16 Comps 17 comps ## X 94.88 97.62 100.00 ## income 75.52 75.52 75.52Copy the code

The output includes the root mean square error in the validation section. Because there are 17 independent variables, there are 17 components. It can be seen that there is little improvement in the variance explained in the dependent variable after components 3 or 4. Below is the code for these resulting graphs.

 

We will use our model to make predictions.

After that, we calculate the mean square error. This is done by subtracting the results of our prediction model from the dependent variable of the test set. We then square this information and calculate the average.

mean((pls.pred-Mroz$income[test])^2)
Copy the code
# # 63386682 [1]Copy the code

We run the data and compare the results using a traditional least squares regression model.

# # 59432814 [1]Copy the code

The least-squares model is a little bit better than the partial least-squares model, but if we look at the model, we see several variables that are not important. Let’s delete this and see what happens, okay

summary(lm.fit)
Copy the code
## ## Residuals: ## Min 1Q Median 3Q Max ## -20131 -2923 -1065 1670 36246 ## ## Coefficients: # # Estimate Std. Error t value (Pr > | | t) # # (Intercept) 1.946 e+04 e+03 3.224 6.036 3.81 e-09 * * * # # workno - 4.823 e+03 Hoursw + 1.255e + 5.517E-01 + 7.712 + 1.14E-13 *** ## Child6-6.313e +02 + 6.694e+ 02-0.943 + + 1 + 1 + 1 + 1 + 1 *** # educw + + 1.268e+02 + 1.889e+02 + 0.671 + 502513 ## hearnw 6.401e+02 + 1.420e+02 + 4.507 + 79e-06 *** ## wagew 1.945e+02 + 1.818e+02 + 1.070 0.285187 ## hoursh 6.030e+ 5.342e-01 11.288 < 2e-16 *** ## ageh-9.433e +01 7.720e+ 1.222 0.222488 ## educh $+ 1.784e+02 + 1.369e+02 + 1.303 + 0.193437 ## wageh 2.202e+03 8.714e+01 25.264 < 2e-16 *** ## educwm -4.394e+01 + 1.128e+ 02-0.390 ## educwf + 1.392e+ 1.053e+02 + 1.322 + 0.186873 ## unemprate + 1.657 +02 + 9.780e+01 + 1.694 0.091055. ## cityyes 3.475 e+02 e+02 6.686 0.520 0.603496 # # experience 1.229 4.490 e+01 e+02 2.737 0.006488 * * # # - # # Signif. Codes: 0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '. '0.1 "' 1 # # # # Residual standard error: Adjusted R-squared: 0.744 ## F-statistic: 667 on 667 degrees of freedom ## Multiple R-squared: 0.75, Adjusted R-squared: 0.744 ## f-statistic: 67.85 on 17 and 374 DF, p-value: < 2.2e-16Copy the code
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
Copy the code
# # 57839715 [1]Copy the code

The error is greatly reduced, which indicates that the least squares regression model is better than the partial least squares model. In addition, partial least squares models are difficult to interpret. Therefore, the least square model is the most popular model.

 


Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression