The original |tecdat.cn/?p=22319Source |Tuo End number according to the tribe public number  

In this paper, partial least squares (PLS) regression (PLSR) model is established and the prediction performance is evaluated. In order to build a reliable model, we also implemented some common outlier detection and variable selection methods, which can remove potential outliers and “wash” your data using only a subset of selected variables.

steps

  • PLS regression model was established
  • PLS k-fold cross validation
  • PLS monte Carlo cross validation (MCCV).
  • Double cross validation of PLS (DCV)
  • Monte Carlo sampling method was used to detect outliers
  • CARS method was used for variable selection.
  • Use moving window PLS (MWPLS) for variable selection.
  • Monte Carlo uninformative variable elimination method (MCUVE) was used for variable selection
  • Make variable selection

PLS regression model was established

This example illustrates how to build a PLS model using benchmark near infrared data.

plot(X'); % displays spectral data. Xlabel (' wavelength index '); Ylabel (' strength ');Copy the code

Parameter setting

A=6; % Number of potential variables (LVS). method='center'; % X internal preprocessing method PLS(X,y,A,method) for establishing PLS model; % command to create a modelCopy the code

 

The pls.m function returns an object PLS containing the list of components. Interpretation of results.

Regcoef_original: The regression coefficient connecting X and Y. X_scores: X scores. VIP: The importance of variables in the prediction, a criterion for evaluating the importance of variables. The importance of variables. RMSEF: root mean square error of the fit. Y_fit: the fitting value of y. R2: percentage of explanatory variation of Y.

PLS k-fold cross validation

How to perform k-fold cross validation for PLS model

clear; A=6; The number of % LV K=5; % Number of times of cross validationCopy the code

Plot (CV.RMSECV) % Plot the RMSECV value under each potential variable (LVs) quantity xLabel (' potential variable (LVs) quantity ') % Add xlabel ylabel('RMSECV') % Add Y labelCopy the code

 

The value CV returned is structural data with a list of components. Interpretation of results.

RMSECV: Root mean square error of cross validation. Smaller is better Q2: has the same meaning as R2, but is computed by cross validation. OptLV: The number of LVS reaching the minimum RMSECV (maximum Q2).

Monte Carlo cross-validation (MCCV) with PLS

Explain how to MCCV PLS modeling. MCCV, like K-FOLD CV, is another method of cross-validation.

% parameter setting A=6; method='center'; N=500; % Number of Monte Carlo samples % run McCv.plot (McCv.rmsecv); % Draw the RMSECV value xlabel(' LVs number ') for each potential variable (LVs number);Copy the code
 
Copy the code

MCCV
Copy the code

MCCV is a structured data. Interpretation of results.

Ypred: predicted value Ytrue: true value RMSECV: root mean square error of cross verification. The smaller the error, the better. Q2: has the same meaning as R2, but is calculated by cross validation.

Double cross validation of PLS (DCV)

Explain how to DCV PLS modeling. Like K-FOLD CV, DCV is a means of cross-validation.

% parameter setting N=50; DCV (X, Y,A, K,method,N); DCVCopy the code

Outlier detection using Monte Carlo sampling method

Explain the use of outlier detection methods

A=6;
method='center';
F=mc(X,y,A,method,N,ratio);
Copy the code

Interpretation of results.

PredError: sample prediction error per sample MEAN: average prediction error per sample STD: standard deviation of the prediction error per sample

Plot (F) % Diagnostic diagramCopy the code

 

Note: Samples with high MEAN or SD values are more likely to be outliers and should be considered to be removed before modeling.

CARS method was used for variable selection.


A=6;
fold=5;
car(X,y,A,fold);
Copy the code

Interpretation of results.

OptLV: number of LVS for the optimal model VSEL: selected variable (column in X).

plotcars(CARS); % in the diagnosis of figureCopy the code

 

Note: In this figure, the top and middle panels show how the number of selected variables and RMSECV changes over iteration. The bottom panel describes how the regression coefficients for each variable (one for each line) change over iteration. The star vertical line represents the best model with the lowest RMSECV.

Use moving window PLS (MWPLS) for variable selection

load corn_m51; % example data width=15; % window size mw(X,y,width); plot(WP,RMSEF); Xlabel (' window position ');Copy the code

 

Note: From this figure, it is suggested to include the regions with low RMSEF values into the PLS model.

Monte Carlo uninformative variable elimination method (MCUVE) was used for variable selection

N=500;
method='center';

UVE
Copy the code


plot(abs(UVE.RI))
Copy the code

 

Interpretation of results. RI: UVE reliability index is a measure of the importance of variables. The higher the better.

Make variable selection

A=6;
N=10000;
method='center';
FROG=rd_pls(X,y,A,method,N);


              N: 10000
              Q: 2
          model: [10000x700 double]
        minutes: 0.6683
         method: 'center'
          Vrank: [1x700 double]
         Vtop10: [505 405 506 400 408 233 235 249 248 515]
    probability: [1x700 double]
           nVar: [1x10000 double]
          RMSEP: [1x10000 double]
Copy the code

Xlabel (' variable number '); Ylabel (' selection probability ');Copy the code

 

Interpretation of the results:

The model results in a matrix that stores the selection variables in each of the interrelationships. Probability: The probability that each variable is included in the final model. The bigger the better. This is a useful indicator of the importance of a variable.


Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression