Original link:tecdat.cn/?p=9706

Original source:Tuo End number according to the tribe public number

 

The overview

Here we relax the assumptions of the popular linear approach. Sometimes the linear hypothesis is just a bad approximation. There are many ways to solve this problem, some of which can be solved by using regularization to reduce model complexity. However, these techniques still use linear models and can only be improved so far. This paper focuses on the extension of linear models

  • Polynomial regressionThis is a simple way to provide a nonlinear fit to the data.
  • Step functionDivide the range of variables intoKDifferent regions to generate qualitative variables. This has the effect of fitting a piecewise constant function.
  • Regression splineIs more flexible than polynomials and step functions, and is actually an extension of both.
  • Local spline curveSimilar to a regression spline, but allowing regions to overlap, and can overlap smoothly.
  • Smooth spline curvesAlso similar to regression splines, but they minimize the sum of squares of residuals for the smoothness penalty.
  • Generalized additive modelAllow the above method to be extended to handle multiple predictive variables.

Polynomial regression

This is the most traditional way to extend a linear model. As we add terms to the polynomial, polynomial regression allows us to generate non-linear curves while still estimating coefficients using least square.

Stepwise regression

It is often used in biostatistics and epidemiology.

Regression spline

Regression spline is one of many basic functions of extended polynomial and stepwise regression techniques. In fact. Polynomials and stepwise regression functions are only specific cases of basis functions.

This is an example of a piecewise cubic fit (top left).

To solve this problem, a better solution is to use constraints so that the fitting curve must be continuous.

Select the location and number of knots

One option is to put more knots where we think the changes are fastest and fewer knots where they are more stable. In practice, however, knots are usually placed in a uniform manner.

To be clear, in this case there are actually five knots, including the boundary knot.

So how many knots should we use? An easy choice is to try many knots and see which produces the best curve. However, a more objective approach is to use cross validation.

Compared with polynomial regression, spline curves can show more stable results.

Smooth splines

We discussed regression spline curves, which are created by specifying a set of knots, generating a series of basis functions, and then estimating spline coefficients using the least square method. Smoothing spline curves is another way to create spline curves. Let’s recall that our goal was to find some function that was well suited to the observed data, minimizing RSS. However, if there are no restrictions on our function, we can set RSS to zero by choosing a function that precisely interpolates all data.

Select the smoothing parameter Lambda

Again, we resort to cross validation. It turns out that we can actually compute LOOCV very efficiently to smooth spline curves, regression spline curves and other arbitrary basis functions.

Smooth splines are generally preferable to regressive splines because they generally create simpler models and have a comparable degree of fit.

Local regression

Local regression involves calculating the degree of fit at the target point x 0 using only nearby training observations.

Local regression can be performed in a variety of ways, especially in multivariable schemes that involve fitting p-linear regression models, so that some variables can be fitted globally and others locally.

Generalized additive model

 

GAM models provide a general framework for extending linear models by allowing non-linear functions for each variable while preserving additivity.

GAM with smooth splines is not that simple because you can’t use least squares. Instead, a method called reverse fitting is used.

Pros and cons of GAM

advantages

  • GAM allows the fitting of nonlinear functions to each predictive variable so that we can automatically model nonlinear relationships that standard linear regression would miss. We don’t need to try many different transformations for each variable.
  • Nonlinear fitting can potentially be applied to dependent variablesYMake more accurate predictions.
  • Because the model is additive, we can still check each pair of predictorsYWhile keeping the other variables constant.

disadvantages

  • The main limitation is that the model is limited to an additive model, so important interactions can be missed.

 

sample

Polynomial regression and step function

library(ISLR)
attach(Wage)
Copy the code

We can easily use it to fit polynomial functions and then specify the variables and degree of the polynomial. This function returns a matrix of orthogonal polynomials, which means that each column is a linear combination of variables age, age^2, age^3, and age^4. If you want to fetch variables directly, you can specify RAW =TRUE, but this will not affect the forecast. It can be used to check the desired coefficient estimation.

fit = lm(wage~poly(age, 4), data=Wage)
kable(coef(summary(fit)))
Copy the code

 

Now let’s create an ages vector that we want to predict. Finally, we will plot the data and the fitted quartic polynomial.

ageLims <- range(age)
age.grid <- seq(from=ageLims[1], to=ageLims[2])

pred <- predict(fit, newdata = list(age = age.grid),
                se=TRUE)
 
Copy the code
plot(age,wage,xlim=ageLims ,cex=. 5,col="darkgrey")
 lines(age.grid,pred$fit,lwd=2,col="blue")
matlines(age.grid,se.bands,lwd=2,col="blue",lty=3)
Copy the code

In this simple example, we can use ANOVA validation.

## Analysis of Variance Table ## ## Model 1: wage ~ age ## Model 2: wage ~ poly(age, 2) ## Model 3: wage ~ poly(age, 3) ## Model 4: wage ~ poly(age, 4) ## Model 5: wage ~ poly(age, 5) ## res. Df RSS Df Sum of Sq F Pr(>F) ## 1 2998 5022216 ## 2 2997 4793430 1 228786 143.59 <2e-16 *** # 3 2996 4777674 1 15756 9.89 0.0017 ** ## 4 2995 4771604 1 6070 3.81 0.0510. ## 5 2994 4770322 1 1283 0.80 0.3697 ## - ## Codes: 0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '. '0.1 "' 1Copy the code

We see that the p-value of _M_2 is essentially zero for _M_1 compared to the quadratic model, indicating that linear fitting is not sufficient. Therefore, we can conclude that quadratic or cubic models may be more suitable for this data and favor simple models.

We can also use cross validation to select polynomial degrees.

 

The minimum cross-validation error we actually see here is for a degree 4 polynomial, but choosing a degree 3 or 2 model doesn’t cost much. Next, we consider predicting whether an individual earns more than $250,000 a year.

However, the confidence interval of the probability is not reasonable, because we end up with some negative probability. To generate confidence intervals, it makes more sense to convert pair predictions.

Drawing:

plot(age,I(wage>250),xlim=ageLims ,type="n",ylim=c(0.2.))
lines(age.grid,pfit,lwd=2, col="blue")
matlines(age.grid,se.bands,lwd=1,col="blue",lty=3)
Copy the code

Stepwise regression function

Here, we need to split the data.

table(cut(age, 4))
Copy the code
# # # # (17.9, 33.5] (33.5, 49] (49,64.5] (64.5, 80.1] # # 750 1399 779 72Copy the code
fit <- lm(wage~cut(age, 4), data=Wage)
coef(summary(fit))
Copy the code

 

# # Estimate Std. Error t value (Pr > | | t) # # (Intercept) 94.158 1.476 63.790 0.000 e+00 # # the cut (age, 4)(33.5,49] 24.053 1.829 1.148 1.982e-29 ## cut(age, 4)(49,64.5) 4)(64.5,80.1) 7.641 4.987 1.532 1.256e-01Copy the code

 

splinesSpline function

In this case, we will use cubic splines.

Since we are using a cubic spline of three knots, the resulting spline has six basis functions.

## [1] dim(bs(age, df= 1)) ## [1] dim(bs(age, df= 1)) ## [1] dim(bs(age, df= 1)) ## [1] dim(bs(age, df= 1)) ## [1] dim(bs(age, df= 1)) ##Copy the code

Fit spline curve.

 

We can also fit smooth splines. Here, we fit a spline curve with 16 degrees of freedom and then select a spline curve through cross-validation, resulting in 6.8 degrees of freedom.

Fit2 $df ## [1] lines(fit, col='red', col='blue', LWD =1) legend('topright', col='blue', legend=c('16 df ', '6.8 DF), col = c (" red ", "blue"), lty = 1, LWD = 2, cex = 0.8)Copy the code

Local regression

Perform local regression.

GAMs

Now, we use GAM to predict wages by spline of year, age, and education. Since this is just a linear regression model with multiple basic functions, we use only the LM () function.

To fit more complex splines, we need to use smooth splines.

Draw the two models

 

Year is linear. We can create a new model and verify it using ANOVA.

## Analysis of Variance Table ## ## Model 1: wage ~ ns(age, 5) + education ## Model 2: wage ~ year + s(age, 5) + education ## Model 3: wage ~ s(year, 4) + s(age, 5) + education ## res. Df RSS Df Sum of Sq F Pr(>F) ## 1 2990 3712881 ## 2 2989 3693842 1 19040 15.4 8.9e-05 *** ## 3 2986 3689770 4071 1.1 0.35 # # - # # Signif. Codes: 0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '. '0.1 "' 1Copy the code

It seems that adding a linear year component is much better than GAM without it.

## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -119.43-19.70-3.33 14.17 213.48 ## ## (Dispersion Parameter for Gaussian family taken to be  1236) ## ## Null Deviance: 5222086 on 2999 degrees of freedom ## Residual Deviance: 3689770 on 2986 degrees of freedom ## AIC: 29888 ## ## Number of Local Scoring Iterations: 2 ## ## Anova for Parametric Effects ## Df Sum Sq Mean Sq F value Pr(>F) ## s(year, 4) 1 27162 27162 22 2.9 e-06 * * * # # s (age, 5) 1 195338 195338 158 < 2e-16 *** ## education 4 1069726 267432 216 < 2e-16 *** ## Residuals 2986 3689770 1236 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '1 ## ## Anova for Nonparametric Effects ## Npar Df Npar F Pr(F) ## (Intercept) ## s(year, 4) 3 1.1 0.35 ## s(age, 5) 4 32.4 <2e-16 *** ## education ## # 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '1Copy the code

In models with nonlinear relationships, we can reconfirm that year does not contribute to the model.

Next, we will locally regression fit GAM.

We can also use local regression to create interaction items before calling GAM.

We can plot the resulting surface.


reference

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression