The original |tecdat.cn/?p=21444Source |Tuo End number according to the tribe public number

Logistic regression is a commonly used method in research, which can be used for influencing factor screening, probability prediction and classification, etc. For example, data obtained by Qualcomm sequencing technology in medical research brings challenges to high-dimensional variable selection. Penalized LogisitC regression can carry out variable selection and coefficient estimation for high-dimensional data. And its effective algorithm ensures the feasibility of calculation. Methods The punishment logistic algorithms such as LASSO and ridge regression were introduced in this paper.

methods

As we have seen before, the classical estimation technique for estimating parameters of parametric models is to use maximum likelihood. More specifically,

The objective function here is only concerned with goodness of fit. But generally, in econometrics, we believe that simple theories are preferable to more complex ones. So we want to punish overly complex models.

That’s a good idea. Econometrics textbooks often mention this, but the choice of model usually does not involve reasoning. Typically, we use maximum likelihood to estimate the parameters and then use AIC or BIC to compare the two models. The Akaike (AIC) standard is based on

We have a measure of goodness of fit on the left, and on the right, the penalty increases with the “complexity” of the model.

Here, complexity is the number of variables used. But suppose we don’t do variable selection, we consider regression of all covariables. define

AIC can be written as

In fact, this is our target function. More specifically, we will consider

In this article, I want to discuss numerical algorithms for solving such optimization problems, for L1 (ridge regression) and L2 (LASSO regression).

Standardization of covariables

Here we used data from patients who had infarction observed in the emergency room, and we wanted to see who survived, to get a predictive model. The first step is to normalize the variables (with unit variance) by considering all linear transformations of the covariates x_Jxj


for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
Copy the code

Ridge regression

Before running some code, recall that we want to solve the following problem

Considering the logarithmic likelihood of gaussian variables, the sum of squares of residuals is obtained and the explicit solution is obtained. But not in logistic regression.

The heuristic method of ridge regression is shown in the figure below. In the background, we can visualize the (two-dimensional) logarithmic likelihood of logistic regression, and if we rewire the optimization problem as a constraint optimization problem, the blue circle is our constraint:

Can be written equivalently (this is a strictly convex problem)

Therefore, the constrained maximum should be on the blue disk

b0=bbeta[1] beta=bbeta[-1] sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*log(1 + exp(b0+X%*%beta)))} u = Seq (251) - 4, 4, length = v = outer (u, u, function (x, y) LogLik (c (1, x, y))) lines (u, SQRT (1 - u ^ 2), type = "l", LWD = 2, col = "blue") lines(u,-sqrt(1-u^2),type="l",lwd=2,col="blue")Copy the code

Let’s consider the target function, the following code


-sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*
log(1 + exp(b0+X%*%beta)))+lambda*sum(beta^2)
Copy the code

Why not try a standard optimizer? We mentioned that using optimization routines is not wise because they rely so heavily on starting points.

Beta_init = lm(y~.,)$coefficients for(I in 1:1000){vpar[I,] = optim(par = beta_init*rnorm(8,1,2), function(x) LogLik(x,lambda), method = "BFGS", Control = list(abstol=1e-9) $par} par(mfrow=c(1,2)) plot(density(vpar[,2])Copy the code

Obviously, even if we change the starting point, it looks like we’re converging toward the same value. This can be considered optimal.

The code will then be used to calculate beta λ


beta_init = lm(y~.,data )$coefficients
logistic_opt = optim(par = beta_init*0, function(x) LogLik(x,lambda), 
method = "BFGS", control=list(abstol=1e-9))
Copy the code

We can visualize the evolution of βλ as a function of λ

V_lambda =c (exp(seq(-2,5,length=61)) plot(v_lambda,est_ridge[1,],col=colrs[1]) for(I in 2:7) lines(v_lambda,est_ridge[i,],Copy the code

This seems to make sense: we can observe the contraction as λ increases.

Ridge, using Netwon Raphson algorithm

We have seen that we can also solve this problem using Newton Raphson. There’s no penalty term. The algorithm is

Among them

And then the code

for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j]) for(s in 1:9){ pi = exp(X%*%beta[,s])/(1+exp(X%*%beta[,s])) B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%Delta%*%z) beta = cbind(beta,B)} beta[,8:10] [,1] [,2] [,3] XInter 0.59619654 0.59619654 0.59619654 XFRCAR 0.09217848 0.09217848 0.09217848 XINCAR 0.77165707 0.77165707 0.77165707 XINSYS 0.69678521 0.69678521 0.69678521 XPRDIA -0.29575642 -0.29575642 -0.29575642 xpapul-0.23921101-0.23921101 0.23921101 XPVENT - 0.33120792-0.33120792-0.33120792 XREPUL - 0.84308972-0.84308972-0.84308972Copy the code

Again, it seems to converge very quickly.

Interestingly, with this algorithm, we can also get the variance of the estimator

Then calculate the code for beta λ according to the λ function

for(s in 1:20){ pi = exp(X%*%beta[,s])/(1+exp(X%*%beta[,s])) diag(Delta)=(pi*(1-pi)) z = X%*%beta[,s] + solve(Delta)%*%(Y-pi) B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%Delta%*%z) beta = cbind(beta,B)} Varz = solve(Delta) Varb = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% t(X)%*% Delta %*% Varz %*% Delta %*% X %*%  solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X)))Copy the code

We can visualize the evolution of βλ (as a function of λ)


plot(v_lambda,est_ridge[1,],col=colrs[1],type="l")
for(i in 2:7) lines(v_lambda,est_ridge[i,],col=colrs[i])
Copy the code

And get the evolution of the variance

Recall that when \λ=0 (on the left side of the graph), β0=β MCO (without penalty). Thus, as λ increases, (I) the bias increases (the estimate goes to 0) and (ii) the variance decreases.

GlmnetRidge regression was used

As usual, R functions are available for ridge regression. Let’s use the GLmnet function, alpha = 0


for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
glmnet(X, y, alpha=0)
plot(glm_ridge,xvar="lambda")
Copy the code

As the L1 standard norm,

Ridge regression with orthogonal covariates An interesting case is obtained when the covariates are orthogonal. This can be obtained by principal component analysis of covariates.


 get_pca_ind(pca)$coord
Copy the code

Let’s do ridge regression for these (orthogonal) covariables


glm_ridge = glmnet(pca_X, y, alpha=0)
plot(glm_ridge,xvar="lambda",col=colrs,lwd=2)
Copy the code

We clearly observe the contraction of parameters, i.e

application

Let’s try a second set of data

We can try various values of λ

glmnet(cbind(df0$x1,df0$x2), df0$y==1, alpha=0)
plot(reg,xvar="lambda",col=c("blue","red"),lwd=2)
abline(v=log(.2))
Copy the code

Abline (v = log (1.2)) plot_lambda (1.2)Copy the code

The next step is to change the punishment criteria to l1 standard norm.

Standardization of covariables

As mentioned earlier, the first step is to normalize the variables (with unit variance) by considering all linear transformations of the covariables X_Jxj


for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
X = as.matrix(X)
Copy the code

Ridge regression

The heuristic method for lasso lasso regression is shown in the figure below. In the background, we can visualize the (two-dimensional) logarithmic likelihood of logistic regression, the blue square is our constraint, and if we reconsider the optimization problem as a constraint optimization problem,

LogLik = function(bbeta){ sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*log(1 + exp(b0+X%*%beta)))} Contour (u, u, v, the add = TRUE) polygon (c (1,0,1,0), c (0, 0, 1), border = "blue")Copy the code

The advantage here is that it can be used as a variable selection tool.

Heuristically, the mathematics is explained as follows. Consider a simple regression equation y_i=xiβ+ε with l1-penalty and L2-loss functions. The optimization problem becomes

The first order condition can be written as

The solution for

The optimization process

Let’s start with standard (R) optimization routines, such as BFGS


logistic_opt = optim(par = beta_init*0, function(x) PennegLogLik(x,lambda), 
hessian=TRUE, method = "BFGS", control=list(abstol=1e-9))


plot(v_lambda,est_lasso[1,],col=colrs[1],type="l")
for(i in 2:7) lines(v_lambda,est_lasso[i,],col=colrs[i],lwd=2)
Copy the code

The result is erratic.

Use the glmnet

For comparison, using an R program dedicated to LASSO, we get the following


plot(glm_lasso,xvar="lambda",col=colrs,lwd=2)
Copy the code

If we look closely at what is in the output, we can see that there is variable selection, in the sense that βj, λ= 0, means “really zero”.

,lambda=exp(-4))$beta 7X1 SPARSE Matrix of Class "dgCMatrix" S0 FRCAR.incar 0.11005070 INSYS 0.03231929 PRdia.papul . Pvent-0.03138089 repuL-0.20962611Copy the code

Without an optimization routine, we cannot expect empty values

Opt_lasso (.2) FRCAR INCAR INSYS PRDIA 0.4810999782 0.0002813658 1.9117847987-0.3873926427 PAPUL PVENT REPUL - 0.0863050787-0.4144139379-1.3849264055Copy the code

Orthogonal covariable

Before we get to math, note that there are some very clear “variable” selection processes when covariables are orthogonal,


pca = princomp(X)
pca_X = get_pca_ind(pca)$coord

plot(glm_lasso,xvar="lambda",col=colrs)
plot(glm_lasso,col=colrs)
Copy the code

Standard lasso

If we go back to the original Lasso method, the goal is to solve

Notice that intercept is not penalized. The first order condition is

That is

We can check if BF0 contains at least subdifferentials.

For the left-hand side

So we can write down the previous equation

We then reduce them to a set of rules to check our solutions

We can decompose βj into the sum of its positive and negative parts by replacing βj with βj+-βj-, where βj+, βj-≥0. The Lasso problem becomes

Let αj+ and αj− represent the Lagrangian multipliers of βj+ and βj− respectively.

To satisfy the stationarity condition, we take the gradient of Lagrangian with respect to βj+ and set it to zero to obtain

We do the same for βj− to obtain

For convenience, a soft threshold function is introduced

Pay attention to optimization problems

Can also write

The observed

This is a coordinate update. Now, if we consider a (slightly) more general problem, there are weights in the first part

Coordinate update becomes

Back to our original question.

Lasso lasso logistic regression

Here the logical problem can be expressed as a quadratic programming problem. Recall that logarithmic likelihood is right here

This is a concave function of parameters. So, you can use a quadratic approximation of log likelihood — using Taylor expansion,

The z_izi is

PI is to predict

In this way, we have a penalty least squares problem. We can do it the same way we did before


beta0 = sum(y-X%*%beta)/(length(y))
beta0list[j+1] = beta0
betalist[[j+1]] = beta
obj[j] = (1/2)*(1/length(y))*norm(omega*(z - X%*%beta - 
beta0*rep(1,length(y))),'F')^2 + lambda*sum(abs(beta))

if (norm(rbind(beta0list[j],betalist[[j]]) - 
rbind(beta0,beta),'F') &lt; tol) { break } 
} 
return(list(obj=obj[1:j],beta=beta,intercept=beta0)) }
Copy the code

It looks like the result of a call to GLmnet, but for some sufficiently large λ, we do have empty components.

Application on the second data set

Now consider a second dataset with two covariables. The code to get the LASSO estimate is

Plot_l = function(lambda){m = apply(df0,2 mean) s = apply(df0,2 sd) for(j in 1:2) df0[,j] & reg = function(lambda){m = apply(df0,2 mean) s = apply(df0,2 sd) for(j in 1:2) df0[,j] & reg = glmnet(cbind(df0$x1,df0$x2), df0$y==1, P = function(x,y){predict(reg,newx=cbind(x1=xt,x2=yt),type="response")} Image (u, u, v, col = clr10, breaks = (0:10) / 10) points (df $x1, df $x2, PCH = c (1) 3 (1 + z], cex = 1.5) contour (u, u, v, levels = 5, the add = TRUE)Copy the code

Considering the small values of lambda, we have only some degree of parameter reduction

The plot (reg, xvar = "lambda", col = c (" blue ", "red"), LWD = 2) abline (v = exp (2.8)) plot_l (exp (2.8))Copy the code

But with larger λ, there is variable choice: β1, λ= 0

Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

A case study on classification model of r-language penalty Logistic regression (LASSO, Ridge regression) with high dimensional variable selection

The original |tecdat.cn/?p=21444Source |Tuo End number according to the tribe public number

methods

Standardization of covariables

Ridge regression

Ridge, using Netwon Raphson algorithm

GlmnetRidge regression was used

application

Standardization of covariables

Ridge regression

The optimization process

Use the glmnet

Orthogonal covariable

Standard lasso

Lasso lasso logistic regression

Application on the second data set

A case study on classification model of r-language penalty Logistic regression (LASSO, Ridge regression) with high dimensional variable selection

The original |tecdat.cn/?p=21444Source |Tuo End number according to the tribe public number

methods

Standardization of covariables

Ridge regression

Ridge, using Netwon Raphson algorithm

GlmnetRidge regression was used

application

Standardization of covariables

Ridge regression

The optimization process

Use the glmnet

Orthogonal covariable

Standard lasso

Lasso lasso logistic regression

Application on the second data set

Related Posts

Maven-dependency – Plugin dependency: Tree Print problem

Use SSH to pull Github project :Could not read from remote repository

The private NPM and PYPI RepO are finally set up