The original |tecdat.cn/?p=21444Source |Tuo End number according to the tribe public number

 

Logistic regression is a commonly used method in research, which can be used for influencing factor screening, probability prediction and classification, etc. For example, data obtained by Qualcomm sequencing technology in medical research brings challenges to high-dimensional variable selection. Penalized LogisitC regression can carry out variable selection and coefficient estimation for high-dimensional data. And its effective algorithm ensures the feasibility of calculation. Methods The punishment logistic algorithms such as LASSO and ridge regression were introduced in this paper.

methods

As we have seen before, the classical estimation technique for estimating parameters of parametric models is to use maximum likelihood. More specifically,

The objective function here is only concerned with goodness of fit. But generally, in econometrics, we believe that simple theories are preferable to more complex ones. So we want to punish overly complex models.

That’s a good idea. Econometrics textbooks often mention this, but the choice of model usually does not involve reasoning. Typically, we use maximum likelihood to estimate the parameters and then use AIC or BIC to compare the two models. The Akaike (AIC) standard is based on

We have a measure of goodness of fit on the left, and on the right, the penalty increases with the “complexity” of the model.

Here, complexity is the number of variables used. But suppose we don’t do variable selection, we consider regression of all covariables. define

AIC can be written as

In fact, this is our target function. More specifically, we will consider

In this article, I want to discuss numerical algorithms for solving such optimization problems, for L1 (ridge regression) and L2 (LASSO regression).

Standardization of covariables

Here we used data from patients who had infarction observed in the emergency room, and we wanted to see who survived, to get a predictive model. The first step is to normalize the variables (with unit variance) by considering all linear transformations of the covariates x_Jxj


for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
Copy the code

 

Ridge regression

Before running some code, recall that we want to solve the following problem

Considering the logarithmic likelihood of gaussian variables, the sum of squares of residuals is obtained and the explicit solution is obtained. But not in logistic regression.

The heuristic method of ridge regression is shown in the figure below. In the background, we can visualize the (two-dimensional) logarithmic likelihood of logistic regression, and if we rewire the optimization problem as a constraint optimization problem, the blue circle is our constraint:

 

Can be written equivalently (this is a strictly convex problem)

Therefore, the constrained maximum should be on the blue disk

b0=bbeta[1] beta=bbeta[-1] sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*log(1 + exp(b0+X%*%beta)))} u = Seq (251) - 4, 4, length = v = outer (u, u, function (x, y) LogLik (c (1, x, y))) lines (u, SQRT (1 - u ^ 2), type = "l", LWD = 2, col = "blue") lines(u,-sqrt(1-u^2),type="l",lwd=2,col="blue")Copy the code

 

 

Let’s consider the target function, the following code


-sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*
log(1 + exp(b0+X%*%beta)))+lambda*sum(beta^2)
Copy the code

Why not try a standard optimizer? We mentioned that using optimization routines is not wise because they rely so heavily on starting points.

Beta_init = lm(y~.,)$coefficients for(I in 1:1000){vpar[I,] = optim(par = beta_init*rnorm(8,1,2), function(x) LogLik(x,lambda), method = "BFGS", Control = list(abstol=1e-9) $par} par(mfrow=c(1,2)) plot(density(vpar[,2])Copy the code

 

Obviously, even if we change the starting point, it looks like we’re converging toward the same value. This can be considered optimal.

The code will then be used to calculate beta λ


beta_init = lm(y~.,data )$coefficients
logistic_opt = optim(par = beta_init*0, function(x) LogLik(x,lambda), 
method = "BFGS", control=list(abstol=1e-9))
Copy the code

We can visualize the evolution of βλ as a function of λ

V_lambda =c (exp(seq(-2,5,length=61)) plot(v_lambda,est_ridge[1,],col=colrs[1]) for(I in 2:7) lines(v_lambda,est_ridge[i,],Copy the code

This seems to make sense: we can observe the contraction as λ increases.

Ridge, using Netwon Raphson algorithm

We have seen that we can also solve this problem using Newton Raphson. There’s no penalty term. The algorithm is

 

Among them

 

so

 

And then the code

for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j]) for(s in 1:9){ pi = exp(X%*%beta[,s])/(1+exp(X%*%beta[,s])) B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%Delta%*%z) beta = cbind(beta,B)} beta[,8:10] [,1] [,2] [,3] XInter 0.59619654 0.59619654 0.59619654 XFRCAR 0.09217848 0.09217848 0.09217848 XINCAR 0.77165707 0.77165707 0.77165707 XINSYS 0.69678521 0.69678521 0.69678521 XPRDIA -0.29575642 -0.29575642 -0.29575642 xpapul-0.23921101-0.23921101 0.23921101 XPVENT - 0.33120792-0.33120792-0.33120792 XREPUL - 0.84308972-0.84308972-0.84308972Copy the code

Again, it seems to converge very quickly.

Interestingly, with this algorithm, we can also get the variance of the estimator

Then calculate the code for beta λ according to the λ function

for(s in 1:20){ pi = exp(X%*%beta[,s])/(1+exp(X%*%beta[,s])) diag(Delta)=(pi*(1-pi)) z = X%*%beta[,s] + solve(Delta)%*%(Y-pi) B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%Delta%*%z) beta = cbind(beta,B)} Varz = solve(Delta) Varb = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% t(X)%*% Delta %*% Varz %*% Delta %*% X %*%  solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X)))Copy the code

We can visualize the evolution of βλ (as a function of λ)


plot(v_lambda,est_ridge[1,],col=colrs[1],type="l")
for(i in 2:7) lines(v_lambda,est_ridge[i,],col=colrs[i])
Copy the code

 

And get the evolution of the variance

 

Recall that when \λ=0 (on the left side of the graph), β0=β MCO (without penalty). Thus, as λ increases, (I) the bias increases (the estimate goes to 0) and (ii) the variance decreases.

GlmnetRidge regression was used

As usual, R functions are available for ridge regression. Let’s use the GLmnet function, alpha = 0


for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
glmnet(X, y, alpha=0)
plot(glm_ridge,xvar="lambda")
Copy the code

 

As the L1 standard norm,

Ridge regression with orthogonal covariates An interesting case is obtained when the covariates are orthogonal. This can be obtained by principal component analysis of covariates.


 get_pca_ind(pca)$coord
Copy the code

Let’s do ridge regression for these (orthogonal) covariables


glm_ridge = glmnet(pca_X, y, alpha=0)
plot(glm_ridge,xvar="lambda",col=colrs,lwd=2)
Copy the code

 

We clearly observe the contraction of parameters, i.e

application

Let’s try a second set of data

We can try various values of λ

glmnet(cbind(df0$x1,df0$x2), df0$y==1, alpha=0)
plot(reg,xvar="lambda",col=c("blue","red"),lwd=2)
abline(v=log(.2))
Copy the code



or

Abline (v = log (1.2)) plot_lambda (1.2)Copy the code

The next step is to change the punishment criteria to l1 standard norm.

Standardization of covariables

As mentioned earlier, the first step is to normalize the variables (with unit variance) by considering all linear transformations of the covariables X_Jxj


for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
X = as.matrix(X)
Copy the code

Ridge regression

The heuristic method for lasso lasso regression is shown in the figure below. In the background, we can visualize the (two-dimensional) logarithmic likelihood of logistic regression, the blue square is our constraint, and if we reconsider the optimization problem as a constraint optimization problem,

LogLik = function(bbeta){ sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*log(1 + exp(b0+X%*%beta)))} Contour (u, u, v, the add = TRUE) polygon (c (1,0,1,0), c (0, 0, 1), border = "blue")Copy the code

The advantage here is that it can be used as a variable selection tool.

Heuristically, the mathematics is explained as follows. Consider a simple regression equation y_i=xiβ+ε with l1-penalty and L2-loss functions. The optimization problem becomes

The first order condition can be written as

The solution for

 

The optimization process

Let’s start with standard (R) optimization routines, such as BFGS


logistic_opt = optim(par = beta_init*0, function(x) PennegLogLik(x,lambda), 
hessian=TRUE, method = "BFGS", control=list(abstol=1e-9))


plot(v_lambda,est_lasso[1,],col=colrs[1],type="l")
for(i in 2:7) lines(v_lambda,est_lasso[i,],col=colrs[i],lwd=2)
Copy the code

 

The result is erratic.

Use the glmnet

For comparison, using an R program dedicated to LASSO, we get the following


plot(glm_lasso,xvar="lambda",col=colrs,lwd=2)
Copy the code

If we look closely at what is in the output, we can see that there is variable selection, in the sense that βj, λ= 0, means “really zero”.

,lambda=exp(-4))$beta 7X1 SPARSE Matrix of Class "dgCMatrix" S0 FRCAR.incar 0.11005070 INSYS 0.03231929 PRdia.papul . Pvent-0.03138089 repuL-0.20962611Copy the code

Without an optimization routine, we cannot expect empty values

Opt_lasso (.2) FRCAR INCAR INSYS PRDIA 0.4810999782 0.0002813658 1.9117847987-0.3873926427 PAPUL PVENT REPUL - 0.0863050787-0.4144139379-1.3849264055Copy the code

 

Orthogonal covariable

Before we get to math, note that there are some very clear “variable” selection processes when covariables are orthogonal,


pca = princomp(X)
pca_X = get_pca_ind(pca)$coord

plot(glm_lasso,xvar="lambda",col=colrs)
plot(glm_lasso,col=colrs)
Copy the code

 

Standard lasso

If we go back to the original Lasso method, the goal is to solve

 

Notice that intercept is not penalized. The first order condition is

 

That is

We can check if BF0 contains at least subdifferentials.

 

 

 

For the left-hand side

 

So we can write down the previous equation

We then reduce them to a set of rules to check our solutions

We can decompose βj into the sum of its positive and negative parts by replacing βj with βj+-βj-, where βj+, βj-≥0. The Lasso problem becomes

Let αj+ and αj− represent the Lagrangian multipliers of βj+ and βj− respectively.

To satisfy the stationarity condition, we take the gradient of Lagrangian with respect to βj+ and set it to zero to obtain

We do the same for βj− to obtain

 

For convenience, a soft threshold function is introduced

 

Pay attention to optimization problems

Can also write

 

The observed

This is a coordinate update. Now, if we consider a (slightly) more general problem, there are weights in the first part

Coordinate update becomes

Back to our original question.

Lasso lasso logistic regression

Here the logical problem can be expressed as a quadratic programming problem. Recall that logarithmic likelihood is right here

 

This is a concave function of parameters. So, you can use a quadratic approximation of log likelihood — using Taylor expansion,

The z_izi is

PI is to predict

In this way, we have a penalty least squares problem. We can do it the same way we did before


beta0 = sum(y-X%*%beta)/(length(y))
beta0list[j+1] = beta0
betalist[[j+1]] = beta
obj[j] = (1/2)*(1/length(y))*norm(omega*(z - X%*%beta - 
beta0*rep(1,length(y))),'F')^2 + lambda*sum(abs(beta))

if (norm(rbind(beta0list[j],betalist[[j]]) - 
rbind(beta0,beta),'F') < tol) { break } 
} 
return(list(obj=obj[1:j],beta=beta,intercept=beta0)) }
Copy the code

It looks like the result of a call to GLmnet, but for some sufficiently large λ, we do have empty components.

Application on the second data set

Now consider a second dataset with two covariables. The code to get the LASSO estimate is

Plot_l = function(lambda){m = apply(df0,2 mean) s = apply(df0,2 sd) for(j in 1:2) df0[,j] & reg = function(lambda){m = apply(df0,2 mean) s = apply(df0,2 sd) for(j in 1:2) df0[,j] & reg = glmnet(cbind(df0$x1,df0$x2), df0$y==1, P = function(x,y){predict(reg,newx=cbind(x1=xt,x2=yt),type="response")} Image (u, u, v, col = clr10, breaks = (0:10) / 10) points (df $x1, df $x2, PCH = c (1) 3 (1 + z], cex = 1.5) contour (u, u, v, levels = 5, the add = TRUE)Copy the code

Considering the small values of lambda, we have only some degree of parameter reduction

The plot (reg, xvar = "lambda", col = c (" blue ", "red"), LWD = 2) abline (v = exp (2.8)) plot_l (exp (2.8))Copy the code

But with larger λ, there is variable choice: β1, λ= 0

 


Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression