Original link:http://tecdat.cn/?p=6851

The standard situation in education or medicine is that we have a constant metric. One example is BMI. You can take a score test with a score of 70 as the standard. When this happens, researchers may sometimes be interested in BMI models that exceed 30 or pass/fail. Substantive issues usually belong to lines that simulate the probability of someone exceeding/falling below that clinically significant threshold. Therefore, we use logistic regression and other methods to dichotomize continuous measurements and analyze new binary variables.

So how does this approach work in practice? Anyone who has attempted to dichotomize continuous variables at different thresholds before using logistic regression for analysis will know that the estimated coefficients change.

We can use simulation. First, I will introduce the data generation process:

Dat $yc < --.5 + dat$x + rlogis(nrow(dat)) dat$x + rlogis(nrow(dat))
hist(dat$yc, main = "")

Then, we can divide the result by yc at different points to determine whether this will affect the estimated coefficient of x we use logistic regression:

Coef (GLM ((yc > - 2) ~ x, binomial, dat)) \ [" x "\] # Cut it at extreme - 2 x 0.9619012 coef (GLM ((yc > 0) ~ x, binomial, Dat))\["x"\] # Cut it at midpoint 0 x 1.002632 coef(GLM ((YC > 2) ~ x, binomial, dinomial, dinomial)\["x"\] # Cut it at midpoint 0 x 1.002632 coef(GLM ((YC > 2) ~ x, binomial, Dat))\["x"\] # Cut it at extreme 2 x 0.8382662

What if we apply linear regression directly to YC?

# First, we create an equation to extract the coefficients, and then # convert them using the formula above for converting to logarithms. Trans < -function (fit, scale = PI/SQRT (3)) {x 1.157362

All these figures are not that different from each other. We can now repeat this process several times to compare the patterns in the results. I repeat 2500 times:

Colmeans (Res < -T (REPLICATE (2500, {# V represents very; L/M/H stands for low/medium/high; And T stands for threshold; LS stands for Conventional Regression. c(vlt = coef(glm((yc > -2) ~ x, binomial, dat))\["x"\], lt = coef(glm((yc > -1) ~ x, binomial, dat))\["x"\], X mt. X ht.x Vht. x ols. X 1.0252116 1.0020822 1.0049156 1.0101613 1.0267511 0.9983772

These numbers are average regression coefficients for different methods.

boxplot(res)

We see that although the mean values are roughly the same, the estimated coefficients vary more when the thresholds are extreme. The minimum variable coefficient is the linear regression coefficient after transformation, so when we use the linear regression method, the result is stable.

What are the estimation coefficient patterns between the different methods?

 ggpairs(as.data.frame(res))

We see that when the threshold is very low, the estimation coefficient is very weakly correlated with the estimation coefficient when the threshold is very high (.13). These differences only reflect thresholds and can be misleading in actual data analysis.


Based on these results, the relationship between predictors and outcomes may also vary depending on the different quantiles of outcomes. The quantile regression method can be used to see if there is such a situation in the original continuous results.

Thank you so much for reading this article, and if you have any questions please leave a comment below!