Original link:tecdat.cn/?p=6358

Original source:Tuo End number according to the tribe public number

 

Multiple interpolation has become a common method to deal with missing data. We can consider using multiple interpolation to estimate missing values in X. The next natural question is, should the variable Y be included as a covariable in the interpolation model of X?

 

Stata

To illustrate these concepts, we simulate a small data set in Stata, with no missing data initially:

Rnormal () gen y = x + 0.25 * rnormal () Twoway () (LFIT yx)Copy the code
 
Copy the code

 

Scatter plot of Y versus X before any data is missing

Next, we set 50 of X’s 100 observations to be missing:

Gen xmiss = (_ n <= 50)Copy the code
 
Copy the code

The interpolation model

In this article, we have two variables Y and X, the analysis model by Y Y on the composition of a certain type of regression (meaning is the dependent variable Y and X is a covariate), we want to generate the interpolation, we get the effective estimation of parameters in Y | X model.

Enter X and ignore Y

Suppose we use a regression model to estimate X, but do not include Y as a covariable in the interpolation model. We can easily do this in Stata, generating an estimate for each missing value, and then plotting Y from the result of X by extrapolating or observing X (when it is observed) :

Mi impute reg x, add (1)Copy the code
 
Copy the code

 

Y versus X, where the X is missing and the Y is ignored.

Clearly shows the problem of ignoring missing values of Y in X – in the ones where we have estimated X, there is no correlation between Y and X that should actually exist.

 

Taking the results into account

Suppose that if we conversely consider the X result as Y (as a covariable in the interpolation model of X), the following steps occur. X | Y interpolation model will use observed X individuals to fitting. Since we assume that X is lost randomly at Y, the complete case study fit is valid. Therefore, if in fact there is no correlation between X and Y, we should (in expectation) find it in this complete case.

To continue our simulation data set, we first discard the previously generated estimate and then re-enter X, but this time including Y as a covariable in the interpolation model:

Mi impute reg x = y, add (1)Copy the code
 
  
Copy the code

Y versus X, where Y is used to estimate the missing X value

 

Variable selection in multiple interpolation

The general rule for selecting variables to be included in the interpolation model is that all variables involved in the analysis model must be included, either as variables to be estimated or as covariables in the interpolation model.