Missing value padding in THE R language: Estimating missing values

Original link:tecdat.cn/?p=8287

Original source:Tuo End number according to the tribe public number

introduce

Missing values are considered to be the primary obstacle to predictive modeling. Therefore, it is important to master ways to overcome these problems.

The choice of methods to estimate missing values greatly affects the predictive ability of the model. In most statistical analysis methods, list deletion is the default method used to estimate missing values. However, it is not so good because it leads to information loss.

In this article, I’ve listed five R language approaches.

Multivariate interpolation of chain equations

Multivariate interpolation by chain equations is commonly used by R users. Creating multiple interpolations resolves the uncertainty of missing values compared to a single interpolation, such as a mean.

MICE assume that lost data is random (MAR) loss, which means that the observed values on a value loss probability only depend on and can be predicted using them. You can interpolate data by variable by specifying an interpolation model for each variable.

For example: suppose we have X1, X2… Xk variables. If X1 is missing a value, then it regains on the other variables X2 through Xk. The missing values in X1 are then replaced with the obtained predicted values. Similarly, if X2 lacks a value, the variables X1, X3, and Xk will be used as independent variables in the prediction model. Later, the missing value is replaced with the predicted value.

By default, linear regression is used to predict continuous missing values. Logistic regression was used to classify missing values. Once this loop is complete, multiple data sets are generated. These data sets differ only in the estimated missing values. In general, it is considered a good practice to model these data sets separately and combine their results.

To be precise, this package uses the following approach:

PMM (Predicted mean matching) – For numerical variables
Logreg (Logistic Regression) — For binary variables (with 2 levels)
Polyreg (Bayesian multiple regression) — For factor variables (level > = 2)
Proportional odds model (ordered, > = 2 levels)

Now let’s actually look at it.

> path <- ".. /Data/Tutorial" > setwd(path) # retrieve Data > Data < -iris # generate 10% missing values > summary(iris) # generate 10% missing values > iris.mis < -proDNA (iris, NoNA = 0.1) # Check for missing values introduced in data > summary(iris.mis)Copy the code

I deleted the classification variables. Let’s focus on continuous values here. To work with categorical variables, simply code the levels and follow these steps.

Iris. Mis < -subset (iris. Mis, select = -c(Species)) > summary(ir.mis)Copy the code

Function of md.pattern (), which returns the tabulated form of missing values present in each variable in the data set.

> md.pattern(iris.mis)
Copy the code

Let’s take a look at this table. There are 98 observations, no missing values. Sepal.Length has 10 missing observations. Similarly, there are 13 missing values for Sepal.Width, etc.

We can also create visual effects that represent missing values.

> mice_plot <- aggr(iris.mis, col=c('navyblue','yellow'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(iris.mis), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))
Copy the code

Let’s get this straight. There are 67% of values in the dataset and no missing values. It’s 10 per cent missing in Petal.Length, 8 per cent missing in Petal.Width, and so on. You can also view the histogram, which clearly describes the impact of missing values in a variable.

Now, let’s estimate the missing values.

 
> summary(imputed_Data)

Multiply imputed data set
Call:
 Number of multiple imputations: 5
Missing cells per column:
Sepal.Length Sepal.Width Petal.Length Petal.Width 
13            14          16           15 
Imputation methods:
Sepal.Length Sepal.Width Petal.Length Petal.Width 
"pmm"        "pmm"        "pmm"       "pmm" 
VisitSequence:
Sepal.Length Sepal.Width Petal.Length Petal.Width 
1              2            3           4 
PredictorMatrix:
              Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length        0          1            1            1
Sepal.Width         1          0            1            1
Petal.Length        1          1            0            1
Petal.Width         1          1            1            0
Random generator seed value: 500
Copy the code

Here is a description of the parameters used:

M — Estimate the data set
Maxit – The number of iterations for missing values of interpolation
Method – Refers to the method used in interpolation. We used predictive mean matching.

Since there are five estimated datasets, you can use the complete () function to select any dataset.

You can also merge the results from these models and use the pool () command to get the merged output.

Please note that I used the command above for demonstration purposes only. You can replace the variable values at the end and try.

Multiple interpolation

The package also performs multiple interpolation (generating interpolated data sets) to deal with missing values. Multiple interpolation helps reduce bias and improve efficiency. It can be enabled by bootstring-based EMB algorithms to insert many variables, including cross sections, time series data, and more quickly and reliably. It can also be enabled using the parallel insert capability of multi-core cpus.

It makes the following assumptions:

All variables in the dataset have multivariate normal distribution (MVN). It aggregates data using means and covariance.
Lost data is random in nature (random loss)

Therefore, this works best when the data has a multivariate normal distribution. If not, a transformation is performed to bring the data closer to normal.

Now let’s actually look at it.

The only thing you need to care about is categorizing the variables.

Amelia_fit $imputation [[1]] > amelia_fit$imputation [[2]] > amelia_fit$imputation [[3]] > amelia_fit$imputation. amelia_fit$imputations[[4]] > amelia_fit$imputations[[5]]Copy the code

To check for a specific column in the dataset, use the following command

> amelia_fit$imputed_data_set [[5]]$Sepal.Length > write. Amelia (amelia_fit, file.stem = "imputed_data_set").Copy the code

Random forests

As the name implies, missForest is an implementation of a random forest algorithm. It is suitable for non-parametric interpolation of various variable types. So, what is a nonparametric method?

Nonparametric methods do not have explicit assumptions about the form of the function. Instead, it tries to estimate F so that it can be as close to the data point as possible, which seems unrealistic.

How does it work? In short, it builds a random forest model for each variable. It then uses the model to predict missing values in the variables with the help of observations.

It produces an OOB (out of bag) estimate error estimate. Furthermore, it provides a high level of control over the interpolation process. It has the option of returning the OOB separately (for each variable) rather than aggregating the entire data matrix. This helps to more carefully model how accurately each variable is estimated.

NRMSE is the normalized mean square error. It is used to represent the error derived from the estimation of continuous values. PFC (Proportion of misclassification) is used to indicate the errors to be derived from the estimated category.

Err < -mixerror (ir.imp $ximp, ir.mis, iris) > ir.err NRMSE PFC 0.1535103 0.0625000Copy the code

This indicates an error of 6% for category variables and 15% for continuous variables. This can be improved by adjusting the values of the mtry and ntree parameters. Mtry refers to the number of variables sampled randomly in each partition. Ntree is the number of trees that grow in a forest.

Nonparametric regression method

** Resampling each of the multiple interpolations using a different bootstrap. The additive model (nonparametric regression method) is then fitted to the sample obtained by substitution from the original data, and the non-missing values (independent variables) are used to predict the missing values (acting as independent variables).

It then uses predicted mean matching (the default) to plug in missing values. Predictive mean matching is well suited for continuity and classification (binary and multistage) without calculating residuals and maximum likelihood fitting.

、

ArgImpute () automatically recognizes the variable type and processes it accordingly.

> impute_arg

The output shows the R² value as the missing value of the prediction. The higher the value, the better the predicted value. You can also check the estimates using the following command

Sepal.Length > impute_arg$imputed$Sepal.LengthCopy the code

Multiple interpolation with diagnostics

Multiple interpolation with diagnostics provides some functionality for handling missing values. It also constructs multiple interpolation models to approximate the missing values. In addition, the predictive mean matching method is used.

Although, I have explained predicted mean matching (PMM) above: for each observation of a missing value in a variable, we find the closest observation from the available value and the predicted mean of that variable. Observations from the “match” are then used as presumptive values.

It can diagnose the interpolation model and realize the convergence of interpolation process.
It uses a Bayesian version of the regression model to deal with the separation problem.
The interpolation model specification is similar to the regression output in R
It automatically detects irregularities in the data, such as high collinearity between variables.
Moreover, noise is added in the reduction process to solve the problem of additive constraint.

As shown, it uses summary statistics to define estimates.

endnotes

In this article, I explain the use of five methods for missing value estimation. This approach can help you achieve greater accuracy in building predictive models.

Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression