Original link:http://tecdat.cn/?p=8287

Original source:Extension end number according to the tribal public number

introduce

Missing values are considered to be the primary obstacle to predictive modeling. Therefore, it is important to master the methods to overcome these problems.

The choice of the method to estimate the missing value greatly affects the prediction ability of the model. In most statistical analysis methods, deletion is the default method for dealing with missing values. However, it can lead to information loss.

In this article, I list five R language approaches.

Multivariate interpolation for chain equations

Multivariate interpolation by chain equations is commonly used by R users. Creating multiple interpolations resolves the uncertainty of missing values as opposed to a single interpolation (such as the mean).

Mice assume that missing data are random (MAR) misses, which means that the observed values on a value miss probability depend only on and can be used to predict. You can interpolate data by variable by specifying an interpolation model for each variable.

For example, suppose we have X1, X2… Xk variables. If X1 is missing a value, then it will regress on the other variables, X2 through Xk. Then, the missing value in X1 is replaced with the obtained predicted value. Similarly, if X2 is missing a value, the X1, X3, through Xk variables will be used as independent variables in the prediction model. Later, the missing value is replaced with the predicted value.

By default, linear regression is used to predict continuous missing values. Logistic regression was used to classify missing values. Once this loop completes, multiple data sets are generated. These data sets differ only in the estimated missing values. In general, it is considered a good practice to model these data sets separately and combine the results.

To be precise, the method used is:

  1. PMM (Predictive Mean Matching) – Used for numerical variables
  2. Logreg (Logical Regression) – for binary variables (with 2 levels)
  3. Polyreg (Bayesian Multiple Regression) — For Factor Variables (> = Level 2)
  4. Scale model (ordered, \> = 2 levels)

Now let’s actually do it.

> path <- ".. /Data/Tutorial" > Data < -iris # > summary(iris) # > iris.mis < -prodna (iris,) NONA = 0.1) # Check for missing value > SUMMARY (IRIS.MIS)

I’ve removed the category variable. Let’s focus on continuous values here. To handle classification variables, simply code the class LEVEL and follow these steps.

> iris.mis < -subset (iris.mis, select = -c(Species)) > summary(iris.mis)

_md.pattern_ returns a tabular representation of the missing values that exist in each variable in the data set.

> md.pattern(iris.mis)

Let’s take a look at this chart. There are 98 observed values and no missing values. There are 10 missing observations in Sepal.Length. Also, there are 13 missing values for sepal.width, etc.

We can also create visual effects that represent missing values.

> mice_plot <- aggr(iris.mis, col=c('navyblue','yellow'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(iris.mis), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))

Let’s take a quick look at this. There are 67% values in the dataset and no missing values. 10% missing value in petal.length, 8% missing value in petal.width, and so on. You can also view a histogram that clearly describes the impact of missing values in a variable.

Now, let’s estimate the missing values.

Multiply imputed data set
Call:
 Number of multiple imputations: 5
Missing cells per column:
Sepal.Length Sepal.Width Petal.Length Petal.Width 
13            14          16           15 
Imputation methods:
Sepal.Length Sepal.Width Petal.Length Petal.Width 
"pmm"        "pmm"        "pmm"       "pmm" 
VisitSequence:
Sepal.Length Sepal.Width Petal.Length Petal.Width 
1              2            3           4 
PredictorMatrix:
              Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length        0          1            1            1
Sepal.Width         1          0            1            1
Petal.Length        1          1            0            1
Petal.Width         1          1            1            0
Random generator seed value: 500

Here is a description of the parameters used:

  1. M — Estimated dataset
  2. Maxit – Number of iterations to plug in missing values
  3. Method – refers to the method used in the interpolation. We used predictive mean matching.

Since there are five estimated datasets, you can select any dataset using the _complete () _ function.

You can also merge the results from these models and get the merged output using the _pool () _ command.

Please note that I used the above commands for demonstration purposes only. You can replace the variable value at the end and try.

Multiple interpolation

The package also performs multiple interpolations (generating interpolation datasets) to handle missing values. Multiple interpolation helps to reduce bias and improve efficiency. Through the EMB algorithm based on Bootstrap program, it can insert many variables, including cross section, time series data, etc., more quickly and reliably. In addition, parallel insertion of multi-core CPUs can be used.

It makes the following assumptions:

  1. All the variables in the dataset have multivariate normal distribution (MVN). It uses mean and covariance to aggregate data.
  2. Missing data is random in nature (random missing)

Therefore, the data is most effective when it has a multivariate normal distribution. If not, a transformation is performed to make the data close to a normal distribution.

The only caveat is the classification of variables.

# visit to estimate the output of > amelia_fit $imputations \ [\ [1 \] \] > amelia_fit $imputations \ [\ [2 \] \] > amelia_fit $imputations \ [\ [3 \] \] > amelia_fit$imputations\[\[4\]\] > amelia_fit$imputations\[\[5\]\]

To check a particular column in the dataset, use

> amelia_fit$imputations\[\[5\]\]$sepal.length # Exports output to CSV file > write.amelia(amelia\_fit, file.stem = "imputed\_data_set")

Random forests

As the name suggests, MissForest is an algorithm that implements random forest. It is suitable for non-parametric interpolation of various variable types. So, what are non-parametric methods?

Nonparametric methods do not make explicit assumptions about the form of the function _˚F _. Instead, it tries to estimate _f_ so that it can be as close to the data point as possible.

How does it work? In short, it builds a random forest model for each variable. It then uses the model to predict missing values in variables with the help of observations.

It produces an OOB (out of bag) estimation error estimate. Furthermore, it provides a high level of control over the interpolation process. It has the option to return the OOB individually (for each variable) instead of aggregating the entire data matrix. This helps to estimate model values accurately.

NRMSE is the normalized mean square error. It is used to express the error from the continuous value of the estimate. PFC (Proportion of Misclassification) is used to represent the error calculated from the estimated category.

>iris.err < -mixerror (iris.imp$ximp, iris.mis, iris) >iris.err NRMSE PFC 0.1535103 0.0625000

This indicates a 6% error for the category variable and a 15% error for the continuous variable. This can be improved by adjusting the values of the _mtry_ and _ntree_ parameters. The mtry is the number of randomly sampled variables in each branch. Ntree is the number of trees growing in a forest.

Nonparametric regression method

Each of the multiple interpolations is resampled using a different bootstrap program. The additive model (the nonparametric regression method) is then fitted to the sample obtained by substitution from the original data, and the missing value (acting as the independent variable) is predicted using the non-missing value (the independent variable).

It then uses predictive mean matching (the default) to plug in the missing values. Predictive mean matching is well suited for continuum and classification (binary and multistage) without calculating residuals and maximum likelihood fitting.

Automatically identify variable types and process them accordingly.

> impute_arg

The output shows the R² value as the predicted missing value. The higher the value, the better the predicted value. Use the following command to check the estimates

# Check the estimator variable Sepal.Length > impute_arg$imputed$Sepal

Multiple interpolation with diagnostics

Multiple interpolation with diagnostics provides some means of dealing with missing values. It also builds multiple interpolation models to approximate missing values. And the predictive mean matching method is used.

Although, I have explained Predictive Means Matching (PMM) above: for each observation of a missing value in a variable, we find the predicted mean of the closest observation for that variable from the available values. The observations from the “match” are then used as inferences.

  1. It can diagnose the interpolation model and realize the convergence of the interpolation process.
  2. It uses a Bayesian version of the regression model to deal with the problem.
  3. The Interpolation Model specification is similar to the regression output in R
  4. It automatically detects irregularities in the data, such as high collinearity between variables.
  5. Moreover, it adds noise in the process of reduction to solve the additive constraint problem.

As shown, it uses summary statistics to define estimates.

endnotes

In this article, I describe five methods for estimating missing values. This approach can help you achieve greater accuracy when building predictive models.


The most popular insight

1. Application case of multiple Logistic Logistic regression in R language

2. Implementation of Panel Smooth Transfer Regression (PSTR) analysis case

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4. Case study of R language Poisson regression model

5. Hosmer-Lemeshow goodness of fit test in R language regression

6. Realization of Lasso regression, Ridge Ridge regression and Elastic Net model in R language

7. Logistic Logistic regression was realized in R language

8. Python uses linear regression to predict stock prices

9. How does R language calculate IDI and NRI indexes in survival analysis and Cox regression