Missing values refer to clustering, grouping, deletion or truncation of data due to lack of information in rough data. It means that the values of one or more attributes in an existing dataset are incomplete.

There are three ways to deal with missing values: delete, fill and leave them untreated.

Application cases (Python) are updated as the project is analyzed.

1 to delete

There are mainly simple deletion method and weight method.

The simple deletion method is the most primitive method to deal with missing values.

1.1 Simple deletion method

Principle:

  • Delete data items (objects, tuples, records) with missing values (delete rows).
  • When a feature value is missing for most objects, the feature is removed (delete columns).

Advantages:

  • Simple and effective in cases where an object has multiple attribute missing values and the deleted object containing missing values is very small compared to the amount of data in the information table.

Disadvantages:

  • It is to reduce the historical data in exchange for information completeness, which will cause a great waste of resources and discard a lot of hidden information in these objects. When there are only a few objects in the information table, deleting a few objects is enough to seriously affect the objectivity of the information in the information table and the correctness of the results. It performs very poorly when the percentage of null values per attribute varies greatly.

Case study:

  • In the Titanic data set, the feature ‘PassengerId’ was removed from the training set because it had no relationship to survival.
train_df = train_df.drop(['PassengerId'],axis=1)
Copy the code

1.2 weight method

Principle:

  • When the missing value type is not completely random, the bias can be reduced by weighting the complete data. After marking cases with incomplete data, the complete data cases were assigned different weights, which could be obtained by Logistic or Probit regression.

Advantages:

  • If there is a variable in the explanatory variable that determines the row factor of the weight estimation, this method can effectively reduce the bias. If the explanatory variable is not related to the weight, it does not reduce the bias.

Disadvantages:

  • This type of method is to complete the information table by filling empty values with certain values. Usually based on the principle of statistics, an empty value is filled according to the distribution of values of other objects in the decision table, for example, the average value of other attributes is used to supplement. The following complement methods are commonly used in data mining: In the case of multiple missing attributes, it is necessary to assign different weights to the missing combinations of different attributes, which will greatly increase the difficulty of calculation and reduce the accuracy of prediction. In this case, the weight method is not ideal.

2 fill

This type of method is to complete the information table by filling empty values with certain values. Usually based on the principle of statistics, an empty value is filled according to the distribution of values of other objects in the decision table, for example, the average value of other attributes is used to supplement. The following complement methods are commonly used in data mining:

2.1 Filling manually

Advantages:

  • Since it is the user who knows the data best, this method produces the least data deviation and is probably the best one for padding.

Disadvantages:

  • However, in general, this method is time-consuming and is not feasible when the data size is large and nulls are many.

Treating Missing Attribute values as Special values

A null value is treated as a special attribute value, unlike any other attribute value. For example, all empty values are filled with unknown. This leads to another interesting concept that can lead to serious data skewness and is generally not recommended.

2.3 Mean filling (Mean/Mode Completer)

The attributes in the information table are divided into numerical attributes and non-numerical attributes respectively.

If the null value is numeric, the missing attribute value is padded based on the average value of the attribute across all other objects. If the null value is non-numerical, the missing attribute value is supplemented by the value that has been evaluated the most frequently in all other objects (that is, the value with the highest frequency) according to the mode principle of statistics.

Another similar method is Conditional Mean Completer. In this method, the completion of the missing attribute value is also obtained by averaging the value of the attribute in other objects, but the difference is that the value used for averaging is not taken from all objects in the information table, but from objects with the same decision attribute value as the object.

The basic starting point of these two data completion methods is the same. They supplement the missing attribute value with the maximum probability possible value, but there is a little difference in specific methods. Compared with other methods, it uses most information of existing data to infer missing values.

2.4 Hot deck imputation (or nearby filling)

For an object containing a null value, the hot card fill method finds the most similar object in the complete data and fills it with the value of that similar object. Different questions may use different criteria to determine similarity. The method is simple in concept and uses the relationship between data to perform null value estimation. The disadvantage of this method is that it is difficult to define the similarity standard and there are many subjective factors.

2.5 Clustering imputation

The most typical representative is K-means clustering, which firstly determines the K samples closest to the samples with missing data according to euclian-distance or correlation analysis, and then estimates the missing data of the sample by weighted average of the K values.

All the same-mean interpolation methods belong to single-value interpolation, but the difference is that it uses hierarchical clustering model to predict the type of missing variables, and then interpolates with the mean value of the type.

Assuming that X = (X1, X2… Xp) is the variable with complete information, and Y is the variable with missing values. Then, X or its subset rows are firstly clustered, and then the mean values of different classes are interpolated according to the class of missing cases. If the introduced explanatory variables and Y are needed for analysis in the future statistical analysis, this interpolation method will introduce autocorrelation into the model, causing obstacles for analysis.

2.6 Padding With All Possible Values (Assigning All Possible values of the Attribute)

This method uses all possible attribute values of the vacant attribute value to fill in, and can get good completion effect. However, when there is a large amount of data or a large number of omitted attribute values, the calculation cost is very high and there are many possible test schemes.

In another method, the principle of filling missing attribute values is the same, but the difference is the possible cases of trying all attribute values from the same decision object, instead of trying all objects in the information table, which can reduce the cost of the original method to some extent.

2.7 Combinatorial Completer

This method tries all possible attribute values of the vacant attribute value and selects the best one from the reduction result of the final attribute as the filled attribute value. This is a data completion method for the purpose of reduction, which can get good reduction results. However, when there is a large amount of data or a large number of omitted attribute values, the calculation cost is very high.

The other, called Conditional Combinatorial Complete, fills the missing attribute values with the same principle except for the possible cases in which all attribute values are tried from the same decision object, rather than from all the objects in the information table. The combined conditional completion method can reduce the cost of the combined complete method to a certain extent. In cases where information tables contain a large number of incomplete data, the number of possible test scenarios increases dramatically.

2.8 Regression

Based on the complete data set, establish regression equation (model). For objects containing null values, the known attribute values are substituted into the equation to estimate the unknown attribute values, and the estimated values are filled. Biased estimates can result when the variables are not linearly correlated or when the predictive variables are highly correlated.

2.9 Maximum Likelihood estimation (Max Likelihood, ML)

Under the condition that the missing type is random, assuming that the model is correct for complete samples, the maximum likelihood estimation of unknown parameters can be carried out through the marginal distribution of observed data (Little and Rubin). This method is also known as maximum likelihood estimation without missing values. In practice, Expectation Maximization (EM) is usually used for parameter estimation of maximum likelihood.

This method is more attractive than deleting individual cases and single value interpolation, and its important premise is that it is suitable for large samples. The number of valid samples is sufficient to ensure that the ML estimate is asymptotically unbiased and normally distributed. However, this method may fall into local extremum, the convergence speed is not very fast, and the calculation is very complicated.

2.10 Multiple Imputation (MI)

The idea of multi-valued interpolation is derived from Bayesian estimation, which holds that the value to be interpolated is random and its value comes from the observed value. In practice, the value to be interpolated is usually estimated, and then different noises are added to form multiple groups of optional interpolation values. According to some selection basis, select the most suitable interpolation value.

The multi-interpolation method is divided into three steps: (1) generate a set of possible interpolation values for each null value, which reflect the uncertainty of the non-response model; Each value can be used to interpolate missing values in a data set, producing several complete data sets. ② Each interpolated data set was statistically analyzed using the statistical method for the complete data set. ③ The final interpolation value is generated by selecting the results from each interpolation data set according to the scoring function.

The idea of multiple interpolation is consistent with that of Bayesian estimation, but multiple interpolation makes up for several shortcomings of Bayesian estimation. First, Bayesian estimation is estimated by the method of maximum likelihood, which requires that the form of the model must be accurate. If the parameter form is incorrect, the wrong conclusion will be obtained, that is, the prior distribution will affect the accuracy of the posterior distribution. However, multiple interpolation is based on the theory of large samples asymptotically complete data. In data mining, the amount of data is large, and the prior distribution will have little influence on the results. Second, Bayesian estimation only requires to know the prior distribution of unknown parameters without using the relationship with parameters. Multiple interpolation estimates the joint distribution of parameters and utilizes the relationship between parameters.

At the same time, multiple interpolation preserves two basic advantages of single interpolation, namely, the ability to apply complete data analysis methods and fuse the knowledge of the data collector. Compared with single interpolation, multiple interpolation has three extremely important advantages. First, in order to represent data distribution, interpolation is performed randomly, which increases the effectiveness of estimation. Second, when multiple interpolation is random sampling under a model, it simply fuses the complete data to infer the valid inference in a direct way, that is, it reflects the additional variation caused by missing values under the model. Thirdly, the sensitivity of inference under different models without answers can be directly studied by using complete data method simply by interpolation with random extraction under multiple models.

Multiple interpolation also has the following disadvantages: (1) Generating multiple interpolation requires more work than single interpolation; (2) Storage of multi-interpolated data sets requires more storage space; ③ It takes more energy to analyze multiple interpolation data sets than single interpolation data sets.

3 does not handle

Data mining is performed directly on data containing null values. These methods include Bayesian networks and artificial neural networks.

Bayesian network is a graphical model used to represent the probability of connection between variables. It provides a natural way to represent causal information and discover potential relationships between data. In this network, nodes represent variables and directed edges represent the dependencies between variables. The Bayesian network is only suitable for those who have a certain understanding of domain knowledge, at least for those who have a clear dependence between variables. Otherwise, learning the structure of Bayesian network directly from data not only has high complexity (exponential increase with the increase of variables) and expensive network maintenance, but also has many estimated parameters, which brings high variance to the system and affects its prediction accuracy. When the number of missing values in any one object is large, there is a danger of exponential explosion.

Artificial neural network can deal with null value effectively, but the research of artificial neural network in this aspect needs to be further developed. Limitations of artificial neural network method in data mining application

4 summarizes

Advantages, disadvantages and applicable environments of some methods are as follows:

The above interpolation method has a good effect on the random missing value type interpolation. The two mean interpolation methods are the easiest to implement and are often used in the past, but they have great interference to samples, especially when the interpolated value is used as an explanatory variable for regression, the estimated value of parameters deviates greatly from the real value. Comparatively speaking, maximum likelihood estimation and multiple interpolation are two relatively good interpolation methods. Compared with multiple interpolation, maximum likelihood lacks uncertain components, so more and more people tend to use multi-valued interpolation methods.

Resources: blog.csdn.net/w352986331q…