Python Data Analysis Basics: outlier detection and processing

Author: xiaoyu

Python Data Science

Python data analyst

Data missing value processing

This article continues with another common problem in data cleaning: outlier detection and handling.

1 What is an outlier?

In machine learning, exception detection and processing is a relatively small branch, or a by-product of machine learning, because in general prediction problems, the model is usually an expression of the data structure of the whole sample, which usually captures the general properties of the whole sample. And those on the nature of the performance and the overall sample is not completely consistent, we call it the abnormal points, usually abnormal points in the prediction problem is not popular with the developers, because they see problems typically focus on the nature of the overall sample, and mechanism of the abnormal point is entirely out line with the overall sample, if the algorithm is sensitive to abnormal points, Then the generated model can not have a good expression of the overall sample, so the prediction will be inaccurate. , on the other hand, instead of abnormal points in some scenarios to analysts felt great interest, such as disease prediction, usually healthy body indicators are similar in some dimensions, if a person’s body index appeared abnormal, so his body situation must be changed, in some respects, of course, this change is not necessarily caused by a disease (often referred to as noise points), However, the occurrence and detection of abnormalities is an important starting point for disease prediction. Similar scenarios apply to credit fraud, cyber attacks and so on.

2 Detection method of outliers

General outlier detection methods include methods based on statistics, methods based on clustering, and some special methods for outlier detection, etc. These methods are introduced in the following.

1. Simple statistics

If pandas is used, we can use describe() directly to observe a statistical description of data (only a cursory observation of statistics), but the statistics are continuous, as follows:

df.describe()
Copy the code

Or the existence of outliers can be clearly observed by simply using scatter plots. As follows:

2. 3 partial principle

There’s a catch to this rule: the data needs to be normally distributed. Under the 3∂ principle, if an outlier exceeds 3 times standard deviation, it can be regarded as an outlier. Plus or minus 3 partial probability is 99.7%, the average distance three values other than partial probability P (x | – u | > 3 partial) < = 0.003, belong to a small probability of rare events. If the data does not follow a normal distribution, it can also be described by the number of standard deviations away from the mean.

The red arrow points to outliers.

3. The box figure

This method uses the quartile distance (IQR) of a box plot to detect outliers, also known as Tukey’s test. A box diagram is defined as follows:

The quartile distance (IQR) is the difference between the upper and lower quartiles. By using 1.5 times of IQR as the standard, we stipulate that the points exceeding the upper quartile +1.5 times IQR distance, or the lower quartile -1.5 times IQR distance, are outliers. Here is the code implementation in Python, using numpy’s Percentile method.

Percentile = np.percentile(df['length'],[0,25,50,75,100]) IQR = Percentile[3] - Percentile[1] UpLimit = Percentile[3]+ageIQR*1.5 DownLimit = The Percentile [1] - ageIQR * 1.5Copy the code

This can also be done using Seaborn’s visual method, BoxPlot:

F, ax = PLT. Subplots (figsize = (10, 8)) SNS. Boxplot (y ='length',data=df,ax=ax)
plt.show()
Copy the code

The red arrow points to outliers.

These are the simple methods used to determine outliers. The following is to introduce some more complex outlier detection algorithms, due to involve more content, only introduce the core ideas, interested friends can in-depth study.

4. Based on model detection

This method generally builds a probability distribution model and calculates the probability that objects conform to the model, and treats objects with low probability as outliers. If the model is a collection of clusters, the exception is an object that does not significantly belong to any cluster; If the model is regressive, the anomaly is an object relatively far from the predicted value.

Definition of probability of outliers: An outlier is an object, a probability distribution model of data, which has low probability. In this case, it is necessary to know what distribution the data set follows, and if the estimation is wrong, a heavy-tailed distribution is created.

For example, the RobustScaler method in feature engineering makes use of the quantile distribution of data features to scale data into multiple segments based on quantile, and only the middle segment is used for scaling, for example, data ranging from 25% to 75%. This reduces the impact of abnormal data.

Advantages and disadvantages :(1) having a solid foundation in statistical theory, these tests can be very valid when sufficient data and knowledge of the types of tests used exist; (2) For multivariate data, there are fewer choices available, and for high-dimensional data, these detection possibilities are poor.

5. Outlier detection based on neighbor degree

Statistical methods use data distribution to observe outliers, some methods even need some distribution conditions, but in practice, data distribution is difficult to meet some assumptions, there are certain limitations in use.

It is easier to determine a meaningful proximity measure for a data set than its statistical distribution. This method is more general and easier to use than statistical methods, because an object’s outlier score is given by the distance to its K-nearest neighbor (KNN).

It should be noted that the outlier score is highly sensitive to the value of K. If k is too small, a small number of neighboring outliers may lead to a lower outlier score. If K is too large, then all objects in the cluster with less than K may become outliers. In order to make the scheme more robust to the selection of K, the average distance of k nearest neighbors can be used.

Advantages and disadvantages :(1) simple; (2) Disadvantages: Proximity based method requires O(m2) time, so large data sets are not applicable; (3) The method is also sensitive to parameter selection; (4) Data sets with different density regions cannot be processed, because it uses global thresholds and cannot take such density changes into account.

5. Density-based outlier detection

From a density-based point of view, an outlier is an object in a low-density region. Density-based outlier detection is closely related to proximity based outlier detection because density is usually defined in terms of proximity. A common way to define density is to define density as the reciprocal of the average distance to k nearest neighbors. If the distance is small, the density is high, and vice versa. Another definition of density is that used by the DBSCAN clustering algorithm, where the density around an object is equal to the number of objects within the specified distance D of the object.

Advantages and disadvantages :(1) the quantitative measurement that the object is an outlier is given, and the data can be processed well even if the data has different regions; (2) Like distance-based methods, these methods must have O(m2) time complexity. O(Mlogm) can be achieved by using a specific data structure for low dimensional data; (3) Parameter selection is difficult. Although the LOF algorithm deals with this problem by observing different k values and then obtaining the maximum outlier score, it is still necessary to choose the upper and lower bounds of these values.

6. Outlier detection based on clustering method

Cluster-based outliers: An object is an outlier based on clustering, if the object does not strongly belong to any cluster, then the object is an outlier.

The effect of outliers on initial clustering: If outliers are detected through clustering, there is a question whether the structure is effective because outliers affect the clustering. This is also the disadvantage of k-means algorithm, which is sensitive to outliers. To deal with this problem, you can use the following approach: cluster objects, delete outliers, and cluster objects again (this does not guarantee optimal results).

Advantages and disadvantages :(1) finding outliers based on linear and near-linear complexity (k-means) clustering techniques may be highly efficient; (2) The definition of clusters is usually the complement of outliers, so it is possible to find both clusters and outliers; (3) The generated outlier sets and their scores may depend very much on the number of clusters used and the existence of outliers in the data; (4) The quality of clusters generated by the clustering algorithm has a great influence on the quality of outliers generated by the algorithm.

7. Special outlier detection

In fact, the original intention of the clustering method mentioned above is unsupervised classification, not to find outliers. It just happens that its function can realize the detection of outliers, which is a derivative function.

In addition to the methods mentioned above, there are two methods specially used for detecting abnormal points that are commonly used: One Class SVM and Isolation Forest. The details will not be further studied.

3 Handling methods of outliers

An outlier has been detected and we need to do something about it. The processing methods of general outliers can be roughly divided into the following:

Delete records containing outliers: Directly delete records containing outliers.
Regarded as missing value: The outlier value is regarded as missing value and processed by the method of missing value processing;
Average correction: the outlier can be corrected by the average of the two observations before and after;
No processing: Data mining is performed directly on data sets with outliers;

Whether to delete an outlier depends on the actual situation. Because some models are not very sensitive to outliers, even if there are outliers, the model effect will not be affected. However, some models, such as logistic regression LR, are very sensitive to outliers. If they are not processed, they may have very poor effects such as over-fitting.

4 Outlier summary

The above is a summary of outlier detection and handling methods.

We can find outliers through some detection methods, but the results are not absolutely correct, and the specific situation needs to be judged according to the business understanding. Similarly, how to deal with outliers, whether to delete, correct or not, also needs to be considered according to the actual situation, there is no fixed.

Reference:

https://zhuanlan.zhihu.com/p/33665409 http://www.cnblogs.com/pinard/p/9314198.html https://blog.csdn.net/u013719780/article/details/48901183 http://www.cnblogs.com/charlotte77/p/5606926.html

Python Data Analysis and Mining

Python Data Science