Author: Mochou

Source: Hang Seng LIGHT Cloud Community

In the context of current big data, data processing accounts for a large share, just like a tomato made into tomato scrambled eggs, it needs to go through seasoning integration, vegetable cleaning, food processing, etc., before it can be released to production, no, sent to the table.

Here is a brief share of my understanding of data cleaning, which is very important, otherwise in the face of messy tomato scrambled eggs, no one is willing to eat.

An uncleaned piece of data will typically have these noncompliance issues: Such as repetition, mistakes, null values, abnormal data, etc., for bad data, because it is a source of business problems, such as the gender is male is female, these we can’t handle, can only from the source end to specification, as customers want to eat tomatoes in henan, the kitchen into only in shandong, the chef can’t solve, only inform reception among changes. Therefore, we only clean and modify the other three problems. It must be stated that all the cleaning should be done based on the actual business, such as repetition. Maybe what the business wants is repetition

A repeated.

If the actual business does not want to duplicate values, the duplicate values can be deleted directly. For example, the database can use union instead of union all in the consolidation. If the database does not support union, the primary key can be used to group the first value

row_number() over (partition by .. order by.. desc) as..Copy the code

If row_number is not supported, then live.

Other languages have similar deduplicate functions, such as Python, which uses drop_duplicates().

Lack of 2.

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Null processing is usually fill, this is done according to the actual business needs, generally speaking,

  • The number of null values can be small to fill in one of the continuous values, such as average, median, etc.
  • If the null value is large, accounting for more than 50%, it can be considered to use mode to fill;
  • In this case, there is no need to use the original data. You can create your own data and generate an indication dummy variable to participate in the subsequent modeling needs.

We’ve already seen how to do it in the database, but how to do it in Python

Df.apply (lambda col:sum(col.isnull())/col.size) Use the fillna df.col1.fillna(df.col1.mean()) in the bag.Copy the code
3. Noise value

Noise value is the value of the index data compared with other values of the difference is relatively large, some are called outliers, such as age out of a few more than 150. The noise value will seriously interfere with the model results, making the conclusion untrue or biased. Therefore, it is necessary to remove these noise values, commonly used methods: for univariate capping method, box method, multivariable clustering method.

  • capping

We’ve all learned about normal distribution, and the probability of adding a block to a record outside the three standard deviations above or below the mean is only 0.01%, so we can replace these peripheral data with three standard deviations above or below the mean, which is the capping method

The database can be replaced with case when, and Python can write a function

Def cap (x, quantile = [0.01, 0.99]) : The capping method handles outliers Args: x: Is a series of continuous variables quantile: range of upper and lower quantiles, Here write 0.01 and 0.99 "" # number of components,Q01 and Q99 are the 1/100th and 99percent quantile Q01 respectively,Q99= x.quanantile (quantile).values.tolist() # Replace the outlier with the designated quantile if Q01 > x.min(): x = x.copy() x.loc[x < Q01] = Q01 if Q99 < x.max(): x = x.copy() x.loc[x > Q99] = Q99 return(x)Copy the code
  • Points box method

Points box method by examining the data of “near abroad” to smooth and orderly data value, order value distribution in some cases, by taking the specific value of each box, such as the value of the mean, median, etc., this value is the value of the box, and then set the standard to judge the value of the good or bad, these boxes to judge each box is good or bad, bad cases need special treatment. The method is divided into equal depth subdivision: the sample size of each subdivision is the same; equal width subdivision: the value range of each subdivision is the same.

For example, A group of numbers 1, 2, 66, 8, 9, 2, 1, 4, 6, first sort 1, 1, 2, 4, 6, 8, 9, 66, and then divided into three boxes, box A: 1, 1, 2, box B: 2, 4, 6 box C: 8, 9, 66

We take the average of boxes here, so that A is 1.3, B is 4, and C is 27.3. Obviously, this VALUE of C is much larger than the mean and median of this set of data. Therefore, box C is A bad box, and the data in C can be processed centrally

  • Clustering method

The previous ones are all univariate, and the multivariable outliers need clustering.

The idea is that the normal values have similar labels, such as the good tomatoes are red in color, sweet and sour in taste, and the skin is complete, etc., while the bad tomatoes have characteristics that are different from “others”, such as a strange taste. Therefore, data objects can be divided into multiple sets. Objects in the same set have a high degree of similarity, while objects in different sets differ greatly. Cluster analysis can mine outliers from these different sets, and these outliers are often abnormal data.