BigData preprocessing (complete step)\

Although the title is complete steps, actually not complete, the following is the original text

One: Why preprocessing data?

(1) Real world data is dirty (incomplete, noisy, inconsistent) (2) without high-quality data, there is no high-quality mining results (high-quality decisions must rely on high-quality data; (3) Problems existing in the original data: inconsistencies — repeated inconsistencies in the data — no noise in the attributes of interest — errors or anomalies in the data (deviating from the expected value) high dimension TWO: Methods of data preprocessing (1) data cleaning — removing noise and irrelevant data (2) Data integration — combining data from multiple data sources and storing them in a consistent data store (3) Data transformation — converting original data into a form suitable for data mining (4) data specification — the main methods include: Data cube aggregation, dimensional reduction, data compression, numerical reduction, discretization and concept stratification. (5) Figure 3: Reference principle of data selection (1) Clear meaning of attribute name and attribute value as much as possible (2) unified attribute coding of multi-data sources (3) Removal of unique attribute (4) Removal of repeated attribute (5) Removal of negligible fields (6) Reasonable selection of associated fields (7) Further processing: By filling missing data, eliminating abnormal data, smoothing noisy data, and correcting inconsistent data, removing noise from data, filling null values, missing values, and processing inconsistent data. — > data analysis (with visual tools) find dirty data — > clean dirty data (with MATLAB or Java/C++ language) — > statistical analysis (Excel data analysis good, maximum size value, median, Mode, mean, variance, etc., and scatter plot) — > Find dirty data or data irrelevant to the experiment again (remove) — > Final experimental analysis — > social example validation — > End.