background

Combined with some relevant information on the Internet, we sorted out and output this article, which explains the importance of data and the status of data in various links and fields. More importantly, this paper will explain in detail the principle and method details of data preprocessing and feature selection.

What is the data?

  • The result of observation, experiment, or calculation. Examples: numbers, words, images, sounds, etc.

What is data analysis?

  • Centralize and extract the information hidden behind the data
  • Summarize the internal laws of the research object to help managers to make effective judgments and decisions

The importance of data in data analysis

  • Data analysis is data + analysis, that is to say, data first, analysis after. Data is the basis of analysis, so the quality, relevance and dimension of data affect the results of data analysis.

Data analysis flow chart

Data preprocessing

What is data preprocessing?

Before feature engineering and log, detect and remove noise data and irrelevant data in data set, process vulnerability data and remove blank data.

Why do data preprocessing?

For example:

  1. Missing value; Class = “”
  2. Errors or outliers; Wages = “to 10”
  3. Contains contradictions; Age = 42, “03/17/1997”

Meaning of data preprocessing?

Improve the quality of the data, thus helping to improve the accuracy and performance of the subsequent learning process

The importance of data preprocessing

Data preprocessing is important. How important is it?

The characteristics of the data determine the upper limit of machine learning, and the application of models and algorithms only brings us closer to that limit

Mind maps for feature processing

Data cleaning

What is data cleansing?

Delete irrelevant data in the original data set, duplicate data, filter out data irrelevant to the mining topic, and deal with missing values and outliers.

Cause of missing data

Information is temporarily unavailable; Information is missing; Some objects have properties that are not available, and so on.

Missing type

Completely random missing; Random missing, non-random missing

The need to handle missing values

Retrieving lost information; Certainty is more pronounced; Get reliable output

Common data cleansing methods

Missing value handling

  1. To delete a tuple
  2. Mean/median/mode interpolation
  3. Use fixed values
  4. Nearest neighbor interpolation
  5. Regression methods
  6. Interpolation method

Outlier handling

  1. Delete records that contain outliers
  2. Regarded as missing value
  3. Mean correction
  4. Don’t deal with

The skewness distribution

Have a more comprehensive understanding of data distribution, and use mode, median, and mean to comprehensively describe data distribution. For data that is significantly left or right biased, it makes more sense to use the median to describe the data than the average, which is affected by extreme values.

To the left means the long tail is on the left, with more extreme data on the left

Many algorithms require samples to be normally distributed

Normal distribution

Normal distribution

Most of the frequencies are concentrated in the central position, and the distribution of frequencies at both ends is roughly symmetrical

Why convert skewness data to normally distributed data?

Many models assume that data is normally distributed

Why is normal distribution common in nature?

If multiple factors are independently and identically distributed and can be superimposed, the superimposed result will approach a normal distribution, which is the central limit theorem

Central limit theorem

The mean of the sample is roughly equal to the mean of the population

Missing value handling

① Delete a tuple

② Mean/median/mode interpolation

  • Nulls are numeric or non-numeric; The problem of complementing by means;
  • Reduced variability;
  • The covariance and correlation estimation are weakened

③ Regression method

  • Use model method to fill in missing value;
  • Problems with regression model complement: overestimation of model fitting and correlation estimation;
  • Weakened variance;

Outlier detection

3 alpha laws; Scatter diagram or box diagram;

Boxplot and quartiles

quartile

  • Rank all values in ascending order and divide them into four equal parts. The score at each of the three points is the quartile.
  • And then the 25% of the number from the smallest to the largest becomes Q1
  • The 50% number in ascending order becomes Q2
  • The 75% array in ascending order becomes Q3
  • Quartile distance (IQR) =Q3-Q1
  • Floor: Q1-1.5 IQR are much less
  • Cap: Q3 + 1.5 IQR are much less

Outlier handling

  1. Delete records that contain outliers
  • Exceptions are obvious and few in number
  1. Regarded as missing value
  • Use missing value processing method for processing
  1. Mean correction
  • Average value correction is a simple and efficient method with little information loss
  1. Don’t deal with
  • If the algorithm is not sensitive to outliers it can be left unprocessed, but if it is sensitive to outliers it is best not to.

Data integration

What is data integration?

Consolidate Chinese data from multiple sources into a consistent data store

Classification of data integration

1. Entity recognition

  • The same objections
  • Different name synonymous
  • Unit disunity

Examples: Customer_id in one database and customer_number in another; The data encoding for Pay_type can be “H” and “S” in one database and 1 and 2 in the other.

2. Redundant attribute recognition

An attribute is redundant if it can be “exported” by another attribute or set of attributes

Correlation analysis test:

Data transformation

What is data transformation?

To transform or unify data into a form suitable for mining

Contents involved in data transformation:

  • Smooth: Remove noise from data
  • Aggregation: To aggregate or aggregate data
  • Data generalization: Use concept layering to replace low-level or “raw” data with higher-level concepts
  • Normalization: Property data is scaled to fall into a small, specific range
  • Attribute construction: New attributes can be constructed and added to the attribute set to aid in the mining process.

What methods are involved in data transformation?

  1. Simple function transformation
  2. The normalized
    1. Why do I do normalization?
    2. What does normalization mean?
    3. The implementation method of normalization?
    4. How to normalize data with outliers?
  3. Discretization of continuous attributes
    1. Unsupervised discretization
    2. Supervised discretization
  4. attribute
  5. The wavelet transform

Data transformation – normalization/normalization

Continuous attribute discretization – equal width algorithm, equal frequency algorithm

Continuous attribute discretization — K-means clustering algorithm

Continuous attribute discretization — ChiMerge algorithm

Supervised discretization

ChiMerge is an X ² -based decentralizing method that uses a bottom-up strategy to recursively find the best adjacent ranges and then merge them to form larger ranges.

Process:

Each different value of point A of the numerical attribute is regarded as an interval, and the x² test is performed on each adjacent interval.

The adjacent intervals with the smallest x² are merged together because they have a similar distribution on the surface with the lowest x²

Data code

What is a data protocol?

  • Generate new datasets that are smaller but retain the integrity of the original data

The meaning of data transformation?

  • Improve modeling accuracy; Shortening the world of data mining; Reduce the cost of storing data

Classification of data specifications

Attribute specifications

  1. Merge attributes
  2. Step forward selection
  3. Step backward delete
  4. Decision tree induction
  5. Principal component analysis

Data specification – Dimension specification

LDA — Linear discriminant analysis

LDA

LDA is a dimension reduction technique for supervised learning, which means that each sample of its dataset is output by category. This is different from PCA, which is an unsupervised dimension reduction technique without considering the output of sample categories.

Idea: after projection, the variance within classes is minimum and the variance between classes is maximum

Data protocol – Numerical protocol

  • Have parameters – assume that the data fits some models, estimate the model parameters, and store only the parameters, and discard the data

1. Regression model

y=wx+b

X and y are numerical database attributes, and the coefficients W and b are the slope and y-intercept of the line, respectively. The coefficients are obtained by the least square method, which minimizes the error between the actual lines separating the data.

2. Log-linear model

Log-linear models can be approximated as discrete multidimensional probability distributions. Using a 3d log-linear model for example:

No parameters – histogram, clustering, sampling

Feature selection

Overview of feature engineering

Why do feature selection?

  1. Alleviate the dimension disaster problem
  2. Reduce the difficulty of learning tasks

Three goals for feature selection

  1. Improve the predictive performance of the model
  2. Faster and more efficient models
  3. Provides the best understanding of the underlying processes in generating data

The principle of feature selection?

  1. Divergence of features
  2. Correlation of features to targets

Classification of features

  1. Relevant features: features related to the current learning task;
  2. Irrelevant features: features irrelevant to the current learning task (the information provided by this feature is useless for the current learning task); E.g. As far as student scores are concerned, student numbers are irrelevant features.
  3. Redundant feature: the information contained in this feature can be deduced from other features; For example, the feature “area”, which can be derived from “length” and “width”, is redundant

Feature selection vs. feature advance

** Common ground: ** Both are dimensionality reduction methods for the same purpose.

Difference:

  • Feature extraction is through the relationship between attributes, such as combining different attributes to get new attributes, so as to change the original feature space
  • Feature selection is an inclusive relationship that selects subsets from the original feature data set without changing the original feature space

Feature selection – Filter

Filter Methods

What is filtration?

Score each feature according to divergence or relevance, set a threshold or the number of selection thresholds, select features

1. Variance selection method

  • Using variance selection method, first calculate the variance of each feature, and then select the feature whose variance is greater than the threshold value according to the threshold value.
  • This method can only be used when the eigenvalues are all discrete variables. If it is a continuous variable, it can only be used after the continuous variable is discretized

2. Correlation coefficient selection method

According to the size of the correlation coefficient to determine the strength of the correlation between two variables, and then select the characteristics of the correlation. Pearson correlation coefficient is commonly used

3. Carr’s test

The degree of deviation between the actual observed value and the theoretically inferred value of the statistical sample determines the magnitude of the Chi-square value

4. Mutual information method

Mutual information, indicating whether there is a relationship between two variables and the strength of the relationship

Residuals and determinations

Feature selection – Wrapper – Step forward selection

Wrapper Methods

What is packaging?

  • Packaging method is actually a search method, which takes the current feature combination as a set with search, finds the optimal feature combination from the set and returns the result.

Step by step:

  • Variables are gradually merged into larger and larger subsets

steps

  1. Start with an empty model
  2. Fit 5 simple linear regression models and search for the best one of all the single variable models
  3. Search through the remaining 4 variables and figure out which variable can be added to the existing model to improve the sum of squares of residuals the most

Feature selection – Wrapper – progressively backward culling

Progressive elimination:

You start with the set of all the variables, and you eliminate them gradually until you get the best one

Steps:

  1. At the beginning the model contains all the variables
  2. Remove the variable with the maximum p-value
  3. The new (p-1) variable model is T, and the maximum p-value variable is removed
  4. Repeat the above steps until the stop condition is reached

Feature selection — Embedded

Embedded Methods

What is theThe embeddedMethod?

  • First, some machine learning algorithms and models are used for training, and the weight coefficients of each feature are obtained, and the features are selected from large to small according to the narration.

Embed **** hair with regularization

When the sample features are many and the sample number is relatively small, the above formula is easy to fall into overfitting. In order to alleviate the over-fitting problem, the regularization term can be introduced into the above equation:

L1 regularization

L1 regularization refers to the sum of absolute values of each element in the weight vector

L2 regularization

L2 regularization refers to the sum of squares of each element in the weight vector and then taking the square root

Both L1 regularization and L2 regularization help to reduce the risk of fitting, but the former has the additional benefit that “it is easier to obtain” sparse “solutions than the latter, that is, it will obtain ω with fewer non-zero components

conclusion

Data cleaning: standardized format, clear abnormal data, correct errors, and remove duplicate data

Data integration: The organic integration of data from disparate sources

Data transformation: To transform data into a form suitable for data mining through smooth aggregation, data generalization, normalization and other methods

Data specification: The specification of the resulting data set indicates that it is much smaller, but still close to preserving the integrity of the original data, resulting in the same or nearly the same result as before the specification

Feature selection: improve the generalization ability of the model and reduce overfitting; Enhance the understanding between features and eigenvalues.

Write in the last

In recent years, against the backdrop of the rapid development of AIOps, the urgent need for IT tools, platform capabilities, solutions, AI scenarios and available data sets has exploded across industries. ** Based on this, Cloud Intelligence launched AIOps Community in August 2021, ** aiming to set up an open source banner and build an active user and developer community for customers, users, researchers and developers in various industries to contribute and solve industry problems and promote the development of technology in this field.

The community hasOpen sourceFlyFish, operation and maintenance management platformOMPCloud service management platform – Moore platform,HoursAlgorithms such as products, and in shorttimeShe won a string of community honors.

Visual Choreography Platform -FlyFish:

Project introduction: www.cloudwise.ai/flyFish.htm…

Github address: github.com/CloudWise-O…

Gitee address: gitee.com/CloudWise/f…

Business case: www.bilibili.com/video/BV1z4…

Some large screen cases:

You can add xiaoyuerwie to join the developer exchange group, and have 1V1 communication with the industry’s big names!

You can also get cloud smart AIOps information from the little assistant to keep up with FlyFish’s latest progress!

(Part of the material/information in this article comes from the network, if there is infringement, please contact small assistant/private letter for processing)

References:

Use Sklearn to do single-player feature engineering – JasonFreak – Blogpark

Resources 2