Data Preprocessing

The article introduces:

Dealing with missing values, dealing with noisy data, data normalization, summary, sampling, discretization and principal component analysis.

Data Quality Issues

Poor data quality can adversely affect data mining. Common data quality problems include noise, outliers, missing values, and duplicate data. This section presents examples of Python code to alleviate some of these data quality issues. We start with an example dataset from the UCI machine learning repository that contains information about breast cancer patients. We will first download the dataset using the Pandas read_csv() function and display its first five data points.

import pandas as pd

Data = pd. Read_csv (‘ archive.ics.uci.edu/ml/machine-… ‘, header=None)

data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses','Class']

data = data.drop(['Sample code'],axis=1)

print('Number of instances = %d' % (data.shape[0]))

print('Number of attributes = %d' % (data.shape[1])) data.head()

1.1 Missing Data

It is common for an object to be missing one or more attribute values. In some cases, information is not collected; In other cases, some properties do not apply to the data instance. This section presents examples of different ways to handle missing values. According to the description of the data (archive.ics.uci.edu/ml/datasets…

import numpy as np data = data.replace('? ',np.NaN)

print('Number of instances = %d' % (data.shape[0]))

print('Number of attributes = %d' % (data.shape[1]))

print('Number of missing values:')

for col in data.columns:

print('\t%s: %d' % (col,data[col].isna().sum()))

Note that only the “Bare columns “contain missing values. In the following example, the missing value in the “Bare Columns “is replaced by the median value of that column. The before and after values show a subset of data points.

data2 = data['Bare Nuclei']

print(‘Before replacing missing values:’)

print(data2[20:25])

data2 = data2.fillna(data2.median())

print(‘\nAfter replacing missing values:’)

print(data2[20:25])

Another common approach is to discard data points that contain missing values rather than replace them. This can be easily done by applying the Dropna () function to the data framework.

1.2 Outliers

Outliers are data instances that have characteristics that are very different from the rest of the data set. In the sample code below, we will plot a Boxplot to identify columns in the table that contain outliers. Note that the values in all columns (except ‘Bare columns’) are originally stored as ‘int64’, while the values in the ‘Bare columns’ are stored as string objects (because the column originally contains ‘? ‘for missing values). Therefore, we must first convert the column to a numeric value before we can create the Boxplot. Otherwise, the column will not be displayed when the Boxplot is drawn.

data2 = data.drop(['Class'],axis=1)

data2['Bare Nuclei'] = pd.to_numeric(data2['Bare Nuclei'])

Data2. Boxplot (figsize = (20, 3))

The waveform showed that only five columns (edge adhesion, single epithelial cell size, flat chromatin, normal nucleoli, and Mitoses) contained abnormally high values. To discard outliers, we can calculate the Z-score for each attribute and remove those instances that contain an unusually high or low Z-score (for example, if Z > 3 or Z <= -3).

The following code shows the results of standardizing data columns. Note that missing values (NaN) are not affected by the standardization process.

Z = (data2-data2.mean())/data2.std()

Z[20:25]

The following code shows the result of discarding columns Z>3 or Z<=-3.

print('Number of rows before discarding outliers = %d' % (Z.shape[0]))

Z2 = Z.loc[((Z > -3).sum(axis=1)==9) & ((Z <= 3).sum(axis=1)==9),:]

print('Number of rows after discarding missing values = %d' % (Z2.shape[0]))

1.3 Duplicate Data

Some datasets, especially those obtained by merging multiple data sources, may contain duplicate or nearly duplicate instances. The term deduplication is usually used to refer to the process of dealing with duplicate data problems. In the example below, we first examine duplicate instances in the breast cancer dataset.

dups = data.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

Data. The loc [28] [11]

The function duplicated() returns a Boolean array that indicates whether each row is duplicated with the previous row in the table. It turned out that there were 236 duplicated rows in the breast cancer data set. For example, an instance with row index 11 has the same attribute value as an instance with row index 28. Although such duplicate rows may correspond to samples of different individuals, in this hypothetical example, we assume that duplicate rows are samples taken from the same individual, and show below how to remove duplicate rows.

Polymerization (Aggregation)

Data aggregation is a pre-processing task in which the values of two or more objects are combined into a single object. The motivations for aggregation include :(1) reducing the size of the data to be processed; (2) Change the granularity of analysis (from fine scale to coarse scale); (3) Improve the stability of data. In the following example, we will use daily precipitation time series data from a weather station located at Detroit Metro Airport. The original data was obtained from the climatic Data online site (www.ncdc.noaa.gov/cdo-web/)… The following code loads the precipitation time series data and plots its daily time series.

daily = pd.read_csv('DTW_prec.csv', header='infer')

daily.index = pd.to_datetime(daily['DATE'])

daily = daily[‘PRCP’]

Ax = daily. The plot (kind = ‘line’, figsize = (15, 3))

ax.set_title(‘Daily Precipitation (variance = %.4f)’ % (daily.var()))

It was observed that the daily time series seemed to be quite chaotic and varied greatly between different time steps. Time series can be grouped and summarized by month to obtain the total monthly precipitation value. Compared with the daily time series, the obtained time series seems to change more smoothly.

monthly = daily.groupby(pd.Grouper(freq='M')).sum()

Ax = or. The plot (kind = ‘line’, figsize = (15, 3))

ax.set_title(‘Monthly Precipitation (variance = %.4f)’ % (monthly.var()))

In the following example, the daily precipitation time series is grouped and summarized by year to obtain the annual precipitation value.

annual = daily.groupby(pd.Grouper(freq='Y')).sum()

Ax = annual. The plot (kind = 'line', figsize = (15, 3))

ax.set_title('Annual Precipitation (variance = %.4f)' % (annual.var()))

Sampling (from)

Sampling is a common method to facilitate :(1) data reduction for exploratory data analysis and extension of algorithms to big data applications; (2) Quantify the uncertainty caused by different data distribution. There are many methods of data sampling, such as sampling without replacement, that is, deleting each selected instance from the data set; Substitution sampling, that is, not deleting each selected instance, thus allowing it to be selected more than once in the sample. In the following example, we will apply both substitution and no-substitution sampling to the breast cancer data set obtained from the UCI machine learning library. We initially display the first five records of the table.

In the code below, a sample of size 3 is randomly selected from the original data (without substitution).

sample = data.sample(n=3)

sample

In the next example, we randomly select 1% of the data (without substitution) and display the selected sample. The random_state argument to the function specifies the seed value for the random number generator.

The sample = data. The sample (frac = 0.01, random_state = 1)

sample

Finally, we sample by substitution, creating a sample whose size is equal to 1% of the total data. You should be able to see duplicate instances in the sample by increasing the sample size.

Sample = data.sample(frac=0.01, replace=True, random_state=1)

sample

Discretization (Discretization)

Scatter is a data preprocessing step that is typically used to convert a contiguous value attribute to a categorical attribute. The following example illustrates two simple but widely used unsupervised discrete methods (equal width and equal depth) applied to “Clump Thickness “properties of breast cancer datasets. First, we draw a histogram showing the distribution of attribute values. You can also use the value_counts() function to count the frequency of each attribute value.

data['Clump Thickness'].hist(bins=10)

data['Clump Thickness'].value_counts(sort=False)

For the equal-width method, we can use the cut() function to split the property into four bins of similar interval widths. The value_counts() function can be used to determine the number of instances in each bin.

bins = pd.cut(data['Clump Thickness'],4)

bins.value_counts(sort=False)

For the iso-frequency method, you can use the qcut() function to divide the value into four bins so that each bin has nearly the same number of instances.

bins = pd.qcut(data['Clump Thickness'],4)

bins.value_counts(sort=False)

Principal Component Analysis (PCA)

Principal component analysis (PCA) is a classical method to reduce the number of attributes in data by projecting data from a higher dimensional space to a lower dimensional space. The new properties created by PCA (also known as components) have the following properties. (1) they are linear combinations of the original attributes, (2) they are orthogonal (vertical) to each other, and (3) they capture the greatest amount of variation in the data. The following example illustrates the application of PCA to an image data set. There are 16 RGB files, each 111 x 111 pixels in size. The sample code below reads each image file and converts RGB images to 111 x 111 x 3 = 36,963 eigenvalues. This will create a 16×36963 data matrix.

With PCA, the data matrix is projected onto its first two principal components. The projected values of the original image data are stored in a Pandas DataFrame object called Projection pandas.

Finally, we draw a scatter plot to show the projection values. Images of burgers, drinks and pasta were observed to be projected onto the same area. However, the image of the fried chicken (shown as a black square) is harder to distinguish.

Data Quality Issues

1.1 Missing Data

1.2 Outliers

1.3 Duplicate Data

Polymerization (Aggregation)

Sampling (from)

Discretization (Discretization)

Principal Component Analysis (PCA)

Related Posts

LinkedIn report: These are more promising jobs than AI engineers in 2018!

3d path planning based on MATLAB A_star algorithm

Speech analysis based on MATLAB cepstrum analysis and MFCC coefficient calculation