Data Preprocessing and Feature Engineering Summary - Missing Values (PART 2)

Arrange according to the course of vegetables, easy to remember and understand

The code location is as follows:

Missing value

A very important part of data preprocessing is to deal with missing values

SimpleImputer should be impute
import pandas as pd
data = pd.read_csv("Narrativedata.csv",index_col = 0)
data.head()

The missing values of Age and Embarked are observed in info()
data.info()

""" 
      
        Int64Index: 891 entries, 0 to 890 Data columns (total 4 columns): Age 714 non-null float64 Sex 891 non-null object Embarked 889 non-null object Survived 891 non-null object dtypes: Float64 (1), object(3) Memory Usage: 34.8+ KB ""
      
Copy the code

Here, we use data from the Titanic, which has three features, one numeric, two character, and the label is character. From here, we will use this data as an example to familiarize you with the various methods of data preprocessing in SkLearn.

impute.SimpleImputer

Class sklearn. Impute. SimpleImputer (missing_values = nan, the strategy = ‘mean’, fill_value = None, verbose = 0, copy = True)

parameter

parameter	Meaning & Input
missing_values	Tell SimpleImputer what the missing value in the data looks like, the default null value np.nan
strategy	We fill in the missing values of the strategy, the default mean. Enter “mean” to fill in with the mean (available only for numeric features) enter “median” to fill in with the median (available only for numeric features) enter “most_frequent” to fill in with the mode (available for both numeric and character features) Enter “constant” to refer to the value in parameter” fill_value” (both numeric and character features are available)
fill_value	This parameter is available when startegy is set to “constant”. You can enter a string or number to indicate the value to be filled
copy	Default to True, a copy of the eigenmatrix will be created, otherwise missing values will be filled into the original eigenmatrix.

Our input data must be 2-0 0 (-1,1) sharpened

This 0 is 0 for the purpose of this 0 because our fit must be 2-d data
age = data.loc[:,"Age"].values.reshape(-1.1)
age[:20]

""" array([[22.], [38.], [26.], [35.], [35.], [nan], [54.], [ 2.], [27.], [14.], [ 4.], [58.], [20.], [39.], [14.], [55.], [ 2.], [nan], [31.], [nan]]) """
Copy the code

Fill in with 0, median, average

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer()   # Fill with average value by default
imp_median = SimpleImputer(strategy="median")   # median fill
imp_0 = SimpleImputer(strategy="constant",fill_value=0)  # fill with constant 0

imp_mean = imp_mean.fit_transform(age)
imp_median = imp_median.fit_transform(age)
imp_0 = imp_0.fit_transform(age)

imp_mean[:20]
"" "array ([[22], [38.], [26], [35], [35], [29.69911765], [54.], [2], [27], [14], [4], [58.]. [20], [39], [14], [55.], [2], [29.69911765], [31], [29.69911765]]) "" "
# Data is not displayed
imp_median[:20]

imp_0[:20]
Copy the code

To fill the

# Fill in
data.loc[:,"Age"] = imp_median
data.info()
Copy the code

The mode is used to populate Embarked

The mode is used to fill in Embarked
Embarked = data.loc[:,"Embarked"].values.reshape(-1.1)
imp_mode = SimpleImputer(strategy = "most_frequent")
data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked)
data.info()
Copy the code

It’s actually easier to fill in with Pandas and Numpy

Fill_na and drop_na are used

We can use pandas and numpy to fill in the data
import pandas as pd
data = pd.read_csv("Narrativedata.csv",index_col = 0)
data.head()

# fill Age with fillna average
data.loc[:,"Age"] = data.loc[:,"Age"].fillna(data.loc[:,"Age"].mean())

#.dropna(axis=0) deletes all rows with missing values,.dropna(axis=1) deletes all columns with missing values
data.dropna(axis=0,inplace=True)
data.info()
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Data Preprocessing and Feature Engineering Summary — Missing Values (PART 2)

Missing value

impute.SimpleImputer

parameter

It’s actually easier to fill in with Pandas and Numpy

Data Preprocessing and Feature Engineering Summary — Missing Values (PART 2)

Missing value

impute.SimpleImputer

parameter

It’s actually easier to fill in with Pandas and Numpy

Related Posts

A review of facial expression recognition

Introduce you to nine commonly used convolutional neural networks

Product distance correlation coefficient