Arrange according to the course of vegetables, easy to remember and understand

The code location is as follows:

Missing value

A very important part of data preprocessing is to deal with missing values

SimpleImputer should be impute
import pandas as pd
data = pd.read_csv("Narrativedata.csv",index_col = 0)
data.head()

The missing values of Age and Embarked are observed in info()
data.info()

""" 
      
        Int64Index: 891 entries, 0 to 890 Data columns (total 4 columns): Age 714 non-null float64 Sex 891 non-null object Embarked 889 non-null object Survived 891 non-null object dtypes: Float64 (1), object(3) Memory Usage: 34.8+ KB ""
      
Copy the code

Here, we use data from the Titanic, which has three features, one numeric, two character, and the label is character. From here, we will use this data as an example to familiarize you with the various methods of data preprocessing in SkLearn.

impute.SimpleImputer

Class sklearn. Impute. SimpleImputer (missing_values = nan, the strategy = ‘mean’, fill_value = None, verbose = 0, copy = True)

parameter
parameter Meaning & Input
missing_values Tell SimpleImputer what the missing value in the data looks like, the default null value np.nan
strategy We fill in the missing values of the strategy, the default mean. Enter “mean” to fill in with the mean (available only for numeric features) enter “median” to fill in with the median (available only for numeric features) enter “most_frequent” to fill in with the mode (available for both numeric and character features) Enter “constant” to refer to the value in parameter” fill_value” (both numeric and character features are available)
fill_value This parameter is available when startegy is set to “constant”. You can enter a string or number to indicate the value to be filled
copy Default to True, a copy of the eigenmatrix will be created, otherwise missing values will be filled into the original eigenmatrix.
  • Our input data must be 2-0 0 (-1,1) sharpened
This 0 is 0 for the purpose of this 0 because our fit must be 2-d data
age = data.loc[:,"Age"].values.reshape(-1.1)
age[:20]

""" array([[22.], [38.], [26.], [35.], [35.], [nan], [54.], [ 2.], [27.], [14.], [ 4.], [58.], [20.], [39.], [14.], [55.], [ 2.], [nan], [31.], [nan]]) """
Copy the code
  • Fill in with 0, median, average
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer()   # Fill with average value by default
imp_median = SimpleImputer(strategy="median")   # median fill
imp_0 = SimpleImputer(strategy="constant",fill_value=0)  # fill with constant 0

imp_mean = imp_mean.fit_transform(age)
imp_median = imp_median.fit_transform(age)
imp_0 = imp_0.fit_transform(age)

imp_mean[:20]
"" "array ([[22], [38.], [26], [35], [35], [29.69911765], [54.], [2], [27], [14], [4], [58.]. [20], [39], [14], [55.], [2], [29.69911765], [31], [29.69911765]]) "" "
# Data is not displayed
imp_median[:20]

imp_0[:20]
Copy the code
  • To fill the
# Fill in
data.loc[:,"Age"] = imp_median
data.info()
Copy the code
  • The mode is used to populate Embarked
The mode is used to fill in Embarked
Embarked = data.loc[:,"Embarked"].values.reshape(-1.1)
imp_mode = SimpleImputer(strategy = "most_frequent")
data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked)
data.info()
Copy the code
It’s actually easier to fill in with Pandas and Numpy
  • Fill_na and drop_na are used
We can use pandas and numpy to fill in the data
import pandas as pd
data = pd.read_csv("Narrativedata.csv",index_col = 0)
data.head()

# fill Age with fillna average
data.loc[:,"Age"] = data.loc[:,"Age"].fillna(data.loc[:,"Age"].mean())

#.dropna(axis=0) deletes all rows with missing values,.dropna(axis=1) deletes all columns with missing values
data.dropna(axis=0,inplace=True)
data.info()
Copy the code