This is the 20th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

Data analysis – Data preprocessing

Handling duplicate values

Duplicated () looks up duplicate values

 import pandas as pd
 a=pd.DataFrame(data=[['A',19],['B',19],['C',20],['A',19],['C',20]],
                columns=['name','age'])
 print(a)
 print('--------------------------')
 a=a.duplicated()
 print(a)
Copy the code

You judge the whole world, not every one

any()

 import pandas as pd
 a=pd.DataFrame(data=[['A',19],['B',19],['C',20],['A',19],['C',20]],
                columns=['name','age'])
 print(a)
 print('--------------------------')
 a=any(a.duplicated())
 print(a)
Copy the code

Drop_duplicates () Deletes duplicate values

Whether the inplace parameter is modified on the original data

 import pandas as pd
 a=pd.DataFrame(data=[['A',19],['B',19],['C',20],['A',19],['C',20]],
                columns=['name','age'])
 print(a)
 print('--------------------------')
 b=a.drop_duplicates(inplace=False)
 a.drop_duplicates(inplace=True)
 print(a)
 print('--------------------------')
 print(b)
Copy the code

Handling missing values

NaN represents a missing value

 import pandas as pd
 a=pd.read_csv(r'text.csv')
 print(a)
Copy the code

Isnull () determines whether all positional elements are missing

 import pandas as pd
 a=pd.read_csv(r'text.csv')
 print(a.isnull())
Copy the code

Any () determines whether the column element is missing

 import pandas as pd
 a=pd.read_csv(r'text.csv')
 print(a.isnull().any())
 print(a.isnull().any(axis=1))
Copy the code

Del ()dropna() deleted

 import pandas as pd
 a=pd.read_csv(r'text.csv')
 del a['name']
 print(a)
Copy the code

 import pandas as pd
 a=pd.read_csv(r'text.csv')
 b=a.dropna(axis=0)
 print(b)
 c=a.dropna(axis=1)
 print(c)
Copy the code

Del () deletes the specified column, dropna() deletes the column with the missing value.

Fillna () missing value is filled

import pandas as pd
a=pd.read_csv(r'text.csv')
a=a.fillna('wu')
print(a)
Copy the code

Fill in according to the upper (lower) data

Pad/ffILL: fill according to the previous line backfill/bfill: fill according to the next line

import pandas as pd a=pd.read_csv(r'text.csv') print(a) print('---------------------') b=a.fillna(method='pad') print(b)  print('---------------------') c=a.fillna(method='bfill') print(c)Copy the code

Numeric data population

The average mean ()

The average value of each column is populated

import pandas as pd
a=pd.read_csv(r'text.csv')
print(a)
print('---------------------')
a=a.fillna(a.mean())
print(a)
Copy the code

Median ()

import pandas as pd
a=pd.read_csv(r'text.csv')
print(a)
print('---------------------')
a=a.fillna(a.median( ))
print(a)
Copy the code

Character data filling

Modal mode ()

import pandas as pd
a=pd.read_csv(r'text.csv')
print(a)
print('---------------------')
for i in a.columns:
    a[i] = a[i].fillna(a[i].mode()[0])
print(a)
Copy the code

Data transformation

Map () data conversion

The import pandas as pd data = {' sex ':,0,1,1,0 [1]} a = pd. The DataFrame (data) a [' sex - T'] = [' sex ']. A map ({1: 'male', 0: 'female'}) print (a)Copy the code

Dummy variable

Import pandas as pd data = {' sex ': [' male' and 'female' and 'male' and 'female' and 'confidential']} a = pd. The DataFrame (data) a = pd. Get_dummies print (a) (a)Copy the code

\