Accompany learning, grow all the way.Please pay attention together, a start like it!

【 Key points 】

2. How to find the missing value 3. How do I discard missing Values? 4. How do I fill missing valuesCopy the code

In this episode, we’ll look at how missing values are handled in Pandas.

Missing values are represented by a NaN (Not a Number) tag in Python, which is essentially a floating-point Number type:

import pandas as pd
import numpy as np

var = np.array([1, np.nan, 3, 4])
print(var)
print(var.dtype)

[  1.  nan   3.   4.]
float64
Copy the code

It is clear from the results that NumPy has chosen a native-type data type for the missing value: floating-point.

An important feature of the nan missing value type is that any operation with it results in a nan.

import pandas as pd
import numpy as np

var = np.array([1, 2, np.nan, 4])
print(np.nan + 1)
print(np.nan * 5)
print(var.sum())
print(var.max())

nan
nan
nan
nan
Copy the code

Of course, there is a way to ignore nan missing values in NumPy and do some operations. NumPy has its own function:

import pandas as pd import numpy as np var = np.array([1, 2, np.nan, 4]) print(np.nansum(var) print(np.nanmax(var) print(np.nanmin(var)) 7.0 4.0 1.0Copy the code

Missing values are similarly handled by the Pandas data type. NaN is a floating point number when representing a missing value, and object when representing a missing object.

If you set a value in a Series of integers to missing, the entire Series will be forced to be floating point:

Import pandas as pd import numpy as np var = pd.Series([1,2],dtype=int) print(var) var[0] = np.nan print(var Dtype: int32 0 NaN 1 2.0 dType: float64Copy the code

For example, if the data type in Series is string, set some string data in Series of that type to missing because string is of type Object, and the whole type is still of type Object.

import pandas as pd
import numpy as np

var = pd.Series(['aa','bb'])
print(var)
var[0] = np.nan
print(var)

0    aa
1    bb
dtype: object

0    NaN
1     bb
dtype: object
Copy the code

How to handle missing values when using Pandas

The first is how to find missing values. Missing values can be found using isnull() and notnull().

import pandas as pd
import numpy as np

var = pd.Series(['aa',np.nan,1,np.nan])
print(var.isnull())
print(var.notnull())

0    False
1     True
2    False
3     True
dtype: bool

0     True
1    False
2     True
3    False
dtype: bool
Copy the code

Both methods return Boolean mask data. Isnull () displays True on empty positions, whereas notnull does the opposite.

This Boolean mask data can be used as a filtering tool:

import pandas as pd import numpy as np var = pd.Series(['aa',np.nan,1,np.nan]) print(var[var.notnull()]) 0 aa 2 1 dtype:  objectCopy the code

So the natural next step is to talk about how to deal with missing values, and the easiest way to do that is to throw them away

The Series datatype handles missing values simply by discarding the corresponding value:

import pandas as pd
import numpy as np

var = pd.Series(['aa',np.nan,1,np.nan])
print(var.dropna())

0    aa
2     1
dtype: object
Copy the code

DataFrame, however, is a bit more complicated. It cannot simply remove a missing value by itself. It either removes the entire row where the missing value is located or removes the entire column.

Import pandas as pd import numpy as np df = pd.DataFrame([[1,np.nan,2], [2,3,5], Print (df.dropna(axis=1)) print(df.dropna(axis=1)) print(df.dropna(axis=1) 3.0 5 2 NaN 4.0 6 0 1 2 1 2.0 3.0 5 2 0 2 1 5 2 6Copy the code

There is also a very important parameter in the dropna () function called how, which defaults to how=any, indicating that the row (or column) should be deleted whenever it has a NaN value. When how=all is used, this operation is performed only if all rows (or columns) are NaN values.

Import pandas as pd import numpy as np df = pd.DataFrame([[1,np.nan,2,4], [2,3,5,3], [np. Nan, np. Nan, np, nan, np, nan]]) print (df) print (df) dropna (how = 'any')) print (df) dropna (how = 'all')) 0 1 2 3 0 nan 2.0 1.0 4.0 1 2.0 3.0 5.0 3.0 2 NaN NaN NaN 0 1 2 3 1 2.0 3.0 5.0 0 1 2 3 0 1.0 NaN 2.0 4.0 1 2.0 3.0 5.0 3.0Copy the code

There is, of course, a more subtle thresh parameter control that states the minimum number of non-missing values required to leave a row (or column).

Import pandas as pd import numpy as NP df = pd.DataFrame([[1,np.nan,2,4], [2, Np.nan,np.nan,3], [np. Nan, np. Nan, np, nan, np, nan]]) print (df) print (df) dropna (thresh = 3)) 0 1 2 3 0 nan 1, 2.0 4.0 2.0 1.0 3.0 2 nan nan nan NaN NaN 0 1 2 3 0 1.0 NaN 2.0 4.0Copy the code

We see that the second and third lines are removed because they have less than three non-missing values

However, sometimes we don’t want to delete the row because too much useful information would be lost, so one alternative is to populate the missing value.

The first method is to fill the missing value with the specified value:

Import pandas as pd import numpy as NP df = pd.DataFrame([[11,np.nan,22,44], [22, Np.nan,np.nan,33], [np. Nan, np. Nan, np, nan, np, nan]]) print (df) fillna (0)) 0 1 2 3 0 11.0 0.0 22.0 44.0 22.0 0.0 0.0 33.0 0.0 0.0 0.0 0.0 2Copy the code

In this example, we fill in all the missing values with 0.

Another common method is to populate with neighboring values, which is quite common in time series analysis. Naturally, you can populate with neighboring values in front of you, or you can populate with neighboring values behind you.

import pandas as pd import numpy as np data = pd.Series([1,np.nan,2,np.nan,3],index=list('abcde')) print(data) Print (data.fillna(method='ffill')) print(data.fillna(method='bfill')) a 1.0b NaN c 2.0d NaN e 2.0dtype: Float64 A 1.0b 1.0C 2.0D 2.0E 3.0 DType: float64 A 1.0b 2.0C 2.0D 3.0E 3.0 DType: float64Copy the code

It is easy to see from the results how the padding works from front to back and from back to front.

The same goes for the DataFrame, but it can also specify which direction to fill in. In this example, we specify the direction to fill in along the vertical axis, from front to back:

Import pandas as pd import numpy as np df = pd.DataFrame([[1,np.nan,2,4], [2,3,5,3], [np. Nan, 5, 4, np. Nan]]) print (df) print (df) fillna (method = 'ffill', Axis =1) 0 1 2 3 0 1.0 NaN 2 4.0 1 2.0 3.0 5 3.0 2 NaN 5.0 4 NaN 0 1 2 3 0 1.0 1.0 2.0 4.0 1 2.0 3.0 5.0 3.0 2 NaN 5.0 4.0 4.0Copy the code

Of course, if you populate from front to back, and the first value is NaN, there is no way to populate.

Pandas has a thorough understanding of how to handle missing values.