Introduction to the

During data processing, Pandas will use NaN to represent unparsed or missing data. Although all the data is represented, NaN is obviously not mathematically feasible.

This article will explain how Pandas handles NaN data.

An example of NaN

As mentioned above, missing data will be represented as NaN, let’s look at a specific example:

Let’s start by building a DF:

In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], ... : columns=['one', 'two', 'three']) ... : In [2]: df['four'] = 'bar' In [3]: df['five'] = df['one'] > 0 In [4]: df Out[4]: One two three four five a 0.469112-0.282863-1.509059 bar True C -1.135632 1.212112-0.173215 bar False E 0.119209 -1.044236-0.861849 bar True F-2.104569-0.494929 1.071804 bar False h 0.721555-0.706771 -1.039575 bar True

There are only several indexes of ACEFH in DF above, so let’s re-index the data:

In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) In [6]: df2 Out[6]: One two three four five a 0.469112-0.282863-1.509059 bar True B NaN NaN NaN C 1.135632 1.212112-0.173215 bar False d NaN NaN NaN NaN NaN NaN e 0.119209-1.044236-0.861849 bar True f - 2.104569-0.494929 1.071804 bar False g NaN NaN NaN NaN NaN h 0.721555-0.706771 -1.039575 bar True

When data is missing, many NaNs are generated.

To check for NaN, either the isna() or notna() methods can be used.

In [7]: df2['one'] Out[7]: a 0.469112 b NaN C-1.135632 d NaN E 0.119209F-2.104569 g NaN H 0.721555 float64 In [8]: pd.isna(df2['one']) Out[8]: a False b True c False d True e False f False g True h False Name: one, dtype: bool In [9]: df2['four'].notna() Out[9]: a True b False c True d False e True f True g False h True Name: four, dtype: bool

Note that None is equal in Python:

In [11]: None == None                                                 # noqa: E711
Out[11]: True

But Np. NaN is unequal:

In [12]: np.nan == np.nan
Out[12]: False

Missing value of integer type

NaN is a float by default, but if it is an integer, we can cast it:

In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]: 
0       1
1       2
2    <NA>
3       4
dtype: Int64

Missing value of type Datetimes

Missing values of the time type are represented by NAT:

In [15]: df2 = df.copy() In [16]: df2['timestamp'] = pd.Timestamp('20120101') In [17]: df2 Out[17]: One two three four timestamp a 0.469112-0.282863-1.509059 bar True 2012-01-01 C-1.135632 1.21212-0.173215 bar 2012-01-01 False 2012-01-01 e 0.119209-1.044236-0.861849 bar True 2012-01-01 f-2.104569-0.494929 1.071804 bar False 2012-01-01 e 0.119209-1.044236-0.861849 bar True 2012-01-01 f-2.104569-0.494929 1.071804 bar False 2012-01-01 H 0.721555-0.706771-1.039575 bar True 2012-01-01 In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan In [19]: df2 Out[19]: One two three four timestamp a NaN -0.282863-1.509059 bar True NaT c NaN 1.212112-0.173215 bar False NaT e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 H NaN-0.706771 -1.039575 bar True NaT In [20]: df2.dtypes.value_counts() Out[20]: float64 3 datetime64[ns] 1 bool 1 object 1 dtype: int64

None and NP. NaN conversions

For numeric types, if the value is assigned to None, it is converted to the corresponding NaN type:

In [21]: s = pd.series ([1, 2, 3]) In [22]: s.oc [0] = None In [23]: s Out[23]: 0 NaN 1 2.0 2 3.0 dtype: float64

If it is an object type, using the assignment None will remain the same:

In [24]: s = pd.Series(["a", "b", "c"])

In [25]: s.loc[0] = None

In [26]: s.loc[1] = np.nan

In [27]: s
Out[27]: 
0    None
1     NaN
2       c
dtype: object

Calculation of missing values

The mathematical calculation of a missing value is still a missing value:

In [28]: a Out[28]: One two a Na-0.282863 C Na-1.212112 E 0.119209-1.044236 F-2.104569-0.494929 H-2.104569-0.706771 In [29]: b Out[29]: One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 1.071804 H NA-0.706771-1.039575 IN [30]: A + B OUT [30]: One three two a NaN NaN -0.565727 c NaN NaN 2.424224 e 0.238417 NaN -2.088472 f -4.209138 NaN -0.989859 h NaN NaN 1.413542

But NaN will be treated as 0 in the statistics.

In [31]: df Out[31]: One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 1.071804 h Nan-0.706771-1.039575 In [32]: df['one']. SUM () OUT [32]: -1.9853605075978744 In [33]: df. A-0.895961 C 0.519449 E-0.595625 F-0.509232 H-0.873173 DTYPE: FLOAT64

If you are in cumsum or cumprod, the default is to skip NaN. If you do not want to count NaN, you can add skipna=False

In [34]: df.cumsum() Out[34]: One two three a NaO-0.282863-1.509059 C NaO-0.929249-1.682273 E 0.119209-0.114987-2.544122 F-1.98531-0.609917 -1.472318 h NaN-1.316688-2.511893 In [35]: df.cumsum(skipna=False) Out[35]: One two three a NaO-0.282863-1.509059 C NaO-0.929249-1.682273 E NaO-0.114987-2.544122 F NaO-0.609917-1.472318 H NaN - 1.316688-2.511893

The NaN data is populated with fillna

In data analysis, if there is NaN data, it needs to be processed. One method of processing is to use fillna to fill it.

Fill in the constants below:

In [42]: df2 Out[42]: One two three four timestamp a NaN -0.282863-1.509059 bar True NaT c NaN 1.212112-0.173215 bar False NaT e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 H NaN-0.706771 -1.039575 bar True NaT In [43]: df2.fillna(0) Out[43]: One two three four timestamp a 0.000000-0.282863-1.509059 bar True 0 c 0.000000 1.21212-0.173215 bar False 0 e 0.119209-1.044236-0.861849 bar True 2012-01-01 00:00:00 F-2.104569-0.494929 1.071804 bar False 2012-01-01 00:00:00 H 0.000000-0.706771 -1.039575 bar True 0

You can also specify the fill method, such as pad:

In [45]: df Out[45]: One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 1.071804 h NaN-0.706771-1.039575 In [46]: df. Fillna (method='pad') Out[46]: One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 1.071804H-2.104569-0.706771-1.039575

You can specify the number of rows to fill:

In [48]: df.fillna(method='pad', limit=1)

Fill method statistics:

The method name describe
pad / ffill Forward to fill
bfill / backfill Back fill

You can use PandasObject to populate it:

In [53]: dff Out[53]: A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 NaN 0.577046 -1.715002 4 Nan Nan -1.157892 5-1.344312 Nan Nan 6-0.109050 1.643563 Nan 7 0.357021-0.674600 Nan 8-0.968914 -1.294524 0.413738 9 0.276662-0.472035-0.013960 In [54]: dff.fillna(dff.mean()) Out[54]: A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3-0.140857 0.577046 A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3-0.140857 0.577046 -1.715002 4-0.140857-0.401419-1.157892 5-1.344312-0.401419-0.293543 6-0.109050 1.643563-0.293543 7 0.357021 -0.674600-0.293543 8-0.968914-1.294524 0.413738 9 0.276662-0.472035-0.013960 In [55]: dff.fillna(dff.mean()['B':'C']) Out[55]: A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 NaN 0.577046 -1.715002 4 Na-0.401419-1.157892 5-1.344312-0.401419-0.293543 6-0.109050 1.643563-0.293543 7 0.357021-0.674600 -0.293543 8-0.968914 -1.294524 0.413738 9 0.276662-0.472035-0.013960

The above operation is equivalent to:

In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')

Delete data containing NA using DropNA

In addition to fillna to populate the data, you can also use dropna to delete the data containing the na.

In [57]: df Out[57]: One two three a NaN -0.282863-1.509059 c NaN 1.212112-0.173215 e NaN 0.000000 0.000000 f NaN 0.000000 h NaN -0.70671-1.039575 In [58]: df.dropna(axis=0) Out[58]: Empty datafame Columns: [one, two, three] Index: [] In [59]: df.dropna(axis=1) Out[59]: Two three A-0.282863-1.509059 C 1.2121212-0.173215 E 0.000000, 0.000000 F 0.000000, 0.000000 H-0.706771 -1.039575 IN Two three A-0.282863-1.509059 C 1.2121212-0.173215 E 0.000000, 0.000000 F 0.000000, 0.000000 H-0.706771 -1.039575 IN [60]: df['one'].dropna() Out[60]: Series([], Name: one, dtype: float64)

The interpolation interpolation

When analyzing the data, in order to make the data smooth, we need some interpolation operation “interpolate()”, which is very simple to use:

In [61]: 2000-01-31 0.469112 2000-02-29 Nan 2000-03-31 Nan 2000-04-28 Nan 2000-05-31 Nan... 2007-12-31-6.950267 2008-01-31-7.904475 2008-02-29-6.441779 2008-03-31-8.184940 2008-04-30-9.011531 Freq: BM, Length: 100, dtype: float64
In [64]: ts.interpolate() Out[64]: 2000-01-31 0.469112 2000-02-29 0.434469 2000-03-31 0.399826 2000-04-28 0.365184 2000-05-31 0.330541... 2007-12-31-6.950267 2008-01-31-7.904475 2008-02-29-6.441779 2008-03-31-8.184940 2008-04-30-9.011531 Freq: BM, Length: 100, dtype: float64

Interpolation functions can also add arguments to specify how to interpolate, such as by time:

In [67]: ts2 Out[67]: 2000-01-31 0.469112 2000-02-29 Nan 2002-07-31-5.785037 2005-01-31 Nan 2008-04-30-9.011531 Dtype: Float64in [68]: ts2.interpolate() Out[68]: 2000-01-31 0.469112 2000-02-29-2.657962 2000-07-31-5.785037 2005-01-31-7.398284 2008-04-30-9.011531 Dtype: 2000-01-31 0.469112 2000-02-29-2.657962 2002-07-31-5.785037 2005-01-31-7.398284 2008-04-30-9.011531 Dtype: float64 In [69]: ts2.interpolate(method='time') Out[69]: 2000-01-31 0.469112 2000-02-29 0.270241 2002-07-31-5.785037 2005-01-31-7.190866 2008-04-30-9.011531 Dtype: Float64

Interpolate by float value of index:

In [70]: ser Out[70]: 0.0 0.0 1.0 NaN 10.0 10.0 dtype: float64 In [71]: ser.interpolate() Out[71]: Interpolate (method='values') Out[72]: interpolate(method='values') Out[72]: interpolate(method='values') Out[72]: 0.0 0.0 1.0 1.0 10.0 10.0

In addition to interpolating Series, you can also interpolate DF:

In [73] : df = pd DataFrame ({' A ': [1, 2.1, np. Nan, 4.7, 5.6, 6.8],... 'B' : [25, np, nan, np, nan, 4, 12.2, 14.4]})... : In [74]: df Out[74]: A B 0 1.0 0.25 1 2.1 NaN 2 NaN 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40 In [75]: Df. Interpolate () Out[75]: A B 0 1.0 0.25 1 2.1 1.50 2 3.4 2.75 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40

Interpolate also accepts the limit parameter, and you can specify the number of interpolations.

In [95]: ser.interpolate(limit=1)
Out[95]: 
0     NaN
1     NaN
2     5.0
3     7.0
4     NaN
5     NaN
6    13.0
7    13.0
8     NaN
dtype: float64

Use replace to replace the value

Replace can replace a constant or a list:

In [102]: ser = pd.Series([0., 1., 2., 3., 4.])

In [103]: ser.replace(0, 5)
Out[103]: 
0    5.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
In [104]: Ser. replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0]) Out[104]: 0 4.0 1 3.0 2 2.0 3 1.0 4 0.0 Dtype: float64

You can replace specific values in DF:

In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})

In [107]: df.replace({'a': 0, 'b': 5}, 100)
Out[107]: 
     a    b
0  100  100
1    1    6
2    2    7
3    3    8
4    4    9

You can use interpolation to replace:

In [108]: ser.replace([1, 2, 3], method='pad')
Out[108]: 
0    0.0
1    0.0
2    0.0
3    0.0
4    4.0
dtype: float64

This article has been included in http://www.flydean.com/07-python-pandas-missingdata/

The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know the tips to wait for you to discover!

Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!