This article originated from personal public account: TechFlow, original is not easy, for attention


In the fourth article of our series, we will talk about indexes in DataFrame.

In the last article, we introduced the use of common indexes in DataFrame data structures, such as ILOC, LOC, and logical indexes. In today’s article we will look at some of the basic operations of DataFrame.

The data aligned

The sum of two dataframes is calculated. Pandas automatically aligns the dataframes to Nan (not a number).

First let’s create two dataframes:

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.arange(9).reshape((3.3)), columns=list('abc'), index=['1'.'2'.'3'])

df2 = pd.DataFrame(np.arange(12).reshape((4.3)), columns=list('abd'), index=['2'.'3'.'4'.'5']) Copy the code

The result is exactly what we expected. In fact, we just create the DataFrame from the NUMpy array, and then specify the index and columns. This should be basic usage.


We then add the two dataframes to get:


We find that pandas combines the two dataframes and sets Nan to any location that does not occur in both dataframes. This actually makes a lot of sense. In fact, not just addition, we can do all four operations of addition, subtraction, multiplication and division for two Dataframes. If two dataframes are to be divided, in addition to being set to Nan for data that does not correspond, division by zero can also result in an outlier (perhaps not necessarily a Nan, but an INF).

fill_value

If we are operating on two Dataframes, we certainly don’t want null values. At this point, we need to populate the null value. We can’t pass the parameter to populate by using the operator directly, so we need to use the arithmetic method provided in the DataFrame.

There are several common DataFrame operators:


Add, sub, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div, div

Convoluted, but worthless, radd is used to flip parameters. For example, if we want the reciprocal of all the elements in a DataFrame, we can write it as 1 / df. Since 1 itself is not a DataFrame, we cannot call the methods in the DataFrame with 1 and therefore cannot pass arguments. To solve this situation, we can write 1 / df as df.rdiv(1), so we can pass arguments in it.


Since division occurs during division, we get inf, which means infinity.

We can pass fill_value in methods like add and div to fill in missing values on one side before calculating. That is, for a position missing from only one DataFrame, it will be replaced with the value we specified. If missing from both dataframes, it will still be a Nan.


(1, d), (4, c), and (5, c) are all Nan, because these positions are null values in the dF1 and DF2 dataframes, so they are not filled.

The fill_value parameter is used in many apis, such as reindex, and can be used in the same way.

So what do we do about empty values that appear after we fill them? Do you have to find these places manually and fill them? Pandas also provides an API for handling null values.

A null value API

Before we populate null values, the first thing we need to do is find null values. For this problem, we have the ISNA API, which returns a DataFrame of type bool. Each position in the DataFrame indicates whether the original DataFrame is null.


dropna

Of course, it is not enough just to find whether a null value is present or not. Sometimes, we hope that there is no null value. In this case, we can choose to drop the null value. In this case, we can use the Dropna method in the DataFrame.


We found that with Dropna, rows with null values were discarded. Only rows without null values are retained. Sometimes we want to discard columns instead of rows, which we can control by passing in the Axis parameter.


This gives us null-free columns, and in addition to controlling rows and columns, we can also control how strictly the DROP is performed. How supports two values: ‘all’ and ‘any’. All is discarded only when a row or column is empty, and any is discarded whenever a null value is present. The default value is any. In general, we do not use this parameter.

fillna

Pandas in addition to dropping data containing null values, pandas can also be used to fill in null values. In fact, this is the most common method.

We can simply pass in a specific value to fill:


Fillna returns a new DataFrame in which all Nan values are replaced with the values we specified. If we do not want it to return a new DataFrame, but to modify the original DataFrame, we can use the inplace parameter to indicate that the operation is an inplace operation. The pandas will modify the original DataFrame.

df3.fillna(3, inplace=True)
Copy the code

In addition to filling in specific values, we can also combine some calculations to figure out what values to fill in. For example, we can calculate the mean, maximum, minimum, etc. of a column to fill in. The fillna function can be used not only on a DataFrame but also on a Series, so we can fill a DataFrame column or columns:


In addition to calculating the mean, maximum, minimum, and other values for padding, you can also specify that the value of the row before or after the missing value is used for padding. To do this, we use the method parameter, which has two received values, ffill, which means to fill with the value of the previous row, and bfill, which means to fill with the value of the next row.


We can see that when we fill with ffill, the Nan of the first row is preserved because it does not have the previous row. Also, when we use bfill, the last line cannot be filled.

conclusion

In today’s article we have introduced some basic DataFrame operations, such as the basic four operations. When performing the four operations, the column and column indexes may not be aligned between dataframes, resulting in null values, so we need to deal with the null values. We can fill the result by passing fill_value during calculation or fillNA after calculation.

In practice, it is rare to directly add or subtract two Dataframes, but it is not uncommon for dataframes to be empty. Therefore, it is very important to fill and deal with null values. It can be said that it is the focus of learning. We must pay attention to it.

This is the end of today’s article, if you like this article, please invite a wave of quality sanlian, give me a little support (follow, forward, like).

This article is formatted using MDNICE