“This is day 9 of my participation in the First Wen Challenge 2022.First challenge in 2022”.

Pandas are introduced

Pandas is a third-party module dedicated to data processing. Pandas has many powerful features that we can use to work with CSV and Excel files. We can use it for data statistics, search and other operations. Since it is a third-party module, we need to install it before using it:

pip install pandas
Copy the code

Then we can try the basic operations of Pandas. Let’s start with the following code:

Df = pd.DataFrame([{'city': 'temperature', 'temperature': 2}, {'city': 'Temperature ': -1}]) print(df)Copy the code

In the code, we create a DataFrame object, which we will describe in more detail later. We pass in a list of elements as dictionaries. Let’s look at the output:

  city  temperature
0nanchang2
1Wuhan -1
Copy the code

You can see that the display looks like a table, and the output adds a row index for us. If you use Jupyter, the display will be more beautiful. Above we just output all the data, we can optionally output:

import pandas as pd

Create a DataFrame object
df = pd.DataFrame([
    {'city': 'the nanchang.'temperature': 2},
    {'city': 'wuhan'.'temperature': -1}])print(df.head(1))

Copy the code

This time we call the df.head() method, passing in argument 1 to fetch 1 row from the beginning. The output looks like this:

  city  temperature
0nanchang2
Copy the code

Instead of viewing entire rows of data, we can index specific columns:

import pandas as pd

Create a DataFrame object
df = pd.DataFrame([
    {'city': 'the nanchang.'temperature': 2},
    {'city': 'wuhan'.'temperature': -1}])print(df['city'])

Copy the code

For example, we index the city column, and the output is as follows:

0nanchang1Wuhan Name: city, dtype:object
Copy the code

In addition to this general approach, we can also do conditional indexing. The operations are described in detail in the following sections.

Pandas’ data structure

Data structures can be understood as data types, the special types defined in Pandas. There are two of them that we often use, one is Series and the other is the DataFrame that we used above. So what are they? What’s the difference?

Series

In fact, a Series is a one-dimensional data, similar to a list, but different from a list in many ways. Let’s start with the following code:

import pandas as pd
Create a Series object
s = pd.Series([1.2.3.4], dtype='float32')
print(s)
Copy the code

When we create a Series, we specify a dtype parameter, which is the data type of the Series. The dtype parameter is not required. When specified, our Series object can only store the specified type of data. That’s where it’s different from list.

Here is the output:

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float32
Copy the code

Let’s see what Series does:

import pandas as pd
Create a Series object
s = pd.Series([1.2.3.4.4.4], dtype='float32')
Look at the first line of elements
print(s.head(1))
# calculate the Series length
print(s.count())
Check the data type
print(s.dtype)
# view data
print(s.values)
Select * from index;
print(s.index)
# check data size (similar to cout)
print(s.size)
Copy the code

Here is the output:

[1. 2. 3. 4. 4.] 6 RangeIndex(start=0, stop=6, step=1)Copy the code

Here are a few more things to note, where S. values gets an Ndarray object. The index is a Series of index, is a pandas. The core. The indexes. Range. RangeIndex object, this is the definition of a particular object pandas.

In addition to the above operations, Series has a number of statistical functions:

import pandas as pd

s = pd.Series([1.2.3.4.4.4], dtype='float32')
# average
print(s.mean())
# maximize
print(s.max())
Take the minimum
print(s.min())
# Standard deviation
print(s.std())
Copy the code

The following output is displayed:

3.0
4.0
1.0
1.264911
Copy the code

The above operation is easy to understand and will not be explained here. Let’s take a look at the index and slice operations for Series, which we often use:

import pandas as pd

s = pd.Series(['a'.'b'.'c'.'d'.'e'.'f'])

# index
print(s[3])
# section
print(s[0: 2])
Index and slice with ILOC
print(s.iloc[0])
print(s.iloc[0: 2])
Index and slice with LOC
print(s.loc[0])
print(s.loc[0: 2])
# Index slices by value
print(s.loc['a':'c'])
Copy the code

We have used four ways to index slices above, of which [] is the simplest. Iloc and LOC are two of the methods offered by Pandas, and both can be used to index slices. Both methods can be sliced by subscript index, but if you want to slice by content index, you need to use loC methods.

Many of the methods described above are also common in DataFrame. We’ll talk about that later, but let’s look at the DataFrame.

DataFrame

A DataFrame is equivalent to a two-dimensional table and is used much more often than a Series. Let’s first look at how to create a DataFrame object:

import pandas as pd

Create a DataFrame object
df = pd.DataFrame([
    {'city': 'the nanchang.'temperature': 2},
    {'city': 'wuhan'.'temperature': -1}])print(df['city'])

Copy the code

This code has been demonstrated above. As for the statistical functions mentioned above, index slices are also available in DataFrame, but are slightly different because they are two-dimensional tables. To take a closer look, first we create a DataFrame that specifies the column index:

import pandas as pd

df = pd.DataFrame([
    ['the nanchang.2],
    ['wuhan', -1]
], columns=['city'.'temperature'], index=[0.1])

print(df)

Copy the code

The output is as follows:

  city  temperature
0nanchang2
1Wuhan -1
Copy the code

Same as before. In addition, we can view some properties:

import pandas as pd

df = pd.DataFrame([
    ['the nanchang.2],
    ['wuhan', -1]
], columns=['city'.'temperature'], index=[0.1])

print(df.shape)
print(df.index)
print(df.columns)
print(df.values)
print(df.T)
Copy the code

The following output is displayed:

(2.2)
Int64Index([0.1], dtype='int64')
Index(['city'.'temperature'], dtype='object')
[['the nanchang 2]
 ['wuhan' -1]]
              0   1Wuhan Temperature2  -1
Copy the code

T looks a little strange, but on closer inspection it turns out to be the transpose of the original data rotation x and y.

Let’s look at the DataFrame index:

import pandas as pd

df = pd.DataFrame([
    ['the nanchang.2],
    ['wuhan', -1]
], columns=['city'.'temperature'], index=[0.1])

print(df.loc[:, 'city'])
print(df.loc[:, 'city':'temperature'])

Copy the code

Above we used loC functional indexes, which are indexed by value in the DataFrame. Among them:

df.loc[:, 'city']
Copy the code

Represents city of all rows in the index, and:

df.loc[:, 'city':'temperature']
Copy the code

Represents all rows in the index from city to temperature. Since my data has only two columns, the output is as follows:

0nanchang1Wuhan Name: city, dtype:object
  city  temperature
0nanchang2
1Wuhan -1

Copy the code

Then index with the ilOC function:

import pandas as pd

df = pd.DataFrame([
    ['the nanchang.2],
    ['wuhan', -1]
], columns=['city'.'temperature'], index=[0.1])

print(df.iloc[1:, :)print(df.iloc[:, 1:)Copy the code

Loc = ilOC; loC = ilOC; loC = ilOC;

  city  temperature
1Wuhan -1
   temperature
0            2
1           -1

Copy the code

Read data from a CSV file

Above we looked briefly at the Series and DataFrame types. In practice, we usually choose to get the DataFrame object from the local CSV file. Let’s see how to do this. Start with a simple CSV file:

City, the temperature of nanchang,2Wuhan, -1
Copy the code

The code to read the file is as follows:

import pandas as pd

df = pd.read_csv('test.csv')
print(df)
Copy the code

We call the read_CSV file and pass in the file path. The following output is displayed:

  city  temperature
0nanchang2
1Wuhan -1
Copy the code

The display is the same as before. After reading the file, we can perform the above operations on the data. Of course, we can modify the data and save it to a file:

import pandas as pd
df = pd.read_csv('test.csv')
df = df.drop('city', axis=1)
df.to_csv('test.csv')
Copy the code

After reading the file, we call the drop function to drop the city column, where the axis argument equals 1 to indicate that the column is dropped. Then call to_csv to save to the file.

Conditions of the query

We can also do some conditional queries, using a new DataFrame object:

import pandas as pd

df = pd.DataFrame([
    {'city': 'the nanchang.'temperature': 2},
    {'city': 'wuhan'.'temperature': -1},
    {'city': 'Shanghai'.'temperature': 1},])print(df[df['temperature'] > 0])

Copy the code

Among them:

df[df['temperature'] > 0]
Copy the code

Indicates to query the data whose temperature is greater than 0 ° C. The following output is displayed:

  city  temperature
0nanchang2
2Shanghai1
Copy the code

We can also specify multiple conditions:

import pandas as pd

df = pd.DataFrame([
    {'city': 'the nanchang.'temperature': 2},
    {'city': 'wuhan'.'temperature': -1},
    {'city': 'Shanghai'.'temperature': 1},])print(df[(df['temperature'] > 0) & (df['temperature'] < 2)])

Copy the code

In the code above, we look for the temperature greater than 0 and less than 2. There’s a little bit of caution here

df['temperature'] > 0
Copy the code

It returns a bool array, and we can index that array with df, which is to get the data that matches the criteria. And more conditions by & or | connection. Here is the output of the code:

  city  temperature
2Shanghai1
Copy the code

The above operation is called Boolean indexing and is also available in NumPy. In addition to Boolean indexing, we can also implement this with the query method:

import pandas as pd

df = pd.DataFrame([
    {'city': 'the nanchang.'temperature': 2},
    {'city': 'wuhan'.'temperature': -1},
    {'city': 'Shanghai'.'temperature': 1},])print(df.query('temperature > 0'))
Copy the code

We pass a condition in query, where the condition needs to be a string. Here is the output:

  city  temperature
0nanchang2
2Shanghai1
Copy the code

Query gm can condition index and connected by & and |.

import pandas as pd

df = pd.DataFrame([
    {'city': 'the nanchang.'temperature': 2},
    {'city': 'wuhan'.'temperature': -1},
    {'city': 'Shanghai'.'temperature': 1},])print(df.query('temperature > 0 & temperature < 2'))
Copy the code

The result is the same as above.