The preface

More than 60% of students who have studied Pandas still prefer Excel, mostly because the initial process of manipulating data in Python was too painful to choose the desired rows and columns.

The Pandas tutorial covers the basic column direction indexes for its length, but it does not meet the growing need for personalized service. In order to relieve the pain, increase the pleasure and meet the needs, in the second part, we carry out the index separately and introduce two commonly used index methods in detail based on the scene:

The first is based on location (integer) index, case short and quick, have a rough understanding, in practice occasionally useful, but its application scope is not as wide as the second.

The second is the name-based (tag-based) index, which is the focus of the blackboard practice, as it will be an important building block for data cleansing and analysis later on.

First, a brief introduction to the case data of the exercise:

Traffic sources The source of subsidiary visitors Payment conversion rate The guest unit price
Level 1 -A 35188 9.98% 54.3
Level 1 -B 28467 11.27% 99.93
Level 1 -C 13747 2.54% 0.08
Level 1 -D 5183 2.47% 37.15
Level 1 -E 4361 4.31% 91.73
Level 1 -F 4063 11.57% 65.09
Level 1 -G 2122 10.27% 86.45
Level 1 -H 2041 7.06% 44.07
Level 1 -I 1991 16.52% 104.57
Level 1 -J 1981 5.75% 75.93
Level 1 -K 1958 14.71% 85.03
Level 1 -L 1780 13.15% 98.87
Level 1 -M 1447 1.04% 80.07
The secondary -A 39048 11.60% 91.91
The secondary -B 3316 7.09% 66.28
The secondary -C 2043 5.04% 41.91
Level 3 -A 23140 9.69% 83.75
Level 3 -B 14813 20.14% 82.97
Level 4 -A 216 1.85% 94.25
Level 4 -B 31 0.00%
Level 4 -C 17 0.00%
Level 4 -D 3 0.00%

Just like the first data set, it records the number of visitors, payment conversion rate and customer unit price corresponding to the details of each channel source under different traffic sources. Although the data set is short (the complex case data set will follow as promised at the end of the foundation chapter), it is representative enough to begin our indexed performance.

1. Location-based (numeric) indexes

Let’s take a look at how indexes work:

Df.iloc [row index, column index] The first position is the row index, and the second position is the column index, and the argument is the position of which columns we want to fetch

We need to fill in the corresponding row and column parameters according to the actual situation.

Scenario 1 (Row selection)

Target: Select all rows whose traffic source is equal to one level.

To select rows from 0 to 12, enter 0:13. To select all columns, enter colon:.

Scenario 2 (Column selection)

Goal: We want to take a look at the source of traffic and the price per customer of all channels. For all traffic channels, i.e. all rows, in the position of the first row argument we type:; The traffic source is column 1, the customer unit price is column 5, the corresponding column index is 0 and 4 respectively:

Across the column selection
Continuous selection
0:5

Scenario 3 (Cross selection of rows and columns)

Goals: We want to take a look at secondary and tertiary traffic sources, their respective visitors, and payment conversion rates

The index is 13:17, and the index is 13:18. For the first four columns, we need traffic source, source details, visitors, and conversion. We pass 0:4 as the parameter.

2. Index based on name (label)

In order to establish a horizontal comparison of the body sense, we still use the above three scenes.

Scenario 1: Select all rows of the primary channel.

Instead of a single value, we can filter all rows with a traffic channel of level 1. We need to determine which values are equal to level 1.

We can pass the value of this column to the row argument position
Pandas will default to returning rows with a result of True (in this case, rows with an index from 0 to 12) and discard rows with a result of False

Scenario 2: We want to take a look at the source of traffic and the unit price of customers of all channels.

All channels are equal to all rows. To extract the traffic source and unit price column, enter the name of the column in the parameter position.

Scenario 3: We want to extract secondary and tertiary traffic sources, source details corresponding to visitors and payment conversion rates.

Line extraction with judgment, column extraction input specific name parameters.

df2.loc[df2['Traffic Source'].isin(['2'.'3']], ['Traffic Source'.'Source Details'.'Number of visitors'.'Payment conversion rate']]
Copy the code

isin

Since loC is more widely used, we should give it a chicken leg and a down-to-earth scene to practice. Before inserting the scene, we’ll take 30 seconds to review the use of column (Series) evaluation in Pandas.

df2['Number of visitors'].mean()
df2['Number of visitors'].std()
df2['Number of visitors'].median()
df2['Number of visitors'].max()
df2['Number of visitors'].min()
Copy the code

All you need to do is add a tail, the mean, standard difference, and so on. With that in mind, let’s move on to scenario four.

Scenario 4: For traffic channel data, we should really focus on high-quality channels. If we define high-quality channels with above-average number of visitors, conversion rate, and customer unit price, how do we find these channels?

Ideas: The quality channel should meet the three conditions of visitors, conversion and higher than average customer orders at the same time, which is the key to solve the problem. Let’s see what the averages are:

df2['Number of visitors'] > df2['Number of visitors'].mean()
df2['Payment conversion rate']> df2['Payment conversion rate'].mean()
df2['Customer unit price'] > df2['Customer unit price'].mean()
Copy the code

&
or
|

(df2['Number of visitors'] > df2['Number of visitors'].mean())&(df2['Payment conversion rate']> df2['Payment conversion rate'].mean())&(df2['Customer unit price'] > df2['Customer unit price'].mean())
Copy the code

df2.loc[(df2['Number of visitors'] > df2['Number of visitors'].mean())&(df2['Payment conversion rate']> df2['Payment conversion rate'].mean())&(df2['Customer unit price'] > df2['Customer unit price'].mean()),:]
Copy the code

The key is to map the rows and columns you want to select in your mind to the corresponding row and column parameters.

With a little practice, Python will be able to manipulate and analyze data as much as it wants, and you’ll see how beautiful Python is compared to Excel.