This is the fourth day of my participation in Gwen Challenge

Antecedents to review

  • Find xiaoming and Li Xiaohua’s student numbers with different scores (half completed)
    • Find columns with different scores (done)
    • Find the student number for that column
  • Group and sum the scores of students who scored incorrectly (incomplete).

Find the student number for that column

Let’s recall the data structure briefly introduced below. Data of a data structure is a one-dimensional array data for a series and a two-dimensional array data for a Dataframe. So we can use the following operations directly

data_clean.values[:,0]
//data_clean.values[:3.0] // Here I only take the first three lines, because I also made up the single subject data.Copy the code

Group and sum the scores of the students who scored wrong

The data for individual subjects are as follows

filepath_danke = "Single subject data.csv"
data_table_danke = pd.read_csv(filepath_danke, encoding='utf-8')
data_table_danke
Copy the code

The output is as follows:

So how do you do the grouping sum?

First consider the group summation scenario:

According to [gender grouping], the [average] of the national population [life span] was counted. According to [class grouping], the [average] of the [math score] in the group was more than 80 points

It is clear that grouping requires three concepts:

  • By what grouping
  • What data to get
  • What do you do with the data

Here we need to calculate the [total score] of the students according to the [student number group].

So pd gives us a group by function.

Df.groupby (by what group)[what data to get]. What to do with the dataCopy the code

For example

student_sum = data_table_danke.groupby('student id') ['results'].sum(a)Copy the code
Student ID 20160003 548 20160006 489 20160010 221 Name: result, DType: INT64Copy the code

So far we have learned filtering and grouping summation through a simple case study, but for some common data analysis operations, there is still a sorting. So let’s add a requirement.

Additional requirements: the highest and second highest marks for each subject

Logically, you should group, sort, and sum.

But it’s hard to sort the grouped data and pick the first two (at least I don’t yet)

So we can use a property of grouping

Grouping does not affect sorting within a group

So you can sort and group and sum

namely

filepath_danke = "Single subject data.csv"
data_table_danke = pd.read_csv(filepath_danke, encoding='utf-8')
data_table_danke.sort_values('results', ascending = False, inplace=True)
res = data_table_danke.groupby('student id').head(2).groupby('student id') ['results'].sum()
res
Copy the code