• 原文 : fully EXTRACTING INSIGHTS FROM A KAGGLE DATASET USING PYTHON’s PANDAS AND SEABORN
  • Original author: Strikingloo
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: haiyang – tju
  • Proofreader: Rocheers Leviding

Curiosity and intuition are two of the data scientist’s most powerful tools. The third one may be Pandas.

In my last article, I showed how to understand the integrity of a data set, plot some variables, and see trends and tendencies over time.

For this, I used Python’s Pandas framework on Jupyter Notebook for data analysis and processing, and Seaborn framework for visualization.

Like this article, in the previous article, we used the data set of 120 years of Olympic Games on Kaggle to study the progress of female athletes’ participation over time, the distribution of athletes’ weight and height, and some other variables, but did not use the data of each athlete’s participation in sports.

This time, we’ll focus on the sports column of the dataset and get some information about it.

A few questions I can think of are:

  • Which exercise is better for the burly? What about tall people?
  • Which sports are new and which are old? Are there any sports that were discontinued because they fell out of favor with the Olympics?
  • Are there any sports in which the same team always wins? What about the most diverse sports, do the winners come from different regions?

As before, the items used in our analysis are placed in this Github project, which you can fork and add your own analysis and understanding. Let’s get started!

Weight and body analysis

In our first analysis, we wanted to look at which sports have the heaviest and tallest athletes, and which sports have the lightest or shortest athletes.

As we saw in the previous article, both height and weight are heavily dependent on gender, with more data on male athletes than female athletes in the data set. So we’ll do the analysis for males, but the same code works for either gender, just toggle the gender filter.

male_df = df[df.Sex=='M']
sport_weight_height_metrics = male_df.groupby(['Sport'[])'Weight'.'Height'].agg(
  ['min'.'max'.'mean'])

sport_weight_height_metrics.Weight.dropna().sort_values('mean', ascending=False)[:5]
Copy the code

As you can see, if I group each athlete by sport, I can calculate the minimum, maximum and average weight and height of each athlete.

Then I looked at the top five sports with the heaviest athletes and found (in kilograms) :

Rugby Sevens 65.0 113.0 91.00 Bobsleigh Sport min Max Average Tug-of-war 75.0 118.0 95.61 Basketball 59.0 156.0 91.68 Rugby Sevens 65.0 113.0 91.00 Bobsleigh 55.0 145.0 90.38 Beach Volleyball 62.0 110.0 89.51Copy the code

Not too surprising, is it? Tug-of-war players, basketball players and football players are all heavy. Interestingly, the weight of basketball and rugby players varies widely, from 59 kg to 156 kg, while most tug-of-war players weigh more than 80 kg.

I then plotted the average weight for each exercise and found that it followed a nice normal distribution:

sns.distplot(sport_weight_height_metrics.Weight.dropna()['mean'])
Copy the code

The average weight of athletes follows a normal distribution.

The height of athletes has a similar normal distribution, but its variance is very small and highly concentrated near the mean:

The height of the athletes is normally distributed.

Next, I plotted all the individual averages, in an ordered scatter plot, to see if any outliers appeared.

means = list(sport_weight_height_metrics.Weight.dropna()['mean'])
sports = list(sport_weight_height_metrics.Weight.dropna().index)
plot_data = sorted(zip(sports, means), key = lambda x:x[1])
plot_data_dict = {
    'x' : [i for i, _ in enumerate(plot_data)],
    'y' : [v[1] for i, v in enumerate(plot_data)],
    'group' :  [v[0] for i, v in enumerate(plot_data)]
}
sns.scatterplot(data = plot_data_dict, x = 'x' , y = 'y')
Copy the code

The average height distribution of each Olympic athlete.

In fact, the sports with the heaviest athletes are very outlier relative to the rest of the chart, as are the sports with the lightest athletes. If we look at height, although the variance is significantly smaller, the difference between the “outliers” shown in the graph and those close to the mean is even greater, and it is even more obvious that most people are not far from the mean.

The average weight of athletes in each sport.

The previously generated variable plot_data can be used to obtain the results for the sport with the lowest weight.

print('lightest:')
for sport,weight in plot_data[:5]:
    print(sport + ':' + str(weight))

print('\nheaviest:')    
for sport,weight in plot_data[-5:]:
    print(sport + ':' + str(weight))
Copy the code

The result (omitting the heaviest because we’ve already seen it above) is as follows:

Lightest: Gymnastics: 63.3436047592 Ski Jumping: 65.2458805355 Boxing: 65.2962797951 65.8378378378 the Nordic Combined: 66.9095595127Copy the code

Gymnasts, even men, are by far the lightest! It was followed by ski jumping, boxing (which surprised me a little) and trampolines, which made sense.

If we looked for the tallest and shortest athletes, the results would be less surprising. I guess we all expected the same sport to be at the top of the list, and unsurprisingly, it was. At least now we can say it’s not a stereotype.

Shortest (CM): 167.644438396 Weightlifting: 169.153061224 Trampolining: 171.368421053 Wrestling 171.555352242:172.870686236Copy the code
New Jersey (CM): Rowing: 186.882697947 Handball: 188.778373113 Volleyball: 193.265659955 Beach Volleyball: New Jersey (CM): 366.882697947 Handball: 188.778373113 Volleyball: 193.265659955 Also 193.290909091:194.872623574Copy the code

We can see that gymnasts are generally very light and short. However, some of the sports in the height rankings do not appear in the weight rankings. I wonder what “body shape” (i.e. weight/height) each sport has?

mean_heights = sport_weight_height_metrics.Height.dropna()['mean']
mean_weights = sport_weight_height_metrics.Weight.dropna()['mean']
avg_build = mean_weights/mean_heights
avg_build.sort_values(ascending = True)
builds = list(avg_build.sort_values(ascending = True))

plot_dict = {'x':[i for i,_ in enumerate(builds)],'y':builds}
sns.lineplot(data=plot_dict, x='x', y='y')
Copy the code

The graph looks linear until we get to the top where most outliers fall:

The size (weight/height) distribution of Olympic athletes

Here are the sports with minimum and maximum size:

Price range Echo Build (Kg/ price) Alpine Skiing 0.441989 Archery 0.431801 Art Competitions 0.430488 Athletics 0.410746 Badminton 0.413997 busy Build tug-of-war 0.523977 Rugby Sevens 0.497754 Bobsleigh 0.496656 weight74433 Handball 0.473507Copy the code

Rugby and tug-of-war are sports with maximum body size. The alpine skiers were among the smallest in size, followed by archery and art (an Olympic sport I just learned about and needed more research).

Changes in sports over time

Now that we’ve done all the interesting things we can think of with these three columns, I want to start looking at the time variable. Especially this year. I’d like to see if new sports have been introduced to the Olympics and when. Also look at sports that have been abandoned.

We want to see when something first appears, and this code is usually useful, especially when we want to see the abnormal growth of a variable.

from collections import Counter

sport_min_year = male_df.groupby('Sport').Year.agg(['min'.'max'[])'min'].sort_values('index')
year_count = Counter(sport_min_year)
year = list(year_count.keys())
new_sports = list(year_count.values())

data = {'x':year, 'y':new_sports}
sns.scatterplot(data=data, x = 'x', y='y')
Copy the code

The results of

This chart shows us how many sports are played in the Olympic Games for the first time each year. Or, to put it another way, how many sports are introduced each year:

Quantity of Sports introduced each year.

So although there were many sports before 1910, and most of them were introduced before 1920, there were still many new ones introduced. If we look at these statistics, we can see that in 1936 a lot of new sports were introduced, and then very few new sports were introduced each year after that (less than 5 sports). There were no new sports introduced between 1936 and 1960, until biathlon came along, After that, new projects are added periodically:

Sport           introduced
Biathlon           1960
Luge               1964
Volleyball         1964
Judo               1964
Table Tennis       1988
Baseball           1992
Short Track Speed Skating 1992
Badminton           1992
Freestyle Skiing    1992
Beach Volleyball    1996
Snowboarding        1998
Taekwondo           2000
Trampolining        2000
Triathlon           2000
Rugby Sevens        2016
Copy the code

An analogue-driven analysis of obsolete sports (the biggest years are not recent) reveals that most of the sports on this list are ones I’ve never heard of (although this is never a good indicator of whether a sport is popular or not!).

Basque Pelota    1900
Croquet          1900
Cricket          1900
Roque            1904
Jeu De Paume     1908
Racquets         1908
Motorboating     1908
Lacrosse         1908
Tug-Of-War       1920
Rugby            1924
Military Ski Patrol 1924
Polo             1936
Aeronautics      1936
Alpinism         1936
Art Competitions 1948
Copy the code

We saw art competitions cancelled in 1948, polo has not appeared in the Olympics since 1936, and neither has flying. If anyone knows what an air race is, please let me know. I could see it on a plane, but I didn’t know what it would be like. Maybe a plane race? Let’s get them back on the field!

That’s all for today, guys! I hope you enjoy this tutorial and maybe you’ve got a new and interesting idea to talk about at your next family dinner. As always, feel free to fork (copy) the code from this analysis and add your own perspective. For follow-up work, I am thinking of using data based on the exercise, weight and height columns to train a small machine learning model to predict the gender of athletes. Tell me what model would you use? If you feel that there is something wrong in this article, or there are some simple mistakes, please let me know, let us learn together!

Continue to visit the website for more data analysis articles, Python technical tutorials, and other data-related content. If you enjoyed this article, please share it with your friends on Twitter.

Can be found inTwitterorMediumFollow me here for more new content.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.