This is the 22nd day of my participation in the August Wen Challenge.More challenges in August

This note is R:www.kaggle.com/umeshnaraya… Inspired by the works of the. The goal of the notes is to make it as simple as possible to implement the visualizations created in the R notebook above, using Python as well as some additional plots, and to add some comments and explanations to help Seborn/Python beginners with their data visualization/customization. We keep things interesting by playing with different colors.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Matplotlib is building the font cache using fc-list. This may take a moment.
Copy the code

Use pandas to read in data sets. We see that each row of entries corresponds to a specific game, and the data contains the name of the game, the year it was released, and some categorical characteristics, such as platform, genre, and publisher. Finally, we see that the game (row) entry also includes the cumulative sales achieved by region, by that particular game.

df = pd.read_csv("/home/kesci/input/Datasets6073/vgsales.csv")
df.head()
Copy the code
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii 2006.0 Sports Nintendo 41.49 29.02 3.77 8.46 82.74
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 0.77 40.24
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo 15.85 12.88 3.79 3.31 35.82
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo 15.75 11.01 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37

Checking the maximum year value, we see that it is 2020, which is an impossible release date.

Year_data = df['Year'] print("Max Year Value: ", year_data.max()) Max Year Value: 2020.0Copy the code

By looking at the name of the entry in the wrong year, we can search the web for the release date of the game and replace the current value with the correct release date.

max_entry = year_data.idxmax()
print(max_entry)
max_entry = df.iloc[max_entry]
pd.DataFrame(max_entry).T
5957
Copy the code
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
5957 5959 Imagine: Makeup Artist DS 2020 Simulation Ubisoft 0.27 0 0 0.02 0.29
Df ['Year'] = df['Year']. Replace (2020.0, 2009.0) print(" year_data.max()) Max Year Value: 2017.0Copy the code

Next we examine the number of games (rows) and the number of unique publishers, platforms, and genres to see how the games in our dataset are clearly distributed.

print("Number of games: ", len(df))
publishers = df['Publisher'].unique()
print("Number of publishers: ", len(publishers))
platforms = df['Platform'].unique()
print("Number of platforms: ", len(platforms))
genres = df['Genre'].unique()
print("Number of genres: ", len(genres))
Number of games:  16598
Number of publishers:  579
Number of platforms:  31
Number of genres:  12
Copy the code

Let’s do a simple null check. We might search the web for all the missing years and publishers, but now we just delete entries for games that don’t have all the data.

print(df.isnull().sum())
df = df.dropna()
Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64
Copy the code

Let’s create a simple column chart to represent the total annual “worldwide sales” of video games. We get our data by data — all of our video game sales data, grouped by “year” and called.sum() to get the total for each year. This creates a data RAME with our year as the index or row name, and an entry for the total sales for that year.

In the dataset, the index for the year is a floating point number, such as “2006.0” instead of “2006.” We get our x terms by taking these values as integers. Once the data is ready, we simply pass the X and Y variables to our SebornBarart function. We also set our X label name, title, and we also rotate our XtickLabels and change their fontSize.

Y = df.groupby(['Year']).sum() y = y['Global_Sales'] x = y.index.astype(int) plt.figure(figsize=(12,8)) ax = sns.barplot(y = y, x = x) ax.set_xlabel(xlabel='$ Millions', fontsize=16) ax.set_xticklabels(labels = x, fontsize=12, rotation=50) ax.set_ylabel(ylabel='Year', fontsize=16) ax.set_title(label='Game Sales in $ Millions Per Year', fontsize=20) plt.show();Copy the code

Below we’ve created a simple column chart to represent the total number of video games released each year, but with a slight twist, it’s horizontal, which means that our year entries, usually our X-axis, are now on the Y-axis, while the count of “Global_Sales” entries, usually on the Y-axis, are now on the X-axis.

X = df.groupby(['Year']).count() x = x['Global_Sales'] y = x.index.astype(int) plt.figure(figsize=(12,8)) colors = sns.color_palette("muted") ax = sns.barplot(y = y, x = x, orient='h', palette=colors) ax.set_xlabel(xlabel='Number of releases', fontsize=16) ax.set_ylabel(ylabel='Year', fontsize=16) ax.set_title(label='Game Releases Per Year', fontsize=20)Copy the code
Text(0.5, 1.0, 'Game Releases Per Year')
Copy the code

Below we have created a bullet-point chart where each publisher has the highest sales per year. Global sales are on the Y-axis and years are on the X-axis, and we use the parameter “Hue” of the pointcut diagram to represent the highest publisher.

We use a fulcrum table that makes it easy to calculate “publisher,” the publisher’s name by annual sales, and “sales,” which is the worldwide sales that the publisher generates each year.

Note that the Pivot table accepts an argument to be applied by a function with other options, such as mean, median, and mode. This pointcut requires a Dataframe, and you can simply add column names to X, Y, and Hue. We also customize our Xtick labels by rotating and resizing them.

table = df.pivot_table('Global_Sales', index='Publisher', columns='Year', aggfunc='sum') publishers = table.idxmax() sales = table.max() years = table.columns.astype(int) data = pd.concat([publishers, sales], axis=1) data.columns = ['Publisher', Figure (figsize=(12,8)) ax = sns.pointplot(y =' Global Sales', x = years, hue='Publisher', data=data, size=15) ax.set_xlabel(xlabel='Year', fontsize=16) ax.set_ylabel(ylabel='Global Sales Per Year', fontsize=16) ax.set_title(label='Highest Publisher Revenue in $ Millions Per Year', fontsize=20) ax.set_xticklabels(labels = years, fontsize=12, rotation=50) plt.show();Copy the code

Below, we create a game that generates global sales and makes the most money each year. We also returned the following data for reference. You can map different colors for each game, but adding a legend to a plot with so many purposes can make a plot look confusing.

The data creation for this graph is similar to the above, excluding the use of hues to represent categories in the data. Instead, we use a palette and pass it the numbers of colors we want from that particular palette.

table = df.pivot_table('Global_Sales', index='Name', columns='Year') table.columns = table.columns.astype(int) games = table.idxmax() sales = table.max() years = table.columns data = pd.concat([games, sales], axis=1) data.columns = ['Game', 'Global Sales'] colors = sns.color_palette("GnBu_d", Figure (figsize=(12,8) ax = SNS. Barplot (y = years, x =' Global Sales', data=data, Orient ='h', palette=colors) ax.set_xlabel(xlabel='Global Sales Per Year', fontsize=16) ax.set_ylabel(ylabel='Year', fontsize=16) ax.set_title(label='Highest Revenue Per Game in $ Millions Per Year', fontsize=20) plt.show(); dataCopy the code

Game Global Sales
Year
1980 Asteroids 4.310
1981 Pitfall! 4.500
1982 Pac-Man 7.810
1983 Baseball 3.200
1984 Duck Hunt 28.310
1985 Super Mario Bros. 40.240
1986 The Legend of Zelda 6.510
1987 Zelda II: The Adventure of Link 4.380
1988 Super Mario Bros. 3 17.280
1989 Tetris 30.260
1990 Super Mario World 20.610
1991 The Legend of Zelda: A Link to the Past 4.610
1992 Super Mario Land 2: 6 Golden Coins 11.180
1993 Super Mario All-Stars 10.550
1994 Donkey Kong Country 9.300
1995 Donkey Kong Country 2: Diddy’s Kong Quest 5.150
1996 Pokemon Red/Pokemon Blue 31.370
1997 Gran Turismo 10.950
1998 Pokémon Yellow: Special Pikachu Edition 14.640
1999 Pokemon Gold/Pokemon Silver 23.100
2000 Pokémon Crystal Version 6.390
2001 Gran Turismo 3: A-Spec 14.980
2002 Grand Theft Auto: Vice City 16.150
2003 Mario Kart: Double Dash!! 6.950
2004 Grand Theft Auto: San Andreas 20.810
2005 Nintendogs 24.760
2006 Wii Sports 82.740
2007 Wii Fit 22.720
2008 Mario Kart Wii 35.820
2009 Wii Sports Resort 33.000
2010 Kinect Adventures! 21.820
2011 Mario Kart 7 12.210
2012 New Super Mario Bros. 2 9.820
2013 Grand Theft Auto V 18.890
2014 Pokemon Omega Ruby/Pokemon Alpha Sapphire 11.330
2015 Call of Duty: Black Ops 3 5.064
2016 Uncharted 4: A Thief’s End 4.200
2017 Phantasy Star Online 2 Episode 4: Deluxe Package 0.020
data = df.groupby(['Publisher']).count().iloc[:,0] data = pd.DataFrame(data.sort_values(ascending=False))[0:10] publishers = data.index data.columns = ['Releases'] colors = sns.color_palette("spring", Len (data)) plt.figure(figsize=(12,8)) ax = sns.barplot(y = publishers, x =' Releases', data=data, Orient ='h', palette=colors) ax.set_xlabel(xlabel='Number of Releases', fontsize=16) ax.set_ylabel(ylabel='Publisher', fontsize=16) ax.set_title(label='Top 10 Total Publisher Games Released', fontsize=20) ax.set_yticklabels(labels = publishers, fontsize=14) plt.show();Copy the code

data = df.groupby(['Publisher']).sum()['Global_Sales'] data = pd.DataFrame(data.sort_values(ascending=False))[0:10] publishers = data.index data.columns = ['Global Sales'] colors = sns.color_palette("cool", Figure (figure size=(12,8)) ax = SNS. Barplot (y = publishers, x =' Global Sales', data=data, Orient ='h', palette=colors) ax.set_xlabel(xlabel='Revenue in $ Millions', fontsize=16) ax.set_ylabel(ylabel='Publisher', fontsize=16) ax.set_title(label='Top 10 Total Publisher Game Revenue', fontsize=20) ax.set_yticklabels(labels = publishers, fontsize=14) plt.show();Copy the code

rel = df.groupby(['Genre']).count().iloc[:,0] rel = pd.DataFrame(rel.sort_values(ascending=False)) genres = rel.index rel.columns = ['Releases'] colors = sns.color_palette("summer", Len (rel) plt.figure(figsize=(12,8)) ax = SNS. Barplot (y = genres, x =' Releases', data=rel, Orient ='h', palette=colors) ax.set_xlabel(xlabel='Number of Releases', fontsize=16) ax.set_ylabel(ylabel='Genre', fontsize=16) ax.set_title(label='Genres by Total Number of Games Released', fontsize=20) ax.set_yticklabels(labels = genres, fontsize=14) plt.show();Copy the code

rev = df.groupby(['Genre']).sum()['Global_Sales'] rev = pd.DataFrame(rev.sort_values(ascending=False)) genres = rev.index rev.columns = ['Revenue'] colors = sns.color_palette('Set3', Len (rev) plt.figure(figsize=(12,8) ax = SNS. Barplot (y = genres, x =' Revenue', data=rev, Orient ='h', palette=colors) ax.set_xlabel(xlabel='Revenue in $ Millions', fontsize=16) ax.set_ylabel(ylabel='Genre', fontsize=16) ax.set_title(label='Genres by Total Revenue Generated in $ Millions', fontsize=20) ax.set_yticklabels(labels = genres, fontsize=14) plt.show();Copy the code

I am white and white I, a program yuan like to share knowledge ❤️

If you don’t know how to program or want to learn, you can leave a message directly to me. Thank you very much for your likes, favorites, comments, one-click support.