How to quickly create powerful visualization and exploratory data analysis is becoming critical in today’s business world. Today we are going to talk about using Python to visualize data!

Once you have a good cleaned data set, the next step is exploratory data analysis (EDA). EDA is the process of determining what the data can tell us, and we use EDA to look for patterns, relationships, or anomalies to guide us as we go along. However, there are many methods in EDA, but one of the most effective tools is the pair graph (also known as scatter graph matrix). The scatter matrix allows us to see the relationship between two variables. Scatterplot matrices are a great way to identify trends in subsequent analysis, and fortunately, they’re easy to implement in Python!

In this article, we will draw and run diagrams in Python by using the Seaborn visual library. We’ll see how to create a default match graph to quickly examine our data, and how to customize the visualization for deeper insight. The code for the project is available on GitHub as Jupyter Notebook. In this project, we will explore a real-world data set composed of national-level socioeconomic data collected by GapMinder.

Seaborn’s Plots

Before we start, we need to know what data we have. We can load the socio-economic data with the Pandas data box and view the columns:



Each row of data represents a country’s results for a year, and the columns contain variables (data in this format is called clean data). There are two classified columns (country and continent) and four digital columns. The columns include: LIFE_EXP is life expectancy at birth for years, POP is population, and GDP_per_CAP is gross domestic product per capita in international dollars.

Although we will use categorization variables for coloring later, the default pair graph in Seaborn draws only numeric columns. Creating the default scatter plot matrix is simple: we load it into the Seaborn library and call the pairplot function, passing it to our data box:

# Seaborn visualization library
import seaborn as sns
# Create the default pairplot
sns.pairplot(df)Copy the code

I’m still amazed that a single line of code can accomplish our entire requirement! The scatter plot matrix is based on two basic graphs, the histogram and the scatter plot. The histogram on the diagonal allows us to see the distribution of individual variables, while the scatter plot on the upper and lower triangles shows the relationship between two variables. For example, the leftmost chart in the second line shows a scatter plot of LIFE_exp versus years.

The default scatter plot matrix often gives us valuable insights. We see a positive correlation between life expectancy and GDP per head, suggesting that people in high-income countries tend to live longer (though of course this does not prove that it causes others to do so). It also seems as if life expectancy around the world is rising over time. To better display these variables in future graphs, we can convert the columns by taking the logarithm of these values:

# Take the log of population and gdp_per_capita
df['log_pop'] = np.log10(df['pop'])
df['log_gdp_per_cap'] = np.log10(df['gdp_per_cap'])

# Drop the non-transformed columns
df = df.drop(columns = ['pop', 'gdp_per_cap'])Copy the code


While this mapping itself can be used for analysis, we can find that it is made more valuable by digitally coloring categorical variables such as continents. This is very easy in Seaborn! All we need to do is call the sns.pairplot function in Hue using the keyword:

sns.pairplot(df, hue = 'continent')Copy the code

Now we see the highest life expectancy in Oceania and Europe and the largest population in Asia. Note that our log conversion of population and GDP makes these variables normally distributed to represent values more comprehensively.

The image above is more informative, but there are still some problems: you can’t find a superimposed histogram, like on a diagonal, and it’s very easy to understand. A better way to show univariate distributions from multiple categories is a density map. We can swap the density of the bar graph in a function call. As we process it, we pass some keywords to the scatter plot to change the point’s transparency, size, and edge color.

# Create a pair plot colored by continent with a density plot of the # diagonal and format the scatter plots. Sns.pairplot (df, hue = 'continent', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4)Copy the code

A density map on a diagonal makes it easier to compare the distribution across continents than a stack bar. Changing the transparency of the scatter plot improves readability because the numbers overlap considerably (called overlapping plots).

As a final example of the pairplot default, let’s reduce data clutter by plotting years after 2000. We will still color by continent distribution, but for now we will not draw the year column. To limit the number of columns to draw, we pass a list of vars to the function. To illustrate the plot, we can also add a title.


# Plot colored by continent for years 2000-2007 sns.pairplot(df[df['year'] >= 2000], vars = ['life_exp', 'log_pop', Hue = 'continent', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4); # Title plt.suptitle('Pair Plot of Socioeconomic Data for 2000-2007', size = 28);Copy the code

This is starting to look good! If we want to model, we can use the information in these diagrams to help us make choices. For example, we know that log_gdp_per_cap is positively correlated with life_exp, so we can create a linear model to quantify this relationship. For this article, we’ll stick to plotting, and if we want to explore our data more, we can use the PairGrid class to customize the scatter plot matrix.

Customize using PairGrid

In contrast to the sns.pairplot function, sns.pairGrid is a class, which means that it does not automatically populate our grid plot. Instead, we create an instance of a class and map specific functions to different parts of the grid. To create a PairGrid instance with our data, we use the following code, which also limits the variables we will display:


# Create an instance of the PairGrid class.
grid = sns.PairGrid(data= df_log[df_log['year'] == 2007],
                    vars = ['life_exp', 'log_pop', 
                    'log_gdp_per_cap'], size = 4)Copy the code


If we were to display it, we would get a blank graph because we didn’t map any functions to the grid section. There are three grid sections to fill in the PairGrid: the upper triangle, the lower triangle, and the diagonal. To map the grid to these parts, we use the methods in the Grid.map section. For example, to map a scatter plot to the upper triangle we use:


# Map a scatter plot to the upper triangle
grid = grid.map_upper(plt.scatter, color = 'darkred')
Copy the code


The map_upper method takes a function of any two variable arrays (such as plt.scatter) and associated keywords (such as color). The map_lower method is exactly the same, but fills the lower triangle of the grid. Because it requires a slightly different function that accepts a single array (remember that the diagonal shows only one variable). An example is plt.hist which we use to fill in the diagonal section below:


# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins = 10, color = 'darkred', 
                     edgecolor = 'k')
# Map a density plot to the lower triangle
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')Copy the code


In this case, we use the 2-D (density plot) kernel density estimate in the lower triangle. Taken together, this code gives us the following diagram:



The real benefit of using the PairGrid class is that we want to create custom functions to map different information onto the graph. For example, I might want to add the Pearson correlation coefficient between two variables to the scatter plot. To do this, I’ll write a function that takes two arrays, calculates the statistic, and then plots it on a graph. The following code shows how this is done (thanks to the Stack Overflow answer) :


# Function to calculate correlation coefficient between two arrays
def corr(x, y, **kwargs):
    # Calculate the value
    coef = np.corrcoef(x, y)[0][1]
    # Make the label
    label = r'$\rho$ = ' + str(round(coef, 2))
    # Add the label to the plot
    ax = plt.gca()
    ax.annotate(label, xy = (0.2, 0.95), size = 20, xycoords = ax.transAxes)
# Create a pair grid instance
grid = sns.PairGrid(data= df[df['year'] == 2007],
                    vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], size = 4)

# Map the plots to the locations
grid = grid.map_upper(plt.scatter, color = 'darkred')
grid = grid.map_upper(corr)
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
grid = grid.map_diag(plt.hist, bins = 10, edgecolor =  'k', color = 'darkred');Copy the code


Our new function maps to the upper triangle because we need two arrays to compute the correlation coefficients (also note that we can map multiple functions to the grid section). This produces the following graph:



The correlation coefficient now appears above the scatter plot. This is a relatively straightforward example, but we can use PairGrid to map any function we want to the graph. We can add as much information as we need, as long as we can figure out how to write the function! As a final example, here is a summary statistic that shows diagonals instead of grids.



It shows the general idea of just doing ICONS, and in addition to using any existing functionality in the library (for example, matplotlib maps data to graphs), we can write our own functions to display custom information.

conclusion

Scatterplot matrices are powerful tools for quickly exploring distributions and relationships in data sets. Seaborn provides a simple default method for customizing and extending the scatterplot matrix through the Pair Grid class. In a data analysis project, the main part of value is often not in flashy machine learning, but in intuitive data visualization. Scatterplot proof provides us with comprehensive data analysis and is a good starting point for data analysis projects.

Dozens of ari cloud products limited time discount, quickly click on the coupon to start cloud practice!

This article is recommended by Beijing Post @ Love coco – Love life teacher, translated by Ali Yunqi Community organization.

The original title of the article was visualizing- Data-with-pair – copy-in-Python,

The tiger said eight ways.

The article is a brief translation. For more details, please refer to the original text.