Data visualization is a major task for data scientists. In the early stages of a project, exploratory data analysis (EDA) is often performed to gain understanding and insight into the data. Especially for large, high-dimensional data sets, data visualization can really help make the data relationships clearer and easier to understand.

At the end of the project, it is also important to present the end result in a clear, concise and compelling way, since the audience is often non-technical and it is easier for them to understand.

Matplotlib is a popular Python library that makes data visualization easy. However, every time the execution of new project drawing, the process of setting data, parameters, graphics is very tedious. In this article, we’ll look at five ways to visualize data, using Python’s Matplotlib library for some quick and easy functionality.

First, let’s take a look at this big map, which will guide you to choose the right visualization method for different situations:



Choose the appropriate data visualization technique for the situation

A scatter diagram

Scatter plots are great for showing the relationship between two variables because you can see the original distribution of the data directly. You can also easily see the relationship between different groups of data by setting different colors, as shown in the figure below. What if you want to visualize the relationship between three variables? No problem! Just add one more parameter (such as the size of the point) to represent the third variable, as shown in the second figure below.






Now let’s write the code. First import the Pyplot sublibrary of the Matplotlib library and name it PLT. Use the plt.subplots() command to create a new plot. X-axis and Y-axis data are passed to the corresponding arrays X_data and y_data, and the array and other parameters are passed to Ax.Scatter () to draw scatter graphs. We can also set the point size, color, alpha transparency, and even set the Y-axis to pairwise coordinates. Finally, set the necessary title and axis labels for the diagram. This function easily implements end-to-end drawing!

import matplotlib.pyplot as plt

import numpy as np



def scatterplot(x_data, y_data, x_label="", y_label="", title="", color = "r", yscale_log=False):



# Create the plot object

_, ax = plt.subplots()



# Plot the data, set the size (s), color and transparency (alpha)

# of the points

ax.scatter(x_data, y_data, s = 10, color = color, alpha = 0.75)



if yscale_log == True:

ax.set_yscale('log')



# Label the axes and provide a title

ax.set_title(title)

ax.set_xlabel(x_label)

ax.set_ylabel(y_label)
Copy the code

The line chart

If one variable changes significantly (with a high covariance) with another, it is best to use a line chart to clearly see the relationship between the variables. For example, from the chart below, it is clear that the percentage of women who earn bachelor’s degrees in different majors has changed significantly over time.

At this point, if scatterplot is used, data points are easy to cluster and appear very chaotic, and it is difficult to see the meaning of the data itself. The line chart is perfect, because it basically shows the covariance of two variables (the proportion of women and time). Similarly, different colors can be used to group groups of data.



Percentage of women with bachelor’s degrees (US)

The code is similar to the scatter diagram, with minor parameter changes.

def lineplot(x_data, y_data, x_label="", y_label="", title=""):

# Create the plot object

_, ax = plt.subplots()



# Plot the best fit line, set the linewidth (lw), color and

# transparency (alpha) of the line

ax.plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1)



# Label the axes and provide a title

ax.set_title(title)

ax.set_xlabel(x_label)

ax.set_ylabel(y_label)
Copy the code

histogram

Histograms are good for viewing (or discovering) data distributions. Below is a histogram of the percentage of people with different IQs. You can clearly see the central expected value and median, and you can see that it follows a normal distribution. Using a histogram, rather than a scatter plot, can clearly show the relative differences between the frequencies of different sets of data. Furthermore, grouping (discretization of data) helps to see the “larger distribution”, and using data points that have not been discretized can generate a lot of data noise, making it difficult to see the true distribution of data.



Normally distributed IQ

Below is the code to create a histogram using the Matplotlib library. There are two parameters to note here. The first parameter is the n_bins parameter, which controls the dispersion of the histogram. On the one hand, more grouping numbers can provide more detailed information, but may introduce data noise and make the results deviate from the macroscopic distribution. On the other hand, a smaller number of groupings provides a larger “bird’s eye view” of the data, giving a more complete picture of the data without requiring too much detail. The second parameter, the cumulative parameter, is a Boolean value that controls whether the histogram is cumulative, that is, whether to use the probability density function (PDF) or the cumulative density function (CDF).

def histogram(data, n_bins, cumulative=False, x_label = "", y_label = "", title = ""):

_, ax = plt.subplots()

ax.hist(data, n_bins = n_bins, cumulative = cumulative, color = '#539caf')

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)Copy the code

What if you want to compare the distribution of two variables in your data? Some people might think that you have to make two separate histograms and put them side by side for comparison. But actually, there’s a better way: overlay histograms with different transparencies. For example, set the opacity of the uniform distribution to 0.5 so that you can see the normal distribution behind it. In this way, the user can see the distribution of two variables on the same graph.



The following parameters need to be set in the code that implements the overlay histogram:








Set one of the variables to be more transparent so that two layouts are displayed on a diagram.

# Overlay 2 histograms to compare them def overlaid_histogram(data1, data2, n_bins = 0, data1_name="", data1_color="#539caf", data2_name="", data2_color="#7663b0", x_label="", y_label="", title=""): # Set the bounds for the bins so that the two distributions are fairly compared max_nbins = 10 data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))] binwidth = (data_range[1] - data_range[0]) / max_nbins if n_bins == 0 bins = np.arange(data_range[0], data_range[1] + binwidth, binwidth) else: bins = n_bins # Create the plot _, ax = plt.subplots() ax.hist(data1, bins = bins, color = data1_color, alpha = 1, Hist (data2, bins = bins, color = data2_color, alpha = 0.75, label = data2_name) ax.set_ylabel(y_label) ax.set_xlabel(x_label) ax.set_title(title) ax.legend(loc = 'best')Copy the code

A histogram

Bar charts are suitable for visualization of classified data with fewer categories (<10). However, when there are too many categories, the columns in the graph tend to pile up and become very messy, making it difficult to understand the data. Histogram is suitable for classifying data because it is easy to distinguish between categories based on the height (or length) of the columns, and because it is easy to distinguish different categories and even to give different colors to them. There are three types of bar charts: regular, grouped and stacked. Refer to the code for detailed instructions.

The general bar chart is shown in the figure below. In the code, the x_data parameter of the barplot() function represents the X-axis coordinate, y_data represents the Y-axis (height of the column) coordinate, and yerr represents the standard deviation line displayed in the center of the top of each column.



The grouping histogram is shown in the figure below. It allows multiple categorical variables to be compared. As shown in the figure, the first relationship between scores and groups (groups G1, G2… The second is the relationship between the sexes differentiated by color. In the code, y_data_list is a list with multiple sublists, each representing a group. Assign x coordinates to each list, loop through each sublist, set it to a different color, and draw a grouping histogram.



Stacking histogram is suitable for visualization of classification data containing sub-classification. The chart below is a stack of daily server load statistics. Stack the servers in different colors and compare them so you can see and understand which servers are working the most efficiently and how much load they are carrying each day. The code follows the same pattern as the bar chart, looping through each group, only this time stacking on top of the old columns instead of drawing new ones next to them.



Here are the codes for the three stacked bar charts:

def barplot(x_data, y_data, error_data, x_label="", y_label="", title=""):

_, ax = plt.subplots()

# Draw bars, position them in the center of the tick mark on the x-axis

ax.bar(x_data, y_data, color = '#539caf', align = 'center')

# Draw error bars to show standard deviation, set ls to 'none'

# to remove line between points

ax.errorbar(x_data, y_data, yerr = error_data, color = '#297083', ls = 'none', lw = 2, capthick = 2)

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)



def stackedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""):

_, ax = plt.subplots()

# Draw bars, one category at a time

for i in range(0, len(y_data_list)):

if i == 0:

ax.bar(x_data, y_data_list[i], color = colors[i], align = 'center', label = y_data_names[i])

else:

# For each category after the first, the bottom of the

# bar will be the top of the last category

ax.bar(x_data, y_data_list[i], color = colors[i], bottom = y_data_list[i - 1], align = 'center', label = y_data_names[i])

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)

ax.legend(loc = 'upper right')



def groupedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""):

_, ax = plt.subplots()

# Total width for all bars at one x location

total_width = 0.8

# Width of each individual bar

ind_width = total_width / len(y_data_list)

# This centers each cluster of bars about the x tick mark

alteration = np.arange(-(total_width/2), total_width/2, ind_width)



# Draw bars, one category at a time

for i in range(0, len(y_data_list)):

# Move the bar to the right on the x-axis so it doesn't

# overlap with previously drawn ones

ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)

ax.set_ylabel(y_label)

ax.set_xlabel(x_label)

ax.set_title(title)

ax.legend(loc = 'upper right')
Copy the code

boxplot

The histogram described above is ideal for visualizing the distribution of variables. But what if you want to visualize more variable information? For example, you want to clearly see the standard deviation, or in some cases, the median is very different from the mean, so are there a lot of outliers or is the distribution of the data itself skewed to one side?

In this case, the boxplot represents all of the above information. The bottom and top of the box are the first and third quartiles (i.e., 25% and 75% of the data) respectively, and the horizontal lines inside the box are the second quartile (i.e., the median). The extended lines (t-shaped dashed lines) above and below the box indicate the upper and lower limits of data.

Copy the code






_, ax = plt.subplots()





# Draw boxplots, specifying desired style





ax.boxplot(y_data



, medianprops = {'color': median_color}

# patch_artist must be True to control box fill



, patch_artist = True



# Properties of box

# Properties of median line



# Properties of whisker caps

, boxprops = {'color': base_color, 'facecolor': base_color}



# Properties of whiskers



, capprops = {'color': base_color})

, whiskerprops = {'color': base_color}





ax.set_xlabel(x_label)

# By default, the tick label starts at 1 and increments by 1 for



# each box drawn. This sets the labels to the ones we want



ax.set_title(title)

ax.set_xticklabels(x_data)



ax.set_ylabel(y_label)

Because the box diagram is drawn for each group or variable, it is very easy to set up. X_data is a list of groups or variables, and each value in X_DATA corresponds to a column of values (a column vector) in y_data. Use the Matplotlib function boxplot() to generate a box for each column of y_data (each column vector) and then set the parameters in the boxplot.

There you have it: 5 quick and easy data visualization methods for your Matplotlib library! Wrapping functions and methods into functions always makes code easier to write and read! Hope this article can be helpful to you, hope you can learn knowledge from it! If you like it, give it a like!


The original article was published on April 23, 2018

Author: Abstract bacteria

This article is from “Big Data Digest”, a partner of the cloud community. You can pay attention to “Big Data Digest” for relevant information.