A Brief introduction to the Pandas style

Kenneth Reitz and Mike Driscoll

Brett Cannon (1)

Brett Cannon (2)

Introduction to the

It is well known that Excel supports cell formatting and minidiagrams. In this article, Chris Moffitt’s Stylin ‘With Pandas How to set DataFrame styles for Pandas

Set the DataFrame columns to different styles, such as semicolons, decimals, currency symbols, date formats, percentages, etc.
Highlight data in a column of the DataFrame that meets certain criteria, for example, displaying the maximum and minimum values in a column in different colors;
Different gradients are used to display the proportion of each line of data to the total data. The larger the proportion, the darker the color, and the smaller the proportion, the lighter the color.
The function similar to Excel mini bar chart is realized. According to the data size, the mini bar chart is displayed in the cell.
Draw mini charts using Sparklines support library and Pandas. You need to install the Sparklines support library using the PIP install Sparklines command.

What is style? Why use styles?

The basic idea behind styles is that users can adjust the presentation of data without affecting the computation of the data itself.

The most intuitive style is the currency symbol. If only one number, such as 25.06, is displayed, it is not clear to the reader whether this number represents US dollars, pounds, yen or RMB, but ￥25.06, everyone knows that this is RMB. Thus, the style can provide more information to the data.

Percentages are also an important form of data styling. 0.12 is not as intuitive as 12%, and it is much clearer and simpler to use percentages to illustrate data.

Pandas supports different styles to display data in a more understandable manner, but the data type is not changed, so the mathematical, date, and string functions provided by Pandas can still be used to manipulate data.

The Pandas style also includes the advanced styles mentioned in the text, such as adding visual elements such as colors and bar charts. This article provides a brief introduction to the basic functions of the Pandas style to improve the readability of data analysis reports.

Data File Description

CSV (click to download). This document quotes the sales data of an enterprise in 2018 invented by the author of the original text Chris, but the title of the data table is Chinese, which is convenient for everyone to identify the information of the data column in the code.
The file format is changed from XLSX to CSV.
The file contains five columns, including account number (string), name (string), item (string), quantity (integer), unit price (2 decimal digits), amount (2 decimal digits), and date (DateTime).

Jupyter Notebook example file, click download.

0.1 Importing Numpy and Pandas

import numpy as np
import pandas as pd
Copy the code

The following code is different from the original one, which uses an XLSX file. This one uses a CSV file and needs to parse the date.

Also note that the data files and ipynb files must be in the same folder.

df = pd.read_csv('2018 Sales Summary Table. CSV ',parse_dates=['date'])
Copy the code

0.2 `df.head()`Preview the data

0.3 Summary data: Calculate the average amount and total consumption of each guest in 2018 by grouping the guests’ names

df.groupby('name') ['value'].agg(['mean'.'sum'])
Copy the code

1.1 Set the amount style: currency symbol, semicolon, 2 decimal digits

You can see that the average shows 6 decimal places, no currency sign, and no semicolons on the whole numbers. In this case, you can use DataFrame style.format.

Format (‘${0:,.2f}’), the $sign in quotes represents the dollar, the comma after the colon (:) represents the thousand, and the.2f represents the 2 decimal places.

(
    df.groupby('name') ['value']
    .agg(['mean'.'sum'])
    .style.format('${0:,.2f}'))Copy the code

After adjustment, the data is displayed as follows:

As you can see, with the.style.format() function, the DataFrame shows data in the mean column with dollar signs, integers with a semicolon, and decimals with only two digits, which makes it easier to understand the concept of amount.

Format (‘${0:,.0f}’) and change the 2 to 0.

(
    df.groupby('name') ['value']
    .agg(['mean'.'sum'])
    .style.format('${0:,.0f}'))Copy the code

At this point, you can see that the monetary data has been rounded to the whole number.

For more examples of Python String formatting, see the Python String Format Cookbook.

1.2 Set date and percentage styles

Next, the groupby function is used to calculate the monthly sales amount and its proportion to the annual total sales by the sales date.

monthly_sales = df.groupby([pd.Grouper(key='date', freq='M'[])'value'].agg(['sum']).reset_index()
monthly_sales['Monthly percentage'] = monthly_sales['sum']/df['value'].sum()
Copy the code

Here you can see that the date is displayed as the last day of the month, and the scale column is displayed as 6 decimal places. This style is obviously not intuitive, but it is not a good method to adjust the style of the data column by column. Can you adjust the style of data column by column at once? Of course, the answer is yes! Use a style dictionary.

format_dict = {'sum': '${0:,.0f}'.'date': '{:%Y-%m}'.'Monthly percentage': '{:.2%}'}
monthly_sales.style.format(format_dict).hide_index()
Copy the code

The above code first defines format_dict as a style dictionary, styling the sum, date, and month columns, respectively. The sum contains the dollar sign, semicolon, and 0 decimal digit. The date is in the classic “month” format, while the month is a percentage with two decimal places.

The second line calls the format_dict set in the first line using style.format() and hides the index with hide_index(). Hiding the index is useful in many cases.

As a result, the DataFrame data display is more tolerable, the data display is clearer, and it is actually very simple to implement.

2. Highlight the maximum and minimum values

Pandas also supports the use of color to highlight conditional data, similar to the conditional format in Excel. In this article, I will show how to highlight the maximum value (light green) and minimum value (red) in a column.

(
    monthly_sales
    .style
    .format(format_dict)
    .hide_index()
    .highlight_max(color='lightgreen')
    .highlight_min(color='red'))Copy the code

In addition to the lightgreen keyword, you can also use ‘#cd4f39’ color code, which is more flexible. You can also set multiple conditions, such as highlight_max and highlight_min.

3. Gradient style

Changing the style of the numbers is not enough. We want to add color gradients to indicate how large a column is. This is something that Excel already provides for Pandas. To do this, use the background_gradient function of style.

(
    monthly_sales.style
    .format(format_dict)
    .background_gradient(subset=['sum'], cmap='BuGn'))Copy the code

Subset =[‘sum’] sets the color gradient for a column, subset=[‘sum’], and the color map uses the CMAP parameter ‘BuGn’, which is a green gradient. See the matplotlib documentation for the values of the color map, and more colors are available.

The display effect is as follows, do you feel more intuitive?

4. Mini bar chart style

In addition to gradients, Pandas supports displaying mini bar charts in DataFrame cells.

(
    monthly_sales
    .style
    .format(format_dict)
    .hide_index()
    .bar(color='#FFA07A', vmin=100_000, subset=['sum'], align='zero')
    .bar(color='lightblue', vmin=0, subset=['Monthly percentage'], align='zero')
    .set_caption('2018 Sales List '))Copy the code

Color still supports color coding and keywords. Vmin is the base value, for example, the base value of sales is 100,000, the base value of percentage is 1, subset is the column targeted, and align represents the alignment.

You can also set the title of a DataFrame with the set_caption function. The code is displayed as follows.

5. Sparklines – Charts

The pattern is not built-in for Pandas, but it is still useful and cool. It comes from a support library called Sparklines. See its documentation for details. This module can add mini charts to Pandas’ DataFrame.

First, import Sparklines;

import sparklines
Copy the code

Next, define functions that call Sparklines;

def sparkline_str(x):
    bins=np.histogram(x)[0]
    sl = ' '.join(sparklines(bins))
    return sl

sparkline_str.__name__ = "Trend chart"
Copy the code

Finally, call the defined sparkline_str function in the groupby function;

df.groupby('name') ['number'.'value'].agg(['mean', sparkline_str])
Copy the code

This summary function can show the data intuitively. What’s more interesting is that the chart is composed of plain text, which is simple and practical, right?

conclusion

Pandas is a popular style function that allows for intuitive and clear analysis of data written by analysts when they have to present the results to others. There are many more styles and functions for Pandas. This article provides a basic guide to getting started with the Pandas style. If you want to play Pandas, read the Pandas official style documentation.

Finally, thanks to Alexas_Fotos for the creative images.