Public account: You and the cabin by: Peter Editor: Peter

Visualization of Plotly

In the previous article, I explained how to use Plotly to draw bar charts, pie charts, and scatter charts, which are all common visualization methods. This article introduces one of the most common ways to plot statistics using Plotly: the box chart.

Further reading

The First eight Plotly visualization articles are:

  • Cool! Love advanced visualization artifact Plotly_Express
  • Plotly play scatter chart
  • Plotly plays a pie chart
  • Plotly plays a funnel
  • Plotly play bar chart
  • Plotly play bubble chart
  • Plotly play stock chart
  • Plotly plays the Gantt chart

Box figure

What is a box diagram

Box graph is a statistical graph used to display a group of data dispersion data, it can quickly display the outliers in the data, its shape like a box, hence the name, also known as box and beard graph, box graph, box graph or box graph.

In 1977, John W. Tukey, a famous American mathematician, first introduced box graphs in his book Exploratory Data Analysis. Worship the big guy!

quartile

Quartiles are the most important concept in box graphs. The following is the introduction of quartiles.

A Quartile is a kind of quantile in statistics, that is, all values are divided into four equal parts from small to large, and the value at three points of division is a Quartile.

  • First Quartile (Q1) : Also known as the Lower Quartile, equal to the 25% of all values in the sample in ascending order.
  • Second Quartile (Q2) : Also known as Middle Quartile or Median, equal to the 50% of all values in the sample sorted from smallest to largest.
  • Third Quartile (Q3) : Also known as Upper Quartile, equal to the 75% of all values in the sample in ascending order.

The difference between Q3 and Q1 is called the quartile distance (InterQuartile Range, IQR) : IQR=Q3-Q1

Quartile calculation

When calculating the quartile, we need to calculate the position of the quartile first, and calculate the position of the three quartiles:

# n is the number of samplesThe position of Q1 is equal to theta n plus theta1) / 4The position of Q2 is equal to n plus1) * 2 / 4The position of Q3 = (n+)1) * 3 / 4
Copy the code

Using an example to illustrate the calculation of position, there are 11 values in unordered order as follows:

6.47.49.15.42.41.7.39.43.40.36
Copy the code

Let’s arrange the values in ascending order:

6.7.15.36.39.40.41.42.43.47.49
Copy the code

Then the positions of the three quartiles are:

# n=11 represents the number of samplesPosition of Q1 = (11+1) / 4 = 3The position of Q2 is equal to n plus1) * 2 / 4 = 6The position of Q3 = (n+)1) * 3 / 4 = 9
Copy the code

The corresponding three quartiles are: Q1=15, Q2=40, Q3=43, IQR=Q3-Q1=28

If the calculated position is not an integer, that is, n+1 is not an integer multiple of 4, the weighted average of the two sides of the position is generally taken (or directly averaged). The closer the position is, the higher the weight of the value is. Generally, the weight is 1− decimal places. For example, for the following sample:

2.3.4.5
Copy the code

So Q1 is calculated as: (4+1) / 4= 1.25 and closer to 2, so Q1 is:

Q1=2 * (1 - 0.25) + 3 * 0.25 = 2.25   # 0.25 is a decimal place
Copy the code

If you take the average, Q1= (2+3) / 2 = 2.5

Comparison of 4 different box diagrams

Comparison of 4 different box diagrams from Wikipedia:

Box diagram function

  • It can roughly tell if the data has symmetry
  • Display information such as the degree of dispersion of data distribution, especially for comparison of several samples.
  • Reflects the central location and spread range of one or more groups of continuous quantitative data distribution
  • Analysis of the level differences of different types of data can also reveal the degree of dispersion, outliers and distribution differences among the data

The biggest advantage of box graph is that it is not affected by outliers, can accurately and stably depict the discrete distribution of data, and is also conducive to data cleaning.

The data set

The following describes how to draw the box graph in various demand scenarios. Most of the data used in this article is the built-in consumption Tips data set in Plotly:

import pandas as pd
import numpy as np

import plotly_express as px
import plotly.graph_objects as go

# Consumption data set
tips = px.data.tips()
tips.head()
Copy the code

There are also two methods used in this paper’s drawing:

import plotly_express as px  # 1, px implementation
import plotly.graph_objects as go  # 2, go implementation
Copy the code

Draw box diagrams based on PX

Point-based box diagram

Use each data point as a marker to draw the box map using px.strip()

Px.strip ()

fig = px.strip(
    tips,
    x='day'.# week
    y='total_bill'  # the total bill
)

fig.show()
Copy the code

Px.strip ()

fig = px.strip(
    tips,
    x='time'.# Lunch or Dinner
    y='tip'  Tip #
)

fig.show()
Copy the code

Base box diagram

fig = px.box(
  tips,  # data set
  y="total_bill"  Do a box graph for which field of data
)

fig.show()
Copy the code

Grouping box plan

Draw box diagrams for groups with different categories:

fig = px.box(
  tips,  # data set
  y="tip".# drawing field
  color="time"  # color field
)

fig.show()
Copy the code

Here’s a slightly more field grouping box diagram:

fig = px.box(
    tips,
    x="day".# Grouped data
    y="total_bill".# Box diagram values
    color="day"  # Color group
)

fig.show()
Copy the code

Container diagram with scattered points

Sometimes when we draw the box diagram, we need to bring scatter points, scatter points are the original data. There are four cases of the value of point:

  • All: all
  • -Leonard: Outliers
  • Suspectedoutliers: Suspected outlier
  • False: Not displayed
fig = px.box(
    tips,
    x="day",
    y="total_bill",
    points="all"   # ['all', 'outliers', 'suspectedoutliers', False]
)

fig.show()
Copy the code

Box diagram with quartiles

There are three methods for calculating scatter interpolation:

  • Linear: Linear difference method, default
  • Exclusive: Indicates statistics of excluding algorithms. If the sample is odd, it does not contain the median of any half, with Q1 being the median of the lower half and Q3 being the median of the upper half
  • Inclusive: algorithm statistics; If the sample is odd, the median is included in both halves, with Q1 being the median of the lower half and Q3 being the median of the upper half
fig = px.box(
    tips, 
    x="day",
    y="tip",
    color="smoker")

fig.update_traces(quartilemethod="exclusive") # exclusive inclusive linear (default)

fig.show()
Copy the code

Comparison of three different quartile display methods

Simulate a data set:

data = [10.20.30.40.50.60.70.80.90]

pd.DataFrame(dict(
    linear=data,
    inclusive=data,
    exclusive=data
))

# Below is a partial screenshot of the data
Copy the code

Melt method is used to merge and transform the above data, using a MELT function:

  • D_vars: column names that do not need to be converted
  • Value_vars: Name of the column that needs to be converted. If all remaining columns need to be converted, do not write
  • Var_name and value_name are column names corresponding to custom Settings.
  • Col_level: this level is used if the column is MultiIndex
## Comparison results between 3 different algorithms

import plotly.express as px
import pandas as pd

data = [10.20.30.40.50.60.70.80.90]

df = pd.DataFrame(dict(
    linear=data,
    inclusive=data,
    exclusive=data
)).melt(var_name="quartilemethod")  Convert the wide table to the long table

df
Copy the code

Add data trace and jitter spacing jitter:

fig = px.box(
    df, 
    y="value", 
    facet_col="quartilemethod", 
    color="quartilemethod",
    boxmode="overlay", 
    points='all')

# jitter: Data jitter =0 indicates that there is no jitter and the distance between points is balanced
fig.update_traces(quartilemethod="linear", jitter=0, col=1)
fig.update_traces(quartilemethod="inclusive", jitter=0, col=2)
fig.update_traces(quartilemethod="exclusive", jitter=0, col=3)

fig.show()
Copy the code

Box diagram with notch

fig = px.box(
    tips,
    x="day",
    y="tip",
    color="smoker",
    notched=True.# display gap
    title="Tipping Data Set Box map",
    hover_data = ["day"]
)

fig.show()
Copy the code

Draw box diagram based on GO

Basic box drawing

import plotly.graph_objects as go

fig = go.Figure(data=[go.Box(
    y=[0.1.1.2.4.7.9.15.21],
    boxpoints='all'.# All, Outliers, SuspectedOutliers, False
    jitter=0.3.# Add jitter between data points
    pointpos=-1.5   # distance between point and box, parameter range: [-2, 2]
      )])

fig.show()
Copy the code

Group box drawing

np.random.seed(1)  # Set random seed

y1 = np.random.randn(60) - 1   Generate 60 data randomly
y2 = np.random.randn(60) - 1
Copy the code

fig = go.Figure()

# Add two data tracks to form a graph
fig.add_trace(go.Box(y=y1)) 
fig.add_trace(go.Box(y=y2))

fig.show()
Copy the code

We can also set the color of the graphic:

fig = go.Figure()

# Add two data tracks to form a graph
fig.add_trace(go.Box(y=y1,  # numerical
                     name="Figure 1".# Track name
                     marker_color="red" # color
                    )) 

fig.add_trace(go.Box(y=y2,
                     name=Figure 2 "",
                     marker_color="lightseagreen"
                    ))

fig.show()
Copy the code

import plotly.graph_objects as go

x = ['day 1'.'day 1'.'day 1'.'day 1'.'day 1'.'day 1'.'day 2'.'day 2'.'day 2'.'day 2'.'day 2'.'day 2']

fig = go.Figure()

fig.add_trace(go.Box(
    x=x,
    y=[0.2.0.2.0.6.1.0.0.5.0.4.0.2.0.7.0.9.0.1.0.5.0.3],
    name='kale',
    marker_color='#3D0970'
))

fig.add_trace(go.Box(
    x=x,
    y=[0.6.0.7.0.3.0.6.0.0.0.5.0.7.0.9.0.5.0.8.0.7.0.2],
    name='radishes',
    marker_color='#0F4136'
))

fig.add_trace(go.Box(
    x=x,
    y=[0.1.0.3.0.1.0.9.0.6.0.6.0.9.1.0.0.3.0.6.0.8.0.5],
    name='carrots',
    marker_color='#FA851B'
))

fig.update_layout(
    yaxis_title='numerical value',
    boxmode='group' # Bar chart pattern
)
fig.show()
Copy the code

Full style cabinet diagram

import plotly.graph_objects as go

# X-axis data
x_data = ['Ming'.'little red'.'little weeks'.'note:'.'zhang'.'little su']

N = 80

Generate Y-axis data: Generate data and specify the data type
y0 = (10 * np.random.randn(N) + 60).astype(np.int)
y1 = (13 * np.random.randn(N) + 78).astype(np.int)
y2 = (11 * np.random.randn(N) + 83).astype(np.int)
y3 = (9 * np.random.randn(N) + 76).astype(np.int)
y4 = (15 * np.random.randn(N) + 91).astype(np.int)
y5 = (12 * np.random.randn(N) + 80).astype(np.int)

y_data = [y0, y1, y2, y3, y4, y5]

# color Settings
colors = ['rgba (93, 164, 214, 0.5)'.'rgba (155, 144, 14, 0.5)'.'rgba (44, 160, 101, 0.5)'.'rgba (155, 65, 54, 0.5)'.'rgba (27, 114, 255, 0.5)'.'rgba (127, 96, 0, 0.5)']

fig = go.Figure()

# Generate 6 different sets of functions to add tracks via zip function
# Generate different tracks
for xd, yd, cls in zip(x_data, y_data, colors):
        fig.add_trace(go.Box(
            y=yd,  # Y-axis data
            name=xd,  # the name
            boxpoints='all'.# Box scatter display
            jitter=0.5.# Jitter distance
# whiskerwidth = 0.2.
            fillcolor=cls,  # color
            marker_size=2.# tag size
            line_width=1)  # line width
        )

# Layout Settings
fig.update_layout(
    title='Comparison of results of 6 students',
    yaxis=dict(
        autorange=True,
        showgrid=True.# display grid
        zeroline=True.# 0 base line
        dtick=5,
        gridcolor='rgb(255, 255, 255)'.# Grid and baseline Settings
        gridwidth=1,
        zerolinecolor='rgb(255, 255, 255)',
        zerolinewidth=2,
    ),
    margin=dict(
        l=40,
        r=30,
        b=80,
        t=100,
    ),
    paper_bgcolor='rgb(243, 243, 243)'.# Background Settings
    plot_bgcolor='rgb(243, 243, 243)',
    showlegend=True  # display legend
)

fig.show()
Copy the code

Quartile display in 3 different ways

Display of quartiles under 3 different calculation methods:

import plotly.graph_objects as go

data = [1.2.3.4.5.6.7.8.9]
fig = go.Figure()

fig.add_trace(go.Box(y=data, quartilemethod="linear", name="Linear Quartile"))
fig.add_trace(go.Box(y=data, quartilemethod="inclusive", name="Inclusive Quartile"))
fig.add_trace(go.Box(y=data, quartilemethod="exclusive", name="Exclusive Quartile"))

fig.update_traces(
    boxpoints='all'.# ['all', 'outliers', 'suspectedoutliers', False]
    jitter=0  # No jitter, point to point distance is the same
)

fig.show()
Copy the code

From the figure above, we can clearly see the differences between the three different difference methods.

Horizontal box plan

x1 = np.random.randn(50)
x2 = np.random.randn(50) + 5

fig = go.Figure()
fig.add_trace(go.Box(x=x1))
fig.add_trace(go.Box(x=x2))

fig.show()
Copy the code

Grouping horizontal box diagram

import plotly.graph_objects as go

y = ['day 1'.'day 1'.'day 1'.'day 1'.'day 1'.'day 1'.'day 2'.'day 2'.'day 2'.'day 2'.'day 2'.'day 2']

fig = go.Figure()

fig.add_trace(go.Box(
    y=y,
    x=[0.2.0.2.0.6.1.0.0.5.0.4.0.2.0.7.0.9.0.1.0.5.0.3],
    name='kale',
    marker_color='#3D0970'
))

fig.add_trace(go.Box(
    y=y,
    x=[0.6.0.7.0.3.0.6.0.0.0.5.0.7.0.9.0.5.0.8.0.7.0.2],
    name='radishes',
    marker_color='#0F4136'
))

fig.add_trace(go.Box(
    y=y,
    x=[0.1.0.3.0.1.0.9.0.6.0.6.0.9.1.0.0.3.0.6.0.8.0.5],
    name='carrots',
    marker_color='#FA851B'
))

fig.update_layout(
# xaxis_title=' value ',
    xaxis=dict(
        title="Value",
        zeroline=False
    ),
    boxmode='group' # Bar chart pattern
)

fig.update_traces(orientation='h')  # Horizontal bar chart
fig.show()
Copy the code

Box diagram with mean and variance

import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Box(
    y=np.random.randn(50),
    name='average',
    marker_color='mediumblue',
    boxmean=True # There is only the mean
))
fig.add_trace(go.Box(
    y=np.random.randn(50),
    name='Mean and standard Deviation',
    marker_color='red',
    boxmean='sd' # means both mean and standard deviation exist
))

fig.show()
Copy the code

Display of 4 different data points

import plotly.graph_objects as go

y_data = [0.75.5.25.5.5.6.6.2.6.6.6.80.7.0.7.2.7.5.7.5.7.75.8.15.8.15.8.65.8.93.9.2.9.5.10.10.25.11.5.12.16.20.90.22.3.23.25]

fig = go.Figure()

fig.add_trace(go.Box(
    
    y=y_data,
    
    name="All data points",
    jitter=0.3.# Jitter distance
    pointpos=-1.8.# Scatter distance from box diagram
    boxpoints='all'.# all: Display all data points
    marker_color='RGB (7,40,89)',
    line_color='RGB (7,40,89)'
))

fig.add_trace(go.Box(
    y=y_data,
    name="Whisker line",
    boxpoints=False.# No data points, only whisker lines
    marker_color='RGB (109,56,125)',
    line_color='RGB (9,56,125)'
))

fig.add_trace(go.Box(
    y=y_data,
    name="Suspicious outliers",
    boxpoints='suspectedoutliers'.# Suspicious outliers
    marker=dict(
        color='RGB (8,81,156)',
        outliercolor='rgba (219, 64, 82, 0.6)',
        line=dict(
            outliercolor='rgba (219, 64, 82, 0.6)',
            outlierwidth=2)),
    line_color='RGB (8,81,156)'
))

fig.add_trace(go.Box(
    y=y_data,
    name="Whisker + outlier",
    boxpoints='outliers'.# Display only outliers
    marker_color='RGB (107174, 14)',
    line_color='RGB (107174214).
))

fig.update_layout(title_text="Scatter plots based on Personalized Outliers")
fig.show()
Copy the code

Rainbow box drawing

import plotly.graph_objects as go
import numpy as np

N = 40     # Number of box diagrams
c = ['hsl('+str(h)+'50%'+', 50%) for h in np.linspace(0.360, N)]
Copy the code

The specific drawing code is:

fig = go.Figure(data=[go.Box(
    Use trigonometric functions to draw graphs
    y=3.5 * np.sin(np.pi * i/N) + i/N + (1.5 + 0.5 * np.cos(np.pi*i/N)) * np.random.rand(10),
    marker_color=c[i]
    ) for i in range(int(N))])

# Layout Settings
fig.update_layout(
    # xy axis Settings
    xaxis=dict(showgrid=True, 
               zeroline=False, 
               showticklabels=False),
    yaxis=dict(zeroline=False, 
               gridcolor='white'),
    # Background color Settings
    paper_bgcolor='RGB (233233233).,
    plot_bgcolor='RGB (233233233).,
)

fig.show()
Copy the code