Box plots, also called box and whiskers plots, box plots, boxplots. You’re not sure now. How many lines are there in this picture? What does each line mean? Does the line in the middle represent the arithmetic mean, the median, or the mode?

What’s the point of the box diagram? What is its practical significance in data analysis?

Next, take you step-by-step through the box diagram and the story behind it, starting with the concept.

1. Be aware of the box chart

The box diagram was an important invention of John Wilder Tukey. Mr. Tukey was born in 1915 in New Bedford, Massachusetts. He earned a master’s degree at Brown University at the age of 22 and a doctorate in chemistry at Princeton. Interestingly, he did not go straight into the statistics work that would make his name, but moved to the Fire-control Laboratory during the Second World War, where much weapons-related research eventually turned to solving statistical problems first. Since then, Tukey’s life has changed direction, and a generation of statisticians is about to emerge.

The biggest advantage of box graph is that it is not affected by outliers and can describe the discrete distribution of data in a relatively stable way. Say it again, it’s important that the box graph is not affected by outliers.

To illustrate, let’s draw a picture and speak from the picture. The following Python code is implemented in Jupyter in Anaconda:

%matplotlib inline import matplotlib from matplotlib import pyplot as plt matplotlib.use('qt4agg') matplotlib.rcParams['font.sans-serif'] = ['SimHei'] matplotlib.rcParams['font.family']='sans-serif' sample PLT,6,2,7,4,2,3,3,8,25,30 = [1]. The boxplot (sample) PLT. Title (' box figure)Copy the code



The first thing is to perceive from the outside what it is. In the middle is a rectangular block, which you can think of as a box. There’s a line inside the box and two T’s on the outside. Oh, and there are two hollow circles on the outside, which not all box charts have. Let’s explain each of these things.

2. Five elements of the box chart

There is one important point that needs to be explained, or it may be overlooked by most people. To draw a box plot, you need to calculate quantiles. If you have a set of data that is not in order, you need to sort the data first.

A median of 2.1

The median is one half of a fraction. Sort the data from the smallest to the largest. If the number of data n is the cardinal number, the position of the median is (n+1)/2. If the number of data n is even, the median is the arithmetic average of the n/2 and n/2+1 numbers.

Import numpy as np sample =[1,6,2,7,4,2,3,3,8,25,30] percentile(sample,50)Copy the code

The median is 4.0.

2.2 Lower quartile Q1

The quartile is located at the bottom of the box plot, so it is also called the “lower quartile”. And again, the quartile is divided evenly into four parts. The specific calculation currently has (n+1)/4 and (n-1)/4, generally use (n+1)/4, that is, the fourth quartile is the (n+1)/4 number, of course, the result may be a fraction. When the result is a fraction, we’ll talk about the calculation later.

2.3 Upper quartile Q3

The third quartile is located at the top of the box plot, so it is also called the “upper quartile”. And again, the quartile is divided evenly into four parts. The specific calculation is currently (n+1)/43 and (n-1)/43, generally use (n+1)/43, namely the third quarter of the (n+1)/43 number.

Within the limit of 2.4

The two t-shaped box whiskers we have seen so far are the inner limits. The extreme distance extended to by the upper T-shaped line segment is Q3+ 1.5iQR (where IQR= q3-Q1) and the maximum value after removing the outliers, while the extreme distance extended to the following T-shaped line segment is q1-1.5iQR and the minimum value after removing the outliers.

3. Box diagram and abnormal address cleaning

In the box diagram above, you see two circles that represent two outliers. The most important use of a box diagram is to identify outliers. Data cleaning, very useful oh. That’s all for today

👏👏👏 take a look at our previous article 😃😃😃 🌺 Excel data analysis tool library – correlation coefficient 🌺 dry goods, hand in hand to teach you to do correlation analysis 🌺 5 years of data analysis road, summary. 🌺 user segmentation and portrait analysis 🌺 K-nearest neighbor algorithm and practice

Welcome to our wechat official account.Home of data analysts

Scan the QR code to follow us

💁 provides career planning, resume guidance, interview counseling services

QQ communication group: 254674155