Author: xiaoyu

Python Data Science

Python data analyst


Histogram is a tool that can quickly display the probability distribution of data, intuitive and easy to understand, and popular with data lovers. We usually see the most is matplotlib, Seaborn and other advanced packaging library packages, similar to the following drawing.

This blogger will summarize all the ways you can draw histograms in Python, which can be divided into three broad categories (see the summary at the end of this article for more details) :

  • A pure Python implementation of the histogram does not use any third party libraries
  • useNumpyTo create a histogram to summarize the data
  • usematplotlib.pandas.seabornDraw a histogram

Let’s take a look at each of these methods.

Pure Python implements histogram

When preparing to draw histograms in pure Python, the simplest idea is to present the number of occurrences of each value as a report. In this case, using a dictionary to accomplish this task is perfectly appropriate. Let’s look at how this is done.

>>> a = (0, 1, 1, 1, 2, 3, 7, 7, 23)

>>> def count_elements(seq) -> dict:
...     """Tally elements from `seq`.""". hist = {} ...for i in seq:
...         hist[i] = hist.get(i, 0) + 1
...     return hist

>>> counted = count_elements(a)
>>> counted
{0: 1, 1: 3, 2: 1, 3: 1, 7: 2, 23: 1}
Copy the code

As you can see, count_elements() returns a dictionary in which the keys that appear are all unique values in the target list, and the values are the number of times they appear. Hist [I] = hist. Get (I, 0) + 1

In fact, this can be done with a Python standard library class collection.counter that is compatible with Pyhont dictionaries and overwrites the dictionary’s.update() method.

>>> from collections import Counter

>>> recounted = Counter(a)
>>> recounted
Counter({0: 1, 1: 3, 3: 1, 2: 1, 7: 2, 23: 1})
Copy the code

It can be seen that the result of this method is the same as that of our own method. We can also check whether the results obtained by the two methods are equal by using collection.Counter.

>>> recounted.items() == counted.items()
True
Copy the code

We use the above functions to reconstruct a wheel ASCII_histogram, and finally use Python’s output format to display the histograms as follows:

def ascii_histogram(seq) -> None:
    """A horizontal frequency-table/histogram plot."""
    counted = count_elements(seq)
    for k in sorted(counted):
        print('{0:5d} {1}'.format(k, '+' * counted[k]))
Copy the code

This function plots numerical values in order of magnitude, with the number of occurrences denoted by a (+) symbol. Calling sorted() on the dictionary returns a list of keystrokes sorted, and then gets the corresponding counted[k] count.

>>> import random
>>> random.seed(1)

>>> vals = [1, 3, 4, 6, 8, 9, 10]
>>> The numbers in # 'VALS' will appear between 5 and 15 times
>>> freq = (random.randint(5, 15) for _ in vals)

>>> data = []
>>> for f, v inzip(freq, vals): ... data.extend([v] * f) >>> ascii_histogram(data) 1 +++++++ 3 ++++++++++++++ 4 ++++++ 6 +++++++++ 8 ++++++ 9 ++++++++++++ 10 + + + + + + + + + + + +Copy the code

In this code, the values in VALS are non-repetitive, and the frequency at which each number appears is defined by ourselves, randomly selected between 5 and 15. Then, using the functions we encapsulated above, we get a histogram representation of the pure Python version.

Summary: Pure Python implementation frequency tables (non-standard histograms) can be implemented directly using the collection.counter method.

Implement Histogram using Numpy

This is a simple histogram done in pure Python, but mathematically, a histogram is a boxed to frequency mapping that can be used to estimate the probability density function of a variable. The pure Python implementation above is purely a frequency statistic, not a true histogram.

Therefore, we upgrade from the simple histogram implemented above. A true histogram should first divide variables into compartments (boxes), that is, into intervals, and then count the number of observations within each interval. It just so happens that Numpy’s histogram method does this, and it is also the basis for the use of Matplotlib and PANDAS, which will be mentioned later.

As an example, consider a sample of floating-point data extracted from a Laplace distribution. This distribution has a wider tail than the standard normal distribution and has two description parameters (location and scale) :

>>> import numpy as np >>> np.random.seed(444) >>> np.set_printoptions(precision=3) >>> d = np.random.laplace(loc=15, Scale = 3, size = 500) > > > d [5] array ([18.406, 18.087, 16.004, 16.221, 7.358])Copy the code

Since this is a continuous distribution, each individual floating-point value (that is, all the infinite decimal positions) does not do a good job of labeling (because there are too many points). However, you can divide the data into boxes and count the number of observations in each box, which is what real histograms do.

Let’s see how Numpy is used to implement histogram frequency statistics.

Histogram (D) >>> Hist array([1, 0, 3, 4, 4, 10, 13, 9, 2, 4]) >>> Bin_edges array([3.217, 5.199, 7.181, 9.163, 11.145, 13.127, 15.109, 17.091, 19.073, 21.055, 23.037]Copy the code

This might not be very intuitive. As a matter of fact, np.histogram() defaults to using 10 intervals (boxes) of the same size, and then returns a tuple (frequency, box boundaries), as shown above. Note that the number of boundaries is one more than the number of boxes, which can be verified simply by the following code.

>>> hist.size, bin_edges.size
(10, 11)
Copy the code

So how does Numpy divide boxes? This is done simply by means of nP.histogram (), but exactly how it is done is still entirely unknown. Let’s take a closer look at the innards of NP.histogram () and see how this works (using the aforementioned list A as an example).

>>> # Take the minimum and maximum of a
>>> first_edge, last_edge = a.min(), a.max()

>>> n_equal_bins = 10  NumPy is set to default, 10 boxes>>> bin_edges = np.linspace(start=first_edge, stop=last_edge, ... num=n_equal_bins + 1, endpoint=True) ... > > > bin_edges array ([0, 2.3, 4.6, 6.9, 9.2, 11.5, 13.8, 16.1, 18.4, 20.7, 23.])Copy the code

First get the minimum and maximum values of the A list, then set the default number of boxes, and finally use Numpy’s Linspace method to split data segments. The result of dividing boxes is also exactly consistent with the actual situation. 0 to 23 are equally divided into 10 parts, 23/10, so the width of each part is 2.3.

In addition to np.histogram, there are two other methods that achieve the same functionality: np.bincount() and np.searchsorted(). Let’s look at the code and the comparison results.

>>> bcounts = np.bincount(a)
>>> hist, _ = np.histogram(a, range=(0, a.max()), bins=a.max() + 1)

>>> np.array_equal(hist, bcounts)
True

>>> # Reproducing `collections.Counter`
>>> dict(zip(np.unique(a), bcounts[bcounts.nonzero()]))
{0: 1, 1: 3, 2: 1, 3: 1, 7: 2, 23: 1}
Copy the code

Summary: To achieve histograms through Numpy, use np.histogram() or np.bincount() directly.

Visualize Histogram using Matplotlib and Pandas

Now that we’ve seen how to construct a histogram using Python’s basic tools, let’s see how to do it using a more powerful Python library package. Matplotlib provides multiple packages based on Numpy’s Histogram and more sophisticated visualization capabilities.

import matplotlib.pyplot as plt

# matplotlib. Axes. Axes. Hist () method of the interface
n, bins, patches = plt.hist(x=d, bins='auto', color='#0504aa'Alpha = 0.7, PLT rwidth = 0.85). The grid (axis ='y'PLT, alpha = 0.75). The xlabel ('Value')
plt.ylabel('Frequency')
plt.title('My Very Own Histogram')
plt.text(23, 45, r'$\mu=15, b=3$')
maxfreq = n.max()
# Set the upper limit of the Y-axis
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)
Copy the code

bins='auto'

If implemented using Python’s scientific computing tools, you can use Pandas’ series.histogram () and matplotlib.pyplot.hist() to plot the histogram of the input Series, as shown in the code below.

import pandas as pd size, scale = 1000, 10 commutes = pd.Series(np.random.gamma(scale, Size =size) ** 1.5) commutes.plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')
plt.title('Commute Times for 1,000 items')
plt.xlabel('Counts')
plt.ylabel('Commute Time')
plt.grid(axis='y', alpha = 0.75)Copy the code

pandas.DataFrame.histogram()

Hist (), datafame.plot.hist (), and matplotlib ().

Plot kernel Density Estimation (KDE)

KDE (Kernel density Estimation) stands for Kernel density estimation. It is used to estimate the probability density function of random variables and smooth data.

With Pandas, you can create a core-density plot using plot.kde(), which works for both Series and DataFrame data structures. But first, we take two different data samples for comparison (two perfectly distributed samples) :

>>> # 2 perfectly distributed samples
>>> means = 10, 20
>>> stdevs = 4, 2
>>> dist = pd.DataFrame(
...     np.random.normal(loc=means, scale=stdevs, size=(1000, 2)),
...     columns=['a'.'b'])
>>> dist.agg(['min'.'max'.'mean'.'std']. Round (Decimals =2) a b min-1.57 12.46 Max 25.32 26.44 mean 10.12 19.94 STD 3.94 1.94Copy the code

As seen above, we generated two groups of normally distributed samples and made a simple comparison of the two groups of data through some descriptive statistical parameters. Pandas’ plot.kde() displays the histogram and kDE for each column on the same Matplotlib axis. Pandas’ plot.kde() displays the histogram and KDE for each column.

fig, ax = plt.subplots()
dist.plot.kde(ax=ax, legend=False, title='Histogram: A vs. B')
dist.plot.hist(density=True, ax=ax)
ax.set_ylabel('Probability')
ax.grid(axis='y')
ax.set_facecolor('#d8dcd6')
Copy the code

Conclusion: To implement the KDE graph in pandas, use seris.plot.kde (), datafame.plot.kde ().

Use the perfect alternative to Seaborn

A more advanced visualization tool is Seaborn, which is a powerful tool that builds on matplotlib. For histograms, Seaborn has distplot(), which can plot both the histogram of the univariate distribution and kDE at the same time, and it is extremely convenient to use. Here is the implementation code (using d generated above as an example) :

import seaborn as sns

sns.set_style('darkgrid')
sns.distplot(d)
Copy the code

distplot
fit

sns.distplot(d, fit=stats.laplace, kde=False)
Copy the code

Notice the slight difference between the two figures. In the first case you are estimating an unknown probability density function (PDF), while in the second case you know the distribution and want to know which parameters better describe the data.

Summary: To implement histograms with Seaborn, seaborn.distplot() can be used. Seaborn also has a separate kde plot, seaborn.kde().

Other tools in Pandas

In addition to the drawing tool, Pandas provides a convenient.value_counts() method for calculating a non-null histogram and converting it to a PANDAS series structure, as shown in the following example:

>>> import pandas as pd

>>> data = np.random.choice(np.arange(10), size=10000,
...                         p=np.linspace(1, 11, 10) / 60)
>>> s = pd.Series(data)

>>> s.value_counts()
9    1831
8    1624
7    1423
6    1323
5    1089
4     888
3     770
2     535
1     347
0     170
dtype: int64

>>> s.value_counts(normalize=True).head()
9    0.1831
8    0.1624
7    0.1423
6    0.1323
5    0.1089
dtype: float64
Copy the code

Pandas. Cut () is also a convenient way to forcibly box data. For example, we have data on some people’s ages and want to group the data by age, as shown in the following example:

>>> ages = pd.Series(
...     [1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52])
>>> bins = (0, 10, 13, 18, 21, np.inf)  # boundary
>>> labels = ('child'.'preteen'.'teen'.'military_age'.'adult')
>>> groups = pd.cut(ages, bins=bins, labels=labels)

>>> groups.value_counts()
child           6
adult           5
teen            3
military_age    2
preteen         1
dtype: int64

>>> pd.concat((ages, groups), axis=1).rename(columns={0: 'age', 1: 'group'})
    age         group
0     1         child
1     1         child
2     3         child
3     5         child
4     8         child
5    10         child
6    12       preteen
7    15          teen
8    18          teen
9    18          teen
10   19  military_age
11   20  military_age
12   25         adult
13   30         adult
14   40         adult
15   51         adult
16   52         adult
Copy the code

In addition to being easy to use, it is even better that all of these operations are eventually done using Cython code, which is also very fast.

Conclusion: Other methods for implementing histograms are.value_counts() and pandas. Cut ().

Which method should I use?

So far, we have seen a number of ways to implement a histogram. But what are the disadvantages of each? How do you choose them? Of course, there is no one way to solve all problems, we need to consider how to choose according to the actual situation, the following is a recommendation for some situations, just for your reference.

Reference: https://realpython.com/python-histograms/

Follow the wechat official account: Python Data Science for more exciting content.