Random sampling

The numpy.random module complements Python’s built-in Random by adding some functions for efficiently generating sample values of various probability distributions, such as normal distribution and Poisson distribution.

  • numpy.random.seed(seed=None) Seed the generator.

Seed () is used to specify the integer value at the beginning of the algorithm used for random number generation. If the same seed() value is used, the random number will be generated every time. If this value is not set, the system selects the value based on the time, and the random number generated each time will be different.

In the pre-processing of data, new operations are often added or processing strategies are changed. At this time, if accompanied by random operations, it is better to specify a unique random seed to avoid the impact of random differences on the results.


Discrete random variables

The binomial distribution

The binomial distribution can be applied to the probability problem of multiple experiments with only one experiment and only two outcomes, each of which corresponds to the same probability. Like the probability of winning 6 out of 10 punches or something like that.

Binom.pmf (k) = choose(n, k) p**k (1-p)**(n-k)

Mathematical representation of binomial distribution probability function:

  • numpy.random.binomial(n, p, size=None) Draw samples from a binomial distribution.

Represents the number of times a binomial distribution is sampled, size represents the number of times a binomial distribution is sampled, n represents the number of Bernoulli tests performed, p represents the probability of success, and the return value of the function represents the number of successes in n.

Nine (n=9) oil exploration Wells are being excavated in the field, each with a probability of producing oil of 0.1 (P =0.1). What is the probability that all exploration Wells will eventually fail?

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(20200605)
n = 9# The number of times you have done STH
p = 0.1# The probability of success in doing something
size = 50000
x = np.random.binomial(n, p, size)
# Use binom. RVS (n, p, size=1) to simulate a binomial random variable, visually representing the probability y = stats.binom.rvs(n, p, size=1) Size =size)# return a numpy.ndarray ""
print(np.sum(x == 0) / size)  # 0.3897

plt.hist(x)
plt.xlabel('Random variable: Number of successes')
plt.ylabel('Number of occurrences in the sample')
plt.show()
# It returns a list in which each element represents the probability of the corresponding value in the random variable
s = stats.binom.pmf(range(10), n, p)
print(np.around(s, 3))
# [0.387 0.387 0.172 0.045 0.007 0.007 0.0.]
Copy the code

What is the probability of getting heads both times?

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(20200605)
n = 2# Number of times you do something, this is two flips
p = 0.5The probability of getting heads on a coin
size = 50000
x = np.random.binomial(n, p, size)
# Use binom. RVS (n, p, size=1) to simulate a binomial random variable, visually representing the probability y = stats.binom.rvs(n, p, size=1) Size =size)# return a numpy.ndarray ""
print(np.sum(x == 0) / size)  # 0.25154
print(np.sum(x == 1) / size)  # 0.49874
print(np.sum(x == 2) / size)  # 0.24972

plt.hist(x, density=True)
plt.xlabel('Random variable: the number of heads.')
plt.ylabel('Number of occurrences in 50,000 samples')
plt.show()
# It returns a list in which each element represents the probability of the corresponding value in the random variable
s = stats.binom.pmf(range(n + 1), n, p)
print(np.around(s, 3))
# 0.5 [0.25 0.25]
Copy the code

"Expectation: E(x) = np variance: Stats.binom. stats(n, p, loc=0, moments='mv') Var(x) = np(1-p)Copy the code

Poisson distribution

Poisson distribution is mainly used to estimate the probability of an event occurring in a certain period of time.

The code of poisson probability function expresses: poisson. PMF (k) = exp(-lam) lam*k/k!

Mathematical representation of Poisson probability function:

  • Numpy. Random. Poisson (lam = 1.0, size = None) Draw samples from a Poisson distribution.

Represents the sampling of a Poisson distribution, size represents the sampling times, LAM represents the average number of events occurring within a unit, and the return value of the function represents the number of events occurring within a unit.

Given that an airline’s reservations office receives an average of 42 calls per hour, what is the probability that it receives exactly six calls in a 10-minute period?

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(20200605)
lam = 42 / 6# Average: An average of 42/6 booking calls per 10 minutes
size = 50000
x = np.random.poisson(lam, size)
X = stats.poisson. RVS (lam,size=size) ""
print(np.sum(x == 6) / size)  # 0.14988

plt.hist(x)
plt.xlabel('Random variable: number of booking calls per 10 minutes')
plt.ylabel('Number of occurrences in 50,000 samples')
plt.show()
# Use poisson. PMF (k, mu) to calculate the probability of corresponding distribution:
x = stats.poisson.pmf(6, lam)
print(x)  # 0.14900277967433773
Copy the code

Hypergeometric distribution

In hypergeometric distribution, each experiment is not independent and the probability of success is not equal. Mathematical representation of probability function of hypergeometric distribution:

  • numpy.random.hypergeometric(ngood, nbad, nsample, size=None) Draw samples from a Hypergeometric distribution.

Size represents the sampling times, ngood represents the number of elements in the population with success flags, nbad represents the number of elements in the population without success flags, ngood+ nBAD represents the population sample size, nSAMPLE represents the number of elements extracted (less than or equal to the population sample size), The return value of the function represents the number of elements with a success identifier in the nsample element extraction.

Out of a total of 20 animals, 7 were dogs, and 12 were selected with a probability of only 3 dogs (without returning the sample).

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(20200605)
size = 500000
x = np.random.hypergeometric(ngood=7, nbad=13, nsample=12, size=size)
Hypergeom. RVS (M=20,n=7, n= 12,size=size)"
print(np.sum(x == 3) / size)  # 0.198664

plt.hist(x, bins=8)
plt.xlabel('Number of dogs')
plt.ylabel('Number of occurrences in 50,000 samples')
plt.title('Hypergeometric distribution',fontsize=20)
plt.show()

""" M is the total capacity, n is the number of elements with success mark in the population, n,k means that there are k successful elements out of n elements.
x = range(8)
# Use hypergeom.pmF (k, M, n, n, LOC) to calculate the probability of k successes
s = stats.hypergeom.pmf(k=x, M=20, n=7, N=12)
print(np.round(s, 3))
# [0. 0.004 0.048 0.199 0.358 0.286 0.095 0.01]
Copy the code

Mean and variance of hypergeometric distribution E(x) = N(N /M) variance Var(x) = N(N /M)(1-n/M)((M-N)/(M-1) note: Considering the hypergeometric distribution of n experiments, let p=n/M, when the total capacity is large enough ((m-n)/(M-1Similar to))1, the mathematical expectation is Np and the variance is Np(1-p).
Stats (M, n, n, loc=0, moments='mv'
stats.hypergeom.stats(20.7.12,moments='mv')
Copy the code

Continuous random variable

Uniform distribution

  • Numpy. Random. Uniform (low = 0.0, high = 1.0, size = None) Draw samples from a uniform distribution.

Samples are uniformly distributed over the half-open interval [low, high) (includes low, but excludes high). In other words, any value within the given interval is equally likely to be drawn by uniform.

Create a evenly distributed random number with size from low to high.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(20200614)
a = 0
b = 100
size = 50000
x = np.random.uniform(a, b, size=size)
print(np.all(x >= 0))  # True
print(np.all(x < 100))  # True
y = (np.sum(x < 50) - np.sum(x < 10)) / size
print(y)  # 0.40144

plt.hist(x, bins=20)
plt.show()

a = stats.uniform.cdf(10.0.100)
b = stats.uniform.cdf(50.0.100)
print(b - a)  # 0.4
Copy the code

As a special column of uniform(), we can get uniformly distributed random numbers between [0,1).

  • numpy.random.rand(d0, d1, ... , dn) Random values in a given shape.

Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

[example] generate random numbers evenly distributed between [0,1) according to the specified size.

import numpy as np

np.random.seed(20200614)
print(np.random.rand())
# 0.7594819171852776

print(np.random.rand(5))
# [0.75165827 0.16552651 0.0538581 0.46671446 0.89076925]

print(np.random.rand(4.3))
# [[0.10073292 0.14624784 0.40273923]
# 0.22226682 [0.21844459 0.37246217]
# 0.01714939 [0.50334257 0.47780388]
# 0.86500477 [0.08755349 0.70566398]]

np.random.seed(20200614)
print(np.random.uniform())  # 0.7594819171852776
print(np.random.uniform(size=5))
# [0.75165827 0.16552651 0.0538581 0.46671446 0.89076925]

print(np.random.uniform(size=(4.3)))
# [[0.10073292 0.14624784 0.40273923]
# 0.22226682 [0.21844459 0.37246217]
# 0.01714939 [0.50334257 0.47780388]
# 0.86500477 [0.08755349 0.70566398]]
Copy the code

As another special case of uniform, random integers uniformly distributed between [low,high] can be obtained.

  • numpy.random.randint(low, high=None, size=None, dtype='l') Return random integers from low (inclusive) to high (exclusive).

Return random integers from the “discrete Uniform” distribution of the Specified dtype in the “half-open” interval [low, high). If high is None (the default), then results are from [0, low).

If high is not None, the value is a random integer between [low,high]; otherwise, the value is a random integer between [0,low).

import numpy as np

np.random.seed(20200614)
x = np.random.randint(2, size=10)
print(x)
# [0 0 0 1 0 1 0 0 0 0 0]

x = np.random.randint(1, size=10)
print(x)
# [0 0 0 0 0 0 0 0 0 0

x = np.random.randint(5, size=(2.4))
print(x)
# [[3 3 0 1]
# [1 1 0 1]]

x = np.random.randint(1.10[3.4])
print(x)
# [[2 1 7 7]
# [7 2 4 6]
# [8 7 2 8]]
Copy the code

Normal distribution

Mathematical representation of the standard normal distribution:

  • numpy.random.randn(d0, d1, ... , dn) Return a sample (or samples) from the “standard normal” distribution.

Generate an array (mean 0, standard deviation 1) that satisfies the standard normal distribution according to the specified size.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(20200614)
size = 50000
x = np.random.randn(size)
y1 = (np.sum(x < 1) - np.sum(x < -1)) / size
y2 = (np.sum(x < 2) - np.sum(x < -2)) / size
y3 = (np.sum(x < 3) - np.sum(x < -3)) / size
print(y1)  # 0.68596
print(y2)  # 0.95456
print(y3)  # 0.99744

plt.hist(x, bins=20)
plt.show()

y1 = stats.norm.cdf(1) - stats.norm.cdf(-1)
y2 = stats.norm.cdf(2) - stats.norm.cdf(-2)
y3 = stats.norm.cdf(3) - stats.norm.cdf(-3)
print(y1)  # 0.6826894921370859
print(y2)  # 0.9544997361036416
print(y3)  # 0.9973002039367398
Copy the code

You can also specify the distribution and the required parameters for randomization, such as mu and sigma in the Gaussian distribution.

  • Numpy. Random. Normal (loc = 0.0, scale = 1.0, size = None) Draw random samples from a normal (Gaussian) distribution.

normal()To create an array with mean loC (mu), standard deviation scale (sigma) and size.

sigma * np.random.randn(...) + mu
Copy the code

[example]

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(20200614)
x = 0.5 * np.random.randn(2.4) + 5
"' or # simulation 10000 random variable x = 0.5 * stats in norm. The RVS (size = (2, 4)) + 5 ' ' '
print(x)
# [[5.39654234 5.4088702 5.49104652 4.95817289]
# [4.31977933 4.76502391 4.70720327 4.36239023]]

np.random.seed(20200614)
mu = 5The average #
sigma = 0.5# standard deviation
x = np.random.normal(mu, sigma, (2.4))
print(x)
# [[5.39654234 5.4088702 5.49104652 4.95817289]
# [4.31977933 4.76502391 4.70720327 4.36239023]]

size = 50000
x = np.random.normal(mu, sigma, size)

print(np.mean(x))  # 4.996403463175092
print(np.std(x, ddof=1))  (# sample standard deviation)
"' ddof: int, optional Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. '''
Copy the code

plt.hist(x, bins=20)
plt.show()
Copy the code

An index distribution

The exponential distribution describes the time interval over which time occurs.

Mathematical representation of exponential distribution:

  • Numpy. Random. An exponential scale = 1.0, (size = None) Draw samples from an exponential distribution.

Scale = 1/lambda

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(20200614)
lam = 7
size = 50000
x = np.random.exponential(1 / lam, size)
# RVS (loc=0, scale=1/lam, size=size, random_state=None)
y1 = (np.sum(x < 1 / 7)) / size
y2 = (np.sum(x < 2 / 7)) / size
y3 = (np.sum(x < 3 / 7)) / size
print(y1)  # 0.63218
print(y2)  # 0.86518
print(y3)  # 0.95056

plt.hist(x, bins=20)
plt.show()

y1 = stats.expon.cdf(1 / 7, scale=1 / lam)
y2 = stats.expon.cdf(2 / 7, scale=1 / lam)
y3 = stats.expon.cdf(3 / 7, scale=1 / lam)
print(y1)  # 0.6321205588285577
print(y2)  # 0.8646647167633873
print(y3)  # 0.950212931632136
Copy the code


Other random functions

Retrieves elements randomly from a sequence

  • numpy.random.choice(a, size=None, replace=True, p=None) Generates a random sample from a given 1-D array.

Get the element from the sequence. If a is an integer, the element value is randomly obtained from np.range(a); If a is an array, the value is randomly obtained from the elements of the array A. This function can also control whether the elements in the generated array duplicate replace and the probability p of selecting the elements.

[example]

import numpy as np

np.random.seed(20200614)
x = np.random.choice(10.3)
print(x)  2 0 # [1]

x = np.random.choice(10.3, p=[0.05.0.0.05.0.9.0.0.0.0.0.0])
print(x)  # [2, 3, 3]

x = np.random.choice(10.3, replace=False, p=[0.05.0.0.05.0.9.0.0.0.0.0.0])
print(x)  [3 0 # 2]

aa_milne_arr = ['pooh'.'rabbit'.'piglet'.'Christopher']
x = np.random.choice(aa_milne_arr, 5, p=[0.5.0.1.0.1.0.3])
print(x) # ['pooh' 'rabbit' 'pooh' 'pooh' 'pooh']

np.random.seed(20200614)
x = np.random.randint(0.10.3)
print(x)  2 0 # [1]
Copy the code

Shuffle the data set

Data are generally arranged in the order of collection, but in machine learning, many algorithms require data to be independent from each other, so it is necessary to shuffle the data set first.

  • numpy.random.shuffle(x) Modify a sequence in-place by shuffling its contents.

This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.

Reorder x, if x is a multidimensional array, just shuffle along the 0th axis, changing the original array, and output None.

Shuffle cards, change their contents, disarrange the order.

import numpy as np

np.random.seed(20200614)
x = np.arange(10)
np.random.shuffle(x)
print(x)
# [6 8 7 5 3 9 1 4 0 2]

print(np.random.shuffle([1.4.9.12.15]))
# None

x = np.arange(20).reshape((5.4))
print(x)
# [[0 1 2 3]
# [4 5 6 7]
# [8 9 10 11]
# [12 13 14 15]
# [16 17 18 19]]

np.random.shuffle(x)
print(x)
# [[4 5 6 7]
# [0 1 2 3]
# [8 9 10 11]
# [16 17 18 19]
# [12 13 14 15]]
Copy the code
  • numpy.random.permutation(x) Randomly permute a sequence, or return a permuted range.

If x is a multi-dimensional array, it is only shuffled along its first index.

The permutation() function, like shuffle(), scrambles data on axis 0, but it does not alter the original array.

[example]

import numpy as np

np.random.seed(20200614)
x = np.arange(10)
y = np.random.permutation(x)
print(y)
# [6 8 7 5 3 9 1 4 0 2]

print(np.random.permutation([1.4.9.12.15]))
# [4 1 9 15 12]

x = np.arange(20).reshape((5.4))
print(x)
# [[0 1 2 3]
# [4 5 6 7]
# [8 9 10 11]
# [12 13 14 15]
# [16 17 18 19]]

y = np.random.permutation(x)
print(y)
# [[8 9 10 11]
# [0 1 2 3]
# [12 13 14 15]
# [16 17 18 19]
# [4 5 6 7]]
Copy the code

Picture shows that the binomial distribution, poisson distribution, exponential distribution and geometric distribution, negative binomial distribution, the gamma distribution of contact @ zhihu Wu Chen zhuanlan.zhihu.com/p/32932782


reference

  • www.jianshu.com/p/63434ad5e…

From: DataWhale Community Team Study