In recent years, Bayesian statistics have become the cornerstone of empirical finance. This chapter cannot cover all the concepts in this field. Therefore, the textbook of Geweke (2005) should be referred to for general introduction if necessary, and the textbook of Rachev (2008) should be referred to by benefit drivers.

13.3.1 Bayes’ formula

The common interpretation of Bayesian formula in finance is diachronic interpretation. This mainly shows that as time goes by, we will learn new information about variables or parameters we are interested in, such as the average return rate of time series. Formula 13-5 formally describes this theory.

Formula 13-5 bayes’ formula

Where H represents an event (hypothesis) and D represents the data that may be provided by experiments or the real world [4]. On the basis of these definitions, we get:

  • p(H) is called prior probability;
  • p(D) is the probability of the data under any hypothesis, called the normalization constant;
  • p(D|H) is to assume thatHLikelihood (probability) of the data below;
  • p(H|D) is a posterior probability, which is the probability we get when we see the data.

Consider a simple example. There are two boxes B1 and B2, B1 has 20 black balls and 70 red balls, and B2 has 40 black balls and 50 red balls. Pick a ball at random from the two boxes and assume that the ball is black. What are the probabilities of “H1: the ball comes from B1” and “H2: the ball comes from B2”?

Before picking the ball at random, the probability of the two hypotheses is equal. But when the ball is clearly black, we have to update the probability of the two hypotheses according to Bayes’ formula, considering hypothesis H1.

  • Prior probability:p(H1) = 1/2
  • Standardization constant:p(D) =
  • Likelihood degree:p(D|H1) =

The updated H1 probability p (H1) | D =

The result also makes intuitive sense. The probability of picking a black ball from B2 is twice the probability of the same thing happening to B1. So, take out a black ball, assuming that H2 updated probability p (H2) | D = 2/3, and hypothesis H1 updated probability of 2 times.

13.3.2 Bayesian regression

The Python ecosystem utilizes PyMC3 to provide a comprehensive package that technically implements Bayesian statistics and probabilistic planning.

Consider the following example based on noise data around a line. [5] First, a linear ordinary least squares regression is performed on the data set (see Chapter 11), and the results are shown in Figure 13-15:

Figure 13-15 Sample data points and regression lines

In [1]: import numpy as np import pandas as pd import datetime as dt from pylab import mpl, plt In [2]: plt.style.use('seaborn') mpl.rcParams['font.family'] = 'serif' np.random.seed(1000) %matplotlib inline In [3]: x = np.linspace(0, 10, 500) y = 4 + 2 * x + np.random.standard_normal(len(x)) * 2 In [4]: Reg = NP. Polyfit (x, Y, 1) In [5]: Reg Out[5]: Array ([2.03384161, 3.77649234]) In [6]: plt.figure(figsize=(10, 6)) plt.scatter(x, y, c=y, marker='v', cmap='coolwarm') plt.plot(x, reg[1] + reg[0] * x, Lw = 2.0) PLT. Colorbar) (PLT) xlabel (' x ') PLT. Ylabel (' y ')Copy the code

The OLS regression method results in fixed values for two parameters (intercept and slope) on the regression line. Notice that the highest order monomial factor (in this case, the slope of the regression line) is at the exponential level 0 and the intercept is at the exponential level 1. The original parameters 2 and 4 were not fully recovered due to the noise contained in the data.

Bayesian regression utilizes the PyMC3 software package. It is assumed that the parameters follow some kind of distribution. For example, consider the equation of the regression line

. Suppose we now have the following prior probabilities:

  • Alpha.Normal distribution, mean value is 0, standard deviation is 20;
  • Beta.It’s a normal distribution with a mean of 0 and a standard deviation of 10.

As for the likelihood, assume an average of zero

And a uniform distribution with a standard deviation between 0 and 10.

An element of Bayesian regression is (Markov chain) Monte Carlo (MCMC) sampling [6]. In principle, this is the same as taking the ball out of the box multiple times in the previous example — just in a more systematic, automated way.

Technical sampling has three different functions that can be called:

  • Find_MAP () finds the starting point of the sampling algorithm by obtaining the local maximum posteriori point.
  • NUTS() implements the so-called “efficient double-average no-gyring-sampler” (NUTS) algorithm for MCMC sampling assuming prior probability;
  • Sample () extracts a number of samples with a given starting value (from find_MAP()) and optimization step size (from the NUTS algorithm).

The above functions are wrapped in a PyMC3 Model object and executed in the with statement:

In [8]: import pymc3 as pm In [9]: %%time with pm.Model() as model: Normal('beta', mu=0, sd=10) -sigma = pm.Uniform('sigma', Lower =0, upper=10) specifies y_est = alpha + beta * x indication = pm.Normal('y', mu=y_est, sd=sigma, (1) Pm.sample (100, tune=1000, start=start) = pm.sample(100, tune=1000, start=start) Progressbar = True, verbose = False) ❻ logp = - 1067.8, | | grad | | = 60.354: 28/28 [00:00 < 100% | | 00:00, 474.70 it/s] Only 100 samples in chain. Auto - assigning 'lads' Mags' including NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [sigma, beta, alpha] Sampling 2 chains: 2200/2200 [00:03 < 100% | | 00:00, draws/s] CPU times 690.96:6.2 s user, sys: 1.72 s, total: 7.92 s Wall time: 20 minutes In size [10]: 20 minutes In size [10]: 20 minutes In size Mean SD Mc_error hPD_2.5 hPD_97.5 n_eff Rhat alpha 3.764027 0.174796 0.013177 3.431739 4.070091 152.446951 0.996281 beta 2.036318 0.030519 0.002230 1.986874 2.094008 106.505590 0.999155 sigma 2.010398 0.058663 0.004517 1.904395 2.138187 188.643293 0.998547 In [11]: A trace[0] accompanied the meal Out[11]: {'alpha': 3.9303300798212444, 'beta': 2.0020264758995463, 'sigma_interval__': -1.3519315719461853, 'sigma': 2.0555476283253156}Copy the code

-definitions A prior probability.

Indication of linear regression.

They define the likelihood.

Determine the starting value by optimizing.

Variations to THE MCMC algorithm are instantiated.

Samples shall be obtained retrospectively using NUTS.

Display the statistical summary of the sample.

The meal was accompanied by an estimate from the first sample.

All three estimates are fairly close to the original values (4,2,2). However, there are a lot of estimates along the way. They are best described in a trajectory diagram (figure 13-20). The trajectory diagram shows the posterior distributions for different parameters and the individual estimates for each sample. The posterior distribution helps us intuitively understand the uncertainty in the estimated value:

In [12]: pm.traceplot(trace, lines={'alpha': 4, 'beta': 2, 'sigma': 2});
Copy the code

Figure 13-16 Posterior distribution and trajectory diagram

All the regression lines can be plotted by just taking the alpha and beta values from the regression (see Figure 13-17) :

In [13]: plt.figure(figsize=(10, 6)) plt.scatter(x, y, c=y, marker='v', cmap='coolwarm') plt.colorbar() plt.xlabel('x') plt.ylabel('y') for i in range(len(trace)): PLT. The plot (x, trace [' alpha '] [I] + trace [' beta '] [I] * x) ❶Copy the code

Specifies A single regression line.

Figure 13-17 Regression lines based on different estimates

13.3.3 Two financial instruments

After introducing PyMC3 Bayesian regression with virtual data, moving to real financial data is simple. The example uses financial time series data for two types of etf (GLD and GDX) (see Figure 13-18) :

In [14]: raw = pd.read_csv('.. /.. /source/tr_eikon_eod_data.csv', index_col=0, parse_dates=True) In [15]: data = raw[['GDX', 'GLD']].dropna() In [16]: Data = data/data. The iloc [0] ❶ [17] : In the data. The info () < class 'pandas. Core. Frame. The DataFrame' > DatetimeIndex: 2138 entries, 2010-01-04 to 2018-06-29 Data columns (total 2 columns): GDX 2138 NON-NULL Float64 GLD 2138 Non-NULL Float64 DTypes: Float64 (2) Memory Usage: 50.1 KB In [18]: Data. Ix [1] / data. Ix ❷ Out [18] [0] - 1: GLD etf has 0.532383 0.080601 dtype: float64 [19] : In the data. The corr () ❸ Out [19] : GDX GLD GDX 1.00000 0.71539 GLD 0.71539 1.00000 In [20]: data.Copy the code

Specifies normalizes data to a starting value of 1.

2. Calculate relative performance.

Calculate the correlation between the two financial instruments.

Figure 13-18 Normalized prices of GLD and GDX after a period of time

In the example below, the dates of individual data points are visualized using a scatter plot. To do this, convert the DatetimeIndex object in the DataFrame to a Matplotlib date. Figure 13-19 shows a scatter plot of time series data, comparing the GLD value with the GDX value, and then showing the date of each pair of data in a different color: [7]

In [21]: data.index[:3] Out[21]: DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06'], dtype='datetime64[ns]', name='Date', freq=None) In [22]: Date2num (data.index.to_pydatetime()) -dates mpl_dates[:3] Out[22]: array([733776., 733777., 733778.]) In [23]: plt.figure(figsize=(10, 6)) plt.scatter(data['GDX'], data['GLD'], c=mpl_dates, marker='o', cmap='coolwarm') plt.xlabel('GDX') plt.ylabel('GLD') plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), format=mpl.dates.DateFormatter('%d %b %y')); ❷Copy the code

A DatetimeIndex object is converted to a matplotlib date.

Color cards for user-defined dates.

Figure 13-19 Scatter plot of GLD price and GDX price

Then, bayesian regression is implemented on the basis of these two time series, and the parameterization is essentially the same as the process of the previous virtual data example. Figure 13-20 shows the results of MCMC sampling process under the assumption of prior probability distribution of three parameters:

In [24]: with pm.Model() as model: alpha = pm.Normal('alpha', mu=0, sd=20) beta = pm.Normal('beta', mu=0, sd=20) sigma = pm.Uniform('sigma', lower=0, upper=50) y_est = alpha + beta * data['GDX'].values likelihood = pm.Normal('GLD', mu=y_est, sd=sigma, observed=data['GLD'].values) start = pm.find_MAP() step = pm.NUTS() trace = pm.sample(250, tune=2000, start=start, Logp progressbar = True) = 1493.7, | | grad | | = 188.29: 27/27 [00:00 < 100% | | 00:00, 1609.34 it/s] Only 250 samples in chain. Auto - assigning 'lads' Mags' including NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [sigma, beta, alpha] Sampling 2 chains: 4500/4500 [100% | | 00:09 < 00:00, 465.07 Quietly/S] The estimated number of effective samples is smaller than 200 for some parameters. In [25]: pm.summary(trace) Out[25]: Mean SD Mc_error hPD_2.5 hPD_97.5 n_eff Rhat Alpha 0.913335 0.005983 0.000356 0.901586 0.924714 184.264900 1.001855 beta 0.385394 0.007746 0.000461 0.369154 0.398291 215.477738 1.001570 sigma 0.119484 0.001964 0.000098 0.115305 0.123315 In [26]: FIG = pm.traceplot(trace)Copy the code

 

Figure 13-20 Posterior distribution and trajectory of GDX and GLD data

Figure 13-21 Add all the resulting regression lines to the previous scatter diagram. However, all the regression lines are very close to each other:

In [27]: plt.figure(figsize=(10, 6))
         plt.scatter(data['GDX'], data['GLD'], c=mpl_dates,
                     marker='o', cmap='coolwarm')
         plt.xlabel('GDX')
         plt.ylabel('GLD')
         for i in range(len(trace)):
             plt.plot(data['GDX'],
                      trace['alpha'][i] + trace['beta'][i] * data['GDX'])
         plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
                      format=mpl.dates.DateFormatter('%d %b %y'));
Copy the code

Figure 13-21 Multiple Bayesian regression lines across GDX and GLD data

Figure 13-21 reveals a major shortcoming of the regression method used: it does not take into account changes over time. That is, the most recent data is treated the same as the oldest.

13.3.4 Keep estimates up to date

As noted earlier, the Bayesian approach to finance is usually most useful for historical analysis — that is, new data revealed over time leads to better regressions and estimates.

To add the above concepts to the current example, it is assumed that the regression parameters are not just random and somewhat distributed, but also “wander” randomly over time. This is in the same way that financial theory generalizes from random variables to random processes (essentially ordered sequences of random variables).

To do this, we define a new PyMC3 model, this time specifying a random walk of parameter values. After specifying the distribution of the random walk parameters, we can proceed to specify alpha and beta values for the random walk. To make the whole process more efficient, the same coefficients were used for the 50 data points:

In [28]: from pymc3.distributions.timeseries import GaussianRandomWalk In [29]: subsample_alpha = 50 subsample_beta = 50 In [30]: model_randomwalk = pm.Model() with model_randomwalk: Sigma_alpha = pm.exponential ('sig_alpha', 1. /.02, testval=.02) -beta = pm.exponential ('sig_alpha', 1. Alpha = GaussianRandomWalk('alpha', sigma_alpha ** -2, Shape =int(len(data)/subsample_alpha)) Shape =int(len(data)/subsample_beta) S3: domain = domain ['GDX']. Values [:2100] Domain SD = PSM.Uniform(' SD ', 0, 20) Roe Likelihood = PM.Normal('GLD', mu=regression, SD =sd, Observed =data['GLD']. Values [:2100])Copy the code

Specifies the prior probabilities for random walk parameters.

The random-walk model.

The parameter vector is substituted into the interval length.

Define a regression model.

A priori value of a roe standard deviation.

The likelihood is defined with mu in regression results.

These definitions are a little more complicated than they used to be, because a random walk is used instead of a single random variable. However, the MCMC derivation process remains essentially unchanged. It should be noted that since we have to calculate one parameter pair for each random walk sample, a total of 1950/50=39 (only 1 was required before), the calculation burden increases significantly:

In [31]: %%time import scipy.optimize as sco with model_randomwalk: start = pm.find_MAP(vars=[alpha, beta], fmin=sco.fmin_l_bfgs_b) step = pm.NUTS(scaling=start) trace_rw = pm.sample(250, Tune =1000, start=start, progressbar=True) logp = -657: 82/5000 [00:00 < 2% | | 00:08, 550.29 it/s] Only 250 samples in chain. Auto - assigning 'lads' Mags' including NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [sd, beta, alpha, sig_beta, sig_alpha] Sampling 2 chains: 2500/2500 [02:48 < 100% | | 00:00, draws/s] CPU times 8.59:27.5 s user, sys: 3.68 s, total: 31.2 s Wall time: 5min 3s In [32]: PM. Summary (trace_rw).head() - Out[32]: Mean SD Mc_error hPD_2.5 hPD_97.5 n_eff \ alpha__0 0.673846 0.040224 0.001376 0.592655 0.753034 1004.616544 alpha__1 0.424819 0.041257 0.001618 0.348102 0.509757 804.760648 alpha__2 0.456817 0.057200 0.002011 0.321125 0.553173 800.225916 Alpha__4 0.651465 0.057472 0.002197 0.544076 0.761216 alpha__4 0.651465 0.057472 0.002197 0.544076 0.761216 978.073246 Rhat alpha__0 0.998637 alpha__1 0.999540 alpha__2 0.998075 alpha__3 0.998995 alpha__4 0.998060Copy the code

Specifies the summary statistics of each interval. Only the first five and alpha values are displayed.

Figure 13-22 shows a subset of the estimated values, indicating that the regression parameters alpha and beta evolve over time:

In [33]: sh = np. Shape (trace_rw['alpha']) -mine. sh -mine. Out[33]: (500, 42) In [34]: Part_dates = Np. linspace(min(mpl_dates), Max (mpl_dates), sh[1]). Index = [dt.datetime.fromordinal(int(date)) for date in part_dates] in [36]: alpha = {'alpha_% I '% I: V for I, v in enumerate(trace_rw['alpha']) if I < 20} For I, v in enumerate(trace_rw['beta']) if I < 20} trilobin [38]: df_alpha = pd.dataframe (alpha, index=index) trilobin [39]: Df_beta = pd.dataframe (beta =index) Ax = df_alpha. The plot (color = 'b', style = '-', legend = False, lw = 0.7, figsize = (10, 6)) df_beta. The plot (color = 'r', style = '. ', Legend = False, lw = 0.7, ax = ax) PLT. Ylabel (' alpha/beta);Copy the code

Specifies the composition of the object that contains the estimated values of parameters.

The date list was created to match the number of time intervals.

Thirdly, collect related parameter time series in two DataFrame objects.

Figure 13-22 Selecting parameters within a period of time Estimated values

 

Absolute price data and relative yield data

The analysis in this section is based on standardized price data. This is for illustrative purposes only, as the corresponding graphical results are easier to understand and interpret (they are visually “more compelling”). However, financial applications in the real world should rely on yield data to ensure the stability of time series data.

Using alpha and beta means, we can account for updates to regressions over time. Figure 13-23 shows the regression updated over time. In addition, it shows 39 regression lines from the mean of alpha and beta. It is clear that updating numbers over time greatly improves regression fitting (for current/most recent data) — in other words, each time period requires its own regression:

In [41]: plt.figure(figsize=(10, 6)) plt.scatter(data['GDX'], data['GLD'], c=mpl_dates, marker='o', cmap='coolwarm') plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), format=mpl.dates.DateFormatter('%d %b %y')) plt.xlabel('GDX') plt.ylabel('GLD') x = np.linspace(min(data['GDX']), max(data['GDX'])) for i in range(sh[1]): ❶ alpha_rw = np. Scheme (trace_rw [' alpha ']. [I] T) beta_rw = np. The scheme (trace_rw [' beta ']. [I] T) PLT. The plot (x, Alpha_rw + beta_rw x x, '-', lw = 0.7 and color = PLT. Cm. Coolwarm (I/sh [1]))Copy the code

Draws a regression line for all intervals with a length of 50.

Figure 13-23 Scatter diagram containing time-dependent regression lines (updated estimates)

This concludes the introduction to Bayesian regression. Python provides a powerful PyMC3 library for implementing different approaches to Bayesian statistics and probabilistic programming, especially Bayesian regression, which has become a very popular and important tool in econometric finance.

This article is excerpted from Python Financial Big Data Analysis 2nd Edition

Python Financial Big Data Analysis 2nd edition is divided into five parts with 21 chapters. Part 1 introduces the use of Python in finance. It covers the reasons why Python is used in finance, Python’s infrastructure and tools, and some concrete introductory examples of Python in econometric finance. Part 2 covers the basics of Python and the well-known Python libraries NumPy and PANDAS toolsets, as well as object-oriented programming; The third part introduces the basic techniques and methods related to financial data science, including data visualization, input/output operations and finacy-related knowledge in mathematics. Part 4 introduces the application of Python in algorithmic trading, focusing on common algorithms, including machine learning, deep neural network and other ARTIFICIAL intelligence related algorithms; The fifth part explains the application of options and derivatives pricing based on Monte Carlo simulation, which covers the introduction of valuation framework, simulation of financial models, derivatives valuation, portfolio valuation and other knowledge. The Python Financial Big Data Analysis 2nd Edition is suitable for financial industry developers who are interested in using Python for big data analysis and processing.