Original link:tecdat.cn/?p=24814

Original source:Tuo End number according to the tribe public number

When it comes to making money in the stock market, there are an infinite number of different ways to make money. It seems that everywhere you go in finance, people are telling you that you should learn Python. After all, Python is a popular programming language that can be used in all types of domains, including data science. There are a number of software packages available to help you achieve your goals, and many companies use Python to develop data-centric applications and scientific computing relevant to the financial world.

Most importantly, Python can help us take advantage of many different trading strategies that (without it) would be difficult to analyze by hand or spreadsheet. One of the trading strategies we will discuss is called paired trading.

Paired trading

Paired trading is a form of mean reversion and has the unique advantage of always hedging against market fluctuations. The strategy is based on mathematical analysis.

Here’s how it works. Suppose you have a pair of securities X and Y that have some underlying economic relationship. An example might be two companies producing the same product, or two companies in a supply chain. If we can model this economic connection mathematically, we can trade it.

To understand paired trading, we need to understand three mathematical concepts: stationarity, difference, and cointegration.

import numpy as np
import pandas as pd
Copy the code

Stationary/non-stationary

Stationarity is the most common untested assumption in time series analysis. When the parameters of the data generation process do not change over time, we usually assume that the data is stable. Or consider two series: A and B. Series A will generate stationary time series with fixed parameters, while B will vary with time.

We will create a function that creates the Z-score for the probability density function. The probability density of the Gaussian distribution is:

Is mean andIt’s the standard deviation. The standard deviation squared,Is the variance. The rule of thumb states that 66% of the data should be somewhere in between 和 , which means the functionnormalMore likely to return samples close to the mean than samples far from the mean.


    mu 
    sigma 
    return normal(mu, sigma )
Copy the code

From there, we can create two graphs showing stationary and non-stationary time series.

Series(index=range(T)) will be measured by a Series of plotsCopy the code

Why is stationarity important

Many statistical tests require that the data being tested be stationary. Using certain statistics on non-stationary data sets can result in garbage results. As an example, let’s go through our non-stationary.

np.mean

plt.figure
plt.plot
plt.hlines
Copy the code

The calculated average shows the average of all data points, but any prediction of future state is useless. Compared to any particular time, it is meaningless because it is a mash-up of different states at different times. This is just a simple and clear example of why non-stationarity skews analysis and more subtle problems arise in practice.

Stationability test Augmented Dickey Fuller(ADF)

To test for stationarity, we need to test for something called the unit root. Autoregressive unit root test is based on the following hypothesis tests:

It’s called the unit root tet because under the null hypothesis, the autoregressive polynomialThe square root of PI is equal to 1.The trend is stable under the null hypothesis. ifThen make the difference first, which becomes:

The test statistic is

Is the least squares estimation and SE() is the usual standard error estimate. The test is a unilateral left tail test. If {} is stationary, then it can be provedorAnd it isHowever, under the nonstationary null hypothesis, the above results are givenThe following functions will allow us to use Augmented Dickey Fuller (ADF) tests to check for stability.

Defty_test (X, cutoff=0.01): # adfuller H_0 is the unit root exists (non-stationary) # we must observe the significant p value to see that the sequence is stable adfullerCopy the code

As we have seen, we may not be able to reject the null hypothesis based on the test statistics of time series A (corresponding to A particular p-value). Therefore, the A series is likely to be stationary. On the other hand, b-series are rejected by hypothesis testing, so this time series is likely to be nonstationary.

cointegration

Correlations between financial quantities are notoriously volatile. Nevertheless, correlation is often used in almost all multivariate financial problems. Another statistical measure of correlation is cointegration. This may be a more robust measure of the link between two financial quantities, but so far there has been little deviation theory based on this concept.

Two stocks may be perfectly correlated in the short run but diverge in the long run, with one rising and the other falling. In contrast, two stocks may follow each other, not more than a certain distance apart, but have correlation, positive and negative correlation changes. Correlation may matter if we are short-term, but it doesn’t matter if we hold stocks in our portfolio for the long term.

We have built two examples of cointegration sequences. We now plot the difference between the two.

Plot np.random. Normal Y = X + 6 + noise plt.show()Copy the code

(y-x).plot # Plot point difference plt.axhline# add mean plt.xlabel plt.xlimCopy the code

Cointegration test

Steps of the cointegration test procedure:

  1. Examine the unit root of each component seriesUse univariate unit root tests alone, such as ADF and PP tests.
  2. If the unit root cannot be rejected, the next step is to test the cointegration relationship between the components, that is, to see ifIs I (0).

If we find that the time series is the unit root, then we proceed with the cointegration process. There are three main co-integration tests: Johansen, Engle-Granger, and Phillips-Ouliaris. We will mainly use the Engle-Granger test.

Let’s consider the regression model:

In theIs a deterministic term. The hypothesis tests are as follows:

 与 Normalized cointegration vector cointegration

 

We also use residualsFor the unit root test.

The hypothesis test applies to the model:

Test statistics for the following equations:

Now that you know what co-integration of two time series means, we can test it and measure it using Python:

Coint print(pvalue) #Copy the code

Correlation and cointegration

Correlation and cointegration, while theoretically similar, are completely different. To demonstrate this, we can look at two examples of related but non-co-integrated time series.

A simple example is two sequences.

Xruns = np.random.normal
yrurs = np.random.normal



pd.concat
plt.xlim
Copy the code

Next, we can output the correlation coefficient,, and cointegration test

As we can see, there is a very strong correlation between sequences X and Y. However, the p value of our co-integration test produces 0.7092, which means that there is no co-integration between time series X and Y.

Another example of this situation is the normal distribution series and square waves.

Figure y2.plot () # correlation almost zero prinr(pvle))Copy the code

 
Copy the code

Although the correlation is very low, the p-value indicates that these time series are co-integrated.


import fix_yaance as yf
yf.pdrde
Copy the code

Data science in trading

Before I begin, I’ll define a function that makes it easy to find co-integration pairs using the concepts we’ve covered.

def fitirs(data):
    n = data.shape
    srmaix = np.zeros
    pvl_mrix = np.ones
    keys = dta.keys 
    for i in range(n):
        for j in range:
          
            reut = coint 
            sr = ret[0]
            paue = rsult[1]
            soeix[i, j] = score
            pu_trix[i, j] = palue
            if palue < 0.05:
                pairs.append
    return soe_mati, prs
Copy the code

We’re looking at a group of tech companies to see if any of them are co-integrated. We’ll start by defining the list of securities we want to view. Then we will get pricing data for each security from 2013 to 2018..

As mentioned earlier, we have developed an economic hypothesis that there is some connection between a subset of securities within the tech industry, and we want to test if there are any co-integration pairs. This produces much less multiple comparison bias than searching hundreds of securities, and slightly more than forming a hypothesis for a single test.

start = datetime.datetime
end = datetime.datetime




df = pdr(tcrs, strt, nd)['Close']
df.tail()
Copy the code

 

# Heat map showing the p-value of stocks between each pair of co-integration tests. Display only the scores on the diagonal of the heatmap, seaborn. HeatmapCopy the code

Our algorithm lists two co-integration pairs: AAPL/EBAY and ABDE/MSFT. We can analyze their patterns.


coit
pvalue
Copy the code

As we can see, the p value is less than 0.05, which means that ADBE and MSFT are indeed co-integration pairs.

Calculation of price difference

Now we can plot the difference between these two time series. To actually calculate the spread, we use linear regression to obtain the coefficient of the linear combination between our two securities, as mentioned earlier in the Engel-Granger method.

results.params

sed = S2 - b * S1
sedplot
plt.axhline
plt.xlim
plt.legend
Copy the code

Alternatively, we can examine the ratio between two time series

rio
rao.plot
plt.axhline
plt.xlim
plt.legend
Copy the code

Whether we use the spread method or the ratio method, we can see that our first graph pairs of ADBE/SYMC tend to move around the mean. We need to standardize this ratio now, because absolute ratios may not be the most ideal way to analyze this trend. To do this, we need to use the Z-score.

The Z-score is the standard difference between the data point and the mean. What’s more, the number of standard deviations above or below the population mean comes from the raw score. The calculation method of Z-Score is as follows:

def zscr:
    return (sres - ees.mean) / np.std


zscr.plot
plt.axhline
plt.axhline
plt.axhline
plt.xlim
plt.show
Copy the code

By placing the other two lines at z scores 1 and -1, we can clearly see that in most cases any large deviation from the mean will eventually converge. This is exactly the matching trading strategy we want.

Trading signals

When engaging in any type of trading strategy, it is always important to clearly define and describe the point in time when the trade is actually taking place. For example, what is the best indicator I need to buy or sell a particular stock?

Set up rules

We’ll use the ratio time series we created to see if it tells us to buy or sell at a particular time. We’ll start by creating a predictive variableIf the ratio is positive, it means “buy”, otherwise, it means sell. The prediction model is as follows:

The beauty of paired trading signals is that we don’t need to know absolute information about where a price will go, we just need to know where it will go: up or down.

Training test split

When training and testing models, there is usually a 70/30 or 80/20 split. We only used a time series of 252 points (this is the number of trading days in a year). We will add more points to each time series before training and splitting the data.

ratios = df['ADBE'] / df['MSFT'] 
print(len(ratios) * .70 )
Copy the code

tran = ratos[:881]
tet = rats[881:]
Copy the code

Characteristics of the engineering

We need to figure out which features are actually important in determining the direction in which the ratio moves. Knowing that ratios will always revert to the mean eventually, perhaps moving averages and indicators related to the mean will be important.

Let’s try:

  • 60 day moving average
  • The 5-day moving average
  • 60-day standard deviation
  • Z score
train.rolg
zcoe_5 = (ra_ag5 - rasag60)/
plt.figure
plt.plot
plt.legend
plt.ylabel
plt.show
Copy the code

plt.figure
z5.plot()
plt.xlim
plt.axhline
plt.legend
plt.show
Copy the code

Create the model

The mean of the standard normal distribution is 0, and the standard deviation is 1. As can be seen from the figure, it is clear that if the time series exceeds the mean by 1 standard deviation, it tends to revert to the mean. Using these models, we can create the following trading signals:

  • Buy (1) whenever the Z-score falls below -1, which means we expect the ratio to increase.
  • Whenever z scores above 1, sell (-1), which means we expect the ratio to fall.

Training to optimize

We can use our model on real data


train.plot()
buy 
sell
buy[z>-1] = 0
sell[z5<1] = 0
buy[160:].plot
sell[160:].plot
Copy the code

Figure # When you buy ratio, you buy S1 and sell S2 sell[buy!=0] = S[uy!=0] # When you sell ratio, you sell S1 and buy S2 sell[SLL!=0] = S1[SLL! =0] BuR[60:].plot selR[60:].plotCopy the code

 

Now we can clearly see when we should buy or sell corresponding stocks.

Now, how much can we expect to gain from this strategy?

# Trade with simple Strydef: Rolng ZSCOe = (MA1-MA2)/ STD # Simulate trading # for I in range (Len (ratios)) : Mey += S1[I] -s2 [I] * RTS [I] cutS2 += raos[I] # mey += S1[I] -s2 [I] * RTS [I] cutS2 += raos[I] # mey += S1[I] -s2 [I] Mey -= S1[I] -s2 [I] * rtos[I] # If z-score is between -.5 and.5, elif ABS (zcre[I]) < 0.75:  mey += S1[i] * ctS + S2[i] * oS2Copy the code
trad
Copy the code

That’s a good profit for a strategy-driven strategy.

Areas for improvement and further steps

This is by no means a perfect strategy, nor has our strategy been the best executed. However, there are several things that can be improved.

1. Use more securities and more diversified time horizons

For co-integration tests of paired trading strategies, I used only a few stocks. Naturally (and in practice) it is more effective to use clusters within the industry. I’ve only used a time horizon of only five years, which may not be representative of stock market movements.

2. Processed fitting

Anything to do with data analysis and training models has a lot to do with overfitting. There are many different ways to handle overfitting like validation, such as Kalman filters and other statistical methods.

3. Adjust trading signals

Our trading algorithm does not take into account overlapping and crossing stock prices. Given that the code only asks to buy or sell based on its ratio, it doesn’t take into account which stock is actually higher or lower.

4. A more advanced approach

This is just the tip of the algorithmic trading iceberg. This is easy because it only deals with moving averages and ratios. If you want to use more complex statistics, use. Other complex examples include topics such as Hurst exponent, half-life mean regression, and Kalman filter.


Most welcome insight

1.R language for S&P500 stock index ARIMA + GARCH trading strategy

2. Stock matching strategy analysis based on SPY — TLT portfolio and Chinese stock market portfolio

3.R language time series: Application of ARIMA GARCH model’s trading strategy in forex market forecasting

4.TMA three moving averages refers to the R language implementation of high frequency trading strategy

5. R language multiple moving average quantization strategy backtest comparison

6. Use R language to realize neural network to predict stock cases

7. Realization of R language volatility prediction: ARCH model and HAR-RV model

8.R language how to do Markov switching model

9. Matlab uses Copula simulation to optimize market risks