It would be nice if the data analyst’s model could accurately predict tomorrow’s broad index. I laugh at the thought of it, and then I tell myself it’s impossible. However, it is the interesting idea and practical application scenarios that make practice more and more interesting and encourage me to learn some knowledge.

\

To get to the topic, this paper uses public data source: NASDAQ index of every trading day in the last 10 years (2007-08-29 to 2017-08-25). Share with you the whole process of building ARIMA model and predicting NASDAQ index using Python.

\

Use pandas’ read_excel method to read data stored in Excel. The code is shown below: \

Mydata = pd.read_excel(‘ Nasdaq index.xlsx’, sheetName =’Sheet1′)

* * * *

1. Obtain time series data


\

The raw data looks like this. The Date column represents the Date, and the data type is Date. Close/Last indicates the index value when the market is closed on that day.

\

Pandas is used to read the dataframe. Series is used to convert the dataframe to a time Series object. Index is the corresponding time Series value.

\

The time series chart is shown below, non-stationary.

\

Click the Figure options in the above Figure to add chart title, abscissa title and ordinate title to the graph through the visual interface. You can also manually adjust the maximum and minimum values of x and y axes. This is similar to our previous EXcel article about adjusting the horizontal and vertical axes.

Scale can also adjust whether the graph is linear or log. Once configured, click Apply.

\

Curves allows you to set some parameters about the Curves in the graph:

Line Sytle has dashdot dotted Line, dashed Line, dotted Line and solid Line.

The former corresponds to the “dashed line” of the last five years, which refers to the last five years. The former refers to the “dashed line” of the last five years

\

Width sets the line width and color sets the line color

The marker style part can adjust the shape of the marker, be it point, square or circle

Size Adjusts the size of the tag

Face color adjusts the color of the tag fill

Edge color Adjusts the markup border color

\

2. Perform the difference calculation


\

The basic condition of ARMA model modeling is that the sequence to be predicted is required to meet the stable condition, that is, the individual value should fluctuate up and down around the mean value of the sequence without an obvious upward or downward trend. In terms of statistical terminology, stationary series refers to a series in which the numerical characteristics such as expectation, variance, autocovariance and autocorrelation coefficient do not change over time.

\

In terms of perceptual cognition, if the time series shows an upward or downward trend, differential leveling of the original series is required. After observation, this group of data is non-stationary time series data. Then, d-order difference operation is carried out until the original sequence is transformed into a stationary time series.

First order difference is made to observe whether the sequence obtained after the first order difference is stable.

\

As shown in the figure below, after one difference, the obtained sequence is basically stable. And you can do another difference to compare the smoothness.

\

The second order difference operation was performed to observe and compare the resulting sequence diagram. After comparing the second-order difference with the first-order difference, it is found that the change rate after the second-order difference is lower than that of the first-order difference. Therefore, d=2 is selected.

\

* * * *

3. Determine the p and Q parameters in ARMA


* * * *

3.1 ARMA model


The model functions of ARMA(P, q) are shown in the following figure:

This function consists of three parts, each of which is explained below.

Autoregressive model AR(p) contains parameter P in the first half;

Represents white noise;

Represents the moving average model MA(q).

\

3.2 the AR (p) model

If q=0, the ARMA(p, q) model is simplified to the AR(p) model.

For AR(p) model, when j> P, ACF function gradually attenuates, that is, trailing appears. PACF functions are equal to zero, that is, truncation occurs.

\

3.3 MA (q) model

If p=0, the ARMA(p, q) model is simplified to MA(q) model.

For MA(q) model, when j>q, ACF function is equal to 0, that is, truncation occurs. PACF function shows exponential decay, that is, trailing appears.

\

3.4 ARMA model

If neither ACF nor PACF functions show truncation phenomenon, that is, both ACF function and PACF function tail, then ARMA(p, q) model should be considered, where p and q are not zero. Box, Jenkins and Reinsel (1994), the originers of time series analysis, believed that p<=2 and Q <=2 were sufficient for most cases. Of course, just to be safe, I can make p(Max) and Q (Max) bigger. (Chen Qiang, Advanced Metrology)

\

3.5 ARIMA model

ARIMA (P, D, q) is called differential autoregressional moving average model, AR is autoregressional, p is autoregressional term; MA is the moving average, q is the number of moving average terms, and D is the number of differences made when the time series becomes stationary.

* * * *

3.6 Determine the values of p and q in this example

Observe the following figure, the first subgraph is ACF autocorrelation graph, and the second subgraph is PACF partial autocorrelation subgraph. ACF lags behind the trailing of order 1 (gradually shrinking to 0 within the confidence interval), so q=1; PACF lags 15 order trailing (gradually shrinking to 0 within confidence interval), so P =15.

\

4. Use ARIMA model to predict the NASDQ index in the next 10 trading days


\

from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(nasdaq, (15, 2, 1)).fit()

model.forecast(10)[0]

\

\

Welcome to exchange ~

\

Long press the picture below to identify the QR code and pay attention to “Data analyst notes” ~

\