Original link:tecdat.cn/?p=12272

Original source:Tuo End number according to the tribe public number


 

With the ARIMA model, you can predict a time series using the past values of the series. In this paper, we build an optimal ARIMA model from scratch and extend it to both the Seasonal ARIMA (SARIMA) and SARIMAX models.

 

1. Introduction to time series prediction

A time series is a sequence in which measurements are recorded at regular time intervals.

According to the frequency, time series can be (for example: the annual budget) every year, every quarter (for example: spending), once a week (for example: the sales amount), (such as weather) every day, every hour (for example: the stock price), minutes (for example, to prompt the inbound call), or even a few seconds (for example: network traffic).

Why predict?

Because forecasting time series, such as demand and sales, is often of great business value.

In most manufacturing companies, it drives basic business planning, procurement, and production activities. Any error in the forecast will spread throughout the supply chain or any business environment associated with it. Therefore, it is important to forecast accurately in order to save costs, which is crucial to success.

The technology and concepts behind time series forecasting can be applied to any business, not just in manufacturing.

Now, prediction time series can be broadly divided into two types.

If only the previous value of a time series is used to predict its future value, it is called univariate time series prediction.

If you use a prediction variable other than the sequence (also known as an exogenous variable), it is called a multivariable time series prediction.

This article focuses on a special type of prediction approach called ARIMA modeling.

ARIMA is a predictive algorithm based on the idea that information in past values of a time series can be used independently to predict future values.

2. Introduction to ARIMA model

So what exactly is the ARIMA model?

ARIMA is a class of models that “interpret” a given time series in terms of its own past values (i.e., its own lag and lagging prediction errors), and thus can use equations to predict future values.

Any “non-seasonal” time series with patterns that are not random white noise can be modeled using the ARIMA model.

ARIMA model is characterized by three terms: P, D and Q

P is the AR

Q is MA

D is the difference order required to make the time series stable

If the time series has seasonal patterns, seasonal conditions need to be added and the time series becomes SARIMA (short for “seasonal ARIMA”). Once ARIMA is completed.

So what does “order of AR terms” really mean? Let’s look at d first.

3. What do P, D and Q mean in ARIMA model

The first step of establishing ARIMA model is to make time series stable.

Why is that?

Because the word “autoregression” in ARIMA implies that it is a linear regression model, using its own lag as a predictor. As you know, linear regression models work best when the predictors are unrelated and independent of each other.

So how do you make a sequence stationary?

The most common method is by difference. That is, subtract the previous value from the current value.

Therefore, the value of d is the minimum difference order required to make the sequence stationary. If the time series is already stationary, then d = 0.

Next, what are “p” and “Q”?

“P” is the order of the “autoregression” (AR) term. It refers to the lag order of Y to be used as a predictive variable. “Q” is the order of the “moving average” (MA) term. It refers to the number of lag prediction errors that should be entered into the ARIMA model.

4. What are AR and MA models

So what are AR and MA models? What are the actual mathematical formulas for AR and MA models?

AR model is a model where Yt only depends on its own lag. In other words, Yt is a function of “Yt lag”.

 

Similarly, the pure moving average (MA only) model is a model where Yt only depends on lag prediction error.

 

The error term is the error of each lag autoregressive model. The errors Et and E (t-1) are errors from the following equation:

 

That’s the AR and MA models.

So what are the equations of the ARIMA model?

An ARIMA model is a model in which the time series is differentiated at least once to make it stationary, and then the AR and MA terms are combined. Thus, the equation becomes:

So, the goal is to identify the values of P, D and Q.

5. How to find the difference order in ARIMA Model (D)

The purpose of making the difference is to stabilize the time series.

But you need to be careful not to make the sequence too differential. Because, the superdifference sequence may still be stationary, which in turn will affect the model parameters.

So how do you determine the correct difference order?

The correct order of difference is the minimum difference to obtain an approximately stationary sequence that fluctuates around the defined mean and the ACF curve reaches zero fairly quickly.

If the autocorrelation is positive after many orders (10 or more), the sequence needs to be further differentiated.

In this case, you can’t really determine the difference between two difference orders, and then choose the order that gives the minimum standard deviation in the difference sequence.

Let’s look at an example.

First, I’ll use Augmented Dickey Fuller test () to check that the sequence is stable.

Why is that?

Because difference is only necessary if the sequence is non-stationary. Otherwise, no difference is required, that is, d = 0.

The null hypothesis of ADF test is that time series is nonstationary. Therefore, if the p-value of the test is less than the significance level (0.05), the null hypothesis is rejected and the time series is indeed stable.

Therefore, in our case, if the P value is > 0.05, we will continue to look for the order of the difference.

from statsmodels.tsa.stattools import adfuller
from numpy import log
result = adfuller(df.value.dropna())
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
Copy the code
ADF Statistic: -2.464240
p-value: 0.124419
Copy the code

Since the p-value is greater than the significance level, let’s differentiate the sequence to see what the autocorrelation graph looks like.

import numpy as np, pandas as pd from statsmodels.graphics.tsaplots import plot_acf, Plot_pacf import matplotlib.pyplot as PLT plt.rcparams. update({'figure.figsize':(9,7), Df = pd.read_csv('wwwusage.csv', names=['value'], header=0) axes = plt.subplots(3, 2, sharex=True) axes[0, 0].plot(df.value); Ax =axes[0, 1]) # axes[1, 0].plot(df.value.diff()) Axes (1, 0] set_title (' 1 st Order Differencing ') plot_acf (df) value) diff () dropna (), ax = axes [1, 1]) # 2 Order difference axes [2, 0].plot(df.value.diff().diff()); axes[2, 0].set_title('2nd Order Differencing') plot_acf(df.value.diff().diff().dropna(), ax=axes[2, 1]) plt.show()Copy the code

 

The differential

For the above series, the time series is stationary and has two different orders. However, when looking at the autocorrelation graph of the second difference, the lag quickly enters the negative region, indicating that the sequence may have been differentiated.

Therefore, even if the sequence is not completely stationary (which is weak), I will temporarily set the order of the difference to 1.

Ndiffs (y, test=' PP ') # 2Copy the code
2 0 2
Copy the code

6. How to find the order of AR term (p)

The next step is to determine whether the model needs AR. You can find the desired AR order by examining the partial autocorrelation (PACF) graph.

But what is PACF?

Partial autocorrelation can be imagined as the correlation between sequence and its hysteresis after the partial lag is excluded. Therefore, the transmission of PACF conveys the pure correlation between hysteresis and sequence. This way, you will know if the lag is needed in AR.

How do I find the order of AR terms?

Any autocorrelation in a stationary sequence can be corrected by adding enough AR terms. Therefore, we initially set the order of the AR term to be equal to the lag order exceeding the significant interval in the PACF diagram.

# Partial autocorrelation coefficient graph of first-order difference plt.show()Copy the code

 

AR order

It can be observed that the PACF lag of order 1 is very important because it is much higher than the significant line. The 2 order lag is also important, slightly exceeding the significant region (blue region).

7. How to find the order of MA term (q)

Just as we looked at the order of AR terms on a PACF diagram, you can look at the order of MA terms on an ACF diagram. MA is technically a lag prediction error.

ACF indicates how many MA terms are required to eliminate autocorrelation in a stationary sequence.

Let’s look at the autocorrelation diagram of the difference sequence.

fig, axes = plt.subplots(1, 2, sharex=True) axes[0].plot(df.value.diff()); Axes [0]. Set_title (' 1 st Differencing) axes [1]. The set (ylim = (0,1.2)) plot_acf (df) value) diff () dropna (), ax=axes[1]) plt.show()Copy the code

 

MA order

Several lags are well above the limit. So let’s fix q at 2 for the moment.

8. How to deal with too low or too high difference of time series

How to deal with it?

If your sequence differential is too low, you can usually add one or more other AR items. Again, if the difference is too high, try adding other MA entries.

9. How to build ARIMA model

Now that the values of P, D and Q have been determined, all the conditions for fitting the ARIMA model have been met.

ARIMA Model Results ============================================================================== Dep. Variable: D. Value No. Observations: 99 Model: ARIMA(1, 1, 2) Log Likelihood -253.790 Method: Css-mle S.D. of Innovations 3.119 Date: Wed, 06 Feb 2019 AIC 517.579 Time: 23:32:56 BIC 530.555 Sample: 1 HQIC 522.829 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = coef STD err z P > | z | [0.025 0.975] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- const 1.1202 1.290 0.868 0.387-1.409 3.649 ar.l1.d. value 0.6351 0.257 2.469 0.015 0.131 1.139 ma.l1.d. value 0.5287 0.355 1.489 0.140-0.167 1.224 ma.l2.d.value-0.0010 0.321-0.003 0.98-0.631 0.629 Roots ============================================================================= Real Imaginary Modulus Frequency -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - AR. 1, 1.5746 + 0.0000 j 1.5746 0.0000 MA. 1 -1.8850 +0.0000j 1.8850 0.5000 MA.2 545.3515 +0.0000j 545.3515 0.0000 -----------------------------------------------------------------------------Copy the code

The model summary reveals a lot of information. The middle table is the coefficient table, where the values under “COef” are the weights of the corresponding items.

Notice that the coefficient of the MA2 term here is close to zero. Ideally, each X value should be less than 0.05.

So let’s reconstruct the model without MA2.

ARIMA Model Results ============================================================================== Dep. Variable: D. Value No. Observations: 99 Model: ARIMA(1, 1, 1) Log Likelihood -253.790 Method: Css-mle S.D. of Innovations 3.119 Date: Sat, 09 Feb 2019 AIC 515.579 Time: 12:16:06 BIC 525.960 Sample: 1 HQIC 519.779 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = coef STD err z P > | z | [0.025 0.975] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- const 1.1205 1.286 0.871 0.386-1.400 3.641 ar.l1.d. value 0.6344 0.087 7.317 0.000 0.464 0.804 ma.l1.d. value 0.5297 0.089 5.932 0.000 0.355 0.705 Roots = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Real Imaginary Modulus Frequency -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - AR. + 0.0000 j / 1.5764 1.5764 0.0000 MA. 1-1.8879 + 0.0000 1.8879 0.5000 j -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --Copy the code

The AIC model has been reduced, which is good. The P values of AR1 and MA1 were significantly increased (<< 0.05).

Let’s plot the residuals.

 

Residual density

The residuals seem to be good, with a mean close to zero and a uniform variance. Let’s use drawing actual values and fitting values.

 

Actual VS fitting

When dynamic=False is set in sample, the lag value is used for prediction.

That is, the model is trained to the previous value for the next prediction.

So, we seem to have a good ARIMA model. But is that the best?

You can’t say that at the moment, because we haven’t really predicted future numbers yet, but compared predictions with actual data.

Therefore, cross-validation is now required.

10. How do I use cross validation to manually find the best ARIMA model

In cross-validation, future data can be predicted. You then compare the predicted value with the actual value.

To cross-validate, you need to create training and test data sets by splitting a time series into two consecutive parts in a ratio of approximately 75:25 or a reasonable ratio based on the time frequency of the series.

Why not randomly sample training data?

This is because the sequence of time series should be intact in order to be used for prediction.

You can now build ARIMA models on training data sets, predict and plot them.

Figure (figsize=(12,5), dpi=100) plt.plot(train, label='training') plt.plot(test, label='actual') plt.plot(fc_series, label='forecast') plt.fill_between(lower_series.index, lower_series, upper_series, color='k', alpha=.15) plt.title('Forecast vs Actuals') plt.legend(loc='upper left', fontsize=8) plt.show()Copy the code

 

Forecast and reality

From the chart, the ARIMA (1,1,1) model seems to give predictions in the right direction. The actual observed value is within the 95% confidence interval.

But every prediction is always below reality. This means that by adding a small constant to our predictions, the accuracy must be improved. Therefore, there is certainly room for improvement.

So, what I need to do is increase the order of the difference to 2, that is, set d=2 and then iteratively increase p to 5 and q to 5 to see which model gives the smallest AIC, and at the same time look for one that gives a closer approximation to the actual situation and prediction.

When doing this, I focus on the P values of the AR and MA terms in the model summary. They should be as close to zero as possible, ideally less than 0.05.

ARIMA Model Results ============================================================================== Dep. Variable: D2.value No. Observations: 83 Model: ARIMA(3, 2, 1) Log Likelihood -214.248 Method: Css-mle S.D. of Innovations 3.153 Date: Sat, 09 Feb 2019 AIC 440.497 Time: 12:49:01 BIC 455.010 Sample: 2 HQIC 446.327 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = coef STD err z P > | z | [0.025 0.975] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- const 0.0483 0.084 Value 1.1386 0.109 10.399 0.000 0.924 1.353 ar.l2.d2. value -0.5923 0.155-3.827 0.000 Value 0.3079 0.111 2.778 0.007 0.091 0.525 ma.l1.d2. value -1.0000 0.035-28.799 0.000-1.068 0.932 Roots = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Real Imaginary Modulus Frequency -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - AR. 1, 1.1557-0.0000-1.1557 j -0.0000 AR.2 0.3839-1.6318j 1.66763-0.2132 AR.3 0.3839 +1.6318j 1.6763 0.2132 MA.1 1.0000 +0.0000j 1.0000 0.0000 -----------------------------------------------------------------------------Copy the code

 

Revised forecasts versus actual values

AIC has been reduced from 515 to 440. The P value of the X term is less than <0.05, which is good.

So it’s much better overall.

Ideally, you should return to multiple time points, such as 1, 2, 3, and 4 quarters, and see how well the forecast worked at each point in the year.

 

11. Accuracy index of time series prediction

Common accuracy indicators used to judge predictions are:

  1. Mean Absolute percentage error (MAPE)
  2. Mean error (ME)
  3. Mean Absolute Error (MAE)
  4. Average percentage error (MPE)
  5. Root mean Square Error (RMSE)
  6. Lag 1 autocorrelation error (ACF1)
  7. Correlation between actual and forecast (CORR)
  8. Minimum maximum error (MINmax)

In general, MAPE, Correlation, and min-max Error can be used when comparing predictions of two different sequences.

Why not use other metrics?

Because only the above three are percentage errors, the errors vary from 0 to 1. Therefore, regardless of the size of the sequence, you can judge the quality of the prediction.

The other measure of error is quantity. This means that a sequence with an average of 1000 has an RMSE of 100 and a sequence with an average of 10 has an RMSE of 5. Therefore, they cannot really be used to compare the predictions of two time series with different proportions.

Forecast_accuracy (fc, test.values) #> {' mAPE ': 0.02250131357314834, #> 'me': 3.230783108990054, #> 'mae': 4.548322194530069, #> 'MPE ': 0.016421001932706705, #>' RMSE ': 6.373238534601827, #> 'ACF1 ': 0.5105506325288692, #> 'corr': 0.9674576513924394, #> 'minmax': 0.05163154777672227}Copy the code

About 2.2% of MAPE indicated that the model was about 97.8% accurate in predicting the next 15 observations.

But in industrial applications, you will be given many time series to make predictions, and the predictions will be repeated periodically.

Therefore, we need a way to automate the process of optimal model selection.

12. How do I make automatic Arima predictions in Python

A stepwise approach is used to search for multiple combinations of p, D, and Q parameters and select the best model with the minimum AIC.

print(model.summary()) #> Fit ARIMA: order=(1, 2, 1); AIC=525.586, BIC= 55.926, Fit time=0.060 seconds #> Fit ARIMA: order=(0, 2, 0); AIC=533.474, BIC=538.644, Fit time=0.005 seconds #> Fit ARIMA: order=(1, 2, 0); AIC=532.437, BIC=540.192, Fit time=0.035 seconds #> Fit ARIMA: order=(0, 2, 1); AIC=525.893, BIC= 53.648, Fit time=0.040 seconds #> Fit ARIMA: order=(2, 2, 1); AIC=515.248, BIC=528.173, Fit time=0.105 seconds #> Fit ARIMA: order=(2, 2, 0); AIC=513.459, BIC=523.798, Fit time=0.063 seconds #> Fit ARIMA: order=(3, 2, 1); AIC=512.552, BIC=528.062, Fit time=0.272 seconds #> Fit ARIMA: order=(3, 2, 0); AIC=515.284, BIC=528.209, Fit time=0.042 seconds #> Fit ARIMA: order=(3, 2, 2); AIC=514.514, BIC=532.609, Fit time=0.234 seconds #> Total Fit time: # # 0.865 seconds > ARIMA Model Results > = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Variable: d2. y No. Observations: 98 #> Model: ARIMA(3, 2, 1) Log Likelihood -250.276 #> Method: Css-mle S.D. of Innovations May 1989 Date: Sat, 09 Feb 2019 AIC 512.552 #> Time: BIC 528.062 #> Sample: 2 HQIC 518.825 # # > > = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = # > coef STD err z P > | z | [0.025 0.975] # > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # > 0.0234 const L1.d2.y 1.1586 0.097 11.965 0.000 0.969 1.348 #> ar.l2.d2.y 0.6640 0.136-4.890 Y 0.3453 0.096 3.588 0.001 0.157 0.534 #> ma.l1.d2.y -1.0000 0.028-36.302 0.000-1.054 # 0.946 # > Roots > = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = # > Real Imaginary Modulus Frequency # > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # > AR. 1, 1.1703 -0.0000j 1.1703-0.0000 #> AR.2 0.3763-1.5274j 1.5731-0.2116 #> AR.3 0.3763 +1.5274j 1.5731 0.2116 #> Ma. 1 1.0000 1.0000 0.0000 + 0.0000 j # > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --Copy the code

13. How to interpret the residual graph in ARIMA model

Let’s look at the residuals diagram.

 

Residual figure

So what explains it?

Top left: The residual error appears to fluctuate around the zero mean and has uniform variance.

Top right: The density diagram suggests a normal distribution with a mean of zero.

Lower left: All dots should match the red line exactly. Any significant deviation means that the distribution is skewed.

Bottom right: The Correlogram (aka ACF) diagram shows that the residual error is not autocorrelated. Any autocorrelation will imply a pattern in the residuals that is not explained in the model. Therefore, you will need to find more X (predictive variables) for the model.

Overall, the model fits well. Let’s make a prediction.

 

 

14. How do I automatically build SARIMA models in Python

The problem with the normal ARIMA model is that it does not support seasonality.

If your time series defines seasonality, then use SARIMA of seasonal difference.

Seasonal difference is similar to regular difference, but you can subtract the value from the previous season instead of subtracting the consecutive term.

Thus, the model will be represented as SARIMA (P, D, q) x (p, D, q), where P, D and Q are the order and SMA terms of SAR, seasonal difference respectively, and ‘x’ is the frequency sequence of time.

If your model has clearly defined seasonal patterns, enforce D = 1 for a given frequency “x”.

Here are some practical suggestions for building the SARIMA model:

In general, set the model parameter D to no more than 1. And the total score’d + ‘is not more than 2. If the model has a seasonal component, try to keep only the SAR or SMA terms.

We build a SARIMA model on the drug sales data set.

 

Seasonal difference

After applying the usual difference (lag 1), the seasonal peak is complete. In view of this, correction should be made after seasonal difference.

Let’s build the SARIMA model. To do this, you need to set seasonal=True, set the frequency of the monthly sequence m=12 and enforce D=1.

Fit ARIMA: order=(1, 0, 1) seasonal_order=(0, 1, 1, 12); AIC=534.818, BIC=551.105, Fit time=1.742 seconds Fit ARIMA: Order =(0, 0, 0) seasonal_Order =(0, 1, 0, 12); AIC=624.061, BIC=630.576, Fit time=0.028 seconds Fit ARIMA: Order =(1, 0, 0) seasonal_Order =(1, 1, 0, 12); AIC=596.004, BIC=609.034, Fit time=0.683 seconds Fit ARIMA: Order =(0, 0, 1) seasonal_Order =(0, 1, 1, 12); AIC=611.475, BIC=624.505, Fit time=0.709 seconds Fit ARIMA: Order =(1, 0, 1) Seasonal_Order =(1, 1, 1, 12); AIC=557.501, BIC=577.046, Fit time=3.687 seconds (... TRUNCATED...) Fit ARIMA: order=(3, 0, 0) seasonal_order=(1, 1, 1, 12); AIC=554.570, BIC=577.372, Fit time=2.431 seconds Fit ARIMA: Order =(3, 0, 0) seasonal_Order =(0, 1, 0, 12); AIC=554.094, BIC=570.381, Fit time=0.220 seconds Fit ARIMA: Order =(3, 0, 0) seasonal_Order =(0, 1, 2, 12); AIC=529.502, BIC=552.305, Fit time=2.120 seconds Fit ARIMA: Order =(3, 0, 0) seasonal_Order =(1, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds Total Fit time: 31.613 secondsCopy the code

 

The model estimates AIC and the p-values of the coefficients appear to be important. Let’s look at a diagnostic diagram of residuals.

The AIC of the best model SARIMAX(3, 0, 0)x(0, 1, 1, 12) was 528.6, P value was very important.

Let’s look ahead 24 months.

 

SARIMA — Final forecast

 

15. How to establish SARIMAX model with exogenous variables

The SARIMA model we built is good.

But for the sake of completeness, let’s try adding external predictive variables (also known as “exogenous variables”) to the model. This model is called the SARIMAX model.

The only requirement for using an exogenous variable is that you also need to know the value of the variable during the forecast period.

To illustrate, I’ll use the seasonal index from the classic seasonal decomposition for the last 36 months.

Why seasonality indices? Is SARIMA already simulating seasonality?

You’re right.

Also, I wanted to see how the model would show up if we imposed recent seasonal patterns on training and forecasting.

Second, this is a good demonstration of the purpose variable. Therefore, you can use it as a template and insert any variables into your code. Seasonality is a good exogenous variable because it repeats once for every frequency cycle, in this case 12 months.

As a result, you will always know what value seasonality will hold for future predictions.

Let’s calculate the seasonality index so that we can use it as a (external) predictor of the SARIMAX model.

The exogenous variable (seasonal index) is ready. Let’s build the SARIMAX model.

Fit ARIMA: order=(1, 0, 1) seasonal_order=(0, 1, 1, 12); AIC=536.818, BIC=556.362, Fit time=2.083 seconds Fit ARIMA: Order =(0, 0, 0) seasonal_Order =(0, 1, 0, 12); AIC=626.061, BIC=635.834, Fit time=0.033 seconds Fit ARIMA: Order =(1, 0, 0) seasonal_Order =(1, 1, 0, 12); AIC=598.004, BIC=614.292, Fit time=0.682 seconds Fit ARIMA: Order =(0, 0, 1) seasonal_Order =(0, 1, 1, 12); AIC=613.475, BIC=629.762, Fit time=0.510 seconds Fit ARIMA: Order =(1, 0, 1) Seasonal_Order =(1, 1, 1, 12); AIC=559.530, BIC=582.332, Fit time=3.129 seconds (... Truncated...) Fit ARIMA: order=(3, 0, 0) seasonal_order=(0, 1, 0, 12); AIC=556.094, BIC=575.639, Fit time=0.260 seconds Fit ARIMA: Order =(3, 0, 0) seasonal_Order =(0, 1, 2, 12); AIC=531.502, BIC=557.562, Fit time=2.375 seconds Fit ARIMA: Order =(3, 0, 0) seasonal_Order =(1, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds Total Fit time: 30.781 secondsCopy the code

So we have models with exogenous terms. But the coefficient is small for X1, so the contribution of this variable is negligible. Let’s keep making predictions.

We have effectively imposed on the model the most recent seasonal effects of the last 3 years.

Let’s forecast the next 24 months. To do this, you need seasonal index values for the next 24 months.

SARIMAX prediction

 


Most welcome insight

1. Use LSTM and PyTorch for time series prediction in Python

2. Long and short-term memory model LSTM is used in Python for time series prediction analysis

3. Time series (ARIMA, exponential smoothing) analysis using R language

4. R language multivariate Copula – Garch – model time series prediction

5. R language Copulas and financial time series cases

6. Use R language random wave model SV to process random fluctuations in time series

7. Tar threshold autoregressive model for R language time series

8. R language K-Shape time series clustering method for stock price time series clustering

Python3 uses ARIMA model for time series prediction