Original: http://tecdat.cn/?p=4130

Every time you find a trend over time, you see a time series. The de facto choice of studying the performance of financial markets and weather forecasts, time series is one of the most popular analytical techniques because it has an inseparable relationship with time – we are always interested in predicting the future.

Time dependent model

An intuitive way to predict is to look at the most recent point in time. Today’s share price may be closer to where it was five years ago than it was yesterday. As a result, we are more important in predicting today’s prices than recent ones. These correlations between past and present values indicate time dependence, which forms the basis of a popular time-series analysis technique called ARIMA (Autoregressive Integrated Moving Average). ARIMA takes into account both seasonal changes and one-off “shocks” from the past to make future predictions.

However, Arima makes strict assumptions. To use ARIMA, the trend should have regular periods and constant mean and variance. For example, if we want to analyze a trend of growth, we must first transform the trend so that it no longer increases, but stagnates. And if we don’t have the data, ARIMA won’t work.

To avoid squeezing our data into the mold, we can consider an alternative, such as neural networks. Long – term short – term memory (LSTM) network is a kind of neural network based on time – dependent model. Although highly accurate, neural networks lack interpretability — it is difficult to identify the model components that lead to a particular prediction.

model

In addition to using correlations between values at similar points in time, we can step back and model the overall trend. Time series can be thought of as the sum of individual trends. For example, Google searches for persimmon, a fruit trend.

From Figure 1, we can infer that persimmons may be seasonal. As supplies peak in November, grocery store customers may be asked for Google nutrition knowledge or persimmon recipes.

Figure 1. Seasonal trends in Google searches for ‘persimmon’, from http://rhythm-of-food.net/per…

In addition, Google searches for persimmons have become more frequent over the past few years.

Figure 2. From http://rhythm-of-food.net/per… Google search for “persimmon” overall growth trend

Therefore, persimmon Google search trends can be modeled in the seasonal trend of increasing growth trends, which is called the Generalized Additive Model (GAM).

The principle behind GAM is similar to regression, except that instead of the summation effect of various predictors, GAM is the sum of smoothing functions. Function allows us to model more complex patterns and average them to get smoother smooth curves.

Because GAM is based on functions rather than variables, they are not constrained by the linear assumptions in regression, which require predictors and outcome variables to move in straight lines. In addition, unlike neural networks, we can isolate and study the effects of various functions in GAM on outcome prediction.

In this tutorial, we will:

See an example of how to use GAM.

Learn how functions in GAM can be identified by reassembly.

Learn how to validate time series models.

For example: save daylight saving time

Anyone who lives in a four-season area will know the fact that there is less sunlight in winter than in summer. To remedy this, some countries move their clocks forward an hour during the summer, allowing more sunlight for outdoor activities at night and, hopefully, less energy for heating and lighting their homes. The practice of pushing clocks in the summer is known as Daylight Saving Time (DST) and was implemented in the early 20th century.

But the actual benefits of DST remain controversial. Notably, DST has been shown to disrupt sleep patterns that can affect job performance and even lead to accidents. So whenever the clocks are adjusted, people are prompted to question the fundamentals of DST, and Wikipedia is one of the sources of the answer.

To investigate trends in DST page viewing, we first used Python scripts to extract data from the Wikipedia database. It used page views from 2008 to 2015. Next, we used a GAM package called Prophet published by Facebook researchers to perform time series analysis in Python. This package can also be found in R.

The Prophet package is user friendly and enables us to specify different types of features, including those that get GAM trending. There are three main types of functions:

Overall growth. This can be modeled as a straight (linear) or slightly curved (logical) trend. In this analysis, we used the default linear growth model.

Seasonal variation. This is modeled using Fourier series, and it’s just a way of approximating periodic functions. The exact functionality is derived using a process called inversion, which is explained in the next section. We can specify whether we expect weekly and/or annual trends to exist. In this analysis, we include two scenarios – weekly trends that seem reasonable based on past research, where people are likely to be outdoors with less Internet activity on weekends, and annual trends that may be consistent with biannual clock-switching exercises.

Special events. In addition to simulating general trends, we should also consider one-off events. This includes any phenomenon, be it a policy announcement or a natural disaster, that adds ripples to a smooth trend. If we do not consider irregular events, GAM may mistake them for persistent events, and their effects will be propagated incorrectly.

Special events in our analysis include the exact dates on which American clocks are toggled back and forth. We can also specify Windows before and after each event that we expect to have a significant impact. For example, online searches on the DST may start to increase before each switch. But search behavior after the time switch may be different, depending on whether the clock is snaking forward or backward: People may be more likely to search online for reasons for their lack of sleep, but not when they get an extra nap. In addition to clock conversion dates, we also include major DST related events. In 2010, for example, there were protests in Israel during an unusually early shift to winter because of differences in Hebrew and solar calendars. The events included in our analysis can be found in the code.

In addition to the above, the Prophet package asks us to specify previous values that determine how sensitive the trend line is to changes in data values. Higher sensitivity leads to more jagged tendencies, which may affect the generality of future values. When we validate our model, we can adjust the Priors, as we’ll see later in this tutorial.

Now, we can continue to adapt to GAM. Figure 3 shows the resulting function for overall growth, special events, and seasonal variations:

Figure 3. A function that contains GAM to predict the page view of a DST Wikipedia article. In the first two graphs of overall trends and special events (i.e., ‘holidays’), the X-axis is marked’ DS ‘for’ date stamp ‘. Duplicate year labels appear because the grid lines are not consistent with the same date each year.

We can see that the overall page views of DST Wikipedia articles have been declining for years, which may be due to competing online sources explaining DST. We can also observe how peak page views are calculated to coincide with a particular event. Weekly trends show that people are most likely to read DST on Mondays and least likely to read it on weekends. Finally, the annual trend shows that page views peak at the end of March and the end of October, when the time shift occurs.

It is convenient that we do not need to know the exact predictive function contained in GAM. Instead, we just specify some constraints and the best functionality will be exported for us automatically. How does GAM do this?

Inverse fitting method

To find the best trend line for the data, GAM uses a program called flip. Inverse fitting is a process of iteratively adjusting functions in GAM so that they produce trendlines that minimize prediction error. A simple example can be used to illustrate this process.

Suppose we have the following data:

Figure 4. Sample dataset, consisting of two predictor variables and one result variable.

Our goal is to find the appropriate function for the predictors so that we can accurately predict the outcome.

First, we focus on finding a function for the predictive variable 1. A good initial guess might be to multiply it by 2:

Figure 5. The result of applying the “Multiply 2” function to the model of Predictor 1.

As can be seen from Figure 5, by applying the function of “times 2” to the predictive variable 1, we can perfectly predict the outcome by 50%. However, there is still room for improvement.

Next, we focus on finding a function for the predictive variable 2. By analyzing the prediction error of fitting the function of predictor variable 1, we can see that as long as the predictor variable 2 has positive value, the accuracy of 100% can be achieved by adding 1 to the result, and nothing else is done (that is, signmoid function).

This is the gist of a flip process, summarizing the following steps:

Step 0: Define a function for a predictor and calculate the error of the result.

Step 1: Derive a function for the next predictor that will minimize errors.

Step 2: Repeat Step 1 for all predictive variables and repeat the cycle further to reevaluate their function as necessary until the prediction error cannot be further minimized.

Now that we have fitted our model, we need to test it: does it accurately predict future values?

Validate the time series model

Cross-validation is the preferred technique for evaluating the effectiveness of models in predicting future value. However, time series models are an exception to the rule that cross validation does not work.

Recall that cross-validation involves dividing a data set into random subsamples for repeated training and testing of the model. Crucially, the data points used in the training sample must be independent of the data points in the test sample. But this is not possible in time series because the data points are time dependent, so the data in the training set will still carry a time-based association with the data in the test set. This requires different techniques to validate the time series model.

Instead of sampling our data points over time, we can slice them over time. If we want to test the prediction accuracy of the model over the next year (that is, the forecast time range), we can divide the data set into training segments of one year (or longer) and use each segment to predict its value for the following year. This technique is called simulated historical prediction. As a guide, if our forecast range is one year, then we should make a simulation forecast every six months. Figure 6 shows the results of a simulated prediction of the number of DST Wikipedia page views for 11 DST pages.

Figure 6. Simulated historical prediction of DST Wikipedia page views.

In Figure 6, the forecast range is one year, with each training segment containing three years of data. For example, the first prediction band (red) uses data from January 2008 to December 2010 to predict views from January 2011 to December 2011. As we can see, with the exception of the first two simulation predictions, these simulation predictions were misled by unusually high page activity in 2010, and the predictions often overlapped with the actual values.

To better assess the accuracy of the model, we can take the average prediction error from all 11 simulated predictions and compare it with the forecast time range, as shown in Figure 7. Notice how the error increases as we try to further predict the future.

Figure 7. Forecast error within the forecast range. The red line shows the average absolute error of the 11 simulation predictions, while the black line shows the smoothing trend of the error.

Recall that one of the parameters we need to adjust is the prior value, which determines how sensitive our trend is to changes in data values. One way is to try different parameter values and compare them with the diagram shown in Figure 8. As we have seen, too much in advance leads to a less general tendency, which leads to a bigger error.

Figure 8. Comparison of prediction errors caused by different previous values.

In addition to tweaking the early movers, we can also tweak the base growth model, seasonal trends, and special event Settings. Visualizing our data also helps us identify and remove outliers. For example, we can improve the forecast by excluding data from 2010, during which page views were very high.

limit

As you might guess, having more training data in a time series does not necessarily lead to a more accurate model. Outliers or rapidly changing trends can exacerbate any forecasting effort. Worse, sudden shocks that have a permanent effect on time series can also render all past data irrelevant.

Therefore, time series analysis is best suited to stable and systematic trends, and we can evaluate trends through visualization.

The profile

Time series analysis is a technique for deriving trends over a period of time that can be used to predict future values. The Generalized Additive Model (GAM) does this by identifying and summing up multiple functions to get the trend line that best fits the data.

Functions in GAM can be identified using an inverse fitting algorithm, which iteratively fits and adjusts the function to reduce prediction errors.

Time series analysis is best suited to stable and systematic trends.