This is a small independent training project given by the leader when I first joined the company. About time series prediction, the scene is relatively simple, and the whole process only has more than 100 lines of code. However, there are many pits in the process of completion, so I feel it is necessary to share (TU) and (CAO). The complete code and sample data are posted on my Github (pasted only) : github.com/scarlettgin…

1, the background

There are different apis on the company platform, for internal or external call, these apis undertake different functions, such as account query, version distribution, grab red envelopes and so on. The log records how many times an API is accessed per minute, which means that an API has 1440 records per day (1440 minutes). It connects the data for each day, similar to the stock movement. I want to predict the traffic access situation on the N+1 day based on the historical data of the last N days. The predicted value is a reasonable reference for real-time comparison between the real value and the new day. When the actual traffic is greatly different from the predicted value, abnormal access is considered and an alarm is triggered.

2. Data exploration

I put a sample data in the data folder to see the size and structure of the data

data = pd.read_csv(filename)
print('size: ',data.shape)
print(data.head())
Copy the code


data.png

Data size: a total of 10,080 records, namely 10,080 minutes, seven days of data. The parameter meaning is as follows: date: time, minute count: number of times that the API is accessed within a minute

Plot a sequence to see how it looks :(some of the exploration class methods for plotting are included in the test_stationarity. Py file, including time series plots, moving averages, if you are interested, try them out for yourself).

def draw_ts(timeseries):
    timeseries.plot()
    plt.show()

data = pd.read_csv(path)
data = data.set_index('date')
data.index = pd.to_datetime(data.index)
ts = data['count']
draw_ts(ts)
Copy the code


Sequence. PNG

Look at this stupid graph, the points that drop to zero and that’s the first pit I hit, and I started doing it as soon as I got this data. Later, after a long time of agonizing, IT was discovered that the sharp drop to 0 was caused by T д T by the ETL students’ automatic zeroing with the data missing.

Fill in the holes, fill in the missing values with the mean of the values before and after, and take another look:




Fill in the sequence of missing values.png

It is found that this data has the following characteristics, which should be taken into account in model design and data preprocessing:

1. This is a periodic time series, and the value fluctuates regularly on a daily basis. The API in the figure is more active in the afternoon and evening, and less frequent in the morning and early morning. You need to do the decomposition before modeling.

2, my second hole: the data itself is not smooth, sudden sudden drop more, and this is not conducive to prediction, after all, the model needs to learn the normal sequence to give objective judgment of unknown data, otherwise there will be frequent false positives, make the atmosphere become very embarrassing (д ‘), so it must be smoothed.

3. This is just a sequence diagram of an API, and there is a big difference in the form of different apis. After all, different functions are assumed, so how to adapt the model to different forms of API is also a problem to be considered.

3. Pretreatment

3.1 Division of training test sets

The first six days of data do the training, and the seventh day do the test set.

class ModelDecomp(object):
    def __init__(self, file, test_size=1440):
        self.ts = self.read_data(file)
        self.test_size = test_size
        self.train_size = len(self.ts) - self.test_size
        self.train = self.ts[:len(self.ts)-test_size]
        self.test = self.ts[-test_size:]
Copy the code

3.2 Smoothing the training data

To eliminate data burrs, you can use the moving average method, which I did not use here, because I tried to find that for my data, the moving average does not smooth the data after processing. The method I used here is very simple, but it works well: Take the change value of each point and the previous point as a new sequence, shave the outliers of the edge here, that is, the values with unreasonable changes, and fill them with the mean value of the data before and after, paying attention to the possibility of continuous points with large changes:

def _diff_smooth(self, ts): Dif = ts.diff().dropna() # dif = describe() # Min, 25%, 50%, 75%, the Max value high = td [' 75% '] + 1.5 * (td [' 75% '] - td [' 25% ']) # define high threshold, 1.5 times the interquartile range beyond low = td [' 25% '] - 1.5 * (td [' 75% '] - td [' 25% ']) # define low threshold, Ditto # of the indexes exceed the threshold point forbid_index = dif [(dif > high) | (dif (low)]. The index while I = 0 I < len (forbid_index) - 1: N = 1 # It is found that how many consecutive points change too much, While forbid_index[I +n] == start + timedelta(minutes=n): N += 1 I += n-1 end = forbid_index[I] # value = np.linspace(ts[start-timedelta (minutes=1)], ts[end + timedelta(minutes=1)], n) ts[start: end] = value i += 1 self.train = self._diff_smooth(self.train) draw_ts(self.train)Copy the code

Smoothed training data:




PNG training sequence after smoothing

3.3 Periodically decompose the training data

Using the Statsmodels kit:

from statsmodels.tsa.seasonal import seasonal_decompose decomposition = seasonal_decompose(self.ts, freq=freq, Two_sided =False) # self.ts: time series; # freq: cycle, in this case 1440 minutes, i.e. one day; # two_sided: observe rows 2 and 4 below, with a blank on the left. If set to True, the left and right sides are blank, and False ensures that the sequence has data at the end for easy prediction. self.trend = decomposition.trend self.seasonal = decomposition.seasonal self.residual = decomposition.resid decomposition.plot() plt.show()Copy the code


Diagram. PNG

The first line observed: Original data; Second line trend: the decomposed trend part; 4. Last residual: residual I decomposed the addition model of seasonal_decompose, that is, observed = trend + seasonal + residual, and the multiplication model. In modeling, we only study and forecast trends. How to process the predicted results of trends into reasonable final results? Of course, we’re going to do addition, and we’ll write that in more detail.

4, models,

4.1 training

The decomposed trend part was trained separately with ArIMA model:

def trend_model(self, order): Self.trend.dropna (inplace=True) train = self.trend.dropna(inplace=True) train = self.trend.dropna(inplace=True) train = self.trend[:len(self.trend)-self.test_size] For details, see the official document. The parameter setting process is omitted. self.trend_model = ARIMA(train, order).fit(disp=-1, method='css')Copy the code

4.2 predict

After the trend data is predicted, the periodic data is used as the final prediction result, but more importantly, what we need to get is not a specific value but a reasonable interval. When the real data exceeds this interval, the alarm will be triggered. The setting of the high and low error interval is from the decomposed residual data:

d = self.residual.describe()
delta = d['75%'] - d['25%']
self.low_error, self.high_error = (d['25%'] - 1 * delta, d['75%'] + 1 * delta)
Copy the code

Predict and complete the final addition processing, and get the predicted value of the seventh day, namely the high and low confidence interval:

def predict_new(self): "Predict new data" # renew train, generate time index of length N, Pred_time_index = pd.date_range(start=self.train. Index [-1], periods=n+1, freq='1min')[1:] self.trend_pred= self.trend_model.forecast(n)[0] self.add_season() def add_season(self): Self. train_season = self.seasonal[:self.train_size] values = [] low_conf_values = [] high_conf_values = [] for i, t in enumerate(self.pred_time_index): Season_train_season [self.train_season.index. Time == t.Time () ]. Mean () # predict = trend_part + season_part low_bound = trend_part + season_part + self.low_error high_bound = trend_part + season_part + self.high_error values.append(predict) low_conf_values.append(low_bound) High_conf_values.append (high_bound) # Self. final_pred = pd.Series(values, index=self.pred_time_index, name='predict') self.low_conf = pd.Series(low_conf_values, index=self.pred_time_index, name='low_conf') self.high_conf = pd.Series(high_conf_values, index=self.pred_time_index, name='high_conf')Copy the code

4.3 evaluation:

For the prediction of the seventh day, the evaluation index is the root mean square error rmSE, and the difference between the graph comparison and the real value:

md = ModelDecomp(file=filename, test_size=1440) md.decomp(freq=1440) md.trend_model(order=(1, 1, Predict_new () pred = md.final_pred test = md.test plt.subplot(211) plt.plot(md.ts) # Plt.title (filename.split('.')[0]) plt.subplot(212) pred.plot(color='blue', Plot (color='red', label='Original') # md.low_conf.plot(color='grey', Md.high_conf.plot (color='grey', label='high') # legend(loc='best') plt.title('RMSE: %.4f' % np.sqrt(sum((pred.values - test.values) ** 2) / test.size)) plt.tight_layout() plt.show()Copy the code


Prediction results.png

As can be seen, the root mean square error of 462.8, relative to the original data on the order of several thousand, is still ok. The two mutation points in the test data also exceeded the confidence intervals and were accurately reported.

5, conclusion

As mentioned above, there are huge differences in different API forms. This paper only shows one. I also came into contact with other forms in this project, some of which have obvious upward or downward trend. Some start off flat and then grow… . However, they all belong to typical periodic time series. Its core idea is very simple: do a good job of decomposition, restore the predicted results, and set the confidence interval. The specific operation can be adjusted according to the specific business logic.