Outlier detection based on RNN autoencoder

Author | David Woroniuk compile | source of vitamin k | forward Data Science

What is an exception

Outliers, often called outliers, are data points, data sequences, or patterns in exponential data that do not conform to the overall behavior of the data series. Thus, anomaly detection is the task of detecting data points or sequences that do not conform to patterns in the broader data.

Effective detection and removal of anomalous data is useful for many business functions, such as detecting broken links embedded in web sites, spikes in Internet traffic, or sharp changes in stock prices. Flagging these phenomena as outliers, or devising pre-planned responses, can save businesses time and money.

Exception types

In general, outlier data can be divided into three categories: additive outliers, time-varying outliers, or horizontally varying outliers.

Additive outliers are characterized by sudden large increases or decreases in value, which may be driven by exogenous or endogenous factors. Examples of additive outliers might be a large increase in website traffic due to a TV show (external cause) or a short-term increase in stock trading volume due to strong quarterly results (internal cause).

Time-varying outliers are characterized by a short sequence that does not fit the broader trend in the data. For example, if a web server crashes, at a series of data points, web traffic will drop to zero until the server restarts, at which point traffic will return to normal.

Level change outliers are a common phenomenon in commodity markets where high demand for electricity is intrinsically linked to adverse weather conditions. So we can observe a “horizontal change” in electricity prices between summer and winter, driven by weather-driven changes in demand and changes in renewable energy generation.

What is an autoencoder

Autoencoders are neural networks designed to learn a low-dimensional representation of a given input. Autoencoders typically consist of two parts: an encoder learns to map input data to low-dimensional representations, and a decoder learns to map representations back to input data.

Because of this structure, the encoder network iteratively learns an efficient data compression function that maps the data to a low-dimensional representation. After training, the decoder can successfully reconstruct the original input data, and the reconstruction error (the difference between the input generated by the decoder and the reconstructed output) is the objective function of the whole training process.

implementation

Now that we know the underlying architecture of the autoencoder model, we can start implementing it.

The first step is to install the libraries, packages, and modules we will use:

# Data processing:
import numpy as np
import pandas as pd
from datetime import date, datetime

# RNN autoencoder:
from tensorflow import keras
from tensorflow.keras import layers

# drawing:! pip install chart-studioimport plotly.graph_objects as go
Copy the code

Secondly, we need to get some data for analysis. This article uses the history encryption software package to obtain historical data of Bitcoin from June 6, 2013 to the present. The code below also generates daily Bitcoin returns and intraday price volatility, then deletes any missing rows and returns the first 5 rows of the data frame.

# Import Historic Crypto:! pip install Historic-Cryptofrom Historic_Crypto import HistoricalData

# Get bitcoin data, calculate returns and intraday volatility:
dataset = HistoricalData(start_date = '2013-06-06',ticker = 'BTC').retrieve_data()
dataset['Returns'] = dataset['Close'].pct_change()
dataset['Volatility'] = np.abs(dataset['Close']- dataset['Open'])
dataset.dropna(axis = 0, how = 'any', inplace = True)
dataset.head()
Copy the code

Now that we have some data, we should intuitively scan each sequence for potential outliers. The plot_dates_values function below iterates to draw each sequence contained in a data frame.

def plot_dates_values(data_timestamps, data_plot) :
  Arguments: datA_timestamps: The timestamp associated with each data instance. Data_plot: Sequence of data to be plotted. Returns: FIG: Displays a sequence of graphics with sliders and buttons. ' ' '

  fig = go.Figure()
  fig.add_trace(go.Scatter(x = data_timestamps, y = data_plot,
                           mode = 'lines',
                           name = data_plot.name,
                           connectgaps=True))
  fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1 Years", step="year", stepmode="backward"),
            dict(count=2, label="2 Years", step="year", stepmode="backward"),
            dict(count=3, label="3 Years", step="year", stepmode="backward"),
            dict(label="All", step="all")
        ]))) 
  
  fig.update_layout(
    title=data_plot.name,
    xaxis_title="Date",
    yaxis_title="",
    font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  return fig.show()
Copy the code

We can now call the above function repeatedly to generate graphs of bitcoin’s volume, closing price, opening price, volatility and yield.

plot_dates_values(dataset.index, dataset['Volume'])
Copy the code

It is worth noting that 2020 saw some spikes in trading volumes, and it may be useful to investigate whether these spikes are unusual or indicative of a broader sequence.

plot_dates_values(dataset.index, dataset['Close'])
Copy the code

! [](qiniu.aihubs.net/newplot (1).png)

The 2018 close saw a significant uptick, followed by a decline to technical support. However, an upward trend prevails across the data.

plot_dates_values(dataset.index, dataset['Open'])
Copy the code

! [](qiniu.aihubs.net/newplot (2).png)

The daily opening price is similar to the above closing price.

plot_dates_values(dataset.index, dataset['Volatility'])
Copy the code

! [](qiniu.aihubs.net/newplot (3).png)

Both price and volatility have been evident in 2018. Therefore, we can study whether these volatility peaks are regarded as anomalies by the autoencoder model.

plot_dates_values(dataset.index, dataset['Returns'])
Copy the code

! [](qiniu.aihubs.net/newplot (4).png)

Due to the randomness of the return sequence, we choose to test outliers in daily trading volume of Bitcoin, characterized by trading volume.

Therefore, we can start data preprocessing for the autoencoder model. The first step in data preprocessing is to determine the appropriate segmentation between training data and test data. The generate_train_test_split feature outlined below can split training and test data by date. When the following function is called, two data frames, training data and test data, are generated as global variables.

def generate_train_test_split(data, train_end, test_start) :
  "" This function splits the data set into training data and test data by using strings. The string supplied as 'train_end' and 'test_start' arguments must be consecutive days. Arguments: Data: Data is split into training data and test data. Train_end: date when training data ends (STR). Test_start: The start date of the test data (STR). Returns: training_data: data used in model training (Pandas DataFrame). Testing_data: Data used in model tests (Panda DataFrame). ' ' '
  if isinstance(train_end, str) is False:
    raise TypeError("train_end argument should be a string.")
  
  if isinstance(test_start, str) is False:
    raise TypeError("test_start argument should be a string.")

  train_end_datetime = datetime.strptime(train_end, '%Y-%m-%d')
  test_start_datetime = datetime.strptime(test_start, '%Y-%m-%d')
  while train_end_datetime >= test_start_datetime:
    raise ValueError("train_end argument cannot occur prior to the test_start argument.")
  while abs((train_end_datetime - test_start_datetime).days) > 1:
    raise ValueError("the train_end argument and test_start argument should be seperated by 1 day.")

  training_data = data[:train_end]
  testing_data = data[test_start:]

  print('Train Dataset Shape:',training_data.shape)
  print('Test Dataset Shape:',testing_data.shape)

  return training_data, testing_data


We now call the above function to generate training and test data
training_data, testing_data = generate_train_test_split(dataset, '2018-12-31'.'2019-01-01')
Copy the code

To improve the accuracy of the model, we can “standardize” or scale the data. This function scales the training data frames generated above, saving training averages and training criteria for later standardization of test data.

Note: It is important to scale training and test data at the same level, otherwise the difference in size will create interpretability problems and model inconsistencies.

def normalise_training_values(data) :
  "" This function normalizes the input values with mean and standard deviation. Arguments: Data: DataFrame column to standardize. Returns: values: Normalized data (numpy array) for model training. Mean: Training set mean, used for standardized test sets (float). STD: Standard deviation of the training set, used for standardized test sets (float). ' ' '
  if isinstance(data, pd.Series) is False:
    raise TypeError("data argument should be a Pandas Series.")

  values = data.to_list()
  mean = np.mean(values)
  values -= mean
  std = np.std(values)
  values /= std
  print("*"*80)
  print("The length of the training data is: {}".format(len(values)))
  print("The mean of the training data is: {}".format(mean.round(2)))
  print("The standard deviation of the training data is {}".format(std.round(2)))
  print("*"*80)
  return values, mean, std


Now call the above function:
training_values, training_mean, training_std = normalise_training_values(training_data['Volume'])
Copy the code

As we said above with the normalise_training_VALUES function, we now have a NUMPY array containing standardized training data called training_VALUES, We have stored training_mean and training_STD as global variables for standardized test sets.

We can now start generating sequences that can be used to train the autoencoder model. We define window size 30 to provide a shape of 3D training data (2004,30,1):

# define the number of time steps for each sequence:
TIME_STEPS = 30

def generate_sequences(values, time_steps = TIME_STEPS) :
  This function generates the length sequence 'TIME_STEPS' to be passed to the model. Arguments: values: The normalized value of the generated sequence (numpy array). Time_steps: Length of sequence (int). Returns: train_data: 3D data for model training (NUMpy array). ' ' '
  if isinstance(values, np.ndarray) is False:
    raise TypeError("values argument must be a numpy array.")
  if isinstance(time_steps, int) is False:
    raise TypeError("time_steps must be an integer object.")

  output = []

  for i in range(len(values) - time_steps):
    output.append(values[i : (i + time_steps)])
  train_data = np.expand_dims(output, axis =2)
  print("Training input data shape: {}".format(train_data.shape))

  return train_data
  
Now call the above function to generate x_train:
x_train = generate_sequences(training_values)
Copy the code

Now that we have finished processing the training data, we can define the autoencoder model and then fit the model onto the training data. The define_model function defines the appropriate model using the training data shape and returns the autoencoder model and a summary of the autoencoder model.

def define_model(x_train) :
  This function uses the dimensions of x_train to generate an RNN model. Arguments: X_train: 3D data for model training (numpy array). Returns: Model: Model architecture (Tensorflow objects). Model_summary: Summary of the model architecture. ' ' '

  if isinstance(x_train, np.ndarray) is False:
    raise TypeError("The x_train argument should be a 3 dimensional numpy array.")

  num_steps = x_train.shape[1]
  num_features = x_train.shape[2]

  keras.backend.clear_session()
  
  model = keras.Sequential(
      [
       layers.Input(shape=(num_steps, num_features)),
       layers.Conv1D(filters=32, kernel_size = 15, padding = 'same', data_format= 'channels_last',
                     dilation_rate = 1, activation = 'linear'),
       layers.LSTM(units = 25, activation = 'tanh', name = 'LSTM_layer_1',return_sequences= False),
       layers.RepeatVector(num_steps),
       layers.LSTM(units = 25, activation = 'tanh', name = 'LSTM_layer_2', return_sequences= True),
       layers.Conv1D(filters = 32, kernel_size = 15, padding = 'same', data_format = 'channels_last',
                     dilation_rate = 1, activation = 'linear'),
       layers.TimeDistributed(layers.Dense(1, activation = 'linear'))
      ]
  )

  model.compile(optimizer=keras.optimizers.Adam(learning_rate = 0.001), loss = "mse")
  return model, model.summary()
Copy the code

The model_FIT function then internally calls the define_model function and provides epochs, BATCH_size, and VALIDATION_loss parameters to the model. This function is then called to begin the model training process.

def model_fit() :
  This function calls the above 'define_model()' function and trains the model based on x_train data. Arguments: N/A. Returns: Model: trained model. History: A summary of how the model is trained (training errors, validation errors). ' ' '
  # call define_model above on x_train:
  model, summary = define_model(x_train)

  history = model.fit(
    x_train,
    x_train,
    epochs=400,
    batch_size=128,
    validation_split=0.1,
    callbacks=[keras.callbacks.EarlyStopping(monitor="val_loss", 
                                              patience=25, 
                                              mode="min", 
                                              restore_best_weights=True)])
  
  return model, history


Call the above function to generate the model and the history of the model:
model, history = model_fit()
Copy the code

Once the model has been trained, training and validation loss curves must be drawn to see if the model has bias (underfitting) or variance (overfitting). This can be observed by calling the plot_training_validation_loss function below.

def plot_training_validation_loss() :
  "This function plots the training and validation loss curves of the training model, allowing visual diagnosis of underfitting or overfitting. Arguments: N/A. Returns: FIG: A visual representation of model training loss and validation.
  training_validation_loss = pd.DataFrame.from_dict(history.history, orient='columns')

  fig = go.Figure()
  fig.add_trace(go.Scatter(x = training_validation_loss.index, y = training_validation_loss["loss"].round(6),
                           mode = 'lines',
                           name = 'Training Loss',
                           connectgaps=True))
  fig.add_trace(go.Scatter(x = training_validation_loss.index, y = training_validation_loss["val_loss"].round(6),
                           mode = 'lines',
                           name = 'Validation Loss',
                           connectgaps=True))
  
  fig.update_layout(
  title='Training and Validation Loss',
  xaxis_title="Epoch",
  yaxis_title="Loss",
  font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  return fig.show()


Call the function above:
plot_training_validation_loss()
Copy the code

! [](qiniu.aihubs.net/newplot (5).png)

It is worth noting that the training and verification loss curves converge throughout the chart, with verification losses still slightly larger than training losses. Given the shape error and relative error, we can determine that the autoencoder model does not underfit or overfit.

Now we can define the reconstruction error, which is one of the core tenets of the autoencoder model. The reconstruction error is represented as the training loss, and the reconstruction error threshold is the maximum of the training loss. Therefore, when calculating the test error, any value greater than the maximum training loss can be regarded as an outlier.

def reconstruction_error(x_train) :
  "This function computes the reconstruction error and displays the histogram of the training average absolute error: X_train: 3D data for model training (numpy array). Returns: FIG: Visualization of the training MAE distribution. ' ' '

  if isinstance(x_train, np.ndarray) is False:
    raise TypeError("x_train argument should be a numpy array.")

  x_train_pred = model.predict(x_train)
  global train_mae_loss
  train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis = 1)
  histogram = train_mae_loss.flatten() 
  fig =go.Figure(data = [go.Histogram(x = histogram, 
                                      histnorm = 'probability',
                                      name = 'MAE Loss')])  
  fig.update_layout(
  title='Mean Absolute Error Loss',
  xaxis_title="Training MAE Loss (%)",
  yaxis_title="Number of Samples",
  font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  
  print("*"*80)
  print("Reconstruction error threshold: {} ".format(np.max(train_mae_loss).round(4)))
  print("*"*80)
  return fig.show()


Call the function above:
reconstruction_error(x_train)
Copy the code

Above, we saved training_mean and training_STD as global variables so that they can be used to scale test data. We now define the normalise_testing_values function to scale the test data.

def normalise_testing_values(data, training_mean, training_std) :
  "This function normalizes the test data using the training mean and standard deviation to generate a NUMPY array of test values. Arguments: data: data used (Panda DataFrame column) mean: average value of the training set (floating point number). STD: Standard deviation of training set (float). Returns: values: array (numpy array).
  if isinstance(data, pd.Series) is False:
    raise TypeError("data argument should be a Pandas Series.")

  values = data.to_list()
  values -= training_mean
  values /= training_std
  print("*"*80)
  print("The length of the testing data is: {}".format(data.shape[0]))
  print("The mean of the testing data is: {}".format(data.mean()))
  print("The standard deviation of the testing data is {}".format(data.std()))
  print("*"*80)

  return values
Copy the code

This function is then called on the Volume column of testing_data. Therefore, test_Value is specified as a NUMpy array.

Call the function above:
test_value = normalise_testing_values(testing_data['Volume'], training_mean, training_std) 
Copy the code

On this basis, the generating test loss function is defined, and the difference between the reconstructed data and the test data is calculated. If any value is greater than the training maximum loss value, it is stored in the global exception list.

def generate_testing_loss(test_value) :
  "" This function uses the model to predict exceptions in the test set. In addition, the function generates an "exception" global variable that contains the outliers recognized by the RNN. Arguments: test_Value: array of tests (numpy array). Returns: FIG: Visualization of the training MAE distribution. ' ' '
  x_test = generate_sequences(test_value)
  print("*"*80)
  print("Test input shape: {}".format(x_test.shape))

  x_test_pred = model.predict(x_test)
  test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis = 1)
  test_mae_loss = test_mae_loss.reshape((-1))

  global anomalies
  anomalies = (test_mae_loss >= np.max(train_mae_loss)).tolist()
  print("Number of anomaly samples: ", np.sum(anomalies))
  print("Indices of anomaly samples: ", np.where(anomalies))
  print("*"*80)

  histogram = test_mae_loss.flatten() 
  fig =go.Figure(data = [go.Histogram(x = histogram, 
                                      histnorm = 'probability',
                                      name = 'MAE Loss')])  
  fig.update_layout(
  title='Mean Absolute Error Loss',
  xaxis_title="Testing MAE Loss (%)",
  yaxis_title="Number of Samples",
  font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  
  return fig.show()


Call the function above:
generate_testing_loss(test_value)
Copy the code

In addition, the MAE distribution is introduced and compared with MAE direct losses.

! [](qiniu.aihubs.net/newplot (6).png)

Finally, outliers are visually represented below.

def plot_outliers(data) :
  "" This function determines the location of outliers in a time series that are plotted in turn. Arguments: data: Initial data set (Pandas DataFrame) Returns: FIG: A visual representation of the outliers present in the sequence determined by the RNN. ' ' '

  outliers = []

  for data_idx in range(TIME_STEPS -1.len(test_value) - TIME_STEPS + 1):
    time_series = range(data_idx - TIME_STEPS + 1, data_idx)
    if all([anomalies[j] for j in time_series]):
      outliers.append(data_idx + len(training_data))

  outlying_data = data.iloc[outliers, :]

  cond = data.index.isin(outlying_data.index)
  no_outliers = data.drop(data[cond].index)

  fig = go.Figure()
  fig.add_trace(go.Scatter(x = no_outliers.index, y = no_outliers["Volume"],
                           mode = 'markers',
                           name = no_outliers["Volume"].name,
                           connectgaps=False))
  fig.add_trace(go.Scatter(x = outlying_data.index, y = outlying_data["Volume"],
                           mode = 'markers',
                           name = outlying_data["Volume"].name + ' Outliers',
                           connectgaps=False))
  
  fig.update_xaxes(rangeslider_visible=True)

  fig.update_layout(
  title='Detected Outliers',
  xaxis_title=data.index.name,
  yaxis_title=no_outliers["Volume"].name,
  font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  
  
  return fig.show()


Call the function above:
plot_outliers(dataset)
Copy the code

Outlying data characterized by the autoencoder model is shown in orange, while consistent data is shown in blue.

We can see that a large portion of bitcoin transaction volume data in 2020 is considered abnormal — could it be due to increased retail transaction activity driven by COVID-19?

Try autoencoder parameters and the new data set to see if you can find any anomalies in bitcoin’s closing price, or download different cryptocurrencies using the historical crypto library!

The original link: towardsdatascience.com/outlier-det…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/

Outlier detection based on RNN autoencoder

What is an exception

Exception types

What is an autoencoder

implementation

Related Posts

Machine Learning Notes 4- Multiple gradient descent methods

Crystal ball “data Insight” is officially launched: insight into consumption trends and awareness of interactive experience details

12 applications of GPT3-based code generation and analysis