Produced by: The Cabin by: Peter Edited by: Peter

Ng Machine Learning -10- Anomaly Detection

“There is white in black and there is black in white. There is no absolute white and there is no absolute black. Black can set off white and white can reflect black. Everything is convertible.”

This week mainly explains the exception detection in machine learning, mainly including:

  1. problems
  2. Gaussian distribution
  3. Algorithm Usage Scenarios
  4. Eight unsupervised anomaly detection techniques
  5. Contrast anomaly detection and supervised learning
  6. Feature selection

Anomaly Detection and Novelty Detection

An anomaly is a significant deviation from other observed data, so much so that it is suspected that it does not belong to the same data distribution as the normal point.

Outlier detection, also known as outlier detection, is a technique for identifying outlier patterns that do not conform to the expected behavior.

There are also many applications in business, such as network intrusion detection (to identify specific patterns in network traffic that could be hacked), system health monitoring, fraud detection for credit card transactions, equipment failure detection, risk identification, etc

Problem of motivation

Anomaly detection is mainly an algorithm used in unsupervised learning. The introduction of the problem: through the aircraft inspection began.

Engine manufacturers that test aircraft produce batches of aircraft engines and test some of their characteristic variables, such as the heat generated when the engine is running, or the vibration of the engine.

Suppose there are m engines and the data is as follows:


x ( 1 ) . x ( 2 ) . . x ( m ) X ^ {} (1), x ^ {} (2),… ,x^{(m)}

Draw the following chart:

For a given data set, you need to test whether xtestx_{test}xtest is abnormal, that is, the probability that the test data does not belong to the set of data.

As can be seen from the figure above, the probability of belonging to this group is high within the blue circle, and the more remote the probability, the lower the probability of belonging to this group.


i f p ( x ) { < Epsilon. a n o m a l y = Epsilon. n o r m a l if \quad p(x) \begin{cases} < \varepsilon & anomaly \\=\varepsilon & normal \end{cases}

Two other examples of exception detection are

  • Identify cheating behavior, build models based on how often users log in, the pages they visit, the number of posts they make, and use the models to identify users who don’t fit the model.
  • Check data center usage: memory usage, number of accessed disks, and CPU load

Gaussian distribution

The Gaussian distribution is also called the normal distribution. The distribution satisfies:


x …… N ( mu . sigma 2 ) x \sim N\left(\mu, \sigma^{2}\right)

The probability density function is:


p ( x . mu . sigma 2 ) = 1 2 PI. sigma exp ( ( x mu ) 2 2 sigma 2 ) p\left(x, \mu, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right)

Mean value μ\muμ is:


mu = 1 m i = 1 m x ( i ) \mu=\frac{1}{m} \sum_{i=1}^{m} x^{(i)}

The variance σ2\sigma^2 is:


sigma 2 = 1 m i = 1 m ( x ( i ) mu ) 2 \sigma^{2}=\frac{1}{m} \sum_{i=1}^{m}\left(x^{(i)}-\mu\right)^{2}

An example of the Gaussian distribution is

When the mean mu \muμ is the same

  • The larger the square of the variance, the pudgy figure
  • The smaller the square of the variance, the taller the figure

Usage scenarios

There are three scenarios for exception detection algorithms:

  1. In feature engineering, abnormal data should be filtered to prevent the normalization results from being affected
  2. Filter the characteristic data without marked output to find abnormal data
  3. In the case of binary classification of feature data with marked output, the unsupervised outlier detection algorithm can also be considered for serious category imbalance due to the small number of training samples in some categories

algorithm

The specific process of the algorithm is

  1. For a given data set:


    x ( 1 ) . x ( 2 ) . . . . . x ( m ) x^{(1)},x^{(2)},… ,x^{(m)}

    Calculate the μ of each feature; Sigma 2 \ mu; \ sigma ^ 2 mu; An estimate of σ2

  2. The estimated values of the two parameters are:


    mu j = 1 m i = 1 m x j ( i ) \mu_j=\frac{1}{m}\sum\limits_{i=1}^{m}x_j^{(i)}


    sigma j 2 = 1 m i = 1 m ( x j ( i ) mu j ) 2 \sigma_j^2=\frac{1}{m}\sum\limits_{i=1}^m(x_j^{(i)}-\mu_j)^2

  1. P (x)p(x)p(x)


p ( x ) = Π j = 1 n p ( x j ; mu j ; sigma j 2 ) = Π j = 1 n 1 2 PI. sigma j exp ( ( x j mu j ) 2 2 sigma j 2 ) p(x)=\Pi^n_{j=1}p(x_j; \mu_j; \sigma^2_j)=\Pi^n_{j=1}\frac{1}{\sqrt{2 \pi} \sigma_j} \exp \left(-\frac{(x_j-\mu_j)^{2}}{2 \sigma^{2}_j}\right)

  1. The training set of two features and the feature non-partial case

  1. The three-dimensional graph represents the density function, and the ZZZ axis is the estimated value of p(x)p(x)p(x) based on the values of the two features

When p(x)>εp(x) > \varepsilonp(x)>ε, the prediction is normal; otherwise, it is abnormal

Design of exception algorithm

When we develop an exception detection system, we start with marked (abnormal or normal) data

  • A portion of the normal data is selected to construct the training set
  • The remaining normal data and abnormal data are then mixed to form cross-check sets and test sets.

Eight unsupervised anomaly detection techniques

  1. Anomaly detection technology based on statistics
    1. MAMoving average method
    2. 3 - the Sigma(Rayda’s Criterion)
  2. Density-based anomaly detection
  3. Anomaly detection based on clustering
  4. Anomaly detection based on ‘ ‘K-means’ clustering
  5. One Class SVMAnomaly detection based on
  6. Isolation ForestAnomaly detection based on
  7. PCA+MDAnomaly detection based on
  8. AutoEncoderAnomaly detection

Contrast anomaly detection and supervised learning

Marked data is also used in anomaly detection, similar to supervised learning. The comparison between the two is as follows:

Anomaly detection Supervised learning
Very few forward classes (exception data
y = 1 y=1
), a large number of negative classes (
y = 0 y=0
)
There are a large number of both positive and negative classes
Lots of different kinds of exceptions, very difficult. Train algorithms on very small amounts of forward class data. There are enough forward class instances for training algorithms that future forward class instances encountered may be very similar to those in the training set.
Future exceptions may be very different from known exceptions.
For example, fraud detection detects the health of computers in a data center for production (such as aircraft engines) For example: mail filter weather forecast tumor classification

When the number of positive samples is small, or even sometimes zero, that is, there are too many different types of exceptions that have not been seen, the algorithm usually used for these problems is the exception detection algorithm.

Feature selection

The anomaly detection algorithm is based on Gaussian distribution. Of course, it can also be processed if it does not meet the Gaussian distribution, but it is best to convert to the Gaussian distribution. Error analysis is an important point in feature selection.

Some abnormal data may have a high value of P (x)p(x)p(x), which is regarded as normal data by the algorithm. Through error analysis, new features are added to get a new algorithm, which helps us to better detect anomalies.

New feature acquisition: new features are obtained by combining original features

The resources

  1. Li Hang – Statistical learning methods
  2. Eight unsupervised anomaly detection techniques