• Time Series of Price Anomaly Detection
  • By Susan Li
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: kasheemlew
  • Proofreader: Xionglong58, Portandbridge

Anomaly detection refers to the detection of data points that do not follow the statistical rules of other data in the data set

Anomaly detection, also known as outlier detection, is a process in data mining to determine the types of anomalies and the details of their occurrence. Automated exception detection is critical today because the volume of data is so large that manual tagging is no longer possible. Automatic exception detection has a wide range of applications, such as anti-fraud, system monitoring, error detection, event detection in sensor networks and so on.

But I’m going to do an anomaly check on hotel rates, for selfish reasons.

I don’t know if you’ve ever traveled somewhere on a regular basis and stayed in the same hotel every time. Normally, room rates don’t fluctuate much. But sometimes, even the same room in the same hotel can be prohibitively expensive. Due to travel allowance restrictions, you may have to choose a different hotel. After being cheated several times, I started thinking about creating a model to automatically detect such price anomalies.

Of course, there are some anomalies that you only come across once in your life, and we know in advance that they won’t happen again at the same time in the next few years. Take, for example, the staggering rates for rooms in Atlanta from February 2 to February 4, 2019.

In this article, I will try different anomaly detection techniques, using unsupervised learning to anomaly detection of hotel room rates in time series. Let’s get started!

data

It was a difficult process to get the data, and I only got some imperfect data.

The data we are using is a subset of Expedia’s personalized hotel search data. Click here to get the data set.

We’ll split a subset from Train.csv:

  • Choose the hotel with the most data pointsproperty_id = 104517.
  • Select visitor_location_country_id = 219 (country 219 stands for the United States from another analysis) to unifyprice_usdThe column. The reason for this is that countries have different habits when it comes to showing taxes and room rates. This can be a nightly rate or a total rate, but we know that hotels in the United States are showing rates per night, excluding taxes.
  • choosesearch_room_count = 1.
  • Select other features we need:date_time,price_usd,srch_booking_windowsrch_saturday_night_bool.
expedia = pd.read_csv('expedia_train.csv')
df = expedia.loc[expedia['prop_id'] = =104517]
df = df.loc[df['srch_room_count'] = =1]
df = df.loc[df['visitor_location_country_id'] = =219]
df = df[['date_time'.'price_usd'.'srch_booking_window'.'srch_saturday_night_bool']]
Copy the code

After the segmentation is complete, we can get the data we want to use:

df.info()
Copy the code

df['price_usd'].describe()
Copy the code

Now we have a serious exception, the maximum price_USD value is 5584.

If a single item of data is unusual compared to other data, we call it a single point of exception (such as a large transaction). We can check the logs, see what’s going on. After some investigation, I thought it might be a data error, or maybe a user accidentally searched the presidential suite without booking or viewing it. In order to find more minor anomalies, I decided to delete this data.

expedia.loc[(expedia['price_usd'] = =5584) & (expedia['visitor_location_country_id'] = =219)]
Copy the code

df = df.loc[df['price_usd'] < 5584]
Copy the code

We don’t know what type of room the user is searching for. The price of a standard room is very different from the price of a large bed with sea view. Just to prove it, remember this. All right, time to move on.

Time series visualization

df.plot(x='date_time', y='price_usd', figsize=(12.6))
plt.xlabel('Date time')
plt.ylabel('Price in USD')
plt.title('Time Series of room price by date time of search');
Copy the code

a = df.loc[df['srch_saturday_night_bool'] = =0.'price_usd']
b = df.loc[df['srch_saturday_night_bool'] = =1.'price_usd']
plt.figure(figsize=(10.6))
plt.hist(a, bins = 50, alpha=0.5, label='Search Non-Sat Night')
plt.hist(b, bins = 50, alpha=0.5, label='Search Sat Night')
plt.legend(loc='upper right')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show();
Copy the code

In general, non-Saturday nights were more stable and cheaper, and Saturday nights were significantly more expensive. Looks like this hotel is popular on weekends.

Anomaly detection based on clustering

K – average algorithm

K – average is a widely used clustering algorithm. It creates clusters of ‘K’ similar data points. Data items outside these clusters may be marked as exceptions. Before we start with k-mean clustering, we use the elbow rule to determine the optimal number of clusters.

From the elbow curve in the figure above, we find that the image gradually becomes horizontal after 10 clusters, that is to say, adding more clusters does not explain more variances of relevant variables. The relevant variable in this example is price_USD.

We set n_clusters=10 and used the k-average output data to draw 3D clusters.

Now we have to figure out how many ingredients to keep.

As we can see, the first component explains almost 50% of the variance, and the second component explains over 30% of the variance. However, we should note that no component is negligible. The first two components contain more than 80% of the information, so we set n_components=2.

The hypothesis emphasized in cluster-based anomaly detection is that we cluster the data, and the normal data belongs to the cluster, while the anomaly does not belong to any cluster or belongs to a very small cluster. Let’s find the exception and visualize it.

  • Calculate the distance between each point and its nearest cluster center. The largest distances are anomalies.
  • We useoutliers_fractionProvide the algorithm with the information of outlier ratio in the data set. Different data sets may differ, but as a starting point, I guessOutliers_fraction = 0.01, which is the proportion of observations in a standard normal distribution that deviate more than 3 from the mean in absolute terms of the Z-score.
  • useoutliers_fractionTo calculatenumber_of_outliers
  • willthresholdSet to the shortest distance between outliers.
  • anomaly1The abnormal results of the above method include clusters (0: normal, 1: abnormal).
  • Visualize exceptions using the cluster view.
  • Visualize exceptions using sequential views.

The results showed that the abnormal room rates detected by k-mean clustering were either very high or very low.

useIsolated forestsforAnomaly detection

Isolated forests are detected purely on the basis that the number of outliers is small and their values are different. Exception isolation can be achieved without measuring any distance or density. This is completely different from clustering or distance-based algorithms.

  • We use an IsolationForest model and set Contamination = outliers_Fraction, which means the percentage of anomalies in the dataset is 0.01.
  • fitpredict(data)Perform exception detection on the data set, returning 1 for normal values and -1 for outliers.
  • Finally, we have a sequential view of the exception.

Anomaly Detection Based on Support Vector Machine (SVM)

SVM is closely related to supervised learning, but OneClassSVM can treat anomaly detection as an unsupervised problem and learn a decision function: classify new data as similar to or different from the training data set.

OneClassSVM

According to this paper, Support Vector Method for Novelty Detection. SVM is based on the maximum interval method, that is, do not model a probability distribution. The core of SVM-based anomaly detection is to find a function that outputs a positive value for the region with high point density and returns a negative value for the region with low point density.

  • In fittingOneClassSVMWhen we model, we setnu=outliers_fraction, this is the upper bound of the training error and the lower bound of the support vector, and this value must be between 0 and 1. This is basically what percentage of outliers we expect to be in the data.
  • Specify the kernel type in the algorithm:rbf. At this point, SVM uses nonlinear functions to map hyperspace to higher dimensions.
  • gammaIs a parameter of the RBF kernel type that controls the influence of a single training sample – it affects the “smoothness” of the model. After testing, I found no significant difference.
  • predict(data)Perform data classification. Because our model is a single-class model, it will only return +1 or -1, -1 for exception and 1 for normal.

Use gaussian distribution for anomaly detection

The Gaussian distribution is also called the normal distribution. We will develop an anomaly detection algorithm using the Gaussian distribution, in other words, we assume that the data follows a normal distribution. This assumption is not applicable to all data sets, and once it is established, outliers can be efficiently determined.

Scikit – Learn [**covariance.EllipticEnvelope**](https://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope. The HTML function attempts to calculate the key parameters of the overall distribution of the data by assuming that our total data is an external representation of a probability distribution that follows a multivariable Gaussian distribution. The process looks something like this:

  • Create two different datasets — search_Sat_night and Search_Non_Sat_night — based on the categories defined previously.
  • Use for each categoryEllipticEnvelope(Gaussian distribution).
  • We setcontaminationParameter, which is the percentage of outliers in the data set.
  • We usedecision_functionTo compute the decision function for a given observation, which is equivalent to the translational Mahalanobis distance. To ensure compatibility with other outlier detection algorithms, the threshold for becoming an outlier is set to 0.
  • predict(X_train)The fitted model is used to predict the labels of X_train (1 means normal, -1 means abnormal).

Interestingly, this method of detection detected only unusually high prices, but not unusually low prices.

So far, we have done price anomaly detection in four ways. Because we’re using unsupervised learning for anomaly detection, once we build the model, we don’t have anything to compare it to, so we don’t know how it’s doing. Therefore, the results of these methods must be tested before they are used to tackle critical problems.

Jupyter Notebook has been uploaded to Github. Enjoy this week!

References:

  • Introduction to Anomaly Detection
  • Sklearn. Ensemble. IsolationForest scikit – learn 0.20.2 documentation
  • OneClassSVM – scikit-learn 0.20.2 documentation
  • Sklearn. Covariance. EllipticEnvelope scikit – learn 0.20.2 documentation
  • Unsupervised Anomaly Detection | Kaggle

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.