Original link:tecdat.cn/?p=22160 

Original source:Tuo End number according to the tribe public number

 

This example shows how to use quantile random forest to detect outliers. Quantile random forest can detect outliers related to the conditional distribution of Y for a given X.

Outliers are observations that are located far enough from most other observations in the data set to be considered outliers. The causes of outlier observations include inherent variability or measurement errors. Outliers significantly influence estimation and inference, so detecting them is important in determining whether to delete or robust analysis.

To demonstrate outlier detection, this example: generate data from a heteroscedasticity nonlinear model and simulate some outliers. A quantile random forest for growing regression trees. Estimate conditional quartiles (Q1, Q2, and Q3) and quartile distances (IQR) within the range of predictive variables. The observed values are compared with the boundaries F1=Q1− 1.5iQR and F2=Q3+ 1.5iQR. Any observation less than F1 or greater than F2 is an outlier.

Generate the data

Generate 500 observations from the model

The εt is approximately N(0,t+0.01). Store data in tables.

rng('default'); % randsample(linspace(0,4* PI,1e6),n,true)'; Epsilon = randn (n, 1). * SQRT (t + (0.01));Copy the code

Move the five observations 90% of their value in a random vertical direction.

numOut = 5; Tbl. Y (independence idx) + randsample ([1] 1, numOut, true) '. * (0.9 * Tbl. Y (independence idx));Copy the code

Draw scatter diagrams of data and identify outliers.

plot(Tbl.t,Tbl.y,'.'); plot(Tbl.t(idx),Tbl.y(idx),'*'); Title (' data scatter chart '); Legend (' data ',' simulated outlier ','Location','NorthWest');Copy the code

Random forest of raw component bits

Generate 200 regression trees.

Tree(200,'y','regression');
Copy the code

The return is a collection of Treebaggers.

Prediction of conditional quartiles and quartile intervals

Quantile regression is used to estimate the conditional quartiles of 50 equidistant values in the range t.

Linspace (0, 4 * PI, 50) '; quantile(pred,'Quantile');Copy the code

Quartile is a 500 × 3 conditional quartile matrix. The rows correspond to observations in T, and the columns correspond to probabilities. The conditional mean and median dependent variables are plotted on the scatter plot of the data.

plot(pred,[quartiles(:,2) meanY]); Legend (' data ',' simulated outlier ',' median dependent variable ',' average dependent variable ',...Copy the code

Although the conditional mean and median curves are close, simulated outliers affect the mean curve. Calculate the conditions IQR, F1 and F2.


iqr = quartiles(:,3) - quartiles(:,1);
f1 = quartiles(:,1) - k*iqr;
Copy the code

K =1.5 means that all observations less than f1 or greater than F2 are considered outliers, but this threshold is not distinguishable from extreme outliers. When k is 3, the extreme outlier can be determined.

Compare the observations with the boundary

Draw observation maps and boundaries.

plot(Tbl.t,Tbl.y,'.'); Legend (' data ',' simulated outlier ','F_1','F_2'); Title (' Outlier detection using quantile regression ')Copy the code

All simulated outliers are outside [F1, F2], and some observations are outside this range.


Most welcome insight

1. Why do employees dimission from decision tree model

2. Tree-based methods of R language: decision tree, random forest

3. Use scikit-learn and PANDAS in Python

4. Machine learning: Running random forest data analysis reports in SAS

5.R language improves airline customer satisfaction with random forest and text mining

6. Machine learning boosts fast fashion precise sales time series

7. Identifying changing Stock Market Conditions with Machine learning: Application of Hidden Markov Models

8. Python Machine learning: Recommendation System Implementation (Matrix factorization for collaborative filtering)

9. Python uses PyTorch machine learning classification to predict bank customer churn