LOF outlier detection algorithm and PYTHon3 implementation

Transfer statement

This article was first published in Zhihu – Data Science and Python Notes

LOF outlier detection algorithm and Python3 implementation – Suranyi article – Zhihu

I. Relevant background

1.1 Anomaly detection algorithm

With the rapid development of data mining technology, people not only pay attention to the overall trend of data, but also pay more and more attention to the outlier data points that are obviously deviating from the overall trend of data, because these data points often contain more important information, and processing these outlier data depends on the corresponding data mining technology. The purpose of outlier mining is to effectively identify the abnormal data in the data set and mine the meaningful potential information in the data set. There are different reasons for outliers, mainly including the following:

Data comes from unusual sources: fraud, intrusion, disease outbreaks, unusual experimental results, etc. The formation mechanism of such outliers deviates from normal track and is often the focus of attention
Inherent variation of data variables: natural and periodic occurrence reflect the distribution characteristics of data sets, such as sudden changes in climate, new purchasing patterns of customers, genetic mutations, etc
Data measurement or collection error: mainly system error, artificial data reading deviation, measurement equipment failure or “noise” interference
Random or human error: the main reason is that there is random mechanism in the experimental platform, and human error may occur in the process of data entry

Outlier detection is of great practical significance and has broad prospects in the corresponding application fields. The engineering application fields mainly include the following aspects:

Fraud detection: credit card misconduct, such as credit card, social security fraud or fraudulent use of bank cards, phone cards, etc
Industrial detection: such as illegal access to computer networks
Activity monitoring: Mobile phone fraud can be detected through real-time detection of mobile phone activity or suspicious transactions in the equity market
Network performance: computer network performance testing (robustness analysis), detection of network congestion, etc
Natural ecological applications: imbalance of ecosystem, discovery of abnormal natural climate, etc
Public service: the outbreak of abnormal diseases in public health and the occurrence of emergencies in public security

At present, with the growing maturity of outlier detection technology, it will be applied in more industries in the future development, and can better provide guidance for human decision-making.

One of the goals of outlier detection is to mine valuable information from a seemingly chaotic mass of data and make these data serve our daily life better. However, data in real life often has hundreds of dimensions and a large amount of data, which undoubtedly brings great difficulties to the existing outlier detection methods. Although the traditional outlier detection methods show good results in their specific application fields, they are no longer applicable in high-dimensional large data sets. Therefore, how to effectively apply outlier detection method to big data and high dimensional data is one of the primary goals of current outlier detection methods.

1.2 Why is outlier detection

Classification learning algorithms have a common basic assumption in training: the number of training samples of different categories is equal. If there is a slight difference in the number of training samples in different categories, it usually has little effect, but if there is a large difference, it will cause trouble to the learning process. — Zhou Zhihua, Machine Learning

This is the case in question B of the 2018 Mathorcup Mathematical Modeling Competition, “Targeted Marketing of Branded Mobile Phones”. In the original problem, 10,000 + users were detected and 500+ mobile phone users were purchased. Using a classification algorithm, the classifier is likely to return an algorithm that says, “None of the users will buy this phone.” It gets the classification right 96 percent of the time, but it’s obviously meaningless because it doesn’t predict any positive examples.

This kind of problem is called “class imbalance” and refers to the situation where the number of training samples in different categories varies greatly (in the above example, the number of users who have purchased and those who have not purchased differ greatly). To deal with such problems, “undersampling” and “over-sampling” are often used for data processing, but through such methods, information in the original data may be lost. Therefore, from the perspective of outliers, purchase behavior is regarded as “abnormal” and outlier mining is carried out.

1.3 Outlier mining methods

1.4 LOF algorithm background

Outliers detection method based on the density of the key step is to assign each data point a discrete degree, its main idea is: according to the given data set, to any one of these data points, if within its local neighborhood points are very dense, so that the data points as normal data points, and outliers is from the normal data of the nearest neighbor points are far data points. There is usually a threshold that defines the distance. Among the density-based Outlier detection methods, the most representative method is Local Outlier Factor (LOF) detection.

2. Introduction to the algorithm

Among many outlier detection methods, LOF method is a typical high precision outlier detection method based on density. In the LOF method, each data point is assigned an outlier factor LOF which depends on the neighborhood density to judge whether the data point is an outlier. If a LOF1, the data point is an outlier; If LOF is close to 1, the data point is normal.

2.1 Distance measurement scale

Let’s say for samples that have no similarities, assuming commonAnd the data dimension isfor

For the data setAny two data points in, define the following common distance measurement methods

2.1.1 Eucild distance:

2.1.2 Hamming Distance:

Note Hamming distance is used in data transmission error control codes to measure different bits of information.

take, easy to see 与 There areBit numbers are not the same, so 与 The Hamming distance of is.

For data processing, one technique is to group continuous data first and convert them into classified variables (grouped variables), which can be measured by introducing hamming distance. “– Wozki Shulder

2.1.3 Mahalanobis distance:

A sample setThe covariance matrix of isAnd its inverse matrix is 若 Reversible, 做 Decomposition (singular value decomposition), get:

ifIf it’s not invertible, we use the generalized inverseInstead of, the Penrose generalized inverse can be obtained:

Two data pointsMahalanobis distance of is:

Note: Mahalanobis distance represents the covariance distance of data. It is an effective method to calculate the similarity between sample sets by using Cholesky transform to deal with the correlation between different dimensions and the transformation of measure scale.

2.1.4 Spherical distance

Spherical distance is actually converted on the basis of Euclidean distance, which is not a unique distance measurement method. It is often used in geographic information conversion. This paper introduces it in detail.

instructions	symbol	instructions	instructions
Ball core,		Supplementary Angle of point latitude	Point weft circle center
The ball size for the ball		Supplementary Angle of point latitude	Point weft circle center
Distance point to be measured		Point longitude difference	Point latitude radius
The weft circle with	Of the circle	The intersection point	Point latitude radius

Figure: Find the spherical distance of two points A and B

Let the spherical coordinates of A and B be. If the sphere is the Earth, x and y represent latitude and longitude, respectively. (Note: the followingIs the remainder of x to facilitate derivation of the notation used)

As shown, connectIn the 和In the calculation:

Due to the 与 It’s a heteroplane line,Is their common perpendicular, and the Angle longitude difference is zero, using the formula of distance between two points on a heteroplane straight line:

in, according to the law of cosines:

Because of theRepresents the supplementary Angle of latitude, and converts it:

Therefore, the spherical distance of points A and B is:

The above measures can be chosen arbitrarily

In addition, there are Also Chebyshev distance, Minkowski distance, absolute distance, Lance & Williams distance, specific problem specific analysis, select the appropriate measurement.

uniformSaid someAnd the pointThe distance between. By definition, the commutative law holds:

2.2 the firstdistance

define For the pointThe firstDistance,, the following conditions are met

(1)At least exist in the setA point,
(2)Exists in a set at mostA point,

In short: point P is the KTH nearest point to O

Point P is the fifth closest point to O, there are four points inside the fifth distance, and there are six points in the fifth distance. The fifth distance of point O is the same as the sixth distance

2.3 From the neighborhood

Define let N_{k}(O) be the KTH distance neighborhood of point O, satisfying:

noteThe concept of neighborhood here is slightly different from that of advanced mathematics textbooks in China (specific points, not intervals). This set contains all points whose distance to point O is less than the distance to the KTH neighborhood of point O. Easy to know a, as shown in the figure above, the 5th distance neighborhood of point O is:

2.4 Reachable distance

The KTH reachable distance from point P to point O is defined as:

noteThe point ofpoint-to-pointThe firstReachable distance is at least one pointThe firstDistance. distanceSome recentStudent: A dot, they goThe reachable distances of are considered to be equivalent and are equal to

The fifth reachable distance from point P to point O is always d(O) when calculating the reachable density of sample points, that is, it is related to the KTH distance. If the outlier density of the new sample point in the old sample domain is calculated, equation (14) will come into play. (For example, preditc in 3.1.4 required to calculate the density of each point under data without considering the influence of its own data set)

2.5 Locally accessible density

Locally accessible density is defined as:

noteSaid someThe firstTo all points in the neighborhoodThe average reachable distance is located at noEven if the number of points on the neighborhood boundary is greater than, the number of points in the range is still calculated as. ifAnd the surrounding neighborhood points are the same cluster, then the reachable distance is more likely to be smallerAs a result, the smaller the sum of reachable distances, the greater the local reachable density. ifIf it is far from the neighboring point, the reachable distance may be largerAs a result, the larger the sum of reachable distances is, the smaller the local reachable density is.

Some of the information is used hereRather than. After consulting a large number of data and data testing, the author believes that this should be, otherwise,Because too many points are on the same inner ring (as in Equation 13)Located on the same ring)Is a very small number, indicating that the density here is low and may be an outlier.

In addition, it is pointed out at the beginning of 2.1 that “no overlap of sample points” can also be explained here: if overlapping sample points are considered, the reachable density here may beOr below 为 The form, the calculation is confusing.

2.6 Local outliers

noteSaid someIn the neighborhoodLocal reachable density and point of other points withinIs the mean of the ratio of locally accessible density. If this ratio gets closer and closer,The neighborhood point density ofMay belong to the same group as the neighborhood; If this ratio is less than,Is higher than its neighborhood point density,Is the dense point; If this ratio is greater than,Is less than its neighborhood point density,It could be an anomaly.

The LOF value of point O is greater than 1, indicating a possible outlier

3. Implementation of LOF Outlier detection algorithm PYTHon3 (Improved in Part 4)

This section introduces the functions provided by SkLearn, followed by step-by-step programming.

3.1 Import sklearn module implementation

In python3, the sklearn module provides the LOF outlier detection algorithm

3.1.1 Importing modules

import pandas as pd
import matplotlib.pyplot as plt
Copy the code

3.1.2 Core functions

clf = LocalOutlierFactor(n_neighbors=20, algorithm='auto', contamination = 0.1, n_jobs = 1)Copy the code

n_neighbors = 20: As mentioned above, if the number of detected neighborhood points exceeds the number of samples, all samples are used for detection
algorithm = 'auto': Specifies the solution algorithm used. Use the default value
Contamination = 0.1: Ranges from 0 to 0.5, indicating the percentage of outliers in the sample. The default value is 0.1
n_jobs = -1: Number of parallel tasks. If the value is -1, all cpus are used
p = 2: distance measure function, using Euclidean distance by default. (Other distance models are rarely used and will not be introduced here.Please refer to official documentation for details)

clf.fit(data)
Copy the code

Unsupervised learning only needs to pass in training data, which has at least 2 dimensions

clf.kneighbors(data)
Copy the code

Function: Get the distance from every point in the KTH distance neighborhood to the center point, and sort by the smallest to the largest
return Array, [distance, sample index]

-clf._decision_function(data)
Copy the code

Function: Obtain the LOF value of each sample point. The negative value of the LOF value in the range of this function needs to take the negative sign
clf._decision_function The output mode is more flexible: If the clf._predict(data) function is used, output the judgement for “contamination” as previously set (the result is proportional, with -1 for the exception and 1 for the non-exception).

3.1.3 Encapsulating functions

localoutlierfactor(data, predict, k)

The input: Training sample, test sample, 值
The output: LOF value of each test sample and corresponding nodistance

plot_lof(result,method)

Input: LOF value, threshold
Output: LOF value distribution with index as abscissa

lof(data, predict=None, k=5, method=1, plot = False)

The input: Training sample, test sample,Value, outlier threshold, whether to draw LOF graph
Output: Classification of outliers and normal points If no test sample is input, the default test sample = training sample

def localoutlierfactor(data, predict, k):
    from sklearn.neighbors import LocalOutlierFactor
    clf = LocalOutlierFactor(n_neighbors=k + 1, algorithm='auto', contamination = 0.1, n_jobs = 1) CLF. Fit (data)Record k neighborhood distance
    predict['k distances'] = clf.kneighbors(predict)[0].max(axis=1)
    Record LOF outliers and do negative number processing
    predict['local outlier factor'] = -clf._decision_function(predict.iloc[:, :-1])
    return predict

def plot_lof(result, method):
    import matplotlib.pyplot as plt
    plt.rcParams['font.sans-serif'] = ['SimHei']  # used to display Chinese labels normally
    plt.rcParams['axes.unicode_minus'] = False  # is used to display the minus sign normally
    plt.figure(figsize=(8, 4)).add_subplot(111)
    plt.scatter(result[result['local outlier factor'] > method].index,
                result[result['local outlier factor'] > method]['local outlier factor'], c='red', s=50,
                marker='. ', alpha=None,
                label='Outlier')
    plt.scatter(result[result['local outlier factor'] <= method].index,
                result[result['local outlier factor'] <= method]['local outlier factor'], c='black', s=50,
                marker='. ', alpha=None, label='Be normal')
    plt.hlines(method, -2, 2 + max(result.index), linestyles=The '-')
    plt.xlim(-2, 2 + max(result.index))
    plt.title('LOF Local Outlier Detection ', fontsize=13)
    plt.ylabel('Local outlier', fontsize=15)
    plt.legend()
    plt.show()

def lof(data, predict=None, k=5, method=1, plot=False):
    import pandas as pd
    # check whether test data is passed in, if not, test data is assigned to training data
    try:
        if predict == None:
            predict = data.copy()
    except Exception:
        pass
    predict = pd.DataFrame(predict)
    # Calculate LOF outliers
    predict = localoutlierfactor(data, predict, k)
    if plot == True:
        plot_lof(predict, method)
    # Divide outliers and normal points according to threshold
    outliers = predict[predict['local outlier factor'] > method].sort_values(by='local outlier factor')
    inliers = predict[predict['local outlier factor'] <= method].sort_values(by='local outlier factor')
    return outliers, inliers
Copy the code

3.1.4 Example test

Test data: Question B data of 2017 National College Students Mathematical Contest in Modeling

There are two files of test data, and three tests are carried out: no input test sample, input test sample, test sample and training sample are exchanged

Test 1 Without input test sample case: Task density

Data Background: In the pricing of crowd-sourcing tasks, the density of regional tasks reflects the intensity of tasks, thus affecting the pricing of tasks. Spherical distance deviation (that is, points on the same plane are considered) is not taken into account here. Now a reasonable indicator is needed to describe the intensity of tasks.

import numpy as np
import pandas as pd

Change according to file location
posi = pd.read_excel(r'Finished project task data.xls')
lon = np.array(posi["Mission GPS longitude"] [:])# longitude
lat = np.array(posi["Mission GPS latitude"] [:])# latitude
A = list(zip(lat, lon))  Match by latitude - longitude

# Obtain the task density, take the 5th neighborhood, and set the threshold to 2 (LOF greater than 2 is considered an outlier)
outliers1, inliers1 = lof(A, k=5, method = 2)
Copy the code

There are 835 pieces of data in the given data, set the LOF threshold to 2, and output 17 outlier information:

The distribution of data point detection is shown in the figure below, where the blue represents the task distribution and the red range represents the LOF value:

Detection when k=5: The larger the red circle is, the larger the LOF value is. It can be seen from the figure that the detection effect is remarkable

Adjust k value for multiple detection effect, the greater the k value, the higher the accuracy

# drawing program
import matplotlib.pyplot as plt
for k in,5,10 [3] : PLT. Figure ('k=%d'%k)
    outliers1, inliers1 = lof(A, k=k, method = 2)
    plt.scatter(np.array(A)[:,0],np.array(A)[:,1],s = 10,c='b', Alpha = 0.5) Plt. scatter(Outliers1 [0], Outliers1 [1], S = 10+ Outliers1 ['local outlier factor']*100,c='r'PLT, alpha = 0.2). The title ('k=%d' % k)
Copy the code

Test 2 has input test sample case: task vs. member density

Data background: In the pricing of crowdsourcing tasks, the density of regional tasks reflects the intensity of tasks, and the density of members reflects the intensity of members. The density of task members can be used to describe the density of members around the task point, so as to reflect the relative probability of task completion. At this time, the training sample is the member density, and the test sample is the task density.

import numpy as np
import pandas as pd

posi = pd.read_excel(r'Finished project task data.xls')
lon = np.array(posi["Mission GPS longitude"] [:])# longitude
lat = np.array(posi["Mission GPS latitude"] [:])# latitude
A = list(zip(lat, lon))  Match by latitude - longitude

posi = pd.read_excel(r'Member information data.xlsx')
lon = np.array(posi["Membership location (GPS) Longitude"] [:])# longitude
lat = np.array(posi["Membership Location (GPS) Latitude"] [:])# latitude
B = list(zip(lat, lon))  Match by latitude - longitude

# Obtain the task pair membership density, take the 5th neighborhood, the threshold is 2 (LOF greater than 2 is considered to be an outlier)
outliers2, inliers2 = lof(B, A, k=5, method=2)
Copy the code

There are 1877 pieces of data in the given training sample and 835 pieces of data in the test sample. Set the LOF threshold to 2 and output 34 outlier information:

The distribution of data point detection is shown in the figure below, where the blue represents the task distribution, the green represents the member distribution, and the red range represents the LOF value.

Detection when k=5: The larger the red circle is, the larger the LOF value is. It can be seen from the figure that the detection effect is significant, but the misjudgment is also serious

When k=1, the detection is not good. Because k=1 only compares the distance between the points, which doesn’t have much practical significance

# drawing program
import matplotlib.pyplot as plt
for k,v in([1, 5], [5, 2]), PLT. Figure ('k=%d'%k)
    outliers2, inliers2 = lof(B, A, k=k, method=v)
    plt.scatter(np.array(A)[:,0],np.array(A)[:,1],s = 10,c='b'PLT, alpha = 0.5). The scatter (np) array [0] :, (B), np, array [:, 1), (B) s = 10, c ='green', Alpha = 0.3) plt.scatter(Outliers2 [0], Outliers2 [1], S = 10+ Outliers2 ['local outlier factor']*100,c='r'PLT, alpha = 0.2). The title ('k = %d, method = %g' % (k,v))
Copy the code

Test 3 test sample and training sample interchange: member versus task density

Data background: In the pricing of crowdsourcing tasks, the density of regional tasks reflects the intensity of tasks, and the density of members reflects the intensity of members. The density of tasks to members can be used to describe the intensity of tasks around members, so as to reflect the relative probability that members can receive tasks. At this point, the training sample is task density, and the test sample is member density.

import numpy as np
import pandas as pd

posi = pd.read_excel(r'Finished project task data.xls')
lon = np.array(posi["Mission GPS longitude"] [:])# longitude
lat = np.array(posi["Mission GPS latitude"] [:])# latitude
A = list(zip(lat, lon))  Match by latitude - longitude

posi = pd.read_excel(r'Member information data.xlsx')
lon = np.array(posi["Membership location (GPS) Longitude"] [:])# longitude
lat = np.array(posi["Membership Location (GPS) Latitude"] [:])# latitude
B = list(zip(lat, lon))  Match by latitude - longitude

# To obtain the task density of member pairs, take the 5th neighborhood, and set the threshold to 5 (LOF greater than 5 is considered as an outlier)
outliers3, inliers3 = lof(A, B, k=5, method=5)
Copy the code

There are 835 pieces of data in the given training sample and 1877 pieces of data in the test sample. Set the LOF threshold to 5 and output 20 outlier information:

The distribution of data point detection is shown in the figure below, where blue represents member distribution, green represents task distribution, and red range represents LOF value.

Detection when k=5: The larger the red circle is, the larger the LOF value is. It can be seen from the figure that the detection effect is remarkable

# drawing program
plt.figure('k=5')
outliers3, inliers3 = lof(A, B, k=5, method=5)
plt.scatter(np.array(B)[:, 0], np.array(B)[:, 1], s=10, c='b', alpha = 0.5) PLT. Scatter (np) array (A) [0] :,, np, array [: 1) (A), s = 10, c ='green', Alpha =0.3) Plt. scatter(Outliers3 [0], Outliers3 [1], S =10 + Outliers3 ['local outlier factor'] * 20, c='r'PLT, alpha = 0.2). The title ('k = 5, method = 5')
Copy the code

If method is set to 0, the LOF value of each point can be output as density indicator.

3.2 Step-by-step programming implementation method

3.2.1 Distance measurement scale

distances(A, B,model = 'euclidean')

Call the functions in the geopy module to calculate the spherical distance
Input: A, B two-point coordinates, distance mode (‘ Euclidean ‘, ‘geo’ are Euclidean and spherical distance respectively)
Output: If A and B are two coordinate points, return the coordinate point distance; If A and B are matrices, the calculation rules are as follows:

def distances(A, B,model = 'euclidean') :' ''Distance defined in LOF, default to Euclidean distance, also provided with spherical distance'' '
    import numpy as np
    A = np.array(A); B = np.array(B)
    if model == 'euclidean':
        from scipy.spatial.distance import pdist, squareform
        distance = squareform(pdist(np.vstack([A, B])))[:A.shape[0],A.shape[0]:]
    if model == 'geo':
        from geopy.distance import great_circle
        distance = np.zeros(A.shape[0]*B.shape[0]).reshape(A.shape[0],B.shape[0])
        for i in range(len(A)):
            for j in range(len(B)):
                distance[i,j] = great_circle(A[i], B[j]).kilometers
    ifShort. Shape = = (1, 1) :return distance[0][0]
    return distance
Copy the code

3.2.2K neighborhood radius and neighborhood point

k_distance(k, instance_A, instance_B, result, model)

Call Accommodate of 3.2.1
Input: k value, A, B coordinates, pandas.DataFrame used to store intermediate information, distance mode
DataFrame table (distance, neighborhood point coordinates)

def k_distance(k, instance_A, instance_B, result, model):
    ' ''Calculate k distance neighborhood radius and neighborhood point'' '
    distance_all = distances(instance_B, instance_A, model)
    Iterate over each point in instance_A
    for i,a in enumerate(instance_A):
        distances = {}
        distance = distance_all[:,i]
        # Record the distance between instance_B and instance_A
        for j in range(distance.shape[0]):
            if distance[j] in distances.keys():
                if instance_B[j].tolist() in distances[distance[j]]:
                    pass
                else:
                    distances[distance[j]].append(instance_B[j].tolist())
            else:
                distances[distance[j]] = [instance_B[j].tolist()]
        # distance sort
        distances = sorted(distances.items())
        ifdistances[0][0] == 0: distances.remove(distances[0]) neighbours = []; k_sero = 0; k_dist = None# Intercept the first k points
        for dist in distances:
            k_sero += len(dist[1])
            neighbours.extend(dist[1])
            k_dist = dist[0]
            if k_sero >= k:
                break
        # output information
        result.loc[str(a.tolist()),'k_dist'] = k_dist
        result.loc[str(a.tolist()),'neighbours'] = str(neighbours)
    return result
Copy the code

3.2.3 Locally accessible density

local_reachability_density(k,instance_A,instance_B,result, model)

Call the k_distance function of 3.2.2
Input: k value, A, B coordinates, pandas.DataFrame used to store intermediate information, distance mode
DataFrame table (distance, neighborhood point coordinates, local reachable density)

def local_reachability_density(k,instance_A,instance_B,result, model):
    ' ''Locally accessible density'' '
    result = k_distance(k, instance_A, instance_B, result, model)
    Iterate over each point in instance_A
    for a in instance_A:
        # select k_distance from k_distance (string -> array -> point)
        try:
            (k_distance_value, neighbours) = result.loc[str(a.tolist())]['k_dist'].mean(),eval(result.loc[str(a.tolist())]['neighbours'])
        except Exception:
            (k_distance_value, neighbours) = result.loc[str(a.tolist())]['k_dist'].mean(), eval(result.loc[str(a.tolist())]['neighbours'].values[0])
        # Calculate local reachable distance
        reachability_distances_array = [0]*len(neighbours)
        for j, neighbour in enumerate(neighbours):
            reachability_distances_array[j] = max([k_distance_value, distances([a], [neighbour],model)])
        sum_reach_dist = sum(reachability_distances_array)
        Calculate the local reachable density and store the results
        result.loc[str(a.tolist()),'local_reachability_density'] = k / sum_reach_dist
    return result
Copy the code

3.2.4 Local outliers

k_distance(k, instance_A, instance_B, result, model)

Call local_reachability_density in 3.2.3
Input: k value, A, B coordinates, pandas.DataFrame used to store intermediate information, distance mode
Output: pandas.DataFrame table (distance, neighborhood point coordinates, local reachable density, outliers)

def local_outlier_factor(k,instance_A,instance_B,model):
    ' ''Local outlier'' '
    result = local_reachability_density(k,instance_A,instance_B,pd.DataFrame(index=[str(i.tolist()) for i in instance_A]), model)
    # Judgement: If test data = sample data
    if np.all(instance_A == instance_B):
        result_B = result
    else:
        result_B = local_reachability_density(k, instance_B, instance_B, k_distance(k, instance_B, instance_B, pd.DataFrame(index=[str(i.tolist()) for i in instance_B]), model), model)
    for a in instance_A:
        try:
            (k_distance_value, neighbours, instance_lrd) = result.loc[str(a.tolist())]['k_dist'].mean(),np.array(eval(result.loc[str(a.tolist())]['neighbours'])),result.loc[str(a.tolist())]['local_reachability_density'].mean()
        except Exception:
            (k_distance_value, neighbours, instance_lrd) = result.loc[str(a.tolist())]['k_dist'].mean(), np.array(eval(result.loc[str(a.tolist())]['neighbours'].values[0])), result.loc[str(a.tolist())]['local_reachability_density'].mean()
        finally:
            lrd_ratios_array = [0]* len(neighbours)
            for j,neighbour in enumerate(neighbours):
                neighbour_lrd = result_B.loc[str(neighbour.tolist())]['local_reachability_density'].mean()
                lrd_ratios_array[j] = neighbour_lrd / instance_lrd
            result.loc[str(a.tolist()), 'local_outlier_factor'] = sum(lrd_ratios_array) / k
    return result
Copy the code

3.2.5 Encapsulating functions

lof(k,instance_A,instance_B,k_means=False,$n_clusters$=False,k_means_pass=3,method=1,model = 'euclidean'

Call the k_distance function in 3.2.4
Input: K value, A, B two-point coordinates, whether to use clustering algorithm, cluster number, skipped cluster with less than 3 samples, threshold, distance mode
Output: outlier information, normal point information, LOF distribution

K_means in the # function is covered in Part 4
def lof(k, instance_A, instance_B, k_means=False, $n_clusters$=False, k_means_pass=3, method=1, model='euclidean') :' ''A as the fixed point, B as the moving point '' '
    import numpy as np
    instance_A = np.array(instance_A);
    instance_B = np.array(instance_B)
    if np.all(instance_A == instance_B):
        if k_means == True:
            if $n_clusters$ == True:
                $n_clusters$ = elbow_method(instance_A, maxtest=10)
            instance_A = kmeans(instance_A, $n_clusters$, k_means_pass)
            instance_B = instance_A.copy()
    result = local_outlier_factor(k, instance_A, instance_B, model)
    outliers = result[result['local_outlier_factor'] > method].sort_values(by='local_outlier_factor', ascending=False)
    inliers = result[result['local_outlier_factor'] <= method].sort_values(by='local_outlier_factor', ascending=True)
    plot_lof(result, method) See 3.1.3 for this function
    return outliers, inliers
Copy the code

Iv. Thinking and improvement of LOF algorithm

4.1 Can LOF be used for one-dimensional data?

The LOF method provided by sklearn module in 3.1 of this paper will judge the data type during training. If the data type is list, tuple, numpy.array, the dimension of the incoming data is required to be at least 2 dimensions. In fact, in order to screen outliers in 1-dimensional data, it is very convenient to directly draw images in the coordinate system for threshold selection and judgment. However, if you want to use the LOF algorithm in this case, you can add virtual dimensions to the data and assign the same value:

# data is originally 1-dimensional data, which is converted to 2-dimensional data using the following method
data = list(zip(data, np.zeros_like(data)))
Copy the code

In addition, the problem can be avoided by converting the data to pandas.DataFrame:

data = pd.DataFrame(data)
Copy the code

4.2 What LOF is the outlier?

The result of LOF calculation does not specify how much value is defined as an outlier. In one stable data set, 1.1 May already be an outlier, while in another with strong data fluctuations, even an LOF of 2 May still be a normal value. Due to the limitations of the method, there may be differences in the definition of outliers in the data set. Therefore, the author believes that statistical distribution method can be used as a reference to determine the threshold based on the data situation.

Threshold division based on statistical distribution

The LOF outlier score is normalized to the interval [0, 1], and divided by statistical methods. The following provides the method of defining by box graph, and the selection is based on the abnormal output.

box(data, legend=True)

Input: LOF exception score value to draw exception data in the box graph (set to False not to draw)
Output: Exception identification

def box(data, legend=True):
    import matplotlib.pyplot as plt
    import pandas as pd
    plt.rcParams['axes.unicode_minus'] = False
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.style.use("ggplot")
    plt.figure()
    If not in DataFrame format, convert first
    if type(data) ! = pd.core.frame.DataFrame: data = pd.DataFrame(data) p = data.boxplot(return_type='dict')
    warming = pd.DataFrame()
    y = p['fliers'][0].get_ydata()
    y.sort()
    for i in range(len(y)):
        if legend == True:
            plt.text(1, y[i] - 1, y[i], fontsize=10, color='black', ha='right')
        if y[i] < data.mean()[0]:
            form = 'low'
        else:
            form = 'high'
        warming = warming.append(pd.Series([y[i], 'partial' + form]).T, ignore_index=True)
    print(warming)
    plt.show()
Copy the code

Lof = predict[‘local outlier factor’]; Alternatively, an initial threshold can be randomly specified (outliers and normal points of output are named outliers and inliers respectively) and then input:

box(outliers['local outlier factor'].tolist()+inliers['local outlier factor'].tolist(), legend=True)
Copy the code

At this point, the output of the interactive console is shown in the left figure and the box figure is shown in the right figure. The output indicates that 1.4 can be used as the threshold for outlier recognition from the perspective of data distribution, but in fact 7 is more appropriate (there are obvious faults between 2 and 7, and the value set as 5 above is selected after several tests).

Setting Legend =False turns off the label on the right

4.3 Can LOF algorithm still be used if the data dimension is too Large?

Too large data dimension will increase the influence of dimension on the one hand and increase the difficulty of calculation on the other hand. At this time, it is unreasonable to directly use the expression form of distance measure, and some people want to magnify the risk of more dispersed data influence. One approach is to use Mahalanobis distance as a measure of distance (de-dimensionalization). Another processing method, referring to the decision idea of random forest, can consider using LOF on multi-dimensional projection and combine the results to improve the detection quality of high-dimensional data.

Integrated learning

Integrated learning: Building and combining multiple learners to accomplish a learning task usually results in significantly better generalization performance than a single learner. — Zhou Zhihua, Machine Learning

The data used in data detection should be meaningful data, which requires simple feature screening. Otherwise, no matter how “outlier” the sample is, it may not have much practical significance. According to the thought of integrated study, need to split the data according to the dimension, for the same type of data, it is assumed that you are ready to code processing (such as the position coordinates can be put together as a “distance” to consider), and the data dimension is greater than 1, otherwise, use the data transformation and the general form 4.1 LOF can handle.

4.3.1 Voting mode

In the voting mode, the data of each dimension are considered equally important, and the LOF threshold is set separately for each dimension data for comparison. If the LOF value of the sample exceeds the threshold, the abnormal votes will accumulate 1 point, and the sample exceeding the threshold will be considered as outlier samples.

localoutlierfactor(data, predict, k)

Input: training sample, test sample, K value
Output: LOF value of each test sample and corresponding KTH distance

plot_lof(result,method)

Input: LOF value, threshold
Output: LOF value distribution with index as abscissa

ensemble_lof(data, predict=None, k=5, groups=[], method=1, vote_method = 'auto')

The input: Training sample, test sample, K value, combined feature index, outlier threshold, vote threshold Combined feature index (column position -1) : If the first column data and the second column data are the same type of data, pass groups = [[0, 1]]
- Outlier threshold: The outlier threshold for each feature, which is replaced by 1 if the number of bits is missing
- Vote threshold: upper limit of votes for outlier of normal point
Output: Classification of outliers and normal points If no test sample is input, the default test sample = training sample

def localoutlierfactor(data, predict, k, group_str):
    from sklearn.neighbors import LocalOutlierFactor
    clf = LocalOutlierFactor(n_neighbors=k + 1, algorithm='auto', contamination = 0.1, n_jobs = 1) CLF. Fit (data)Record LOF outliers and do negative number processing
    predict['local outlier factor %s' % group_str] = -clf._decision_function(predict.iloc[:, eval(group_str)])
    return predict

def plot_lof(result, method, group_str):
    import matplotlib.pyplot as plt
    plt.rcParams['font.sans-serif'] = ['SimHei']  # used to display Chinese labels normally
    plt.rcParams['axes.unicode_minus'] = False  # is used to display the minus sign normally
    plt.figure('local outlier factor %s' % group_str)
    try:
        plt.scatter(result[result > method].index,
                    result[result > method], c='red', s=50,
                    marker='. ', alpha=None,
                    label='Outlier')
    except Exception:
        pass
    try:
        plt.scatter(result[result <= method].index,
                    result[result <= method], c='black', s=50,
                    marker='. ', alpha=None, label='Be normal')
    except Exception:
        pass
    plt.hlines(method, -2, 2 + max(result.index), linestyles=The '-')
    plt.xlim(-2, 2 + max(result.index))
    plt.title('LOF Local Outlier Detection ', fontsize=13)
    plt.ylabel('Local outlier', fontsize=15)
    plt.legend()
    plt.show()

def ensemble_lof(data, predict=None, k=5, groups=[], method=1, vote_method = 'auto'):
    import pandas as pd
    import numpy as np
    # check whether test data is passed in, if not, test data is assigned to training data
    try:
        if predict == None:
            predict = data.copy()
    except Exception:
        pass
    data = pd.DataFrame(data); predict = pd.DataFrame(predict)
    # Data tag group, default is a separate group
    for i in range(data.shape[1]):
        if i not in pd.DataFrame(groups).values:
            groups += [[i]]
    Expand the threshold list
    if type(method) ! = list: method = [method] method += [1] * (len(groups) - 1)else:
        method += [1] * (len(groups) - len(method))
    vote = np.zeros(len(predict))
    # Calculate the LOF outliers and count the votes according to the threshold
    for i in range(len(groups)):
        predict = localoutlierfactor(pd.DataFrame(data).iloc[:, groups[i]], predict, k, str(groups[i]))
        plot_lof(predict.iloc[:, -1], method[i], str(groups[i]))
        vote += predict.iloc[:, -1] > method[i]
    # Divide outliers and normal points according to the vote threshold
    predict['vote'] = vote
    if vote_method == 'auto':
        vote_method = len(groups)/2
    outliers = predict[vote > vote_method].sort_values(by='vote')
    inliers = predict[vote <= vote_method].sort_values(by='vote')
    return outliers, inliers
Copy the code

In Test 4, the situation of Test 3 was still used for analysis. At this time, longitude and latitude were set as independent features to identify the data of the two dimensions respectively (although the independent latitude and longitude data are not of great practical significance).

import numpy as np
import pandas as pd

posi = pd.read_excel(r'Finished project task data.xls')
lon = np.array(posi["Mission GPS longitude"] [:])# longitude
lat = np.array(posi["Mission GPS latitude"] [:])# latitude
A = list(zip(lat, lon))  Match by latitude - longitude

posi = pd.read_excel(r'Member information data.xlsx')
lon = np.array(posi["Membership location (GPS) Longitude"] [:])# longitude
lat = np.array(posi["Membership Location (GPS) Latitude"] [:])# latitude
B = list(zip(lat, lon))  Match by latitude - longitude

For the task density of members, the 5th neighborhood is taken, with thresholds of 1.5 and 2 respectively. Those who get more than 1 votes are considered as outliersOutliers4, inliers4 = ensemble_lof(A, B, k=5, method=[1.5,2], vote_method = 1)# drawing program
plt.figure('Vote integration with LOF mode')
plt.scatter(np.array(B)[:, 0], np.array(B)[:, 1], s=10, c='b', alpha = 0.5) PLT. Scatter (np) array (A) [0] :,, np, array [: 1) (A), s = 10, c ='green', Alpha =0.3) Plt. scatter(Outliers4 [0], Outliers4 [1], S =10 + 1000, C ='r'PLT, alpha = 0.2). The title ('k = 5, method = [1.5, 2]')
Copy the code

The abnormal points can still be identified accurately

4.3.2 LOF anomaly score weighting model

The anomaly score weighting mode is to weight the LOF values of the data of each dimension and obtain the final LOF score as the LOF score of the overall data. Weight can be considered as the importance degree of features or the relative dispersion degree of data distribution. In the latter case, it can be set according to the entropy weight method. For the introduction of entropy weight method, please refer to another blog of the author.

Type in theAccording to the firstThe first digit of the dataDimension LOF outlier score value.

localoutlierfactor(data, predict, k)

Input: training sample, test sample, K value
Output: LOF value of each test sample and corresponding KTH distance

plot_lof(result,method)

Input: LOF value, threshold
Output: LOF value distribution with index as abscissa

ensemble_lof(data, predict=None, k=5, groups=[], method=2, weight=1)

The input: Training sample, test sample, K value, combined feature index, outlier threshold, feature weight Combined feature index (column position -1) : If the first column data and the second column data are the same type of data, pass groups = [[0, 1]]
- Outlier threshold: weighted LOF outlier threshold. Default value: 2
- Feature weight: Sets the weight for the LOF of each dimension, replacing it with 1 for missing bits
Output: classification of outlier and normal points
When no test sample is entered, the default test sample = training sample

def localoutlierfactor(data, predict, k, group_str):
    from sklearn.neighbors import LocalOutlierFactor
    clf = LocalOutlierFactor(n_neighbors=k + 1, algorithm='auto', contamination = 0.1, n_jobs = 1) CLF. Fit (data)Record LOF outliers and do negative number processing
    predict['local outlier factor %s' % group_str] = -clf._decision_function(predict.iloc[:, eval(group_str)])
    return predict

def plot_lof(result, method):
    import matplotlib.pyplot as plt
    plt.rcParams['font.sans-serif'] = ['SimHei']  # used to display Chinese labels normally
    plt.rcParams['axes.unicode_minus'] = False  # is used to display the minus sign normally
    plt.scatter(result[result > method].index,
                result[result > method], c='red', s=50,
                marker='. ', alpha=None,
                label='Outlier')
    plt.scatter(result[result <= method].index,
                result[result <= method], c='black', s=50,
                marker='. ', alpha=None, label='Be normal')
    plt.hlines(method, -2, 2 + max(result.index), linestyles=The '-')
    plt.xlim(-2, 2 + max(result.index))
    plt.title('LOF Local Outlier Detection ', fontsize=13)
    plt.ylabel('Local outlier', fontsize=15)
    plt.legend()
    plt.show()

def ensemble_lof(data, predict=None, k=5, groups=[], method='auto', weight=1):
    import pandas as pd
    # check whether test data is passed in, if not, test data is assigned to training data
    try:
        if predict == None:
            predict = data
    except Exception:
        pass
    data = pd.DataFrame(data);
    predict = pd.DataFrame(predict)
    # Data tag group, default is a separate group
    for i in range(data.shape[1]):
        if i not in pd.DataFrame(groups).values:
            groups += [[i]]
    # expand the weight list
    if type(weight) ! = list: weight = [weight] weight += [1] * (len(groups) - 1)else:
        weight += [1] * (len(groups) - len(weight))
    predict['local outlier factor'] = 0
    # Calculate LOF outliers and calculate weighted LOF scores according to feature weights
    for i in range(len(groups)):
        predict = localoutlierfactor(pd.DataFrame(data).iloc[:, groups[i]], predict, k, str(groups[i]))
        predict['local outlier factor'] += predict.iloc[:, -1] * weight[i]
    if method == 'auto':
        method = sum(weight)
    plot_lof(predict['local outlier factor'], method)
    # Divide outliers and normal points according to outlier threshold
    outliers = predict[predict['local outlier factor'] > method].sort_values(by='local outlier factor')
    inliers = predict[predict['local outlier factor'] <= method].sort_values(by='local outlier factor')
    return outliers, inliers
Copy the code

In Test 5, the situation of Test 3 was still used for analysis. At this time, longitude and latitude were set as independent features to identify the data of the two dimensions respectively (although the independent latitude and longitude data seem to have little practical significance).

import numpy as np
import pandas as pd

posi = pd.read_excel(r'Finished project task data.xls')
lon = np.array(posi["Mission GPS longitude"] [:])# longitude
lat = np.array(posi["Mission GPS latitude"] [:])# latitude
A = list(zip(lat, lon))  Match by latitude - longitude

posi = pd.read_excel(r'Member information data.xlsx')
lon = np.array(posi["Membership location (GPS) Longitude"] [:])# longitude
lat = np.array(posi["Membership Location (GPS) Latitude"] [:])# latitude
B = list(zip(lat, lon))  Match by latitude - longitude

# To obtain the task density of member pairs, take the 5th neighborhood, the threshold value is 100, and the weights are 5, 1 respectivelyOutliers5, inliers5 = ensemble_lof(A, B, k=5, method=100,weight = [5,1])# drawing program
plt.figure('LOF anomaly score Weighted model ')
plt.scatter(np.array(B)[:, 0], np.array(B)[:, 1], s=10, c='b', alpha = 0.5) PLT. Scatter (np) array (A) [0] :,, np, array [: 1) (A), s = 10, c ='green', Alpha =0.3) Plt. scatter(Outliers5 [0], Outliers5 [1], S =10 + Outliers5 ['local outlier factor'], c='r'PLT, alpha = 0.2). The title ('k = 5, method = 100')
Copy the code

The possible outliers can be identified more accurately than the above recognition models

4.3.3 Mixed Mode

Mixed mode is applicable to the situation where some features are equally important and some features are different in importance, that is, 4.3.1 and 4.3.2 are considered comprehensively. Data of equal importance will be voted on and data of different importance will be weighted and converted to voting based on thresholds. Procedurally, you only need to mix the two parts, which will not be shown in this article.

4.4 The algorithm execution efficiency is low due to large amount of data. Is there any improvement method?

In the process of detecting outliers, LOF algorithm traverses the whole data set to calculate the LOF value of each point, which makes the algorithm operation speed slow. At the same time, as the number of data normal points is generally far more than the number of outliers, THE LOF method determines the degree of outliers by comparing the LOF values of all data points, resulting in a large number of unnecessary calculations. Therefore, the original data pruning can effectively improve the computational efficiency of LOF method. In addition, it is also found in practice that data set pruning can greatly reduce the probability of misjudgment of data as outliers. This method of Outlier detection Based on Cluster pruning is called CLOF (Cluster-based Local Outlier Factor) algorithm.

CLOF algorithm based on K-means

Before LOF algorithm is applied, k-means clustering algorithm is used to aggregate the original dataCluster. For each cluster, compute the center of the cluster, calculate the average distance between all points in the cluster and the center and record it as the radius of the cluster. For all points in this class, if the distance between the point and the center of the cluster is greater than or equal toPut it in the outlier candidate set.And finally,LOF algorithm was used to calculate outliers.

Set the firstThe number of points in the cluster is, point sets

The calculation formula of center and radius is as follows:

How do you determine the best one— Elbow rule

K-means algorithm by specifying the cluster numberAnd randomly generate cluster centers, classify the objects closest to them iteratively, and update the values of each cluster center successively until the best clustering effect (minimum cost function value) is obtained.

forThe selection of is will directly affect the clustering effect of the algorithm. The elbow law will be differentThe cost function of the value is characterized with theWith the increase, the number of samples contained in each cluster will decrease, the samples will be closer to its center, and the cost function will decrease. But as theAs it continues to increase, the improvement degree of the cost function decreases (in the image, the cost function curve tends to be stable).In the process of increasing value, the position with the greatest improvement degree of the cost function corresponds toThat’s the elbow. Use thatGenerally can achieve good results. But the elbow rule is used only at the cost level, and sometimes with practical considerations.

Because the number of outlier samples is generally small, if the sample size of the cluster produced is too small (e.g. 1-4, but other clusters have hundreds or thousands of samples), the cluster should not be pruned.

Define the cost function:

CLOF local outlier detection algorithm flow

elbow_method(data,maxtest = 11)

Input: Data set to trim, maximum test range (default: 11)
Output: Cost curve

def elbow_method(data,maxtest = 11):
    ' ''Determine the $N_clusters $value using the elbow rule'' '
    from sklearn.cluster import KMeans
    from scipy.spatial.distance import cdist
    import numpy as np
    import matplotlib.pyplot as plt
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False ax = plt.figure(figsize=(8,4).add_subplot(111) N_test = range(1, maxtest)# list of cost function values
    meandistortions = []
    for $n_clusters$ in N_test:
        model = KMeans($n_clusters$=$n_clusters$).fit(data)
        # Calculate the cost function value
        meandistortions.append(sum(np.min(cdist(data, model.cluster_centers_, 'euclidean'), axis=1)) / len(data))
    plt.plot(N_test, meandistortions, 'bx-'PLT, alpha = 0.4). The xlabel ('k')
    plt.ylabel(Cost function,fontsize= 12)
    plt.title('Use the elbow rule to determine the optimal value of $N_clusters $',fontsize= 12)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.set_xticks(np.arange(0, maxtest, 1))
    plt.show()
Copy the code

The cost curves corresponding to different cluster numbers indicate that 2~4 May be ideal values

kmeans(data, $n_clusters$, m):

Input: data set to be pruned, number of cluster clusters, number of sample points that the minimum pruning cluster should contain
Output: Trimmed data set

def kmeans(data, $n_clusters$, m):
    ' ''Prune data set using k-means algorithm'' '
    from sklearn.cluster import KMeans
    import numpy as np
    data_select = []
    model = KMeans($n_clusters$=$n_clusters$).fit(data)
    centeroids = model.cluster_centers_
    label_pred = model.labels_
    import collections
    for k, v in collections.Counter(label_pred).items():
        if v < m:
            data_select += np.array(data)[label_pred == k].tolist()
        else: distance = np.sqrt(((np.array(data)[label_pred == k] - centeroids[k]) ** 2).sum(axis=1)) R = distance.mean() data_select  += np.array(data)[label_pred == k][distance >= R].reshape(-1, np.array(data).shape[-1]).tolist()return np.array(data_select)
Copy the code

Test 6 performed pruning analysis on data set B, which contained 1877 data

The elbow rule determines the best clipping set
elbow_method(B,maxtest = 11)
Copy the code

B Dataset cost curve: Hint 3 in the figure may be the number of ideal clusters

# Set the minimum sample size to 3 based on the $N_clusters $set above
B_cut = kmeans(B, $n_clusters$ = 3, m = 3)
Copy the code

By performing the above procedure, dataset B, which originally contained 1877 data, was pruned to a smaller dataset with 719 data. LOF algorithm is used for outlier detection, and the detection results are as follows:

# Obtain the member distribution density, take the 10th neighborhood, and set the threshold to 3 (LOF greater than 3 is considered an outlier)
outliers6, inliers6 = lof(B_cut, k=10, method=3)
# drawing program
plt.figure('CLOF outlier detection ')
plt.scatter(np.array(B)[:, 0], np.array(B)[:, 1], s=10, c='b', Alpha =0.5) Plt. scatter(Outliers6 [0], Outliers6 [1], S =10 + Outliers6 ['local outlier factor']*10, c='r'PLT, alpha = 0.2). The title ('k = 10, method = 3')
Copy the code

Using CLOF to detect the situation, the accuracy is further improved and the misjudgment rate of data is reduced

The LOF of the trimmed data set is no longer so significant, but the LOF of the outlier will still be a large value, and the larger the value of K is, the more obvious the discrimination effect will be.

4.5 How to improve the identification accuracy

Reasonable increaseValue can significantly improve the identification accuracy. butThe increase of the value will bring unnecessary calculation and affect the efficiency of the algorithm, which is also used in this paperI’m going to take smaller values. Reasonable selection ofAnd the threshold will beThe key to success.

Threshold value 1.2 is selected to facilitate the display of the improvement process of identification accuracy with the increase of K, and the calculation time is also getting longer and longer (only one experiment is taken, and the effect is not obvious when the sample is small and the value of K is small, so the strict increasing relationship is not shown, but the time is obviously prolonged in the last figure). It can also be seen from the bubble size that appropriately increasing the threshold can also improve the identification accuracy, not necessarily depending on K

Five, the afterword.

This paper mainly refers to the original text of the algorithm and the author’s learning experience to summarize. In the field of anomaly recognition, LOF algorithm and Isolation Forest algorithm have been pointed out as the algorithms with the best performance and recognition effect. LOF outlier score value can be regarded as a “high-end” method (reference [3]) for the commonly used characterization of population density (or other). Compared with traditional methods, it has more mature theoretical support.

If there is time later, the author will further explain how to apply LOF algorithm in data processing in detail according to the method in this paper and combined with actual data.

Vi. References

[1] Wikipedia. Moore – Penrose inverse [EB/OL]. [2018-6-2]. https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse

[2] Breunig M M, Kriegel H P, Ng R T, et al. LOF: identifying density-based local outliers[C]// ACM SIGMOD International Conference on Management of Data. ACM, 2000:93-104.

[3] Dong Tianwen, Pan Weidi, Qi Mingjia, Zhang Xiaomin. Task Pricing for “Taking pictures to make money”. Ministry of Education of Chinese university students online website [J/OL]. [2017-11-13]. http://upload.univs.cn/2017/1113/1510570109509.pdf

Author: Zhang Liu-bin

If you have any questions, please contact QQ: 965579168

Reprint please state the source

LOF outlier detection algorithm and PYTHon3 implementation

LOF outlier detection algorithm and PYTHon3 implementation

Transfer statement

I. Relevant background

1.1 Anomaly detection algorithm

1.2 Why is outlier detection

1.3 Outlier mining methods

1.4 LOF algorithm background

2. Introduction to the algorithm

2.1 Distance measurement scale

2.1.1 Eucild distance:

2.1.2 Hamming Distance:

2.1.3 Mahalanobis distance:

2.1.4 Spherical distance

2.2 the firstdistance

2.3 From the neighborhood

2.4 Reachable distance

2.5 Locally accessible density

2.6 Local outliers

3. Implementation of LOF Outlier detection algorithm PYTHon3 (Improved in Part 4)

3.1 Import sklearn module implementation

3.1.1 Importing modules

3.1.2 Core functions

3.1.3 Encapsulating functions

3.1.4 Example test

3.2 Step-by-step programming implementation method

3.2.1 Distance measurement scale

3.2.2K neighborhood radius and neighborhood point

3.2.3 Locally accessible density

3.2.4 Local outliers

3.2.5 Encapsulating functions

Iv. Thinking and improvement of LOF algorithm

4.1 Can LOF be used for one-dimensional data?

4.2 What LOF is the outlier?

4.3 Can LOF algorithm still be used if the data dimension is too Large?

4.3.1 Voting mode

4.3.2 LOF anomaly score weighting model

4.3.3 Mixed Mode

4.4 The algorithm execution efficiency is low due to large amount of data. Is there any improvement method?

4.5 How to improve the identification accuracy

Five, the afterword.

Vi. References

Related Posts

MySQL8 Transaction Chapter 1- Transaction basics

Algorithm analysis | the third set of asymptotic notation ()

Redis – Sets and obtains the redis expiration time