DBSCAN is introduced

His biggest advantage is that it can find clusters of arbitrary shape, while the traditional clustering algorithm can only use convex sample clustering class

Two parameters:

Neighborhood radius R and minimum number of points minpoints. When the number of points in neighborhood radius R is greater than minpoints, it is dense.

Addendum: Calculate radius R empirically

According to get the set of all points distance k – E, after ascending order on set E get k – E ‘distance set, need fitting article sorted collection of k E’ – change graph of the distance, and then draw curve, through the observation, will be a sharp change in the position of the value of k – distance, identified as the value of the radius of Eps.

There are three types of points: core points, boundary points and noise points.

The point with the number of sample points greater than or equal to minpoints in neighborhood radius R is called core point. Points that do not belong to the core but are in the neighborhood of a core are called boundary points. Those that are neither core nor boundary points are noise points.

Sklearn instance

Official document <–

Sample generation point

import numpy as np
import pandas as pd
from sklearn import datasets
%matplotlib inline

X,_ = datasets.make_moons(500,noise = 0.1,random_state=1)
df = pd.DataFrame(X,columns = ['feature1'.'feature2'])
df.plot.scatter('feature1'.'feature2', s = 100,alpha = 0.6, title = 'dataset by make_moon')
Copy the code

Call the DBSCAN interface to complete the clustering

from sklearn.cluster import dbscan

# EPS is the neighborhood radius, and min_samples is the minimum number of points
core_samples,cluster_ids = dbscan(X, eps = 0.2, min_samples=20) 
# cluster_IDS -1 indicates that the corresponding point is a noise point

df = pd.DataFrame(np.c_[X,cluster_ids],columns = ['feature1'.'feature2'.'cluster_id'])
df['cluster_id'] = df['cluster_id'].astype('i2')

df.plot.scatter('feature1'.'feature2', s = 100,
    c = list(df['cluster_id']),cmap = 'rainbow',colorbar = False,
    alpha = 0.6,title = 'DBSCAN cluster result')
Copy the code

The example that oneself changes, better understand

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
%matplotlib inline

# Generate data
X=np.empty((100.2))
X[:,0]=np.random.uniform(0..100.,size=100)
X[:,1] =0.75*X[:,0] +3+np.random.normal(0.10,size=100)
plt.scatter(X[:,0],X[:,1])
plt.show()
df=pd.DataFrame(X,columns=['feature1'.'feature2'])
df.plot.scatter('feature1'.'feature2')
print(df)

Call the DBSCAN interface to complete the clustering
from sklearn.cluster import dbscan
# EPS is the neighborhood radius, and min_samples is the minimum number of points
core_samples,cluster_ids = dbscan(X, eps = 10, min_samples=3) 
df = pd.DataFrame(np.c_[X,cluster_ids],columns = ['feature1'.'feature2'.'cluster_id'])
# df [' cluster_id] = df [' cluster_id] astype (' i2) # that have a purpose
df.plot.scatter('feature1'.'feature2', s = 100,
    c = list(df['cluster_id']),cmap = 'rainbow',colorbar = False,
    alpha = 0.6,title = 'DBSCAN cluster result')
Copy the code