Clustering is often used in the early stage of data exploration or mining. It is an exploratory analysis in the background of no prior experience, and it is also suitable for data preprocessing in the case of large sample size. For example, in view of the overall user characteristics of the enterprise, the user group is divided according to the characteristics of the data itself before the relevant information or experience is obtained, and then further analysis is made for different groups. For example, continuous data can be discretized to facilitate the subsequent application of classification analysis.

Common clustering algorithms are divided into classification, hierarchy, density, grid, statistics, model and other types of algorithms, typical algorithms include K-means (classic clustering algorithm), DBSCAN, two-step clustering, BIRCH, spectral clustering and so on.

Cluster analysis can solve the following problems: how many types of data sets can be divided into, how many sample sizes are in each category, how strong and weak relationships are between variables in different categories, and what are the typical characteristics of different categories. In addition to categorization, clustering can also be used for other applications based on categorization, such as image compression. However, clustering can not provide a clear action direction, and the clustering results are more to provide pretreatment and reference for the later mining and analysis work, unable to answer the questions of “why” and “how to do”.

The influence of abnormal data on clustering results

K-means is one of the most commonly used methods in clustering. It calculates the best category attribution based on the similarity of distance between points. However, two kinds of data anomalies must be paid attention to before applying K-means:

  1. Outliers of the data. Outliers in the data can significantly change the distance similarity between different points, and the effect is significant. Therefore, it is necessary to deal with outliers in the discriminant mode based on distance similarity.
  2. Anomaly dimensions of data. If there is a difference in numerical scale or dimension between different dimensions and variables, then it is necessary to normalize or standardize the variables before doing the distance. For example, the numerical distribution interval of bounce rate is [0,1], the order amount may be [0,10, 000, 000], and the order quantity may be [0,1000]. If there is no normalization or standardization operation, the similarity will be mainly affected by the order amount.

The k-means algorithm should be abandoned when the data volume is very large

K-means performs very well in algorithmic stability, efficiency, and accuracy (relative to true label discrimination), and still does so when dealing with large amounts of data. The upper bound of the algorithm’s time complexity is O(NKT), where n is the sample size, k is the number of divided clusters, and t is the number of iterations. When the number of clusters and the number of iterations are constant, the time consumed by the algorithm of K-means is only related to the sample size, so it will show a linear growth trend.

Dealing with clustering of high dimensional data

When clustering high-order data, the traditional clustering methods commonly used in low-dimensional space can not achieve satisfactory clustering results when applied to high-dimensional space, which is mainly reflected in the long time of clustering calculation, and the accuracy and stability of clustering results are greatly reduced compared with the real label classification. Why do clustering problems occur in higher dimensional space?

  • In the face of high dimensional data, the efficiency of similarity calculation based on distance is very low.
  • A large number of properties in high-dimensional space make the possibility of clusters in all dimensions very low.
  • Due to the sparsity and near-neighbor characteristics, the distance-based similarity is almost zero, which makes it difficult to exist data clusters in high-dimensional space.

There are two main methods to deal with high-dimensional data clustering: dimension reduction and subspace clustering.

  • Dimension reduction is an effective way to deal with high-dimensional data. By means of feature selection method or dimension transformation method, the high-dimensional space is reduced or mapped to low-dimensional space, which directly solves the high-dimensional problem.
  • Subspace clustering algorithm is an extension of traditional clustering algorithm in high-dimensional data space. The idea is to select the dimensions closely related to a given cluster and then cluster in the corresponding subspace. For example, spectral clustering is a seed spatial clustering method. Since the method of selecting the correlation dimension and the method of evaluating the subspace need to be customized, this method is highly demanding on the operator.

How to select the clustering analysis algorithm

There are dozens of clustering algorithms. The selection of clustering algorithm mainly refers to the following factors:

  • If the data set is high dimensional, then choose spectral clustering, which is a subspace partition.
  • If the data volume is small and medium-sized, for example, within 100W, then k-mean is a better choice. If the data volume exceeds 100W, consider using Mini Batch KMeans.
  • If there are noisy points (outliers) in the data set, using density-based DBSCAN can effectively address this problem.
  • If higher classification accuracy is pursued, spectral clustering will be more accurate than K-means.

Python cluster analysis

import numpy as np 
import matplotlib.pyplot as plt


# Data preparation
raw_data = np.loadtxt('./pythonlearn/cluster.txt') Import data files
X = raw_data[:, :-1] # Divide the data to be clustered
y_true = raw_data[:, -1] 

print(X)

Copy the code
[[0.58057881 0.43199283] [1.70562094 1.16006288] [0.8016818-0.51336891]... [-0.75715533-1.41926816] [-0.34736103 0.84889633] [0.61103884 0.46151157]]Copy the code

Training the clustering model

Training the clustering model. The number of clusters is set to 3, and the cluster model object is established. Then the model is trained by USING FIT method, and the clustering tag set y_pre of the original training set is obtained by using predict method. The clustering tag of the training set can also be obtained directly from the Labels_ attribute of the clustering object after applying FIT method. From the obtained cluster model, the sum of each category center and the cluster center nearest to the sample can be obtained by cluster_centers_ and Inertia_ attributes.

from sklearn.cluster import KMeans # Import the Sklearn clustering module

n_clusters = 3 # Set the number of clusters
model_kmeans = KMeans(n_clusters=n_clusters, random_state=0) # Create cluster model objects

Copy the code

Knowledge: Save the algorithm to the hard disk

Python’s built-in standard library, cPickle, is an effective way to do this. CPickle can serialize/persist any type of Python object, and algorithmic model objects are no exception. The main application methods of cPickle are dump and load.

  • Dump: serializes Python objects to a local file
  • Load: Reads Python objects from local files and restores instance objects

Note: Pickle is used in Python3 and cPickle is used in Python2

import pickle
pickle.dump(model_kmeans, open("my_model_object.pkl"."wb"))
model_kmeans2 = pickle.load(open("my_model_object.pkl"."rb"))

model_kmeans2.fit(X) Training the clustering model
y_pre = model_kmeans2.predict(X) # Predictive clustering model
print(y_pre)
Copy the code
[1 1 2 0 0 1 1 0 0 0 1 2 1 2 1 0 2 2 2 2 2 2 1 1 0 0 0 0 1 2 2 2 2 0 0 0 1 1 1 2 2 2 2 2 2 1 1 0 0 1 1 1 1 0 0 0 2 1 1 0 0 1 2 2 2 1 1 1 1 1 0 0 1 1 1 1 1 2 0 1 2 0 0 0 1 2 2 2 0 0 0 0 1 1 1 1 1 1 1 0 0 2 2 2 1 2 1 2 1 2 0 0 1 2 1 2 2 1 1 0 1 1 1 1 1 0 2 2 2 2 1 0 0 1 2 2 2 2 2 2 2 1 1 1 0 0 0 1 2 1 1 1 1 1 1 1 0 0 0 1 2 2 2 2 1 2 0 0 1 1 1 2 0 0 0 0 0 0 0 1 2 0 1 0 0 0 1 2 2 1 2 0 0 1 1 1 1 1 1 1 1 2 2 2 2 0 0 0 1 1 1 2 2 2 2 0 1 2 2 2 2 2 1 0 0 0 1 to 1 2 2 1 2 2 0 2 2 2 1 0 1 2 2 1 2 0 0 1 2 2 2 2 1 2 0 0 1 2 2 2 2 0 0 1 1 1 0 0 1 2 2 2 2 2 0 0 1 1 1 0 0 1 2 2 0 0 2 1 1 1 0 1 1 1 2 1 2 1 2 0 0 0 1 0 0 1 2 2 2 1 0 0 0 0 0 1 0 0 1 2 1 2 2 2 2 2 0 0 0 1 1 0 1 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 0 0 1 2 0 1 2 2 2 2 2 2 2 2 2 1 0 0 0 0 1 2 2 2 2 2 2 0 0 0 1 2 1 2 0 0 0 1 2 1 1 0 0 0 0 0 0 1 2 1 2 1 1 0 0 2 1 2 1 2 2 2 2 2 2 2 0 0 0 1 1 1 1 0 0 1 0 0 0 0 1 1 1 2 2 2 2 0 0 0 1 1 1 1 1 1 1 1 1 0 1 2 2 2 2 2 2 2 2 0 0 0 0 1 1 1 0 0 0 0 1 2 2 2 2 2 0 0 0 0 1 1 1 0 0 0 1 2 0 0 0 1 1 2 1 2 1 2 0 0 1 0 0 0 0 0 2 2 2 0 0 0 1 1 1 1 0 1 2 2 0 1 2 1 2 2 2 2 2 0 0 0 1 1 1 1 1 0 0 2 2 2 2 1 0 0 0 0 0 1 2 1 0 0 1 1 1 2 2 2 2 2 1 2 1 0 0 0 2 2 2 0 0 0 1 2 2 0 1 2 2 1 0 0 0 1 1 1 2 2 2 2 2 0 0 0 1 2 2 2 2 1 1 0 0 0 0 0 0 0 1 1 1 1 2 2 2 2 1 1 0 0 1 2 2 2 2 2 0 1 1 0 0 0 2 1 2 0 0 0 0 2 1 2 1 1 1 0 0 0 1 2 2 2 2 0 0 1 1 0 0 1 1 2 2 2 2 2 1 0 0 0 1 1 1 0 0 0 1 2 0 0 0 2 0 1 2 1 0 0 1 1 1 1 1 0 0 1 2 2 2 2 0 0 0 0 1 2 1 0 0 0 1 0 0 0 0 1 2 2 2 2 1 0 0 0 0 0 1 0 0 1 2 2 2 2 2 0 0 1 1 2 2 1 0 0 2 2 2 1 2 2 2 2 2 0 0 0 0 1 1 1 1 1 1 0 0 0 1 2 2 2 2 1 2 0 0 0 1 1 1 2 2 2 0 0 0 1 0 0 0 0 0 0 0 2 2 2 2 0 0 0 1 0 1 2 1 2 1 2 0 1 2 1 2 2 2 2 2 0 0 0 0 0 0 0 1 2 1 0 0 1 1 0 0 1 2 2 2 2 2 1 1 1 1 0 0 1 1 0 0 0 1 2 2 2 2 1 0 2 2 0 0 1 1 1 2 0 0 1 1 1 1 1 2 0 0 0 1 2 0 0 0 0 0 1 1 0 0 0 1 2 0 1 2 2 2 2 1 0 0 1 2 2 2 2 2 2 0 0 0 0 0 0 0 1 1 0 0 0 0 2 2 2 2 2 2 0 2 2 1 1 1 2 1 1 1 1 1 2 2 2 0 0 0 0 2 2 0 2 1 1 1 2 2 1 1 1 2 2 2 0 0 0 0 2Copy the code

Model effect index evaluation

from sklearn import metrics Import the SkLearn effect evaluation module

n_samples, n_features = X.shape # Total sample size, total number of features
print ('samples: %d \t features: %d' % (n_samples, n_features))  Print out sample size and number of features

Copy the code
samples: 1000 	 features: 2
Copy the code

Evaluation indicator 1: Inertias

Inertias is an attribute of the k-means model object, representing the sum of the cluster center closest to the sample. It is used as an unsupervised evaluation index under the label of no real classification results. The smaller the value, the better, the smaller the value proves that the sample distribution between classes is more concentrated, that is, the distance within the class is smaller.

inertias = model_kmeans.inertia_  The sum of the nearest cluster center to the sample
print(inertias)
Copy the code
300.1262936093466
Copy the code

Adjusted Rand Index: Adjusted Rand Index

The RAND Index calculates a measure of similarity between two clusters by taking into account all sample pairs and count pairs assigned in the same or different clusters in both predicted and real clusters. By adjusting the RAND Index, the adjusted RAND Index obtains a value close to 0 independent of the sample size and category. The value range is [-1,1]. Negative numbers represent bad results, and the closer to 1, the better, meaning that the clustering results are more consistent with the real situation.

adjusted_rand_s = metrics.adjusted_rand_score(y_true, y_pre)  # Adjusted RAND Index
print(adjusted_rand_s)
Copy the code
0.9642890803276076
Copy the code

Evaluation indicator 3: Mutual_info_s, Mutual Information (MI)

Mutual information is the amount of information about another random variable contained in one random variable. In this case, it is a measure of the similarity between two labels of the same data. The result is non-negative.

mutual_info_s = metrics.mutual_info_score(y_true, y_pre)  # mutual information
print(mutual_info_s)
Copy the code
1.0310595406681184
Copy the code

Indicator 4: Adjusted Mutual Information (AMI)

Adjusted mutual information is an adjusted score of the mutual information score. It takes into account that for clusters with a larger number, MI is generally higher, regardless of whether more information is actually shared, and it corrects this effect by adjusting the probability of clusters. When two cluster sets are identical (that is, an exact match), AMI returns 1; The average expected AMI for a random partition (independent label) is about 0 or may be negative.

adjusted_mutual_info_s = metrics.adjusted_mutual_info_score(y_true, y_pre)  # Adjusted mutual information
print(adjusted_mutual_info_s)
Copy the code
0.938399249349474 / anaconda3 / lib/python3.6 / site - packages/sklearn/metrics/cluster/supervised. Py: 732: FutureWarning: The behavior of AMI will change in version 0.22. To match The behavior of 'v_measure_score', AMI will use average_method='arithmetic' by default. FutureWarning)Copy the code

One case report of HBV infection with HBV infection

If all clusters contain only data points that belong to members of a single class, the clustering results will satisfy homogeneity. The larger the value range [0,1] is, the more consistent the clustering results are with the real situation.

homogeneity_s = metrics.homogeneity_score(y_true, y_pre)  # Homogenization score
print(homogeneity_s)
Copy the code
0.9385116928897981
Copy the code

Evaluation Index 6: Comleteness_S, integrity score

If all data points that are members of a given class are elements of the same cluster, the clustering result satisfies completeness. The larger the value range [0,1] is, the more consistent the clustering results are with the real situation.

completeness_s = metrics.completeness_score(y_true, y_pre)  # Integrity score
print(completeness_s)
Copy the code
0.9385372785555511
Copy the code

Measurement Indicator 7: V_Measure_S, V-measure score

It is the harmonic mean between homogenization and integrity, v=2*(uniformity * integrity)/ (uniformity + integrity). The value range is [0,1]. The larger the value is, the more consistent the clustering results are with the real situation.

v_measure_s = metrics.v_measure_score(y_true, y_pre)  # V - measure points
print(v_measure_s)
Copy the code
0.938524485548298
Copy the code

Silhouette_s, Silhouette

It is used to calculate the mean profile coefficient of all samples, using the mean intra-group distance and the mean nearest cluster distance of each sample. It is an unsupervised evaluation index. The highest value is 1, and the worst value is -1. Values near 0 indicate overlapping clusters, and negative values usually indicate that you have been assigned to the wrong cluster.

silhouette_s = metrics.silhouette_score(X, y_pre, metric='euclidean')  # Average profile coefficient
print(silhouette_s)
Copy the code
0.6342086134083013
Copy the code

Evaluation indicator 9: Calinski_HARabaz_s, Calinski and Harabaz scores

The score is defined as the ratio of intra-group discreteness to inter-cluster discreteness, which is an unsupervised evaluation index.

calinski_harabaz_s = metrics.calinski_harabaz_score(X, y_pre)  # Calinski and Harabaz score
print(calinski_harabaz_s)
Copy the code
2860.8215946947635
Copy the code

Visualization of model effects

centers= model_kmeans.cluster_centers_ # Various centers
colors = ['#4EACC5'.'#FF9C34'.'#4E9A06']  # Set the colors for different categories
plt.figure() # Create canvas
for i in range(n_clusters):  # Loop read category
    index_sets = np.where(y_pre == i)  # Find the index collection of the same class
    cluster = X[index_sets]  # Divide similar data into a cluster subset
    plt.scatter(cluster[:, 0], cluster[:, 1], c=colors[i], marker='. ')  # Display sample points within the cluster subset
    plt.plot(centers[i][0], centers[i][1].'o', markerfacecolor=colors[i], markeredgecolor='k',
             markersize=6)  Show the center of each cluster subset
plt.show() # Display images

    
Copy the code

Model application

new_X = [[1.3.6]]
cluster_label = model_kmeans.predict(new_X)
print('cluster of new data point is: %d' % cluster_label)
Copy the code
cluster of new data point is:  1
Copy the code