Unsupervised learning and supervised learning is the most important difference is that whether the training data contain tag data, in machine learning and development work, often contains a large number of tag data and a small number of tag data, unsupervised methods by learning from samples of unmarked training to excavate the inherent law of data provide the basis for further data analysis.

Clustering algorithm is one of the most commonly used methods in unsupervised learning. Performance measurement is an indicator to measure the merits of learning model, and can also be used as the objective function to optimize learning model. Clustering performance measures can be divided into two categories according to whether training data contains marker data. One is to compare clustering results with marker data, which is called “external indicators”. The other is the direct analysis of clustering results, known as internal indicators. This paper gives a detailed summary of the two performance measures and similarity methods.

directory


1. External indicators

2. Internal metrics

3. Summary of similarity method

4. Summary

1. External indicators

Before going into the details of external metrics, let’s define pairwise paired variables A and B:

A: The number of sample pairs of the dataset belonging to both the same cluster C and the same cluster K

B: the number of sample pairs of the dataset that do not belong to the same cluster C or cluster K

Use a simple example to illustrate the meaning of a and B:

Real cluster vector: [0, 0, 0, 1, 1, 1]

Prediction cluster vector: [0, 0, 1, 1, 2, 2]

A is the number of sample pairs belonging to the same cluster vector, marked with red box:

As shown in the figure above: A = 2;

B is the number of sample pairs whose data sets do not belong to the same cluster C nor to the same cluster K, marked by green boxes:

As shown in the figure above: b = 1;

Knowing the meanings of A and B, the performance measures of external indicators are introduced in detail.

1.1 RI (Rand coefficient)

RI is to measure the similarity of two cluster classes. Assuming that the number of samples is N, it is defined as follows:

Among themIt’s the number of possible sample pairs.

Assumptions:

Real cluster vector: [0, 0, 0, 1, 1, 1]

Prediction cluster vector: [0, 0, 1, 1, 2, 2]

The disadvantage of RI coefficient is that with the increase of the cluster number, the RI of the randomly allocated cluster vector gradually increases, which is not consistent with the theory. RI of the marker vector of the randomly allocated cluster should be 0.

1.2 ARI (Adjusting rand coefficient)

ARI solves the problem that RI cannot well describe the similarity of randomly assigned cluster class marker vector. ARI is defined as follows:

Where E stands for expectation and Max stands for maximum value.

The specific formula of the above formula is as follows:

Where I and j are real and predicted clusters respectively.Represents the number of real cluster class I and predicted cluster class J,The meaning of the following tableThe same,The meaning of the following tableThe same.

Use the above example to explain the calculation process of ARI.

Assumptions:

Real cluster vector: [0, 0, 0, 1, 1, 1]

Prediction cluster vector: [0, 0, 1, 1, 2, 2]

The number of cluster vector pairs in the table:

Therefore, ARI indicators are calculated according to the table:

According to the table:

So: ARI = 0.2424

from sklearn import metrics
# Real clusters
labels_true = [0.0.0.1.1.1]
# Predicted clusters
labels_pred = [0.0.1.1.2.2]
# calculation ARI
ARI = metrics.adjusted_rand_score(labels_true, labels_pred)
ARI

# > 0.24242424242424246
Copy the code

Advantages:

1) The value range of ARI is [-1,1]. The larger ARI is, the higher the similarity between the predicted cluster vector and the real cluster vector is.

2) For any number of clusters and samples, the ARI score of randomly assigned labels is close to 0;

3) It can be used to evaluate the performance measure of non-hypothetical cluster structure, such as comparing the performance of k-means clustering algorithm after dimensionality reduction through spectral clustering;

Disadvantages:

1) It is necessary to know the marking information of real data, so it is difficult to be applied in practice or data can be manually calibrated (similar to supervised learning);

1.3 AMI (Adjusted mutual information Index)

AMI is based on the mutual information score between the predicted cluster vector and the real cluster vector to measure its similarity. The larger AMI is, the higher the similarity is. AMI close to 0 means that the cluster vector is randomly allocated. MI (mutual information index) and NMI (standardized mutual information index) do not conform to the theory of random distribution of cluster vectors, that is, MI and NMI tend to increase with the increase of the number of distributed clusters, and the value of AMI is always close to 0. Therefore, AMI is adopted as the Angle based on mutual information to measure the similarity between cluster vectors. Due to space reasons, AMI calculation formula is not introduced here. Readers can refer to the clustering section of sciKit-learn official website.

Comparison between AMI and MI:

# AMI index
AMI1 = metrics.adjusted_mutual_info_score(labels_true, labels_pred)
print('AMI= ',AMI)
# If the predicted cluster vector is exactly the same as the real cluster vector
labes_pred = labels_true[:]
The similarity of the analysis based on the adjusted mutual information index should be 1
AMI2 = metrics.adjusted_mutual_info_score(labels_true, labes_pred)
print('AMI2= ',AMI2)
The similarity based on the standardized mutual information index is also 1
NMI1 = metrics.normalized_mutual_info_score(labels_true, labes_pred)
print('NMI1= ',NMI1)
# The similarity based on mutual information index is not 1, which is not consistent with the theory
MI = metrics.mutual_info_score(labels_true, labes_pred)
print('MI= ',MI)

# >
AMI1=  0.2250422831983088
AMI2=  1.0
NMI1=  1.0
MI=  0.6931471805599452
Copy the code

If the cluster vector is randomly assigned, when the number of cluster classes and samples is large, the values of MI and NMI are not 0, and the values of ARI and AMI are 0.

Number of cluster classes
cluster_num = 100
# Number of samples
samples = 1000
# distribution function
seed = 45
random_labels = np.random.RandomState(seed).randint
# Randomly assign real cluster vectors
labels_true = random_labels(low=0,high=cluster_num,size=samples)
# Random distribution of prediction cluster vectors
labels_pred = random_labels(low=0,high=cluster_num,size=samples)
# AMI calculates the similarity
AMI = metrics.adjusted_mutual_info_score(labels_true, labels_pred)
print('AMI= ',AMI)
# NMI computes similarity
NMI = metrics.normalized_mutual_info_score(labels_true, labels_pred)
print('NMI= ',NMI)
# MI computes similarity
MI = metrics.mutual_info_score(labels_true, labels_pred)
print('MI= ',MI)
# ARI computes similarity
ARI = metrics.adjusted_mutual_info_score(labels_true, labels_pred)
print('ARI= ',ARI)
Copy the code
# >
AMI=  0.006054902942529268   # is close to zero
NMI=  0.5004078317894934
MI=  2.2775606710406766
ARI=  0.006054902942529268   # is close to zero
Copy the code

Advantages:

1) For any number of clusters and samples, AMI score of randomly assigned labels is close to 0.

2) Bounded range [0,1] : near 0 means that the two labels are largely independent of each other, and close to 1 means consistent cluster vector.

Disadvantages:

1) Such indicators need to know the marking information of real data, so it is difficult to be applied in practice or data can be manually calibrated (similar to supervised learning).

2) The NMI and MI assigned by random labels are not 0.

1.4 Homogeneity, integrity and V-measure

Homogeneity and integrity is based on the mutual information fraction of conditional entropy to measure the similarity between cluster vectors. V-meansure is the harmonic average of homogeneity and integrity.

Homogeneity: each cluster contains only one class member;

Displacement: All the members of a given class were assigned to the same family class;

Homogeneity is 1:

# homogeneity is 1
labels_true = [0.0.0.1.1.1]
labels_pred = [0.0.1.2.2.2]
homogeneity=metrics.homogeneity_score(labels_true, labels_pred)
print('homogeneity= ',homogeneity)

# > homogeneity = 1.0
Copy the code

When integrity is 1:

# completeness is 1
labels_true = [0.1.1.1.2.2]
labels_pred = [0.0.0.0.2.2]
completeness=metrics.completeness_score(labels_true, labels_pred)
print('completeness= ',completeness)

# > completeness = 1.0
Copy the code

V-measure evaluates the similarity between cluster vectors by combining homogeneity and integrity. V-measure 1 indicates the optimal similarity:

# V-measure is 1
labels_true = [0.0.1.1.2.2]
labels_pred = [1.1.2.2.3.3]
completeness=metrics.v_measure_score(labels_true, labels_pred)
print('completeness= ',completeness)

# > completeness = 1.0
Copy the code

If v-measure value is small, we can give qualitative analysis from the aspects of homogeneity and integrity:

# Qualitative analysis of V-measure
labels_true =[0.0.0.1.1.1]
labels_pred =[0.0.0.1.2.3]
metrics.homogeneity_completeness_v_measure(labels_true,labels_pred)

# Order: homogeneity, completeness and V-measure
# > (0.9999999999999997, 0.5578858913022595, 0.7162089270041652)
Copy the code

The results show that the reason for the small V-measure value is the poor homogeneity among the cluster vectors.

Advantages:

1) Bounded scores: 0 represents the lowest similarity, 1 represents the highest similarity, and scores are positively correlated with similarity;

2) Intuitive interpretation: Qualitative analysis of homogeneity and integrity of clustering with poor V-measure is carried out to clarify the error types of clustering algorithm;

3) No assumptions are made on the structure of clusters: for example, the performance of k-means clustering algorithm after dimensionality reduction through spectral clustering is compared;

Disadvantages:

1) Random markers are not standardized, so random markers will not always produce the same homogeneity, completeness and V-measure scores. Especially when the number of cluster classes is large, the V-Meaure value of random label is not 0.

When the number of samples is greater than 1000 and the number of cluster classes is less than 10, such problems can be ignored. For smaller sample sizes or larger cluster numbers, adjustment coefficients (such as ARI and AMI) are recommended. As shown in the figure below, with the increase of the number of cluster classes, ARI and AMI values of randomly labeled are close to 0, v-measure and mutual_info_score (mutual information score) gradually increase.

2) Such indicators need to know the marking information of real data, so it is difficult to be applied in practice or data can be manually calibrated (similar to supervised learning).

1.5 Fowlkes – Mallows scores

Fowlkes-mallows index (FMI) is the geometric mean of the accuracy rate and recall rate:

Where TP is True Positive, that is, the number of sample pairs that the real and predicted labels belong to the same cluster class; FP is False Positive, that is, the number of sample pairs that the real label belongs to the same cluster class and the corresponding predicted label does not belong to the cluster class. FN is False Negative, that is, the number of sample pairs that the prediction label belongs to the same cluster class and the corresponding real label does not belong to the cluster class.

Assuming that the real cluster vector is [0, 0, 0, 1, 1, 1] and the predicted cluster vector is [0, 0, 1, 1, 2, 2], the FMI between the cluster vectors is calculated.

According to the definition:

TP = 2, FP=3, FN=1

Therefore:

Code representation:

# Fowlker-Mallows scores
labels_true =[0.0.0.1.1.1]
labels_pred =[0.0.1.1.2.2]
metrics.fowlkes_mallows_score(labels_true,labels_pred)

#>  0.4714045207910317
Copy the code

Advantages:

1) For any number of clusters and samples, the FMI of random cluster labels is approximately 0;

2) Bounded range: FMI range is [0,1], value close to 0 means randomly assigned labels are largely independent of each other, and value close to 1 means consistent cluster vector;

3) No assumptions are made about the structure of clusters: it can be used to compare the performance of k-means clustering algorithm after dimensionality reduction by spectral clustering;

Disadvantages:

1) Such indicators need to know the marking information of real data, so it is difficult to be applied in practice or data can be manually calibrated (similar to supervised learning).

2. Internal metrics

If the data set label is unknown, the model’s own internal metrics must be used to measure clustering performance.

The internal indicators are summarized as follows:

2.1 Silhouette Coefficient

Each sample has a corresponding contour coefficient, which consists of two scores:

  • A: Average distance between the sample and other sample points in the same cluster class;
  • B: average distance between the sample and all sample points in the nearest cluster;

The contour coefficient of each sample is defined as:

The contour coefficient of a data set is equal to the average of the contour coefficient of each sample in the data set.

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn import datasets
# Generate data set
dataset = datasets.load_iris()
X = dataset.data
y = dataset.target
# Build k-means clustering model
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
# Contour coefficient
metrics.silhouette_score(X, labels, metric='euclidean')

# > 0.5528190123564091
Copy the code

Advantages:

1) The contour coefficient is within the range of [-1,1], -1 represents the wrong clustering,1 represents the high-density clustering, and 0 represents the overlapping clustering;

2) When the cluster density is high and the separation is large, the contour coefficient of the cluster is larger;

Disadvantages:

1) The contour coefficient of convex clusters is higher than that of other types of clusters, such as density-based clusters obtained by DBSCAN.

2.2 Caliniski – Harabaz index

Criteria for evaluating good clustering models: data sets of the same cluster should be as dense as possible, and data sets of different clusters should be as far away as possible.

Define the cluster divergence matrix:

Dispersion matrix between clusters:

Among themIs the sample set of cluster class Q,Is the center of cluster q,Is the sample number of cluster q, and C is the center of all data sets.

According to the related concept of covariance, we use the trace of cluster divergence matrix to represent the density of the same cluster. The smaller the trace is, the more dense the data set of the same cluster is (the smaller the variance is). The trace of the divergence matrix between clusters represents the distance degree between different clusters. The larger the trace is, the larger the distance degree between different clusters is (the larger the variance is).

The Calinski-Harabaz index is defined according to the evaluation criteria of clustering model:

Where N is the number of data set samples, k is the number of cluster classes,.Are the traces of intercluster divergence matrix and cluster-like divergence matrix respectively.

# Caliniski - Harabaz index
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.calinski_harabaz_score(X, labels)

# > 561.62775662962
Copy the code

Advantages:

1) When clusters are dense and well separated, the higher the Caliniski-Harabaz score is, the better the clustering performance is.

2) Fast calculation speed.

Disadvantages:

1) The Caliniski-Harabaz index of convex clusters is higher than that of other types of clusters, such as density-based clusters obtained by DBSCAN.

2.3 DB Index (Davies-Bouldin Index)

We use the average distance of cluster C to represent the density of the cluster:

The | | C class C the number of clusters, dist (,,,) to calculate the distance between two samples.

The distance between the center of different cluster classes indicates the distance degree of different cluster classes:

Among them.Respectively are cluster classesandIn the center.

DB index is defined based on clustering model evaluation criteria:

The lower limit of DB index is 0. The smaller the DB index, the better the clustering performance.

Advantages:

1) DB index is simpler to calculate than contour coefficient; 2) DB index calculation only needs to know the number and characteristics of the data set;

Disadvantages:

1) The DB index of convex clusters is higher than other types of clusters, such as density-based clusters obtained through DBSCAN.

2) The distance measure of cluster center is limited to Euclidean space

At this point, this section introduces the internal indicators for evaluating the performance characteristics of clustering. The internal indicators are expanded based on similarity. If the similarity of the same cluster is good and the similarity between different clusters is bad, the clustering performance is considered to be good. Internal indicators such as contour coefficient and DB index are used to calculate similarity based on the distance between samples, while Calinski-Harabaz index is used to calculate similarity based on covariance. The next section summarizes the similarity algorithm between samples.

3. Summary of similarity algorithm

The common methods to evaluate the similarity between samples are distance calculation, cosine similarity calculation and kernel function calculation. If the distance between samples is smaller, the similarity will be higher. If the kernel function value between samples is larger, the similarity is higher.

Conversion relation between distance and kernel function:

Where D is the distance between samples, gamma often takes (1/features), and S is the kernel of the mapping.

The common similarity calculation methods are summarized below.

3.1 Distance Calculation

A given samplewith, the most commonly used method to calculate sample distance is Minkowski distance.

Formula:

The p 1 or higher.

When p=2, minkowski distance is Euclidean distance.

# Euclidean distance
import numpy as np
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import pairwise_kernels
X = np.array([[2.3]])
Y = np.array([[0.1]])
pairwise_distances(X, Y, metric='euclidean')

# >
array([[2.82842712]])
Copy the code

When P =1, minkowski distance is Manhattan Distance.

# Manhattan Distance
pairwise_distances(X, Y, metric='manhattan')

# >
array([[4.]])
Copy the code

3.2 Cosine similarity calculation

Formula:

# Cosine similarity calculation
pairwise_distances(X,Y,metric='cosine')

# >
array([[0.16794971]])
Copy the code

3.3 linear kernel

# Linear kernel computation
pairwise_kernels(X,Y,metric='linear')

# >
array([[3.]])
Copy the code

3.4 Polynomial kernel function

# Polynomial kernel calculation
coef0 = 1
degress = 3
metrics.pairwise.polynomial_kernel(X, Y, degree=3,gamma=0.2,coef0=1)

# >
array([[4.096]])
Copy the code

3.5 Sigmoid kernel function

# Sigmoid kernel function
metrics.pairwise.sigmoid_kernel(X, Y,gamma=0.2,coef0=1)

# >
array([[0.92166855]])
Copy the code

3.6 RBF kernel function

# RBF kernel
metrics.pairwise.sigmoid_kernel(X, Y,gamma=0.2)

# >
array([[0.92166855]])
Copy the code

3.7 Laplace kernel

# Laplace kernel
metrics.pairwise.laplacian_kernel(X, Y, gamma=0.2)

# >
array([[0.44932896]])
Copy the code

3.8 Chi-square kernel function

# Chi square kernel
metrics.pairwise.chi2_kernel(X, Y, gamma=0.2)

# >
array([[0.54881164]])
Copy the code

4. Summary

This article summarizes the performance measurement method and similarity calculation method of clustering. It mainly refers to the documentation of sciKit-learn website.

reference

Scikit-learn.org/stable/modu…

Scikit-learn.org/stable/modu…

Machine Learning by Zhou Zhihua

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

Like articles, click Looking at the