Learning notes about Zhou Zhihua’s book Machine Learning record the learning process this blog is Chapter9

1 Clustering task

Clustering task (clustering) is a kind of typical “unsupervised learning” task, the training sample tag information is unknown, the goal is through the learning of marker training samples are to reveal the internal nature and law of data: data concentration of sample is divided into several are often mutually disjoint subsets, each subset is called a cluster.

Formally, suppose the sample set D={x1,x2… , xm} D = \ {x_1, x_2,… , x_m \} D = {x1, x2,… ,xm} contains MMM unlabeled samples, each sample xi=(xi1,xi2… , xin) x_i = (x_ {i1}, x_ {i2},… , x_ xi = {in}) (xi1, xi2,… ,xin) is an NNN dimensional eigenvector. Then the clustering algorithm divides the sample set DDD into KKK disjoint clusters {Cl∣l=1,2… , k} \ {C_l | l = 1, 2,… , k \} {Cl ∣ l = 1, 2,… , k}. We use the lambda j ∈ {1, 2,… , k} \ lambda_j \ in \ {1, 2,… Lambda \} j, k ∈ {1, 2,… ,k} represents the cluster label of sample xjx_jxj, that is, Xj ∈Cλjx_j \in C_{\lambda_j}xj∈Cλj. So the result of clustering can be marked by a cluster vector λ={λ1,λ2… , lambda j} \ lambda = \ {\ lambda_1 \ lambda_2,… , lambda \ lambda_j \} = {1, lambda lambda. 2,… Said, lambda j}.

2 Performance Measurement

The clustering performance measure is also known as the “validity index”. To solve the clustering problem, intuitively, we hope that “birds of a feather flock together”, that is, samples of the same cluster are as similar as possible to each other, and samples of different clusters are as different as possible. In other words, the “intra-cluster similarity” of clustering results should be relatively high, and the “inter-cluster similarity” should be relatively low.

There are roughly two categories of indicators for clustering performance measurement:

  • External index: Compare the clustering results with a reference model
  • Internal index: Directly examine the clustering results without using any reference model

For the data set D={x1,x2… , xm} D = \ {x_1, x_2,… , x_m \} D = {x1, x2,… ,xm}, assuming that the cluster obtained by clustering is divided into C={C1,C2… Ck} C = \ {C_1, C_2,… , C_k \} = {C1 and C2, C… C∗={C1∗,C2∗,… ∗, Cs = \ {} C ^ * C_1 ^ *, C_2 ^ *,… , C_s ^ * \} C ∗ = {C1 ∗, C2 ∗,… ,Cs∗}, and accordingly, let λ\lambda lambdaλ and λ∗\lambda^*λ∗ represent the cluster marker vectors of CCC and C∗C^*C∗. We consider the samples in pairs and define:


a = S S .    S S = { ( x i . x j ) Lambda. i = Lambda. j . Lambda. i = Lambda. j . i < j } . b = S D .    S D = { ( x i . x j ) Lambda. i = Lambda. j . Lambda. i indicates Lambda. j . i < j } . c = D S .    D S = { ( x i . x j ) Lambda. i indicates Lambda. j . Lambda. i = Lambda. j . i < j } . d = D D .    D D = { ( x i . x j ) Lambda. i indicates Lambda. j . Lambda. i indicates Lambda. j . i < j } a=|SS|, \space\space SS=\{(x_i,x_j)|\lambda_i=\lambda_j,\lambda^*_i=\lambda_j^*,i<j\},\\ b=|SD|, \space\space SD=\{(x_i,x_j)|\lambda_i=\lambda_j,\lambda^*_i\neq \lambda_j^*,i<j\},\\ c=|DS|, \space\space DS=\{(x_i,x_j)|\lambda_i\neq \lambda_j,\lambda^*_i=\lambda_j^*,i<j\},\\ d=|DD|, \space\space DD=\{(x_i,x_j)|\lambda_i\neq \lambda_j,\lambda^*_i\neq \lambda_j^*,i<j\}

Where, SSSSSS represents sample pairs belonging to the same cluster in CCC and C∗C^*C∗, and others are similar to…

Because of each sample can only appear in one case, have a + b + c + d = m (m – 1) 2 a + b + c + d = \ frac {m (m – 1)} {2} a + b + c + d = 2 m (m – 1)

Based on the above four formulas, we can deduce the following commonly used external indicators for measuring clustering performance:

  • The Jaccard coefficient


    J C = a a + b + c JC=\frac{a}{a+b+c}
  • FM index


    F M I = a a + b a a + c FMI=\sqrt{\frac{a}{a+b}\cdot \frac{a}{a+c}}
  • Rand index


    R I = 2 ( a + d ) m ( m 1 ) RI=\frac{2(a+d)}{m(m-1)}

The results of the above three external indicators are all between [0,1][0,1][0,1], and the larger the value, the better.

Cluster partitioning considering clustering results C={C1,C2… Ck} C = \ {C_1, C_2,… , C_k \} = {C1 and C2, C… ,Ck}, define:


a v g ( C ) = 2 C ( C 1 ) 1 Or less i Or less j Or less C d i s t ( x i . x j ) d i a m ( C ) = max 1 Or less i Or less j Or less C d i s t ( x i . x j ) d m i n ( C i . C j ) = min x i C i . x j C j d i s t ( x i . x j ) d c e n ( C i . C j ) = d i s t ( mu i . mu j ) avg(C)=\frac{2}{|C|(|C|-1)} \sum_{1\le i\le j\le |C|}dist(x_i,x_j)\\ diam(C)=\max_{1\le i\le j\le |C|}dist(x_i,x_j)\\ d_{min}(C_i,C_j)=\min_{x_i\in C_i,x_j\in C_j}dist(x_i,x_j)\\ d_{cen}(C_i,C_j)=dist(\mu_i,\mu_j)

Among them, the distdistdist said the distance between the two samples, mu, mu mu said the center of the cluster mu = 1 C ∣ ∣ ∑ 1 I ∣ C or less or less ∣ xi \ mu = \ frac {1} {| | C} \ sum_ {1 \ I \ le le | | C} x_i mu = C ∣ ∣ 1 ∑ 1 I ∣ ∣ xi C or less or less. Avg (C) AVG (C) avG (C) refers to the average distance between samples in cluster CCC, diam(C)diam(C) refers to the furthest distance between samples in cluster CCC, dmind_{min}dmin refers to the distance between the nearest samples between two clusters, Dcend_ {cen}dcen indicates the distance between the centers of two clusters.

Based on the above four formulas, common internal indicators of clustering performance measurement can be derived:

  • Index of the DB


    D B I = 1 k i = 1 k max j indicates i ( a v g ( C i ) + a v g ( C j ) d c e n ( C i . C j ) ) DBI=\frac{1}{k}\sum_{i=1}^k\max_{j\neq i}(\frac{avg(C_i)+avg(C_j)}{d_{cen}(C_i,C_j)})
  • Dunn index


    D I = min 1 Or less i Or less k { min j indicates i ( d m i n ( C i . C j ) max 1 Or less l Or less k d i a m ( C l ) ) } DI=\min_{1\le i\le k}\{\min_{j \neq i}(\frac{d_{min}(C_i,C_j)}{\max_{1\le l\le k}diam(C_l)})\}

The smaller the DBI value, the better, and the larger the DI value, the better.

3 Distance Calculation

12. For function Dist (⋅,⋅) Dist (\cdot,\cdot)dist(⋅,⋅), if it is a distance measure, it needs to meet some basic properties:

  • Nonnegative: Dist (xi,xj)≥0dist(x_i, X_j)\ge 0Dist (xi,xj)≥0
  • Dist (xi, Xj)= 0Dist (x_i, X_j)= 0Dist (xi,xj)=0 if and only if xi=xjx_i=x_jxi= Xj
  • Symmetry: dist (xi, xj) = dist (xj, xi) dist (x_j x_i) = dist (x_j x_i) dist (xi, xj) = dist (xj, xi)
  • Straight pass: Dist (xi, xj) or less dist (xi, xk) + dist (xk, xj) dist (x_j x_i) \ le dist (x_i, x_k) + dist (x_j x_k) dist (xi, xj) or less dist (xi, xk) + dist (xk, xj), You can think of it as the triangle inequality.

For ordered attributes (e.g. {1, 2, 3} where 1 and 2 are closer and can be compared in order, they are ordered attributes; But {plane, train, ship} is difficult to find the sequential relation, so it is not an ordered attribute), for a given sample xi={xi1,xi2… , xin} x_i = \ {x_ {i1}, x_ {i2},… Xi, x_ {in} \} = {xi1, xi2,… Xin} and xj = {xj1, xj2,… , XJN} x_j = \ {x_ {j1}, x_ {j2},… Xj, x_ {Jacqueline Nottingham} \} = {xj1, xj2,… , XJN}, the most commonly used distance function is “Minkowski distance” :


d i s t m k ( x i . x j ) = ( u = 1 n x i u x j u p ) 1 p dist_{mk}(x_i,x_j)=(\sum_{u=1}^n |x_{iu}-x_{ju}|^p)^{\frac{1}{p}}
  • When p=1p=1p=1, the distance is the Manhattan distance
  • When p=2p=2p=2, the distance is Euclidean

For unordered attributes, we can use VDMVDMVDM to measure the distance. Let mu,am_{u,a}mu,a represent the number of samples with aaa value on attribute uuu, mu,a, IM_ {u,a, I}mu,a, I represent the number of samples with AAA value on attribute uuu in the third sample cluster, KKK is the number of sample clusters, Then the VDMVDMVDM distance between two discrete values aaa and BBB on attribute uuu is:


V D M p ( a . b ) = i = 1 k m u . a . i m u . a m u . b . i m u . b VDM_p(a,b)=\sum_{i=1}^k |\frac{m_{u,a,i}}{m_{u,a}}-\frac{m_{u,b,i}}{m_{u,b}}|

The combination of Minkowski distance and VDM distance can process the mixed attributes. It is assumed that there are nCN_CNC ordered attributes and N − NCN-N_CN − NC unordered attributes, and the ordered attributes are placed before the unordered attributes:


M i n k o v D M p ( x i . x j ) = ( u = 1 n c x i u x j u p + u = n c + 1 n V D M p ( x i u . x j u ) ) 1 p MinkovDM_p(x_i,x_j)=(\sum_{u=1}^{n_c}|x_{iu}-x_{ju}|^p+\sum_{u=n_c+1}^nVDM_p(x_{iu},x_{ju}))^{\frac{1}{p}}

It should be noted that “similarity measure” is usually defined based on some form of distance. The larger the distance is, the smaller the similarity is. However, the distance used for similarity measurement does not necessarily satisfy all the basic properties of distance measurement, especially directivity. For example, in some tasks we might want to have a similarity measure like this: “people” and “horses” are similar to “centaurs” respectively, but “people” and “horses” are very different; To achieve this goal, the distance between “man”, “horse” and “centaur” can be small, but the distance between “man” and “horse” is very large, as shown in the figure. At this time, the distance no longer meets the direct recursion; Such a distance is called a non-metric distance. In addition, the distance formulas described in this section are defined in advance, but in many real-world tasks, it is necessary to determine the appropriate distance formulas based on data samples, which can be achieved through distance metric learning.

4. Prototype clustering

Prototype-based clustering: It is assumed that the clustering structure can be characterized by a set of archetypes. In general, the algorithm initializes the prototype and then iteratively updates and solves the prototype.

4.1 K-means algorithm

Given sample set D={x1,x2… , xm} D = \ {x_1, x_2,… , x_m \} D = {x1, x2,… , xM}, k-means (K-means) algorithm divides the cluster obtained by clustering C={C1,C2… Ck} C = \ {C_1, C_2,… , C_k \} = {C1 and C2, C… ,Ck}, the goal is to minimize square error:


E = i = 1 k x C i x mu i 2 2 E=\sum_{i=1}^k \sum_{x\in C_i}||x-\mu_i||_2^2

The mu I = 1 ∣ Ci ∣ ∑ x ∈ Cix \ mu_i = \ frac {1} {| C_i |} \ sum_ \ {x C_i} in x mu I = 1 ∑ ∈ x ∣ Ci ∣ Cix is cluster CiC_iCi mean vector. Intuitively, this formula describes the tightness of the samples in the cluster around the mean vector of the cluster. The smaller the EEE value, the higher the similarity of the samples in the cluster.

The k-means algorithm flow is as follows:

To avoid excessively long running times, a maximum number of running rounds or a minimum adjustment threshold is usually set.

4.2 Learning vector quantization

Learning Vector Quantization (LVQ) is an attempt to find a group of prototype vectors to characterize the clustering structure. However, LVQ is different from general clustering algorithms in that it assumes that the data samples have category markers, and the supervised information of the samples is used to assist clustering in the Learning process.

Suppose the dataset D={(x1,y1),(x2,y2)… , (xm, ym)} D = \ {(x_1, y_1), (x_2, y_2),… , (x_m, y_m) \} D = {(x1, y1), (x2, y2),… ,(xm,ym)}, where each sample xix_ixi has NNN attributes (xi1; xi2; … ; xin)(x_{i1}; x_{i2}; … ; x_{in})(xi1; xi2; … ; Xin), yiy_iyi is the category marker for sample xix_IXI. The goal of LVQ algorithm is to learn a set of NNN dimensional prototype vectors {P1, P2… , pn} \ {p_1, p_2,… , p_n \} {p1, p2,… Pn}, each prototype vector represents a cluster labeled ti∈Yt_i\in Yti∈Y.

LVQ algorithm is described as follows:

From the flow chart, we can see that the key of LVQ algorithm lies in lines 6-10 in the figure above, namelyUpdate the prototype vector. Intuitively, if you take a sample
x j x_j
Class and prototype vector
p i p_i
If the same, make
p i p_i*
to
x j x_j
In the direction of:


p = p i + eta ( x j p i ) p’=p_i*+\eta (x_j-p_i*)

The distance between p’p ‘and xjx_jxj is:


p x j 2 = p i + eta ( x j p i ) x j 2 = ( 1 eta ) p i x j 2 ||p’-x_j||_2 =||p_i*+\eta(x_j-p_i*)-x_j||_2 =(1-\eta)\cdot||p_i* -x_j||_2

Where, η\etaη is the learning rate between (0,1)(0,1)(0,1). Similarly, if the sample xjx_jxj category and prototype vector pip_ipi, distance will increase for the (1 + eta) ∣ ∣ PI ∗ – xj ∣ ∣ 2 (eta) 1 + \ | | p_i * – x_j | | _2 (1 + eta) ∣ ∣ PI ∗ – xj ∣ ∣ 2, make the prototype vectors and xjx_jxj distance increased.

Through the above process, for any sample XjX_JXJ, it will be assigned to the cluster represented by its nearest prototype vector; In other words, each prototype vector PIp_IPi defines a region associated with it, RiR_iRi, and the distance between each sample in this region and PIp_IPI is no greater than its distance with other prototype vectors, i.e


R i = { x X   x p i 2 Or less x p i 2 . i indicates i } R_i=\{x\in X|\space||x-p_i||_2 \le||x-p_{i’}||_2,i\neq i’ \}

Thus, the cluster division of sample space XXX {R1,R2… , Rq} \ {R_1, R_2,… , R_q \} {R1, R2,… ,Rq}, this partition is usually called “Voronoi partition”.

4.3 Gaussian mixture clustering

The Mixture-of-Gaussian clustering model is used to express the clustering prototype.

The probability density function of the Gaussian distribution is as follows:


p ( x ) = 1 ( 2 PI. ) n 2 1 2 e 1 2 ( x mu ) T 1 ( x mu ) p(x)=\frac{1}{(2\pi)^{\frac{n}{2}}|\sum|^{\frac{1}{2}}} e^{-\frac{1}{2}(x-\mu)^T\sum^{-1}(x-\mu)}

Where, μ\muμ is the mean vector of NNN dimension, and ∑\sum∑ is the covariance matrix of N ×nn \times nn×n (symmetric positive definite matrix).

To clearly show that gaussian distribution and the corresponding parameters of dependencies, we will record for probability density function p (x ∣ mu, ∑) p (x | \ mu, \ sum) p (x ∣ mu, ∑), define the gaussian mixture distribution, which is suitable for mixing coefficient alpha I \ alpha_i alpha I (mixture coefficient), :


p M ( x ) = i = 1 k Alpha. i p ( x mu i . i ) i = 1 k Alpha. i = 1 p_M(x)=\sum_{i=1}^k \alpha_i\cdot p(x|\mu_i,\sum_i)\\ \sum_{i=1}^k \alpha_i=1

Assuming that the sample generation process is given by the Gaussian mixture distribution, the Gaussian mixture distribution is selected according to the prior distribution defined by α I \alpha_iα I (α I \alpha_iα I represents the probability of selecting the third distribution), and then the sample is sampled according to the probability density function of the selected mixture components, so as to generate the corresponding sample.

If the training set {x1,x2… , xm} \ {x_1, x_2,… , x_m \} {x1, x2,… ,xm} is given by the above procedure, let the random variable zj∈1,2… , kz_j \ in {1, 2,… , k} what zj had ∈ 1, 2,… ,k represents the Gaussian mixture component of the generated sample XjX_Jxj, and its value is unknown. Obviously the prior probability of Zjz_jzj P(zj= I)P(zj= I)P(zj= I) corresponds to α I \alpha_iα I. The posterior probability corresponds to the following formula and is denoted as γji\gamma_{ji}γji:


p M ( z j = i x j ) = P ( z j = i ) p M ( x j z j = i ) p M ( x j ) = Alpha. i p ( x j mu i . i ) l = 1 k Alpha. l p ( x j mu l . l ) p_M(z_j=i|x_j)=\frac{P(z_j=i)\cdot p_M(x_j|z_j=i)}{p_M(x_j)} =\frac{\alpha_i\cdot p(x_j|\mu_i,\sum_i)}{\sum_{l=1}^k \alpha_l\cdot p(x_j|\mu_l,\sum_l)}

When the mixed Gaussian distribution is known, the gaussian mixture cluster divides the sample set into KKK clusters, and the cluster label λj\lambda_jλj of each sample xjX_JXj is determined as follows (i.e., the gaussian mixture cluster uses the probability model to describe the circle, and the cluster division is determined by the posterior probability corresponding to the prototype) :


Lambda. j = arg max i { 1 . 2 . . k } gamma j i \ lambda_j = \ mathop {\ arg \ Max} _ {I \ in \ {1, 2,… ,k\}} \gamma_{ji}

With the maximized logarithmic likelihood and EM algorithm, we can solve α I \alpha_iα I:


Alpha. i = 1 m j = 1 m gamma j i \alpha_i=\frac{1}{m}\sum_{j=1}^m\gamma_{ji}

That is, the mixing coefficient of each Gaussian distribution is determined by the average posterior probability of the sample belonging to the variant.

The process of Gaussian mixture clustering is as follows:

5. Density clustering

Density -based clustering: It is assumed that the clustering structure can be determined by the tightness of the sample distribution. In general, density clustering algorithm inspects the connectability of samples from the perspective of sample density, and continuously expands the cluster cluster based on the connectable samples to obtain the final clustering result.

DBSCAN is a famous density clustering algorithm, which depicts the tightness of sample distribution based on a set of neighborhood parameters (ϵ,MinPts\ Epsilon,MinPtsϵ,MinPts). Given data set D={x1,x2… , xm} D = \ {x_1, x_2,… , x_m \} D = {x1, x2,… ,xm}, define the following concepts:

  • ϵ−\epsilon-ϵ− neighborhood: for XJ ∈Dx _j\in Dxj∈D, its ϵ−\ Epsilon-ϵ − neighborhood contains samples in the sample set DDD whose distance from Xjx_jxj is not greater than that of calle \epsilon x, i.e


    N ϵ ( x j ) = { x j D d i s t ( x i . x j ) Or less ϵ } N_\epsilon(x_j)=\{x_j\in D|dist(x_i,x_j)\le \epsilon \}
  • Core Object: If xjx_jxj ϵ – \ epsilon – ϵ – neighborhood contains at least MinPtsMinPtsMinPts samples, namely ∣ N ϵ (xj) ∣ p MinPts | N_ \ epsilon (x_j | \ ge MinPts ∣ N ϵ (xj) ∣ p MinPts, Then xjx_jxj is a core object.

  • Directdensity-reachable: If xjx_Jxj bits are in the ϵ−\epsilon-ϵ− neighborhood of Xix_IXI, and Xix_iXI is the core object, then xix_IXI and XjX_Jxj density direct.

  • Density reachable: For xix_IXI and Xjx_Jxj, there are sample sequences P1, P2… , pnp_1, p_2,… , p_np1, p2,… ,pn, where P1 =xi, PN = XJP_1 = X_I, P_n = X_JP1 =xi, Pn =xj, and the density of PI +1p_{I +1} PI +1 and PIP_IPi are direct, then the density of Xix_IXI and XjX_Jxj are reachable.

  • Density connected: For Xix_IXI and XjX_JXJ, if xKX_KXK makes xix_IXI and XjX_JXJ reachable by xkX_KxK, xix_IXI is said to be connected to xjX_JxJ density.

Based on the above concepts, DBSCAN defines “cluster” as: the maximum density connected sample set derived from the density reachable relationship. Formally, given neighborhood parameters (ϵ,MinPts\epsilon,MinPtsϵ,MinPts), a cluster C⊆DC \subseteq DC⊆D is a subset of non-empty samples with the following properties:

  • Connectivity: XI ∈Cx_i \in Cxi∈C, XJ ∈C tail xix_j\in C \Rightarrow X_IXj ∈C tail XI and xjX_Jxj density are connected
  • Maximum property: Xi ∈C, xjX_i \in C, X_JXI ∈C, x_JXI ∈C can be understood by the density of xiX_IXI

if
x x
Is the core object, by
x x
The set of all samples with reachable density is denoted as
X = { x D x by x The density of } X = \ {X ‘\ | X in D’ by the density of X \}
Is the cluster that satisfies the maximum size and connectivity.

DBSCAN algorithm first selects a core object in the data set as seed, and then determines the corresponding cluster based on it. The algorithm is described as follows:

6 hierarchical clustering

Hierarchical Clustering attempts to divide data sets at different levels, thus forming a tree-like clustering structure. Data sets can be divided by either a “bottom-up” aggregation strategy or a “top-down” splitting strategy.

AGNES is a hierarchical clustering algorithm based on bottom-up aggregation strategy. It first regards each sample of the data set as an initial cluster, and then finds the two clusters closest to each other in each step of the algorithm to merge. The process is repeated until the preset number of cluster clusters is reached.

Given cluster CiC_iCi and CjC_jCj, the distance formula is as follows:

  • Single link: The minimum distance dmin (Ci, Cj) = min ⁡ x ∈ Ci, z ∈ Cjdist (x, z) d_ (C_i C_j) = {min} \ min_ {\ \ x in C_i, z in C_j} Dist (x, z) dmin (Ci, Cj) = minx ∈ Ci, z ∈ Cjdist (x, z)
  • All the links: Maximum distance (Ci, Cj) = Max on3dmax ⁡ x ∈ Ci, z ∈ Cjdist d_ (x, z) (C_i C_j) = {Max} \ max_ {\ \ x in C_i, z in C_j} Dist (x, z) (Ci, Cj) = on3dmax maxx ∈ Ci, z ∈ Cjdist (x, z)
  • All links: Average distance davg (Ci, Cj) = 1 ∣ Ci ∣ ∣ Cj ∣ ∑ x ∈ Ci ∑ z ∈ Cjdist (x, z) d_ (C_i C_j) = {avg} \ frac {1} {| C_i | | C_j |} \ sum_ \ {x in C_i} \ sum_ \ in C_j} {z Dist (x, z) davg (Ci, Cj) = ∣ Ci ∣ ∣ Cj ∣ 1 ∑ x ∈ Ci ∑ z ∈ Cjdist (x, z)