Original link:tecdat.cn/?p=2981

Original source:Tuo End number according to the tribe public number

 

There are many clustering analysis algorithms, the classic ones are K-means and hierarchical clustering.

K-means clustering analysis algorithm

K of k-means is the final cluster number, which needs to be specified by you in advance. K-means is relatively simple among common machine learning algorithms, and the basic process is as follows:

  • Firstly, take any k sample points as the initial centers of k clusters;
  • For each sample point, the distance between them and k centers is calculated, and it is grouped into the cluster with the center with the smallest distance.
  • After all the sample points are classified, the centers of k clusters are recalculated.
  • Repeat the process until the cluster of sample points does not change.

The clustering process of K-means is shown as follows:

K-means clustering process

Although the principle of K-means clustering analysis is simple, its disadvantages are also obvious:

  • First of all, you’re going to have a bunch of categories and you’re going to have to decide what k is, but if you don’t know anything about the data you don’t know what K should be;
  • The initial centroid should also be selected, and this initial centroid directly determines the final clustering effect;
  • Each iteration has to recalculate the distance between each point and the center of mass, and then sort, which costs a lot of time.

It’s worth noting that there are many ways to calculate distance, not necessarily cartesian distance; We normalize before we calculate the distance.

Hierarchical clustering method

Although the principle of K-means is simple, the principle of hierarchical clustering is even simpler. Its basic process is as follows:

  • Each sample point is regarded as a cluster;
  • The distance between each cluster was calculated, and the two nearest clusters were combined to form a new cluster.
  • Repeat the process until there is only one cluster.

Hierarchical clustering does not specify the specific number of clusters, but only focuses on the distance between clusters, and eventually forms a tree graph.

Examples of hierarchical clustering

With this tree diagram, you can quickly delineate whether you want to divide it into clusters.

The following demonstrates the process of K-means and hierarchical clustering by taking the details of cancer cells as an example.


nci.labels = NCI60$labs 
nci.data = NCI60$data 
sd.data = scale(nci.data) 
data.dist = dist(sd.data) 
plot(hclust(data.dist),labels = nci.labels, main = "Complete Linkage", xlab = "", sub = "", ylab = ""Plot (hclust(data.dist,method) by default= "average"),labels = nci.labels, main = "Average Linkage", xlab = "", sub = "", ylab = "") # class average method> plot(hclust(data.dist),labels = nci.labels, main = "Single Linkage", xlab = "", sub = "", ylab = "") # Shortest distance methodCopy the code

Complete Linkage

Average Linkage

Single Linkage

It can be seen that the final clustering effect is also different when different distance indicators are selected. Among them, the longest distance and the class mean distance are used more, because the pedigree is more balanced.

Hc. out = hclust(dist(sd.data)) > Hc. clusters = cutree(hc.out, dist(sd.data))4)

 
> plot(hc.out,labels = nci.labels) > abline(h=139,col="red") # cut into4classCopy the code

Hierarchical clustering is divided into 4 categories

The red line in the figure divides the cluster into four categories, and it is easy to see which samples belong to which cluster.

The above is the result of hierarchical clustering, but if k-means clustering is used, the result is likely to be different.

 

> < span style = "max-width: 100%; clear: both; min-height: 1em;2) 
> km.out = kmeans(sd.data,4,nstart = 20) > km.clusters = km.out$cluster > table(km.clusters, hC. clusters) # The results of the two clusters were indeed different, and k-means was the first2The order of cluster and hierarchy clustering3Clusters are consistentCopy the code

  


 

Most welcome insight

1.R language K-Shape algorithm stock price time series clustering

2. Comparison of different types of clustering methods in R language

3. K-medoids clustering modeling and GAM regression are performed for time series data of electricity load using R language

4. Hierarchical clustering of IRIS data set of R. language

5.Python Monte Carlo K-means clustering

6. Use R to conduct website comment text mining clustering

7. Python for NLP: Multi-label text LSTM neural network using Keras

8.R language for MNIST data set analysis and exploration of handwritten digital classification data

9.R language deep learning image classification based on Keras small data sets