Original link:tecdat.cn/?p=22879

Tuo End number according to the tribe’s official number

Data set Overview

This data set is often used for data overview, visualization, and clustering models. It includes three iris varieties, each with 50 samples, as well as a number of attributes. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The given columns of this data set are:

I > Id II > sepal length (Cm) III > sepal width (Cm) IV > petal length (Cm) V > petal width (Cm) VI > variety

Let’s visualize this data set and cluster it using Kmeans.

Basic visualization

IRIS data, basic visualization before clustering

plot(data, aes(x , y )) 
Copy the code

The plot (data, geom_density (alpha = 0.25)Copy the code

The volcano figure

plot( iris, stat_density(aes(ymax = .. density.. , ymin = -.. density.. .Copy the code

plot(data, aes(x ),stat_density= .. density.. , facet_grid. ~ Species)Copy the code

Clustering data :: Method-1

# Kmeans clustering analysis was performed for 15 times in one cycle for (I in 1:15) Kmeans (Data, I) totalwSS[I]< -TOTt # Clustering lithograph -- The value of Total_WSS and no-of-clusters was plotted using the plot function. Plot (x=1:15, # x= number of classes, 1 to 15 totalwSS, # total_wSS value for each class type="b" # Plot two points and connect themCopy the code

Clustering data :: Method -2

Measures of cluster validity were used

Library (NbClust)# Set the margin to: C (bottom, left, top, right) PAR (MAR = C (2,2,2)) # Measure the suitability of the cluster against some metrics. # By default, it checks for clusters from 2 to 15 # take timeCopy the code

Hubert index

Hubert index is a graphical method to determine the number of clusters. In the Hubert index plot, we look for a distinct inflection point corresponding to a distinct increase in the measured value, a distinct peak in the Second difference plot of the Hubert index.

D index

In the d-index graph, we look for an important inflection point (an important peak in the d-index second difference graph) that corresponds to a significant increase in the measured value.

# # # # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * # # * in all index: ## * 10 proposed 2 as the best number of clusters ## * 8 proposed 3 as the best number of clusters ## * 2 proposed 4 as the best number of clusters ## * 1 proposed 5 as the best number of clusters ## * 1 proposed 8 as the best number of clusters ## * 1 proposed 14 as the best number of clusters ## * 1 proposed 15 as the best number of clusters ## ## ***** Conclusion # # # # * * * * * * according to majority rule, the optimal number of clusters is 2 # # # # # # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

Draw a histogram showing how the various indices vote on the number of clusters. Of the 26 indexes, the majority (10) voted for 2 clusters, 8 voted for 3 clusters, and the remaining 8 (26-10-8) voted for other numbers of clusters. Histogram, breakpoint =15, because our algorithm checks for 2 to 15 clusters.

hist(Best.nc)  
Copy the code

Clustering data :: Method 3

The Kalinsky indicator is similar to finding the ratio of variance between groups/variance within groups.

KM(Data, 1, 10) # Test of clusters 1 to 10 # sorTG = TRUE: Sort IRIS objects (rows) as a function of their group members # Color group member classes in heat maps # Sort to produce a more easily interpreted chart. # Two graphs. One is the heat map, and the other is the number and value of clusters (=BC/WC).Copy the code

ModelData $Results [2,] # Clustering for BC/WC valuesCopy the code

# So, which of these values is the largest? BC/WC should be as large as possible which. Max (modelData$results[2,])Copy the code

Silhoutte graph is used to cluster the data :: method 4

Try two classes first

# Calculate and return the distance matrix calculated by using the Euclidean distance measure. Calculate the distance between the rows in the data matrix. Silhouette (Cluster, DIS)Copy the code

Try out 8 clusters

# Calculate and return the distance matrix calculated by using the Euclidean distance measure. Calculate the distance between the rows in the data matrix. Silhouette (Cluster, DIS)Copy the code

Analyze the clustering trend

Calculate hopkin statistics for IRIS and random data sets

Runif (length(x), min(x), (Max (x)))# 2. Generate random data by applying functions on each column apply(Iris [,-5], 2, genx) # 3. Standardize the two datasets scale(IRIS) # default, Center = T, scale = T# 4. Calculate the Hopkins_stat statistic for the data setCopy the code

# can also be evaluated using the function Hopkins (). hopkins(iris)Copy the code

# 5. Compute the Hopkins_stat statistic of a random data setCopy the code

Most welcome insight

1.R language K-Shape algorithm stock price time series clustering

2. Comparison of different types of clustering methods in R language

3. K-medoids clustering modeling and GAM regression are performed for time series data of electricity load using R language

4. Hierarchical clustering of IRIS data set of R. language

5.Python Monte Carlo K-means clustering

6. Use R to conduct website comment text mining clustering

7. Python for NLP: Multi-label text LSTM neural network using Keras

8.R language for MNIST data set analysis and exploration of handwritten digital classification data

9.R language deep learning image classification based on Keras small data sets