1. Are clustering algorithms all unsupervised learning?

What is clustering algorithm? Clustering is a machine learning technique that involves grouping data points. Given a set of data points, we can use a clustering algorithm to divide each data point into a specific group. In theory, data points in the same group should have similar attributes and/or characteristics, while data points in different groups should have highly different attributes and/or characteristics. As an unsupervised learning method, clustering is a common statistical data analysis technique in many fields.

Commonly used algorithms include K-means, Gaussian Mixed Model (GMM) and self-organizing Map (SOM).

2. K-means (K-means) algorithm

2.1 Algorithm Process

K-means is the most popular clustering algorithm, which takes an unlabeled data set and then clusters the data into different groups.

K-means is an iterative algorithm, assuming that we want to cluster the data into N groups, the method is as follows:

  • Firstly, 𝐾 random points are selected, called cluster centroids.
  • For each data in the data set, it is associated with the nearest central point according to the distance to 𝐾, and all points associated with the same central point are grouped into one class.
  • Calculate the average for each group and move the center point associated with the group to the average.
  • Repeat the process until the center point does not change.

withTo represent the clustering center, use 𝑐(1),𝑐(2)… 𝑐(𝑚) to store the index of the cluster center closest to the 𝑖 instance data. The pseudo-code of the K-means algorithm is as follows:

Repeat {
    for i = 1 to m
    c(i) := index (form 1 to K) of cluster centroid closest to x(i)
    forK = 1 to k μk := Average (mean) of points assigned to cluster k}Copy the code

The algorithm is divided into two steps. The first for loop is the assignment step: for each example 𝑖, calculate the class it should belong to. The second for loop is the shift of the cluster center, that is: for each class 𝐾, the center of mass of that class is recalculated.

The K-means algorithm can also be conveniently used to divide data into many different groups, even in cases where there are no very distinct groups. The data set shown in the figure below consists of two characteristics, height and weight. K-means algorithm is used to divide the data into three categories to help determine the three sizes of t-shirts to be produced.

2.2 Loss function

The k-mean minimization problem isto minimize the sum of distances between all data points and their associated cluster center points. Therefore, the cost function of K-mean (also known as Distortion function) is:


Among themRepresentative andThe nearest cluster center. Our optimization goal is to find the lowest cost function

2.3 Selection of K value

Before running the K-means algorithm, we first need to randomly initialize all clustering centers. Here is how to do it:

  1. We should choose 𝐾 < 𝑚, that is, the number of cluster centers should be less than the number of all training set instances.
  2. One problem with the k-mean of randomly selecting 𝐾 training instances and then making 𝐾 cluster centers equal to the 𝐾 training instances is that it may stay at a local minimum, depending on the initialization.

To solve this problem, we usually need to run the K-mean algorithm several times, re-initialize randomly each time, and finally compare the results of running the K-mean for many times and select the result with the smallest cost function. This may work when 𝐾 is small (2–10), but it may not improve significantly if 𝐾 is large.

There is no so-called best method to select the number of clusters, but it usually needs to be selected manually according to different problems. When choosing, think about what motivates us to use k-means clustering. One method that may be discussed is called the “elbow rule”. For the “elbow rule”, all we need to do is change the 𝐾 value, which is the total number of cluster categories. We run the k-means clustering method with a cluster. This means that all the data will be grouped into a cluster and the cost function or distortion function 𝐽 will be calculated. 𝐾 represents the clustering number.

We might get a curve that looks something like this. Like a man’s elbow. That’s what the elbow rule does. Let’s look at a picture like this. It looks like there’s a very clear elbow there. And what you see is this pattern, the distortion goes down very quickly, from one to two, from two to three, and then you get to a cubit point at three. After that, the distortion drops very slowly. It looks as if clustering with 3 clusters is correct. ** This is because that point is the elbow of the curve and the distortion drops very quickly. ** When you apply the “elbow rule”, if you get a graph like the one above, then this is a reasonable way to choose the number of clusters.

2.4 Is KNN different from K-means?

K-nearest Neighbor (KNN) classification algorithm is a relatively mature method in theory and one of the simplest machine learning algorithms.

KNN K-Means
1.KNN is a classification algorithm

2. Supervised learning

3. The training data set is the data with label
1. K-means is a clustering algorithm

2. Unsupervised learning

3. The training data set is the data without label, which is chaotic. After clustering, it becomes orderly, first disorderly, then orderly.
There is no obvious pre-training process, which belongs to memory based learning There is an obvious pre-training process
The meaning of K: to classify a sample X, find the nearest K data point near x from the training data set. In this K data point, category C accounts for the most number, so set the label of X to C. K is a manually fixed number. Assuming that the data set can be classified into K clusters, the training data can be used to train the K categories.

The similarities

Both involve the process of finding the nearest point in the data set, given a point. That is, NN(Nears Neighbor) algorithm is used in both of them.

2.5 Advantages, disadvantages and improvements of K-means

K-means: Under the condition of big data, it will consume a lot of time and memory. Suggestions for optimizing K-means:

  1. Reduce the number of clusters K. Because each sample has to compute the distance from the center of the class.

  2. Reduce the characteristic dimension of the sample. For example, dimension reduction through PCA etc.

  3. Investigate other clustering algorithms and test the performance of different clustering algorithms by selecting toy data.

  4. Hadoop cluster, k-means algorithm is easy to carry out parallel computing.

  5. The algorithm may find a locally optimal cluster rather than a globally optimal cluster. Improved binary K-means algorithm is used.

    Binary k – means algorithm: first of all, the whole data set as a cluster, and then conduct a k – means (k = 2) algorithm to cluster in two, and calculate the error sum of squares of each cluster, select sum of squares of the largest cluster iteration process again in two above, until achieve the user to specify k clusters, can achieve the global optimal at this time.

3. Gaussian Mixture Model (GMM)

3.1 Idea of GMM

Gaussian Mixed Model (GMM) is also a common clustering algorithm. Similar to k-means algorithm, EM algorithm is also used for iterative calculation. The Gaussian mixture model assumes that the data of each cluster conforms to the Gaussian distribution (also called normal distribution), and the current data distribution is the result of the superposition of the Gaussian distribution of each cluster.

The first figure is an example of data distribution. If only a Gaussian distribution is used to fit the data in the figure, the ellipse shown in the figure is the ellipse corresponding to the double standard deviation of the Gaussian distribution. Intuitively speaking, the data in the figure is obviously divided into two clusters, so it is not reasonable to use only one Gaussian distribution to fit the sum, which needs to be extended to the superposition of multiple Gaussian distributions to fit the data. The second figure is the result of fitting the superposition of two Gaussian distributions. ** This leads to the Gaussian mixture model, which uses linear combinations of multiple Gaussian distribution functions to fit data distributions. ** Theoretically, gaussian mixture model can fit any type of distribution.

The core idea of the Gaussian mixture model is to assume that the data can be regarded as generated from multiple Gaussian distributions. Under this assumption, each individual fractional model is the standard Gaussian model with its meanAnd varianceIs the parameter to be estimated. In addition, each sub-model has a parameter, can be understood as the weight or probability of the generated data. The formula of gaussian mixture model is:


Generally, we cannot directly obtain the parameters of gaussian mixture model, but observe a series of data points, and after giving the number K of a category, we hope to obtain the optimal K Gaussian fraction models. Therefore, the calculation of gaussian mixture model becomes the search for the optimal mean μ, variance σ and weight π. Such problems are usually solved by maximum likelihood estimation. Unfortunately, the maximum likelihood estimation is directly used in this problem, and a complex non-convex function is obtained. The objective function is the logarithm of sum, and it is difficult to expand and obtain its partial derivative.

** In this case, the EM algorithm can be used. **EM algorithm is in the maximization of the objective function, first fix a variable to make the whole function into a convex optimization function, take the derivative to get the maximum value, and then use the optimal parameter to update the fixed variable, enter the next cycle. Specifically, the iterative process of EM algorithm is as follows.

Firstly, the value of each parameter is selected randomly. Then, repeat the following two steps until convergence.

  • E steps. Based on the current parameters, calculate the probability that each point is generated by a sub-model.
  • M step. The probability estimated by step E is used to improve the mean, variance and weight of each sub-model.

Gaussian mixture model is a generative model. The data generation process can be understood by assuming the simplest case, that is, there are only two one-dimensional standard gaussian distribution sub-models N(0,1) and N(5,1), with weights of 0.7 and 0.3, respectively. Then, when generating the first data point, select a distribution randomly according to the proportion of weights, such as the first gaussian distribution, and then generate a point from N(0,1), such as −0.5, which is the first data point. When generating the second data point, the second gaussian distribution N(5,1) was randomly selected to generate the second point 4.7. This loop generates all the data points.

In other words, we do not know the 3 parameters of each of the optimal K Gauss distributions, nor do we know which Gauss distribution generates each data point. Therefore, during each cycle, the current Gaussian distribution is fixed first, and the probability of each data point being generated by each Gaussian distribution is obtained. Then the probability of formation is fixed and a better gaussian distribution is obtained according to the data points and the probability of formation. The cycle repeats until the parameters change no more, or the change is very small, and a more reasonable gaussian distribution is obtained.

3.2 Comparison between GMM and K-means

The similarity between gaussian mixture model and k-means algorithm is as follows:

  • They are all algorithms that can be used for clustering;
  • You need to specify K;
  • EM algorithm is used to solve;
  • Both tend to converge only to local optimum.

And the advantage of this, compared to the K-means algorithm, is that you can tell us what the probability of a sample being in a certain category is; It can not only be used for clustering, but also for probability density estimation. And it can be used to generate new sample points.

4. How to evaluate the clustering algorithm

Due to the diversity of data and requirements, no algorithm can be suitable for all data types, data clusters or application scenarios, and it seems that each situation may require a different evaluation method or metric. For example, k-means clustering can be evaluated using a sum of error squares, but densiy-based data clusters may not be spherical and the sum of error squares will fail. In many cases, judging the result of clustering algorithm is strongly dependent on subjective interpretation. However, the evaluation of clustering algorithm is still necessary. It is one of the most important parts in clustering analysis.

The task of clustering evaluation is to estimate the feasibility of clustering on data sets and the quality of the results produced by clustering methods. This process is divided into three sub-tasks.

  1. Estimate the clustering trend.

    This step is to detect whether there are non-random cluster structures in the data distribution. If the data is basically random, then the clustering results are meaningless. We can observe whether the clustering error changes monotonously with the increase of the number of clustering categories. If the data is basically random, that is, there is no non-random cluster structure, then the range of clustering error changes with the increase of the number of clustering categories should be less significant, and there is no real cluster number corresponding to an appropriate K data.

  2. Determine the number of data clusters.

    After determining the clustering trend, we need to find the number of clusters most consistent with the real data distribution to judge the quality of clustering results. There are many methods to determine the number of data clusters, such as elbow method and Gap Statistic method. It should be noted that the optimal number of data clusters for evaluation may be different from the number of clusters the program outputs. For example, some clustering algorithms can automatically determine the number of data clusters, but may be different from the optimal number of data clusters determined by other methods.

  3. The clustering quality was determined.

    In the unsupervised case, we can evaluate the clustering effect by examining the separation of clusters and the clustering of clusters. Defining assessment indicators can demonstrate the candidate’s ability to actually solve and analyze problems. In fact, there are many kinds of measurement indicators. Several commonly used measurement indicators are listed below. For more indicators, you can read relevant literature.

    Contour coefficient, root mean Square standard deviation, R-square, improved Hubert γ statistics.

5. Code implementation

Gauss hybrid model code

K – Means the code

Machine Learning

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]