• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/3…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source

The introduction

Clustering, the most common unsupervised learning algorithm, refers to the partitioning of a data set into different classes or clusters according to a specific criterion (such as distance), so that the similarity of data objects within the same cluster is as large as possible, and the differences of data objects outside the same cluster are as large as possible. In other words, after clustering, data of the same class should be gathered together as much as possible, and data of different classes should be separated as much as possible.

Clustering algorithm has been applied in many scenarios, such as news automatic grouping, user clustering, image segmentation and so on. In many cases, the clustering results obtained by unsupervised clustering algorithms can also be used as features in subsequent supervised learning to improve the overall effect. ShowMeAI takes you to learn about clustering algorithms.

(this clustering algorithm part involves the basic knowledge of machine learning, no sequence knowledge reserve first baby articles can view ShowMeAI graphic machine learning | basic knowledge of machine learning.

1. Clustering problem

1) Clustering problems and core concepts

As the saying goes, people are divided into groups by clustering objects. What clustering algorithm does is to group and group unlabeled data based on data distribution, so that similar data fall into the same cluster as far as possible.

Let’s compare and distinguish clustering and classification:

  • Clustering is a kind of unsupervised learning while classification is a kind of supervised learning.
  • Clustering only needs to manually specify the standard of similarity and the number of categories, while classification needs to learn the classification method from the training set.

2) Clustering algorithm usage

Clustering algorithm is widely used. In the face of the unknown world, it is a basic method for human beings to explore the unknown world by first classifying and then researching one by one. Clustering algorithm can be applied to exploratory data mining, statistical analysis, bioinformatics, data compression, computer image recognition, medical image analysis, etc., in the commercial field can be used to do market research, commodity classification, in the social science field can be used to do crime area analysis and so on.

Here are some sample points, all of which are grouped into three categories based on physical distance. You only need to tell the algorithm this information, and the algorithm can complete the clustering according to your requirements:

  • The number of classification was 3;
  • The classification criteria is physical distance;
  • The classified classes are represented by red, green and blue.

In fact, in addition to physical distance, any rule you can think of in real life that a computer can determine through computation and logic can be used as a classification standard.

Let’s use image compression as an example. On the far left is a color photo, about 1Mb in size, which can be compressed into tens to hundreds of kilobytes, which is how compression software is implemented. So what is the implementation principle of compression software? One of them is the clustering algorithm.

The process from the original image to compressed storage is as follows:

Clustering algorithm can also be used for image segmentation. Each pixel in the image is a 3-dimensional vector corresponding to the [R, G, B] pixel value. Given the number of categories K in the cluster, the algorithm uses K different colors to represent the original image, and each pixel is represented by one of the K colors. Details are as follows:

For documents, news, and products, we often use nested categorization, which is a hierarchical clustering:

3) Mainstream clustering algorithm

Firstly, we have an understanding of Clustering algorithms. The mainstream Clustering algorithms can be divided into two categories: Partitioning Clustering and Hierarchical Clustering. Their main differences are shown here:

Partition clustering algorithms give a series of flat clusters (separate classes) without any explicit structure between them to indicate their relationship to each other.

  • Common algorithms include K-means/K-medoids, Gaussian Mixture Model, Spectral Clustering, centroid-based Clustering, etc.

Hierarchical clustering can output a cluster aggregation with hierarchical structure, so it can provide more information than unstructured cluster aggregation output of partition clustering. Hierarchical clustering can be considered as nested partition clustering.

  • Common algorithms include Single-linkage, Complete-linkage, Connectivity-based Clustering and so on.

The two algorithms use different algorithms in the clustering process, and we will focus on the K-means algorithm, single-linkage algorithm, and complete-linkage algorithm.

2. K-means clustering algorithm

K-means algorithm is a very basic algorithm in clustering algorithm, and it is also widely used. ShowMeAI will explain the algorithm principle to you below.

1) Core concept of K-means algorithm

We mentioned that the clustering algorithm should divide N data points into K classes according to the distribution (k of many algorithms is artificially set in advance). We hope to obtain k central points through clustering algorithm, and the division of which central point each data point belongs to.

  • The center point can be found by iterative algorithm, which satisfies the condition that the sum of the distances from all data points to the cluster center is minimum.

  • After the central point is determined, each data point belongs to the nearest central point.

Before we get to the core question of how to find the center, let’s solve a few small questions:

Q1: How to calculate the distance from the data point to the center point?

  • We usually choose the geometric distance, which is the distance L2 squared.

Q2: Is the central point unique, or is there a global optimal solution?

  • For the case of multiple centers, global optimization is a very difficult problem. Theoretically, there exists a global optimal solution, but it may not be found. Since the global optimal solution is hard to find, let’s settle for the next best thing, and see if we can find a local optimal solution.

Q3: How are the clustering results represented?

  • Space segmentation is adopted: space is divided into multiple polygons, each polygon corresponds to a cluster center.

2) K-means algorithm steps

K-means uses EM algorithm to determine the center point iteratively. The process is divided into two steps:

  • (1) Update center point: randomly selected point is used as the starting point during initialization; During iteration, the center of gravity (or centroid) of all data points in the same class is taken as the new center point.

  • ② Distribution data point: all data points are allocated to the nearest central point.

Repeat the above two steps until the center point does not change. The process is shown in the figure:

  • Left Assignments: At first, three points are randomly selected as the centers of three classes, and clusters are assigned based on the distance between other points and these three centers. Each class recalculates and allocates centers.

  • On the right side refrepeatable Means: reassign all data points according to the new center points (1 point that originally belonged to the green center point was changed to the red center point after this iteration).

The following example shows the process of k-means dynamic iterative convergence:

  • Figure (a) has a group of scattered points, and we set the number of clusters K=2.

  • Figure (b) shows the result of the first classification after randomly selecting two points as the center initialization.

    • And you can see that the red and blue line runs through the middle of this group of points. That’s obviously a problem, but it doesn’t matter, the algorithm keeps going down. Calculate the centers of the red and blue categories.
  • As can be seen in Figure (c), one falls in the lower left group and the other falls in the upper right group. The second classification is performed with the new center point.

  • The boundary in figure (d) is basically enough to separate the two masses.

  • Figure (f) and (g) show that the subsequent process of repeated calculation of your “central point-classification data point” has converged, data point allocation is basically stationary, and clustering is complete.

The GIF below shows the process more clearly:

3. K-means disadvantages and improvements

1) Disadvantages of k-means algorithm

The K-means algorithm is simple and easy to use. What are its disadvantages? We summarize some shortcomings of the K-means algorithm as follows:

  • Disadvantage 1: The center point is the centroid of all the data points of the same class, so the cluster center point may not belong to the sample point of the data set.

  • Disadvantage 2: When calculating distance, we use the square of distance L2. It is very sensitive to outliers, Noisy Data and outliers can pull the center off and even change the position of the dividing line.

2) K-Medoids algorithm

Aiming at the shortcomings of the K-means algorithm, the K-Medoids algorithm is improved:

(1) The restricted clustering center must come from the data point.

  • The calculation method of center point is changed from directly calculating the center of gravity to finding a data point near the center of gravity as a new center point.

  • The k-Medoids refitting procedure is more complicated than k-means.

(2) In order to avoid the sensitivity of square calculation to outliers, square is changed into absolute value.

To sum up, the iteration process of k-Medoids algorithm is consistent with k-means, the differences are as follows:

  • The starting point is not a random point, but a arbitrarily selected point in the data set.

  • Distances use L1 distances instead of L2 distances.

  • The new center point is not the center of gravity of all points of the same class, but the closest point to other points of all data points of the same class.

  • In terms of complexity, compared with THE O(n)O(n)O(n) O(n)O(n)O(n) of K-means, the complexity of the update center of K-medoids is O(n2)O(n^2)O(n2).

The following figure is a systematic comparison of k-means and K-Medoids algorithms:

4. Hierarchical clustering algorithm

Compared with k-means classification clustering, we have another hierarchical clustering algorithm.

1) Hierarchical clustering vs partition clustering

Partition clustering results in several clearly divided classes, while hierarchical clustering finally results in a tree-like hierarchical structure.

It is very simple to transform from hierarchical clustering to partitioned clustering. By cutting a layer of hierarchical clustering, a partitioned cluster is obtained. As shown below:

2) Single-linkage algorithm

Next we introduce a single-linkage algorithm in hierarchical clustering. The algorithm constructs a binary tree with leaves representing data, and each internal node of the binary tree representing a cluster. As shown in the figure:

This is a bottom-up cluster. The tree starts with leaves, and gradually the branches grow along the leaves, and the branches grow bigger and bigger until they reach the roots.

If there are many leaves, the growth process will have many classes to merge. There are 11 data points in the figure, and a total of 10 mergers took place.

3) Complete-linkage algorithm

Similar to the single-linkage algorithm, the complete-linkage algorithm iterates the same way, except that when classes are merged, a single-linkage uses the two least distant points of the two classes as the distance between them. The complete-linkage algorithm does the opposite. Use the distance between the two farthest data points as the distance between the two classes.

Both methods of calculation have advantages and disadvantages. In general, the computational complexity of hierarchical clustering is O(N3)O(n^3)O(N3), which is quite high. The algorithm can be accelerated by using the data structure of the priority queue, down to O(n2log⁡n)O(n^{2} \log{n})O(n2logn).

5. DB – SCAN algorithm

1) DB-SCAN algorithm

In the previous content, we introduced the algorithm of partition clustering and hierarchical clustering. Next, we will learn another clustering algorithm: DB-Scan algorithm.

Db-scan is a density-based cluster. If k-means is used for points with irregular shapes as shown in the figure below, the effect will not be very good. With DB-scan, points in the same density area can be grouped into one class.

2) Key concepts of DB-SCAN algorithm

Core objects are points of density.

  • If xjx_jxj ∈ – \ the in – ∈ – neighborhood contains at least MinPts samples, namely ∣ N ∈ (xj) ∣ p MinPts | N_ \ (x_j | acuity in MinPts ∣ N ∈ (xj) ∣ p MinPts, xjx_jxj is a core object.

Density reachable: Core objects can be density direct or density reachable between them.

  • If xix_Ixi is located in ∈−\ in-∈ − neighborhood of XjX_Jxj and xjX_jxj is the core object, xjX_JXj is said to be direct from xjX_Jxj density.

  • For xix_Ixi and Xjx_Jxj, if there are sample sequences P1, P2… Pnp_1, p_2 \ dots, p_np1, p2,… ,pn, where P1 = XIp_1 = X_IP1 =xi, Pn = xJP_n = X_JPN =xj and PI +1p_i+1pi+1 is direct from piP_IPi density, then xjX_Jxj is reachable from xix_Ixi density.

Density -connected: All core points of reachable density constitute density-connected.

  • For xix_IXI and Xjx_Jxj, if xKX_KxK makes xix_IXI and XjX_JXj reachable by xkX_KxK, xix_IXI is said to be connected to xjX_Jxj density.

Let’s take a closer look at some of the basic concepts mentioned above.

First assume that the required minimum point density MinPts is 3.

  • Within a radius, the number of points around the point x1x_1x1 is 5, exceeding the threshold of 3, so x1x_1x1 is a core object. Also, x2x_2X2, x3X_3x3, and x4x_4x4 are core objects.

  • X1x_1x1 and x2x_2x2 are in the same neighborhood, so they are density direct relations, and x3x_3x3 and x2x_2x2 are density direct relations, by x2x_2x2, x1x_1x1 and x3x_3x3 are density reachable relations.

  • X3x_3x3 and X4X_4x4 connect density through multiple core objects.

3) Pseudo-code of DB-SCAN algorithm

The process is described in plain language as:

  • For a data set, specify the minimum point density MinPts and radius range.

  • Find the core object first: if the density of points within the radius is greater than MinPts, that point is the core object. Put all the core objects into a collection.

  • From this core object set, find a core object randomly and judge whether other data points are directly related to its density. If so, they will be grouped into the cluster.

  • Continue to judge whether the density of other points is direct to the points in the cluster until all points are checked. At this time, all points in the cluster are a density cluster.

More unsupervised learning algorithm model summary articles can view ShowMeAI AI knowledge skill quick | machine learning – unsupervised learning.

Video tutorial

You can click on theB stationCheck out the [bilingual subtitles] version of the video

MIT bilingual + data download 】 【 6.036 | introduction to machine learning (2020 · complete version)

www.bilibili.com/video/BV1y4…

ShowMeAI related articles recommended

  • 1. Machine learning basics
  • 2. Model evaluation methods and criteria
  • 3.KNN algorithm and its application
  • 4. Detailed explanation of logistic regression algorithm
  • 5. Detailed explanation of naive Bayes algorithm
  • 6. Detailed explanation of decision tree model
  • 7. Detailed explanation of random forest classification model
  • 8. Detailed analysis of regression tree model
  • 9. Detailed explanation of GBDT model
  • 10. The XGBoost model is fully resolved
  • 11. Detailed explanation of LightGBM model
  • 12. Detailed explanation of support vector machine model
  • 13. Detailed explanation of clustering algorithm
  • 14.PCA dimension reduction algorithm in detail

ShowMeAI series tutorials recommended

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials