This article originated from personal public account: TechFlow, original is not easy, for attention


Today is the 12th article of machine learning topic, let’s take a look at Kmeans clustering algorithm.

In the last article, we discussed KNN algorithm. KNN algorithm is very vivid. It finds the nearest K neighbors by the distance formula and predicts the current results by the results of neighbors. The algorithm we are going to look at today is also very intuitive and one of the most classical clustering algorithms, which is Kmeans.

As we all know, in English Means Means, so we also translate it into k-means algorithm. Of course, the meaning is the same, is to get the class cluster of the sample by taking the mean.

Now that we know that Kmeans algorithm is related to mean and cluster, there are only two questions left: first, how do we calculate the mean, and second, how do we cluster once we get the mean?

Clustering algorithm

Let’s take a look at an example. Suppose we have a sample of users’ income, and we want to divide them into rich, middle, and working class users based on their income, where they live, and how much they spend.

In this problem, we just know that we want to divide the sample into three categories, but we don’t know exactly how, and that’s what we want the model to do for us. In other words, we hope that the model can automatically identify the correlation between these samples, and gather the samples with strong correlation together into a class cluster. In this case, we want the model to divide the data into three categories for us.

It would be easy to divide the problem manually, but let’s divide it by revenue. Just draw a line graph of a user’s income and find the best two points to split it. But what if we don’t have revenue in our profile? If we could figure out if the user had a car, if they had a house, if they had savings at home, if they had any foreign debts, it would be a little bit less intuitive, but it would be easy, and we could easily model it. What if we don’t even have information about the garage, just where the user works, where the user lives? Is the question more abstract?

When the features are abstract and obscure, it is often not easy for us to divide them directly. Since we do not know the real label, we cannot use the supervision model. In order to solve the problem, Kmeans can only do the opposite. We no longer divide the data and let the data that are relatively close gather together by themselves. Kmeans algorithm is based on this idea. Clustering algorithm is a method that aggregates data through an algorithm and no longer divides it.

In the clustering problem, a series of samples are aggregated together by the model according to the attributes of the data and become the same category. The categories here are called clusters of these samples. The center of each cluster is called the cluster center. Therefore, the KMeans algorithm, as the name implies, is to cluster the samples into K class clusters according to the K value set by the user.

Kmeans principle

I don’t know if you’ve heard of this theory, but humans and computers are actually opposites. Problems that are difficult for humans are very simple for computers. For example, memory, it is difficult for human to instantly remember a large number of things, and the computer is not, as long as the bandwidth and capacity is enough, no amount of data can be remembered. Not only can they remember, but they can never make mistakes. Another example is calculation. It is difficult for humans to quickly calculate complex formulas. Basically, multiplication and division of more than two digits must rely on tools. But computers are not, and as long as the CPU resources are sufficient, any amount of computing can be done.

But what humans find simple is very difficult for computers. Like vision, we humans can easily tell a cat from a dog in a picture, but computers can’t. Even in the days of deep learning and AI, sophisticated models and large amounts of data have to be specially designed and trained for computers to learn to distinguish between images. Another example is creation. Human beings can create things that have not been created before, but computers cannot. The so-called computer music composition and writing is only the result of the comprehensive action of programs in accordance with fixed patterns and some random fluctuating values. Another example is thinking. Humans can think about problems they’ve never seen before, but computers obviously can’t.


For example, in the picture above, we humans see three categories at a glance, but computers don’t. The data in the computer is discrete, the computer has no vision, can not see the connection between the data. So we look at the simple problem, and it’s not that simple, but in fact we’ve got the essence of the analysis: since the computer can’t see the connection, we need to find a way to make it “see”.

Remember, how did we just quickly identify the three categories on the graph? Easy, you say, because there are the most dots in three regions. That’s true, but it’s not quantified, and if we quantify it, there should be three regions with the highest density. Once quantified, it becomes clear that we are clustering by density. Kmeans is based on this simple idea, but it is too simple and does not design an algorithm to calculate the number of class clusters. Therefore, the number K of class clusters should be provided by users.

In other words, the algorithm doesn’t know how many classes it’s going to cluster into, so let’s just say how many classes.

Let’s ignore this detail and assume that we know by some strange method that the data is divided into three categories. How would Kmeans be divided?

If we think about it a little bit, we say we’re going to quantify density, but density is hard to quantify. Since the definition of density is itself based on the results of clustering, we must already know that such a set of data has to be aggregated to calculate its density, not the other way around. So the idea makes sense, but you can’t do it directly. But just because you can’t do it directly, doesn’t mean you can’t do it backwards, it’s a very common idea in mathematics, and we see it again here.

Since we can’t cluster by density, can we cluster first and then calculate density, and adjust according to the result of density?

I googled for a long time and couldn’t find Kmeans’s original author, but I figured whoever came up with such a genius idea must be pretty smart. Kmeans is based on such a simple and clever idea.

Initialize the

At the beginning of the operation of the algorithm, Kmeans will randomly select K center points in the range of the data set, and then conduct clustering according to these K center points. It’s really easy to cluster the centers, and for each sample we just calculate its distance to all the centers, and we just pick the closest one.

Of course, such results must be no, but it doesn’t matter, even on the basis of wide of the mark center, we can also complete the clustering, we randomly into the center of the position and the final clustering results are drawn in a map, you can see though a begin to choose the location of the looks is not so, but we can achieve a good result.


The initial clustering result must be inaccurate, but it doesn’t matter, we are not afraid of inaccurate, but afraid of no result. Now that we have the result, we can analyze it to see where optimization is headed. With the direction of optimization, the results can become more and more accurate, just like in linear regression, we are not the first to solve the selection of the best parameter value, but also through the gradient descent of the iterative point by point.

The iteration

Before we introduce the specific iterative approach, let’s examine the situation. Obviously, due to the relationship of random selection, the result of clustering must be inaccurate, which is caused by the distance between the randomly selected center and the class cluster. In other words, we have to find a way to get the center closer to the class.

So how do we get close? Let’s look at what happens after perfect clustering.


Let’s observe that the center and cluster overlap in perfect clustering. So what’s the property of this central point? And if you’re familiar with physics, you can imagine that this center is going to be the center of mass of this sample. It doesn’t matter if you’re not familiar with the concept, as you can see from the figure above, the sample points are evenly distributed around the center. What is the characteristic of evenly distributed? It is also easy to imagine that there are about the same number and distribution of points to the left and right of the center, up and down, and in every other direction. So if we quantify this, we get that the mean of the coordinates of all the points in the class is where the center is.

So the question is, in the case of a clustering error, will the mean of the sample coordinates (the center of mass) coincide with the center we choose? What’s the deviation if they don’t overlap?

In fact, we can guess from the above figure that the center point we chose is not in the right position, so it must not coincide with the center of mass of the sample after clustering. The direction of the deviation is the direction from the center of mass.

This conclusion is also naive, because the closer you are to the real class cluster, the denser the points are, so the center of mass calculated will obviously be closer to the direction of the real class cluster. With this conclusion, it is very simple, we only need to calculate the center of mass of each class after each clustering, and then re-cluster the calculated center of mass as the center point of the next clustering, and repeat the above process. When the center before clustering overlaps with the center of mass after clustering, it shows that clustering converges and we find the class cluster.

The figure below shows how the center of a class changes with iteration. We can see intuitively that as we iterate, our class center gets closer and closer to the true center of the cluster. After three iterations, we are very close to the final result. Therefore, this conclusion is correct, and the idea of using the center of mass as the new center to iterate is feasible.

IMAGE

Code implementation

After hou GUI figured out the principle of Kmeans and traction, it became very simple to implement it in Python.

We can of course write the logic to generate the data ourselves, but the SkLearn library provides us with an API to create the data, and we can easily create the data we want by calling the API. We can use dataset. Make_blobs to create cluster data. The number of incoming samples and the number of features, the coordinates of the real class cluster, and the standard deviation of the sample can be used to obtain a batch of corresponding samples.

def createData(a):

X, y = datasets.make_blobs(n_samples=1000, n_features=2, centers=[[- 1.- 1], [1.1], [2.2]], cluster_std=[0.2.0.3.0.4])

return X, y

Copy the code

Now that we have created the data, we are ready to implement the algorithm.

First of all, we develop the basic method of the whole algorithm to simplify the subsequent development. In the KMeans problem, we already know that we adjust the category of samples by the distance between vectors and various cluster centers in the sample space. Therefore, we first develop the calculation method of distance between vectors.

With Numpy, the whole calculation process becomes very simple:

def calculateDistance(vecA, vecB):

return np.sqrt(np.sum(np.square(vecA - vecB)))

Copy the code

In this line of code, we first compute the difference between the two vectors. And then we take the sum of squares of each of the terms of this difference vector and take the square root of it, and we get the Euclidean distance between vectors A and B.

Next, we need the coordinates of the center points of the random K class clusters. Although the selection of class clusters in KMeans algorithm is random, it should be noted that the range of our randomness is not infinite. Because clustering is to find K locations with the highest sample density, it is naturally impossible to find legal class clusters in places without sample distribution. So we can limit the range of randomness to the distribution of the sample, which greatly simplifies the computation.

def randomCenter(samples, K):

m,n = np.shape(samples)

centers = np.mat(np.zeros((K, n)))

for i in range(n):

# obtain the maximum value of column I via np.max

mxi = np.max(samples[:, i])

# Obtain the minimum value of column I by np.min

mni = np.min(samples[:, i])

rangeI = mxi - mni

# assign to the i-th column of the cluster center

centers[:, i] = np.mat(mni + rangeI * np.random.rand(K, 1))

return centers

Copy the code

The above logic is not difficult to understand. We first create a coordinate matrix for K cluster centers and initialize it to 0, where n is the number of dimensions of the sample. Next, we iterate over the n dimensions, looking for the maximum and minimum values of each dimension in the sample. With these two values, we know the range of cluster centers on each sample dimension. Finally, we call the random.rand method to randomize the specific coordinates.

At this point, the two basic tools needed for the algorithm have been developed. Next, as long as the iterative process is implemented, the whole KMeans is completed.

Before we move on, let’s test the two interfaces we’ve developed.

First of all, we have data:

IMAGE

If there is data output, it indicates that our data has been generated. Then, K cluster centers are randomly selected according to the generated data.

When we generate data, we pass in three sample centers, so the number of cluster centers is 3, that is to say, our K is 3, then we call randomCenter method, check the result.

Sure enough, we generated three points. To be sure, we need to output the range of the sample and check that the coordinates of the points we generate are within the range of our sample.

Using Numpy’s Max and min methods, combined with Python’s slicing operations, we can solve for these four values very easily. Obviously, our cluster centers are all in the range. There is nothing wrong with our code.

IMAGE

Once these two methods are ok, we can start to develop the core logic of KMeans, which is the computational logic of clustering.

According to the pseudo-code we came up with at the front, we randomly get the cluster center first. Then classify each sample according to the cluster center. Finally, update the location of the cluster center according to the marked sample. The whole logic is actually very simple and not complicated to write in code:

def KMeans(dataset, k):

m, n = np.shape(dataset)

The first dimension is the category and the second dimension is the distance to the center of the cluster

clusterPos = np.zeros((m, 2))



centers = randCenter(dataset, k)

clusterChange = True

while clusterChange:

clusterChange = False

Walk through all samples

for i in range(m):

minD = inf

idx = - 1

The distance to each cluster center is traversed

for j in range(k):

dis = calculateDistance(centers[j,:], dataset[i, :])

if dis < minD:

minD = dis

idx = j

# If the category changes

if clusterPos[i,0] != idx:

clusterChange = True

Update the sample clustering results

clusterPos[i,:] = idx, minD

Update the coordinates of the cluster center

for i in range(k):

nxtClust = dataset[np.nonzero(clusterPos[:,0] == i)[0]]

centers[i,:] = np.mean(nxtClust, axis=0)

return centers, clusterPos

Copy the code

Now, let’s test our code to see if we can cluster the right results.

centers, clusterRet = KMeans(x, 3)

plt.scatter(x[:,0],x[:,1],c=clusterRet[:,0] ,s=3,marker='o')

plt.scatter(centers[:, 0].A, centers[:, 1].A, c='red', s=100, marker='x')

plt.show()

Copy the code

We plotted all the points in the sample according to the clustering results, and then marked the location of the cluster center on the same graph. The results are as follows.

IMAGE

It is not difficult to see that in the figure above, both the location of cluster center and the final clustering result are basically the same as the result estimated manually. It shows that the KMeans algorithm written by us runs successfully and outputs correct results.

conclusion

Here, the principle and code of Kmeans algorithm are introduced. I do not know what you feel, when I first learned this algorithm, the biggest feeling is simple, this algorithm is too “play”, it is easy to understand, there is no winding or complex things, all the problems and ideas are straight.

Simple algorithms are easy for us to learn, but often too simple algorithms will leave shortcomings. Kmeans’s weakness is also obvious, and I believe everyone has felt it. In each iteration, we need to calculate the category of all samples, which is a full calculation. As our initial center point is randomly selected, the location of the center at the beginning may be quite different from the final class cluster. The farther the distance is, the more iterations are obviously required, and the greater the calculation consumption will naturally be.

Then, is there any way to improve the efficiency of Kmeans?

Think about that for a moment, and I’ll talk about it next week in machine learning.

Similarly, because Kmeans algorithm is simple in principle and easy to implement, it often appears in the recruitment pen test questions of major companies. As far as I know, Alibaba’s pen-based test for several years was to ask contestants to hand write a Kmeans cluster. So although this algorithm is simple, we should not take it lightly. In addition, for the algorithm is not satisfied with understanding the principle, everything can be more thought and more questions, so that the understanding will be more in-depth, later to deal with the interview more flexible.

Today’s article is all of these, if you feel that there is a harvest, please feel free to point attention or forward, your help is very important to me.