This article originated from personal public account: TechFlow, original is not easy, for attention


Today is the 13th article of machine learning topic, we look at the optimization of Kmeans algorithm.

In an article in the Kmeans clustering algorithm, we studied together at the end of the algorithm, we put forward a question: although Kmeans algorithm works well, but each iteration are need to traverse the whole amount of data, once the data quantity is too large, due to the large computational complexity iterative times is too much, can lead to very slow convergence speed.

Think about it. If we were asked this question in an interview, and we didn’t know the answer beforehand, what would we say?

Again, we analyze the question before we answer it. The problem is slow convergence speed and high computational complexity. We also know the reasons for high computational complexity. One is the large sample size, the other is the excessive number of iterations. So obviously, if we want to improve this problem, we should start with these two things.

These two points are key to the problem, and there are many ways to optimize and improve them. In other words, this is an open question, where derivation and thinking are more important than standard answers. On the other hand, if we miss the point, the answer will be off the mark, which is why, in my interviews, some candidates will say use a distributed system or increase resources to speed up the calculation, or switch to another algorithm.

That is to say, the process of analyzing and solving problems is more important than the solution itself.

Below, we introduce an optimization method for each of the two key points mentioned above.

mini batch

The idea of mini Batch is very simple. Since the amount of data in the whole sample is too large, the iteration time will be too long, so can we reduce the data scale?

So how do we reduce the size? It’s very simple, we take a random sample from the whole population, and we take a small fraction of the data to replace the whole population. In this way, we artificially reduce the size of the sample, so that the speed of iteration can be improved.

It is true that we can improve the efficiency of iteration by sampling, but does this guarantee correctness?

That’s an easy question to answer, and we just need to do a simple experiment to prove it.

We took advantage of the code developed last week without any optimization and increased the number of generated samples to 50,000. As you can see from the graph below, it took Kmeans 37.2 seconds to complete the calculation. The clustering results obtained are as follows:


Then we randomly select 1000 samples from random. Choice under NUMpy, and compare the time consuming and results before and after.


Let’s look at the center of the two clusters again. It can be seen from the picture that the error between the two clusters is very small. We print out the coordinates to observe, and the error is within 0.05, which can be said to be very close.


While the mini Batch principle is worthless to put it plainly, it is important, not only important, but also widely used in machine learning. In the scenario of big data, almost all models need to do mini Batch optimization.

However, we can’t help but have a question. This scheme is totally random and looks very unreliable. Will there be a situation where the results we selected are very biased, for example, they all happen to be in the same cluster? Theoretically, this is certainly possible, so to be prudent, we can repeat the sampling several times and then calculate the mean of the calculated class cluster coordinates until the cluster center is stable. Alternatively, you can manually set the number of iterations until the number of iterations is met.

Kmeans ++

If mini Batch is a generic method and seems a bit of a joke, the following method is much more hardcore. This method is directly optimized on the Kmeans algorithm itself and is therefore called Kmeans++.

As mentioned above, there are probably two starting points to optimize the efficiency of the Kmeans algorithm. One is that the number of samples is too large, the other is that the number of iterations is too many. The mini batch we introduced just now targets at the situation of too many samples, while the method of Kmeans++ targets at the number of iterations. In order to achieve rapid convergence, we reduce the number of iterations required by some method.

This idea is very clear, but the operation is not simple, the number of iterations and convergence effect is related. In other words, the number of iterations cannot be reduced before convergence is achieved, otherwise non-convergence will occur. And the clustering problem is different from the classification problem, in which we have a clear loss function for optimization. When we use the gradient descent method, we can also set the learning rate before the gradient to be slightly larger, thus speeding up the speed of convergence. But the clustering problem is different, especially the Kmeans algorithm. The value of coordinate transformation is obtained by finding the average coordinate, namely the coordinate of the center of mass, for our sequential iteration. There is no way to speed up the iteration unless we modify the iteration logic.

We can indeed get this conclusion from the idea of algorithm operation, and this conclusion is no problem, but the problem is that the speed of convergence depends on the rate of change of each iteration, there is another important indicator. That’s where the iteration starts.

That is to say, where do we start to converge? Obviously, the closer our initial state is to the final convergence state, the fewer iterations are needed for convergence. So the goal of our optimization algorithm is to find an initial state that is close enough to the convergence result. This idea should not be difficult to figure out, but there is a huge question hidden in it. We do not know what the state of convergence is during training, and how can we judge the distance between the initial state and the convergence result?

Obviously going straight isn’t going anywhere. We need to make a detour.

If we analyze it, we can actually get a lot of results. First of all, if we randomly select K sample points as the starting cluster center, the effect is better than randomly selecting K coordinate points. The reason is also simple, because our random coordinates correspond to the selection of K points from the rectangular area of the maximum and minimum boxes, whereas our selection of K points from the sample is much smaller. We can see that just from the percentage of area. Due to the clustering nature of the samples, we choose the initial state among the samples, and the possibility of selecting close to the class cluster is much greater than random selection.

But there is a small problem. For example, in the example above, the class cluster is 3. We randomly select 3 samples as the initial state. But the question is, what if we happen to pick 3 points in a class cluster, wouldn’t it take a long time to get to convergence?

This is a real problem, and we want to avoid picking the same cluster midpoints. But since we don’t know the distribution of samples, how can we tell?

Here we need to use another property of clustering, let’s look at the figure above:


We can see that clusters are centripetal. In other words, points near the same cluster will be included in the range of the cluster, which in turn means that two points far away from each other are more likely to belong to different clusters than those near.

The idea of Kmeans++ is based on these two points, and we can sort out the insights we have so far, and get the principle of the algorithm.

Algorithm principle

First of all, the actual center of the cluster is what we get by randomizing the sample. But instead of randomizing K at once, we only randomize 1.

And then we’re going to randomly pick one of the n-1 points to be the center of the next cluster. However, our randomness is not blind. We hope to design a mechanism so that the farther the point is from all cluster centers, the more likely it is to be selected, and the closer it is, the less likely it is to be randomly selected.

We repeat the process until a total of K cluster centers are selected.

Roulette wheel

Let’s look at how to determine probabilities based on weights, and there are a number of algorithms for doing that, but the easiest one is the roulette wheel method. The algorithm should be based on gambling or a lottery, and the principle is very similar.

We’ve all, at one point or another, played a turntable lottery in a supermarket or some other setting, where a pointer stays fixed. We spin the wheel, and when the wheel stops, the position the needle points to is the result of the draw.

We all know that the probability of making a hit is related to the area on the wheel, the larger the area, the greater the probability of making a draw, otherwise the smaller the probability of making a draw.


Let’s use the formula to express that for each point the probability of being selected is:


Among themIs the shortest distance from each point to all class clusters,Said someThe probability of being selected as the center of a class cluster.

Roulette is actually a simulation of the wheel lottery process, but we use an array to simulate the wheel. We flatten the fan of the turntable into strips, so that each of the original fan corresponds to an interval. The area of the sector corresponds to the length of the interval, and obviously the longer the length, the more likely it is to be picked. Then we have a lottery, where we multiply the sum of the lengths of the interval by a number in the 0-1 interval.

We found the interval where this result falls, which is the result of the roulette wheel. So we’ve achieved the probability of controlling the randomness of each outcome.


In the graph above, we randomly came up with 0.68, and then we subtracted the interval each time, and that’s what we randomly came up with.

conclusion

After understanding the roulette algorithm, the whole idea of Kmeans++ is already at a glance. In other words, we likened the selection of class cluster centers to a roulette wheel lottery. We used the roulette wheel to extract K samples as the initial class cluster centers. Thus, the number of iterations can be reduced as much as possible to approximate the final result.

So, does this approach work?

Again, let’s prove it experimentally, first let’s write the code. We need an auxiliary function to calculate the minimum distance between a sample and the chosen cluster center, and we will use this distance to do the roulette algorithm.

This function is very simple, just calculate the distance, take the minimum value:

def get_cloest_dist(point, centroids):

# Assign infinity first, decreasing in order

min_dist = math.inf

for centroid in centroids:

dist = calculateDistance(point, centroid)

if dist < min_dist:

min_dist = dist

return min_dist

Copy the code

Then we pick K centers by the roulette wheel method, first we pick one at random, then we pick the next one by the roulette wheel method based on the distance from the center, and so on, until we pick all K centers.

import math

import random



def kmeans_plus(dataset, k):

clusters = []

n = dataset.shape[0]

# First pick a central point

rdx = np.random.choice(range(n), 1)

# np.squeeze remove redundant parentheses

clusters.append(np.squeeze(dataset[rdx]).tolist())

d = [0 for _ in range(len(dataset))]

for _ in range(1, k):

tot = 0

# Calculate the minimum distance between the current sample and the existing cluster center

for i, point in enumerate(dataset):

d[i] = get_cloest_dist(point, clusters)

tot += d[i]

# random.random() returns a decimal between 0 and 1

# The total times that means we're spinning randomly

tot *= random.random()

Select the next cluster center

for i, di in enumerate(d):

tot -= di

if tot > 0:

continue

clusters.append(np.squeeze(dataset[i]).tolist())

break

return np.mat(clusters)

Copy the code

Finally, let’s draw a picture to see what it looks like:


In the figure above, the white dot represents the final convergence position, and the red X represents the starting position calculated by Kmeans++, which shows that we are very close to the final result. Obviously, we only need a few iterations to achieve convergence.

Of course, Kmeans++ itself also has randomness, not necessarily every random starting point can have such a good effect, but through strategy, we can ensure that even if the worst situation will not be too bad.

In actual scenarios, if we really need to apply Kmeans algorithm to large-scale data, we often combine multiple optimization strategies and calculate the average multiple times, so as to ensure a good result in a relatively short time. This is the essence of a lot of algorithm optimization in machine learning, which is to stop looking for an optimal solution and just have a good enough solution. A lot of times, a little compromise on the results can improve the efficiency of the algorithm a lot.

That’s all about Kmeans optimization today. If you feel you have gained something, please click on it or forward it. Your help is very important to me.