Abstract: In data mining, k-means algorithm is a cluster analysis algorithm, which is mainly to calculate the data aggregation algorithm, mainly by constantly taking the nearest mean from the seed point algorithm.

In data mining, K-means algorithm is a cluster analysis algorithm, which is mainly used to calculate data aggregation, mainly through the algorithm of constantly taking the nearest mean value from the seed point.

The problem

The main problems solved by k-means algorithm are shown in the figure below. We can see, on the left hand side of the graph, we can see with the naked eye that there are four point groups, but how do we find these point groups using a computer program? And that’s where our K-means algorithm comes in.

 

K-means to solve the problem

Algorithm profile

The algorithm is actually quite simple, as shown below:

 

From the figure above, we can see that A, B, C, D and E are five points in the middle of the figure. And the gray points are our seed points, which are the points that we use to find the cluster of points. There are two seed points, so K is equal to 2.

Then, the k-means algorithm is as follows:

  1. Randomly pick K (where K=2) seed points in the graph.
  2. Then calculate the distances of K seed points from all points in the figure. If the point Pi is closest to the seed point Si, then Pi belongs to the Si point group. (In the figure above, we can see that A and B belong to the upper seed spot, while C, D and E belong to the lower middle seed spot)
  3. Next, we move the seed point to the center of his “point cluster”. (See step 3 on the chart)
  4. Then repeat steps 2) and 3) until the seed points do not move (we can see in step 4 that the top seed points converge A, B, C and the bottom seed points converge D, E).

It’s a pretty simple algorithm, but there are a few details that I’m going to mention, and I’m not going to tell you the formula for distance, but if you’ve graduated from junior high school you should know how to do it. I want to focus on the algorithm for finding the center of a point group.

Algorithm for finding center of point group

In general, the algorithm for finding the center of a group of points is that you can simply use the average of the X/Y coordinates of each point. But I want to give you three other formulas for finding the center:

1) Minkowski Distance formula — λ can be arbitrarily evaluated as negative, positive, or infinite.

 

2) Euclidean Distance formula — i.e. the case of the first formula λ=2

 

3) CityBlock Distance formula — that is, the case of the first formula λ=1

 

There are some differences in the centers of the three formulas, as shown in the figure below (λ is between 0 and 1 for the first one).

 

(1) Minkowski DistanceEuclidean Distance (3)CityBlock Distance

The above diagrams show how they approach the center, the first in the form of a star, the second in the form of concentric circles, and the third in the form of a diamond.

K – Means of demonstration

If you Google “K Means Demo” you can find a lot of demos. Here recommend a demo: home. Collection. Polimi. It/matteucc/Cl…

The operation is that the left mouse button is the initialization point, right click to initialize the “seed point”, then check “Show History” to see the step by step iteration.

Note: The link to this demo also has a nice K Means Tutorial.

K – Means++ algorithm

K-means has two major flaws — both related to initial values:

  • K is given in advance, and this selection of K value is very difficult to estimate. Many times, it is not known in advance how many categories a given data set should be grouped into. (ISODATA algorithm obtains a reasonable number of types K through automatic merging and splitting of classes)
  • The k-means algorithm needs to use the initial random seed point, which is so important that different random seed points will get completely different results. (The K-Means++ algorithm can be used to solve this problem, which can effectively select the starting point)

Here I focus on the steps of the K-Means++ algorithm:

  1. Let’s start with a random seed from our database.
  2. For each point, we calculate its distance from the nearest “seed point” D(x) and store it in an array, and then add up the distances to get Sum(D(x)).
  3. Then, take a random value, use the way of weight to calculate the next “seed point”. The implementation of this algorithm is to take a Sum(D(x), then use Random -= D(x) until <=0, at which point is the next “seed point”.
  4. Repeat steps (2) and (3) until all K seed points have been selected.
  5. Carry out k-means algorithm.

The code for this can be found here “Implement the K-means++ algorithm” (wall). Apache’s common Data library also implements this algorithm

Application of k-means algorithm

If you look at this, you might say, well, k-means looks pretty simple, and it looks like it’s just playing around with coordinate points, not really useful. Moreover, this algorithm has many defects, not as good as artificial. Yes, the previous example was just playing with two-dimensional coordinate points, which is really not interesting. But consider the following questions:

1) If it is not two-dimensional, but multi-dimensional, such as 5-dimensional, then it can only be calculated by computer.

2) The X and Y coordinates of two-dimensional coordinate points are actually a vector and a mathematical abstraction. Many properties in the real world can be abstracted into vectors, such as our age, our preferences, our products, etc., and the purpose of being able to abstract into vectors is to let the computer know the distance between two properties. For example, we think that 18-year-olds are closer to 24-year-olds than to 12-year-olds, shoes are closer to clothes than computers, and so on.

As long as the properties of objects in the real world can be abstracted into vectors, k-means algorithm can be used to classify.

In the article “K-means clustering” gave a very good application example, the author made a direction scale with the results of 15 Asian football teams from 2005 to 1010, and then used K-means to classify the teams, and got the following results, hehe.

  • Top Asian players: Japan, South Korea, Iran, Saudi Arabia
  • Asia’s second tier: Uzbekistan, Bahrain, North Korea
  • Asia third tier: China, Iraq, Qatar, UAE, Thailand, Vietnam, Oman, Indonesia

In fact, there are many other such business examples, such as analyzing the customer classification of a company, so that different business strategies can be used for different customers, or analyzing the similarity of commodities in e-commerce, classifying commodities, so that different sales strategies can be used, and so on.

Finally give a good algorithm slide: www.cs.cmu.edu/~guestrin/C…