K – Means is introduced

K-means algorithm is one of the most widely used algorithms in cluster analysis. It divides N objects into K clusters according to their attributes so that the obtained clusters meet the following requirements: objects in the same cluster have high similarity; The similarity of objects in different clusters is smaller. The clustering process can be shown in the following figure:



As shown in the figure, data samples are represented by dots, and the center points of each cluster are represented by crosses. (a) At the beginning, it was raw data, chaotic and without label. They all looked the same and were all green. (b) Suppose that the data set can be divided into two classes, let K=2, and randomly select two points on the coordinates as the center points of the two classes. (C-f) demonstrates two iterations of clustering. First, divide each data sample into the cluster of the nearest central point; After the division, update the center of each cluster, that is, add up the coordinates of all data points of the cluster to average. This continues “divide – update – divide – update” until the center of each cluster no longer moves.

It’s a fairly simple algorithm, but there are some things that we need to focus on, and here I want to talk about “algorithms for finding points.”

In general, the algorithm for finding the center of a group of points is that you can simply use the average of the X/Y coordinates of each point. We can also use three other formulas for the center:

1.Minkowski Distance formula —λ can take any value, be negative, positive, or infinite.



2. The Euclidean formula for short— that is, the case of the first formula λ=2



3. The CityBlock short formula— that is, the case of the first formula λ=1



There are some differences in the centers of the three formulas, as shown in the figure below (λ is between 0 and 1 for the first one).



(1) Minkowski DistanceEuclidean Distance (3)CityBlock Distance**

The above diagrams show how they approach the center, the first in the form of a star, the second in the form of concentric circles, and the third in the form of a diamond.

Defects of Kmeans algorithm

Clustering center by the number of K must be given in advance but in practice, the selection of K value is very difficult to estimate, most of the time, do not know in advance of a given data set should be divided into many categories is the most appropriate Kmeans need artificially to determine the initial clustering center, different initial clustering center may lead to different results. (can be solved using Kmeans++ algorithm)

Aiming at the second defect above, Kmeans++ algorithm can be used to solve the k-means++ algorithm. The basic idea of selecting initial seeds by k-means++ algorithm is that the distance between the initial clustering centers should be as far as possible. Select a point randomly from the input data point set as the first clustering center. For each point x in the data set, calculate the distance D(x) between it and the nearest clustering center (referring to the selected clustering center) and select a new data point as the new clustering center. The selection principle is as follows: The point with larger D(x) has a higher probability of being selected as the cluster center. Repeat 2 and 3 until k cluster centers are selected to run the standard K-means algorithm using the k initial cluster centers

As can be seen from the above algorithm description, the key of the algorithm is the third step, which is how to reflect D(x) to the probability of point selection. One algorithm is as follows:

So let’s pick a random random point from our database when the seed point and for each of these points, we calculate the distance D(x) from the nearest seed point and store it in an array, and then add those distances together to get Sum(D(x)). Then, take a random value, use the way of weight to calculate the next “seed point”. The implementation of this algorithm is to take a Random value Random that can fall in Sum(D(x)), and then use Random -= D(x) until it <=0, at which point is the next “seed point”. Repeat 2 and 3 until k cluster centers are selected to run the standard K-means algorithm using the k initial cluster centers

You can see that the third step of the algorithm is to select a new center, so that the point with a larger distance D(x) will be selected as the center of the cluster. As for why the reason is relatively simple, as shown in the figure below:

                                                

Suppose D(x) of A, B, C and D is as shown in the figure above. When the algorithm values Sum(D(x))*random, this value will fall into the range of D(x) with A high probability, so the corresponding point will be selected as the new clustering center with A high probability.

K – means++ code:Rosettacode.org/wiki/K-mean…

KNN (K – on his Neighbor) is introduced

Algorithm idea: A sample belongs to a category if most of the k most similar samples in the feature space (that is, the closest neighbors in the feature space) belong to a category. This method only determines the classification of the samples according to the category of the nearest one or several samples.

Take a look at the picture below:



The algorithm process of KNN is as follows:

As can be seen from the figure above, the data sets in the figure are good data, that is, they have all been labeled. One kind is the blue square, the other is the red triangle, and the green circle is the data to be classified.

If K=3, then the nearest one to the green point has two red triangles and one blue square, and these three points vote, so the green point is in the red triangle

If K=5, then the nearest points to the green point have two red triangles and three blue squares, and these five points vote, so the green point to be sorted belongs to the blue square

We can see that the nature of KNN is based on a statistical method of data! In fact, many machine learning algorithms are also based on statistics.

KNN is a kind of memory-based learning, also called instance-based learning, which belongs to lazy learning. That is, it does not have an obvious pre-training process, but when the program starts to run, after the data set is loaded into memory, it can start to classify without training.

Specifically, every time we come to an unknown sample point, we will find K nearest points nearby and vote.

As another example, Locally weighted regression (LWR) is also a memory-based approach, as shown in the following data set.



It is impossible to simulate this dataset with any straight line, because the dataset does not look like a straight line. But each data point in the local range can be considered on a straight line. Every time a position sample x comes, we take the data sample as the center on the X-axis, find several points on the left and right, perform linear regression on these sample points, calculate a local straight line, and then substitute the position sample X into this line, calculate the corresponding Y, and complete a linear regression. That is, every time you get a data point, you have to train a local line, that is, once you train it, you use it once. LWR and KNN are similar in that they are both tailor-made for position data and trained locally.

The difference between KNN and K-means

Reference: 1) www.yanjiuyanjiu.com/blog/201302… 2) www.cnblogs.com/shelocks/ar…