Clustering algorithm kmeans, Kmeans ++ and Python implementation

1, kmeans

Kmeans, the K-means clustering algorithm, is able to realize the algorithm of discovering k clusters of data sets, each cluster described by its center of mass.

Kmeans steps: (1) Randomly find K points as centroids (seeds); (2) Calculate the distance from other points to these K seeds and select the nearest one as the category of this point; (3) Update all kinds of centroid and iterate until the centroid remains unchanged.

Q: How do I choose k?

A: Model performance curve when taking different values according to k. The abscissa is the number of clustering k, and the ordinate is the sum of the distances from each point to the distance center. K at the inflection point (the distance and the rate of descent slow down) is chosen as the value.

2, kmeans++

Because the initial seed of Kmeans is randomly found, the relationship between the convergence speed of the algorithm and the initial value is very large, so Kmeans ++ mainly improves the selection of the initial value. The initial value is selected as follows:

1. A seed is also randomly selected; 2. Calculate the distance from other points to the seed; 3. Select the point with the larger distance to replace the randomly selected seed as the new seed point (the larger distance here can be understood as maxDistance* random number, which can be selected 0.5-1;) ; 4. Iterate 1-3 steps until k seeds are selected.

3. Evaluate the clustering effect

Contour coefficient method SC

Silhouette Coefficient, which combined Cohesion and Separation of clustering, was used to assess the effect of clustering. The value is between -1 and 1. The larger the value is, the better the clustering effect is. Calculation steps: (1) Select a point, calculate the average distance between this point and all other elements in the same cluster, denoted as, quantified the degree of cohesion in clusters; (2) Select another cluster and calculate the average distance from this point to the cluster. Select the value of the minimum average distance between this point and other clusters, denoted asIs used to quantify the degree of separation between clusters. (3) For sample points, the contour coefficient is:

A (I) : average dissimilarity degree of I vector to other points in the same cluster B (I) : minimum average dissimilarity degree of I vector to other clustersWhen the contour coefficient is less than 0, the point may be misclassified. Value = 0, the point is on the edge of the two clusters; Value =1, far from the other clusters. Calculate the contour coefficient and calculate the average value, which is the overall contour coefficient of the current cluster.

Therefore, the contour coefficient can be used to determine the K value of Kmean, and the K with the largest contour coefficient is selected as the final number of classes.

(2) Sum of squares of error

4. Other issues

(1) Will Kmeans convergeObjective function of Kmeans:Step E: Evaluate the hidden variables, which category each sample belongs toStep M: Fix the allocation of data points and update parametersSince EM algorithm has convergence, Kmeans will eventually converge.

(2) Differences between Kmeans and KNN

KNN:

Classification algorithm; Supervised learning; Data sets are labeled data; No obvious training process, memory-based learning;

Meaning of K value – For a sample X, to classify it, first of all, find the nearest K data point near X from the data set and divide it into the category that belongs to the most categories

K – means:

Clustering algorithm; Unsupervised learning; The data set is label-free, haphazard data; There is an obvious training process;

Meaning of K value -k is a number set in advance. It depends on human prior knowledge to divide the data set into K clusters

Difference:

The fundamental difference between the two algorithms is that K-means is essentially unsupervised learning, while KNN is supervised learning. K-means is a clustering algorithm, and KNN is a classification (or regression) algorithm.

K-means algorithm divides a data set into clusters so that the clusters formed are isomorphic and the points in each cluster are close to each other. The algorithm tries to maintain sufficient separability between these clusters. Due to their unsupervised nature, these clusters do not have any labels. The KNN algorithm attempts to classify unlabeled observations based on k (which can be any number) of its neighbors. It is also known as lazy learning because it involves minimal model training. Therefore, it does not use training data to generalize unseen data sets.

Practice of 5.

Kmeans Python code:

import numpy as np
class KMeans(object):
    def __init__(self, n_cluster, epochs=10):
        self.n_cluster = n_cluster
        self.epochs = epochs
 pass  def init_centers(self, X):  idx = np.random.randint(len(X), size=(self.n_cluster,))  centers = X[idx,:]  return centers   def calculate_distance(self,arr1,arr2):  # L2 distance.  distance = np.mean(np.sqrt((arr1-arr2)**2))  return distance  def update_centers(self, X):  predict_class = self.predict(X)  # update centers  centers = self.centers  for ct in range(len(centers)):  idx, = np.where(predict_class == ct)   samples = X[idx, :]  assert len(samples)>0  centers[ct] = np.mean(samples,axis=0)  self.centers = centers  return self.centers   def fit(self, X, y=None):  self.centers = self.init_centers(X)  for epoch in range(self.epochs):  self.centers = self.update_centers(X)  return self.centers  def predict(self,X):  predict_class = np.zeros(shape=(len(X),))  centers = self.centers  for n_sample,arr in enumerate(X):  min_distance = float("inf")  p_class = 0  for ct in range(len(centers)):  distance = self.calculate_distance(arr,centers[ct])  if distance < min_distance:  min_distance = distance  p_class = ct  predict_class[n_sample] = p_class  return predict_class  def score(self, X):  pass Copy the code

Example: Define three Gaussian distributions.

np.random.seed(1)
data1 = np.random.normal(1.0.5, size=(100.2))
np.random.seed(1)
data2 = np.random.normal(4.0.8, size=(100.2))
np.random.seed(1)
data3 = np.random.normal(7.0.8, size=(100.2)) X = np.concatenate((data1, data2, data3),axis=0)# np. Random. Randint (10, size = (100, 2)) print(X.shape) kmeans = KMeans(3) centers = kmeans.fit(X) print(centers) Copy the code

Results of visual clustering:

# Generate scatter plot for training data
import matplotlib.pyplot as plt

plt.scatter(X[:,0], X[:,1], marker="o", picker=True)
plt.title('Clusters of data')
plt.xlabel('x1') plt.ylabel('x2') plt.scatter(centers[:,0], centers[:,1], marker="v", picker=True) plt.show() Copy the code

Reference:

Kmeans++ _ lotte notes;
Kmeans_ cool shell;
How to determine the k value of k-means? ;
sklearn kmeans;
Encyclopedia contour coefficient;
Convergence of EM algorithm and convergence of K-means;
The difference between KNN and KMeans

This article is formatted using MDNICE

Clustering algorithm kmeans, Kmeans ++ and Python implementation

1, kmeans

2, kmeans++

3. Evaluate the clustering effect

4. Other issues

Practice of 5.

Related Posts

【 Break the casserole to ask to the end 】 the logistic regression you said, does it have logic

The iOS Ai interface implements one-click image matting

NLP has made another major advance. What is BERT?