Data mining algorithm – The K-means algorithm is a Python version

Introduction to the

Also known as k-means algorithm, clustering algorithm in unsupervised learning.

 

The basic idea

The K-means algorithm is relatively simple. In k-means algorithm, cluster is used to represent cluster. It is easy to prove that convergence of k-means algorithm is equivalent to no change of all centroids. The basic k-means algorithm flow is as follows:

Select k initial centroids (as initial clusters, each initial cluster contains only one point);

Repeat:

For each sample point, the nearest centroid is calculated, and its category is marked as the cluster corresponding to the centroid.

Recalculate the center of mass corresponding to K clusters (the center of mass is the mean of sample points in the cluster);

Until the center of mass stops changing

 

The number of repeat determines the number of iterations of the algorithm. In fact, the essence of K-means is to minimize the objective function, which is the sum of squares of the distance from each point to the center of mass of its cluster:

N is the number of elements, x is the element, and C (j) is the center of mass of the JTH cluster

 

Algorithm complexity

The time complexity is O(NKT), where N represents the number of elements, t represents the number of iterations of the algorithm, and k represents the number of clusters

 

The advantages and disadvantages

advantages

Simple and fast;

High efficiency and scalability for large data sets;

The time complexity is nearly linear, which is suitable for mining large-scale data sets.

disadvantages

K-means is locally optimal, so it is sensitive to the selection of initial centroid.

It is very difficult to choose the k value that achieves the optimal objective function.

 

code

# coding:utf-8 import numpy as np import matplotlib.pyplot as plt def loadDataSet(fileName): DataList = [] with open(fileName) as fr: for line in fr.readlines(): curLine = line.strip().split('\t') fltLine = list(map(float, curLine)) dataList.append(fltLine) return dataList def randCent(dataSet, k): N = np.shape(dataSet)[1] # n denotes dimension of dataSet Centroids = np.mat(np.zeros((k, n))) for j in range(n): minJ = min(dataSet[:, j]) rangeJ = float(max(dataSet[:, j]) - minJ) centroids[:, j] = np.mat(minJ + rangeJ * np.random.rand(k, 1)) return centroids def kMeans(dataSet, k): "The KMeans algorithm, M = np.shape(dataSet)[0] # m = np.mat(np.zeros((m, Centroids = randCent(dataSet, k) # clusterChanged = True iterIndex = 1 clusterChanged = False for i in range(m): minDist = np.inf minIndex = -1 for j in range(k): distJI = np.linalg.norm(np.array(centroids[j, :]) - np.array(dataSet[i, :])) if distJI < minDist: minDist = distJI minIndex = j if clusterAssment[i, 0] ! = minIndex: ClusterChanged = True clusterAssment[I, :] = minIndex, minDist ** 2 print(" % n%s" % (iterIndex, k, k) IterIndex += 1 for centroids in range(k): ptsInClust = dataSet[np.nonzero(clusterAssment[:, 0].A == cent)[0]] # get all the point in this cluster centroids[cent, :] = np.mean(ptsInClust, axis=0) return centroids, clusterAssment def showCluster(dataSet, k, centroids, ClusterAssment): "numSamples, dim = dataSet. Shape if dim!" = 2: return 1 mark = ['or', 'ob', 'og', 'ok', 'oy', 'om', 'oc', '^r', '+r', 'sr', 'dr', '<r', 'pr'] # draw all samples for i in range(numSamples): markIndex = int(clusterAssment[i, 0]) plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex]) mark = ['Pr', 'Pb', 'Pg', 'Pk', 'Py', 'Pm', 'Pc', '^b', '+b', 'sb', 'db', '<b', 'pb'] # draw the centroids for i in range(k): plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize=12) plt.show() if __name__ == '__main__': DataMat = np.mat(loadDataSet('./data.txt')) # mat = numpy Centclust = kMeans(dataMat, k) showCluster(dataMat, k, cent, clust)Copy the code