Summary of clustering algorithm

I. Classification of clustering methods

Clustering algorithm can be roughly divided into partition clustering method, hierarchical clustering method, density clustering method, grid clustering method, model clustering method and so on. In recent years, quantum clustering method, spectral clustering method, granularity clustering method, probability graph clustering method, synchronous clustering method are also popular.

  • Partition based clustering algorithm optimizes the objective function by constructing an iterative process. When optimization reaches the minimum or minimum value of the objective function, some disjoint subsets of the data set can be obtained. It is generally considered that each subset obtained at this time is a cluster. Most clustering algorithms based on partition are very efficient, but they need to set a number of clustering that is difficult to determine before cluster analysis. K-means algorithm [5] and FCM(FuzzyC Means) algorithm are the two most famous algorithms in this type. In addition, K-center algorithm PAM(PartitioningAround Medoid) and CLARA(Clustering LARgeApplications), k-mode algorithm (Clustering classified data) and k-prototype algorithm (Clustering mixed data)[25] also belong to this type of algorithm. The clustering algorithm based on partition has the advantages of easy implementation and fast clustering speed. The time complexity has a linear relationship with the number of data points N, the dimension D of data and the number of pre-set clusters K. The disadvantage is that the optimization function is NP hard problem, the time cost is high to search for the minimum value, and easy to fall into the local minimum value.
  • 1.2 Hierarchical Clustering Algorithm The hierarchical clustering method uses a distance matrix as the input, and obtains a cluster hierarchical structure diagram reflecting the distribution of the data set after clustering. Hierarchical clustering algorithms are usually divided into two types. The first is the condensed hierarchical clustering algorithm, which firstly considers each data point as a cluster, and then constructs a hierarchical tree representing the clustering structure of the data in a bottom-up manner by continuously selecting the nearest neighbor clustering pairs. The second is the split hierarchical clustering algorithm, which first regards all data points as a cluster, and then splits them in a top-down manner by continuously selecting the loosest clusters, and finally constructs a hierarchical tree representing the data clustering class structure. Although the time cost of hierarchical clustering method is higher than that of partition clustering method, most hierarchical clustering algorithms do not need to set a parameter that is difficult to determine the number of clusters in advance, and this kind of method can obtain a multi-level clustering structure with multiple granularity, which is its biggest advantage compared with partition clustering method.
  • 1.3 Density-based clustering Algorithm Density-based clustering algorithm tries to divide high-density areas through sparse areas to find obvious clustering and isolated points, which is mainly used for spatial data clustering. Representative is the DBSCAN algorithm.
  • 1.4 Clustering Algorithm based on grid The clustering algorithm based on grid is a multi-resolution clustering method based on grid. It first divides the distribution space of the dataset into several regular grids (such as hyperrectangular cells) or flexible grids (such as arbitrarily shaped polyhedra), and then obtains obvious clustering by fusing connected grids with data profiles. The advantage of this algorithm is that the processing time is independent of the number of data points and the input order of data, and can process any type of data. Its disadvantage is that the processing time is related to the number of units divided in each dimension, which reduces the quality and accuracy of clustering to a certain extent. The representative algorithm is STING(STatistical INformation Grid) algorithm.
  • 1.5 Model-based clustering Algorithm Model-based clustering algorithm uses some statistical models to obtain the clustering distribution information of the data set. The method assumes that the data set is generated by a finite number of probability distribution models. In this method, the multivariable Gaussian distribution mixed model is the most widely used. Among them, COBWEB algorithm is a common and simple incremental concept clustering method, which uses the form of classification tree to represent the hierarchical clustering results.
  • 1.6 Graph-based Clustering Algorithm When the graph clustering method is used for cluster analysis, the first step is to establish a graph suitable for specific problems. The nodes of the graph represent the base units of the analyzed data, and the edges of the graph represent the similarity measures (or dissimilarity measures) between the base unit data. Typically, there is a metric expression between each base unit data to preserve the locally distributed nature of the data set. The graph clustering method takes the local connection feature of data set as the main information source of clustering, so it is easy to deal with the characteristics of local data. The chameleon algorithm proposed by Karypis et al can also be regarded as a graph clustering algorithm.
  • The quantum clustering method borrows quantum theory. It first creates a probability function based on spatial scale from the source data, then uses some analysis operations to obtain a potential function that determines the cluster center according to the minimum value, and finally searches the cluster structure by adjusting the scale parameters. Spectral Clustering can be used to calculate the eigenvalues through the similarity matrix of the source data, so as to discover the obvious Clustering regions. Many spectral clustering algorithms are easy to implement and their effects are better than traditional clustering algorithms, such as K-means, and they have been successfully implemented in many applications. The Shi-Malik algorithm for image segmentation is developed based on spectral clustering method. Granularity – based clustering is a new research direction of information granularity. At present, the research of this clustering method is not mature, especially the research of granularity computational semantics is relatively little. Probability graph clustering is a popular clustering method in recent years. The most famous probability graph clustering method is AP(AffinityPropagation) clustering algorithm published in Science in 2007. Synchronization Clustering (SynC) algorithm This algorithm can not only discover the internal structure of the dataset through dynamic synchronization process without knowing any distribution of the dataset, but also deal with outliers well, and realize automatic clustering by using the principle of minimum description length. Since grid-based algorithms are mainly used to process spatial data, and power load curve is a time series, this paper focuses on other types of clustering algorithms.

2. Clustering method of power load

The current power load curve clustering method are many, of the more popular Kmeans clustering, wavelet analysis, fuzzy c-means (FCM) clustering algorithm, integrated clustering algorithm, the self-organizing feature map (SOM) neural network, extreme learning machine (ELM), cloud model, etc., as well as some on the basis of these algorithms to improve the algorithm.

  • 1.1 Cluster Effectiveness Cluster effectiveness research is a process of evaluating cluster quality and determining the optimal cluster number by establishing validity indicators. Typical clustering effectiveness indicators include sum of Squared error (SSE), Calinski-Harabasz index (Calinski-Harabasz index, CHI), Davies-Bouldi index (DBI), etc. 1) SSE indicators. Error square and SSE index ISSE are represented by the Euclidean distance between a subclass and the cluster center of its class cluster, namely:

  • 1.2 Index Analysis

  • 2. Comparison of classical clustering algorithms Table 2 summarizes the above algorithms:

It can be seen from FIG. 3 that hierarchical clustering has a good effect and low volatility. In addition, both algorithms have the advantage of requiring fewer input parameters. In this paper, an integrated clustering algorithm of load curve is proposed by combining the high efficiency of partition clustering and high precision of hierarchical clustering.

  • The basic idea of ensemble Clustering algorithm (EnsClust) is to divide clustering repeatedly on multiple subsets of the original data set, and combine the clustering centers obtained by hierarchical clustering. The integrated clustering algorithm consists of three steps: bootstrap resampling, partition clustering and hierarchical clustering. Firstly, multiple sample sets of the original data set were obtained by bootstrap resampling. Then, the data size of each sample set was reduced by clustering. Finally, hierarchical clustering was used to combine the results of clustering. 3.1 Algorithm flow is as follows: 1) Bootstrap resampling. The so-called bootstrap resampling is to randomly sample the original sample with a capacity of N and put it back. The selected random sample size is N, and the probability of each individual being selected in each sampling is guaranteed to be equal. Since partition clustering has poor stability and may converge to the local optimal solution, that is, small changes in different initial clustering centers or data sets may lead to different clustering results. Through bootstrap resampling, clustering on multiple sample sets can reduce the influence of random factors in the initial cluster center and outliers in the original data set, thus improving the stability of the algorithm. 2) Clustering. In ensemble clustering algorithm, partition clustering can be regarded as the data preprocessing step of hierarchical clustering. If the original data set is directly hierarchical clustering, it is necessary to calculate the Euclidean distance between all objects and store the distance matrix, which costs a lot in time and space. Partition clustering is the most efficient, and its output cluster center is used to characterize the structure of the original data set, thus greatly reducing the size of the data. Compared with random sampling, the clustering center contains more information from the original data set and can eliminate the influence of noise points and outliers. 3) Hierarchical clustering. In ensemble clustering algorithm, hierarchical clustering can be regarded as the combination of the output results of sample set partition clustering. In order to solve the problem that partition clustering is easy to fall into the local optimal solution and is greatly affected by the initial cluster center, the hierarchical clustering algorithm with better cluster quality and higher stability is used to combine the cluster centers output by partition clustering, and then the final clustering result is obtained.
  • 3.2 Algorithm analysis and experimental results

Based on the above analysis, the ensemble clustering algorithm not only has higher computational efficiency, but also has better clustering quality than the traditional partition clustering and hierarchical clustering algorithm.

  • 4.1 Load Curve Dimension reduction When the size of the data set is large, the calculation time of the integrated clustering algorithm is still relatively long. In order to further improve the clustering efficiency, it is necessary to conduct dimension reduction on the data set. The purpose of dimensionality reduction is to represent load curves with vectors of lower dimensions. Dimensionality reduction can reduce the storage space of data on the one hand, on the other hand, reduce the calculation time of Euclidean distance between vectors, and improve the efficiency of the algorithm. Common dimensional reduction algorithms include Sammon mapping, self-organizing mapping, principal component analysis and so on. 4.2 Classification of dimension reduction Algorithms 1) Sammon mapping. Sammon Mapping (SM) is a nonlinear mapping whose algorithm aims to minimize the error function:



































DTW algorithm Dynamic Time Warping (DTW Dynamic Time Warping) is a method to measure the similarity of two Time series with different length. It is also widely applied, mainly in template matching, such as isolated word speech recognition (recognition of whether two segments of speech represent the same word), gesture recognition, data mining and information retrieval. In time series, the length of two time series that need to be compared may not be the same, which is manifested in the different speed of speech in the field of speech recognition. Because speech signals are quite random, even if the same person speaks the same sound at different times, it is impossible to have a complete length of time. And different phonemes within the same word are pronounced at different speeds. For example, some people make the “A” sound very long, or the “I” sound very short. In these complex cases, the distance (or similarity) between two time series cannot be effectively calculated using the traditional Euclidean distance.


















First of all, this path is not chosen at random and needs to satisfy the following constraints: 1) Boundary conditions: w1=(1, 1) and wK=(m, n). The speed of any kind of speech may change, but the order of its parts cannot change, so the chosen path must start at the lower left corner and end at the upper right corner. 2) continuity: if wk – 1 a = (a, b ‘), then for the next point of the path wk = (a, b) a need to meet (a – a ‘) < = 1 and (b – b ‘) < = 1. That is, you can’t cross a point to match, you can only align with the next point. This ensures that every coordinate in Q and C appears in W. 3) monotonicity: if wk – 1 a = (a, b ‘), then for the next point of the path wk = (a, b) need to meet a 0 < = (a – a ‘) and 0 < = (b – b ‘). This limits that the points above W must be monotonically continuous over time. To ensure that the dotted lines in figure B do not intersect. Combined with continuity and monotonicity constraints, there are only three paths for each lattice point. For example, if the path already passes through the lattice (I, j), then the next lattice can only be one of three things :(I +1, j), (I, j+1), or (I +1, j+1).








































Third, the conclusion

In this paper, an improved DTW algorithm based on constraint scaling is proposed. The search range limit is introduced to standardize the scaling degree of load sequence, avoid the curve periodic change caused by malicious matching, and improve the computational complexity to the average O (N) level. By introducing sequence translation, the computational accuracy of the algorithm is improved for similar users with different peak times.

The Canopy algorithm

One shortcoming of k-means and K-medoids algorithms is that the number of clusters K must be specified in advance. It is not easy to set a reasonable K value in many practical applications. At this point, Canopy algorithm can be used to complete the estimation of cluster number K and initial cluster center. Canopy is a clustering algorithm that achieves rough division of objects according to parameters T1 and T2. The figure below shows a typical Canopy clustering process.

Personal ideas

The results are summarized as follows: 1. SSE, CHI and DBI can find the optimal cluster number correctly. DBI index is easy to calculate and the curve is intuitive, so it is suitable to be used as an effective index of load curve clustering. 2. Before clustering algorithm, Canopy algorithm was used to pre-classify data, laying a foundation for more accurate clustering algorithm later. 3. Integrated clustering algorithm can combine the advantages of high efficiency of partition clustering and high precision of hierarchical clustering. The clustering efficiency is close to that of partition clustering algorithm, and the clustering quality is superior to partition clustering and hierarchical clustering. In the application of electricity load, k-medoIDS is more reasonable. 4. For massive high-dimensional load curves, principal component analysis can be used to reduce the dimensionality of the data set. On this basis, integrated clustering algorithm is applied to cluster and DBI validity index is used to evaluate the clustering results.