1. Introduction to clustering algorithm

The goal of clustering is to make the similarity of objects in the same class as large as possible. The similarity between objects of different classes should be as small as possible. At present, there are many clustering methods. According to the different basic ideas, clustering algorithm can be roughly divided into five categories: hierarchical clustering algorithm, segmentation clustering algorithm, clustering algorithm based on constraints, clustering algorithm in machine learning and clustering algorithm for high dimension. The following implementation mainly selects Kmeans algorithm based on partition and DBSCAN algorithm based on density for processing

1.1 Kmeans algorithm based on partition

A typical partition clustering algorithm uses the center of a cluster to represent a cluster, that is, the cluster selected in the iterative process is not necessarily a point in the cluster. Its purpose is to minimize the Error square and SSE(Sum of Squared Error) between data points in each cluster (k in total) and the center of mass of the cluster, which is also the evaluation standard for the final clustering effect of k-means algorithm. The detailed principle of the algorithm can be found on Google or Wiki.

1.2 DBSCAN algorithm based on density

A typical density-based clustering algorithm, which uses spatial index technology to search the neighborhood of objects, introduces concepts such as “core object” and “density reachable”, starting from the core object, all objects with reachable density form a cluster. Simply put, it is an algorithm based on a process that expands according to the density of the object. The density of an object O can be determined by the number of objects near O. In DBSCAN algorithm, data points are divided into the following three categories: core points: boundary points with more points than MinPts in radius Eps; noise points in the neighborhood of core points with less points in radius Eps than MinPts: A point that is neither a core point nor a boundary point where there are two quantities, one is the radius Eps and the other is the specified number MinPts.

2. Clustering implementation of user geographic location information

This experiment is implemented in Python, which relies on numpy, PANDAS, SKlearn, scipy and other scientific libraries. The data is collected from the user’s geographic location information, which is a series of latitude and longitude data sets.

Xy = numpy. Array ([[116.456065, 39.920965], [116.452312, 39.92304], [116.421385, 39.989539], [116.455685, 39.92069], [116.455876, 39.920845], [116.455973, 39.920902], [116.455645, 39.920657], [116.456022, [116.455685, 39.920691], [116.456023, 39.920671], [116.45596, 39.920864], [116.455522, 39.920856], [116.455276, 39.920407], [116.455799, 39.920867], [116.455349, 39.920425], [116.45511, 39.920377], [116.455318, [116.455298, 39.920474], [116.455839, 39.920636], [116.455979, 39.921168], [116.454281, 39.920006], [116.45598, 39.920612], [116.45388, 39.919584], [116.455474, 39.920737], [116.456009, 39.920641], [116.455439 39.920574], [116.455759, 39.920841], [116.455838, 39.920644], [116.455983, 39.920847], [116.459803, 39.922041] [116.456029, 39.92088], [116.455539, 39.920603], [116.455989, 39.920851], [116.455719, 39.920789], [116.45601 [116.456229, 39.920564], [116.455906, 39.920771], [116.456248, 39.920868], [116.455805, 39.920544], [116.455896, 39.920758], [116.43692, 39.926767], [116.454672, 39.92024], [116.454813, 39.917848], [116.381415, 40.00875], [116.422925, 39.980757], [116.422849, 39.9808], [116.38107, 40.009217], [116.456078, 39.920747] [116.455242, 39.919515], [116.455615, 39.920533], [116.422092, 39.991104], [116.454847, 39.917724], [116.456686, [116.45575, 39.920642], [116.456713, 39.924413], [116.455846, 39.920828], [116.422108, 39.991098], [118.775572, 31.97337], [118.776968, 31.97392], [118.778187, 31.973121], [118.775695, 31.973254], [118.775302, 31.973807], [118.776303, 31.973692], [118.777541, 31.973439], [118.776196, 31.973489], [116.448944, 39.926799], [116.45487, 39.917804], [116.455762, 39.920645], [116.456146, 39.920441], [116.455857, [116.455458, 39.920826], [116.455533, 39.920791], [116.455426, 39.920896], [116.45566, 39.920811], [116.455696, 39.920621], [116.453667, 39.9259], [116.466606, 39.886322], [116.455917, 39.92062])Copy the code

2.1 Clustering implementation based on Kmeans

Assume that the user’s geographical location information is usually work place and home, so choose k value as 2, the code is as follows

res, idx = kmeans2(numpy.array(zip(xy[:, 0], xy[:, 1], z)), 2, iter=20, minit='points')Copy the code

Realize the output



But in fact, the user did not appear in Hebei, the user often appeared in Beijing in addition to the work place and home, but also in Nanjing for a period of time on business. So set K to 3 and run again



res, idx = kmeans2(numpy.array(zip(xy[:, 0], xy[:, 1], z)), 3, iter=20, minit='points')Copy the code

The output



This distinguishes nanjing geographically. The work place and the business place have been very compatible, but the home place is still far from the actual distance.

In fact, it can already be seen that the user’s location is unpredictable, soIt’s hard to determine K. andThe result obtained by Kmeans aggregation is the centroid location of the aggregation cluster, not the actual geographical location of the userAnd I chose the Euclidean distance instead of the spherical distance calculated by latitude and longitude. Therefore, the results obtained are not ideal.





2.2 Cluster implementation based on DBSCAN

The DBSCAN algorithm focuses on the selected aggregation radius parameter and the number of MinPts that the aggregation needs to specify. In this paper, spherical distance is used to measure the geographical distance as the radius parameter of aggregation. In the following experiment, 2 km was selected as the radius parameter of density aggregation, and the number of MinPts was 5.

def haversine(lonlat1, lonlat2): lat1, lon1 = lonlat1 lat2, lon2 = lonlat2 lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2 c = 2 *  asin(sqrt(a)) r = 6371 return c * r def clustering_by_dbscan(): ...... distance_matrix = squareform(pdist(X, (lambda u, v: haversine(u, v)))) db = DBSCAN(eps=2, min_samples=5, metric='precomputed') y_db = db.fit_predict(distance_matrix) X['cluster'] = y_db ...... plt.scatter(X['lat'], X['lng'], c=X['cluster']) plt.show()Copy the code
The output is as followsCopy the code

The result shows that the user’s geographic location information aggregation cluster is 4 blocks, which are marked by 0.0,1.0, 2.0 and -1.0 respectively. Can see DBSCAN algorithm can radius according to the user’s activity, is the set of minimum 2 km radius parameters, the user’s location data set is divided into four clusters, and each cluster in space is arbitrary shape, the chunking effect is good, but the result is a cluster, which is a collection of geographic points one by one, It’s not a “center.” And the noise points are indistinguishable.

3. Hybrid algorithm implementation based on DBSCAN and Kmeans

According to the above experimental results, the key of Kmeans algorithm is the selection of K value, and I cannot determine the number of clusters of user geographic information clustering. If the actual geographical location distribution is too scattered, the obtained centroid location may be far different from the actual location according to the fixed K value aggregation. However, the clustering result of DBSCAN algorithm is good, because it is aggregated according to the density of the set activity radius of people, but the result is to classify the data set without figuring out the center point. Therefore, I designed a hybrid algorithm based on DBSCAN and Kmeans: Firstly, the user’s geographic location data set is aggregated into several clusters according to the active radius by using the density reactable feature of DBSCAN algorithm, and the data set of each cluster is taken as the new input. Then, the location of the centroid is calculated by using the iterative aggregation of Kmeans algorithm, and the K value is set to 1. The following code

def clustering_by_dbscan_and_kmeans2(): X = pd.DataFrame( {"lat": [39.920767, 39.920965, 39.92304, 39.989539, 39.92069, 39.920845, 39.920902, 39.920657, 39.920934, 39.920691, 39.920671, 39.920864, 39.920856, 39.920407, 39.920867, 39.920425, 39.920377, 39.920442, 39.920474, 39.920636, 39.921168, 39.920006, 39.920612, 39.919584, 39.920737, 39.920641, 39.920574, 39.920841, 39.920644, 39.920847, 39.922041, 39.92088, 39.920603, 39.920851, 39.920789, 39.92082, 39.920564, 39.920771, 39.920868, 39.920544, 39.920758, 39.926767, 39.92024, 39.917848, 40.00875, 39.980757, 39.9808, 40.009217, 39.920747, 39.919515, 39.920533, 39.991104, 39.917724, 39.924316, 39.920642, 39.924413, 39.920828, 39.991098, 39.991139, 31.97337, 31.97392, 31.973121, 31.973254, 31.973807, 31.973692, 31.973439, 31.973489, 39.926799, 39.917804, 39.920645, 39.920441, 39.920043, 39.920826, 39.920791, 39.920896, 39.920811, 39.920621, 39.9259, 39.886322, 39.92062] [116.455788, 116.456065, 116.452312, 116.421385, 116.455685, 116.455876, 116.455973, 116.455645, 116.456022, 116.455685, 116.456023, 116.45596, 116.455522, 116.455276, 116.455799, 116.455349, 116.45511, 116.455318, 116.455298, 116.455839, 116.455979, 116.454281, 116.45598, 116.45388, 116.455474, 116.456009, 116.455439, 116.455759, 116.455838, 116.455983, 116.459803, 116.456029, 116.455539, 116.455989, 116.455719, 116.45601, 116.456229, 116.455906, 116.456248, 116.455805, 116.455896, 116.43692, 116.454672, 116.454813, 116.381415, 116.422925, 116.422849, 116.38107, 116.456078, 116.455242, 116.455615, 116.422092, 116.454847, 116.456686, 116.45575, 116.456713, 116.455846, 116.422108, 116.422075, 118.775572, 118.776968, 118.778187, 118.775695, 118.775302, 118.776303, 118.777541, 118.776196, 116.448944, 116.45487, 116.455762, 116.456146, 116.455857, 116.455458, 116.455533, 116.455426, 116.45566, 116.455696, 116.453667, 116.466606, }) distance_matrix = squareform(pdist(X, (lambda u, v: haversine(u, v)))) db = DBSCAN(eps=2, min_samples=5, metric='precomputed') y_db = db.fit_predict(distance_matrix) X['cluster'] = y_db results = {} for i in X.values: if i[2] not in results.keys(): results[i[2]] = [[i[1], i[0]]] else: if results[i[2]]: results[i[2]].append([i[1], i[0]]) else: results[i[2]] = [[i[1], i[0]]] print "DBSCAN output: ", len(results), results.keys() print "KMeans calc center as below: " for k in results.keys(): X = x. Sin (x [:, 1] -0 * x [:, 1]) z = x (z) res, idx = kmeans2(numpy.array(zip(xy[:, 0], xy[:, 1], z)), 1, iter=20, minit='points') address_text = my_get_address_text_by_location(res[0][1], res[0][0]) print res, address_textCopy the code

The output is as follows



Among them, the location information of “home”, “company” and “business trip” has been very close to the actual information of the user. But there are still noise points of information. We haven’t found a solution yet. The next step is to bring in the auxiliary information such as time obtained during the collection of users’ geographic location information to assist the analysis, hoping to get better results.