In general, there are two basic ways to query similarity in an index structure:

  1. One is range query, in which the query point and the query distance threshold are given and all the data whose distance from the query point is less than the threshold is searched from the data set
  2. The other is k-nearest neighbor query, which is to find K data nearest to query point from data set given query point and positive integer K. When K=1, it is the nearest neighbor query.

Similarly, there are two methods for feature point matching:

  • The easiest method is a linear scan, also known as exhaustive search, which calculates the distance between each sample in sample E and the input instance point in turn, and then extracts the point with the smallest calculated distance as the nearest neighbor point. This method is simple and straightforward, but when the sample set or training set is large, its shortcomings will be immediately exposed. For example, in the problem of object recognition, there may be thousands or even tens of thousands of SIFT feature points, and it is obviously inadequate to calculate the distance between the thousands of feature points and the input instance points.
  • The other is to build data indexes, because actual data is usually clustered, so we want to build data indexes and then do quick matching. Index tree is a kind of tree structure index method whose basic idea is to divide search space into different levels. Depending on whether the space is overlapped, it can be classified as Clipping and Overlapping. The former partition space without overlap, its representative is k-D tree; The latter partition space overlaps with each other and is represented by R tree.

In 1975 Jon Louis Bentley of Stanford University published a paper in the journal ACM: There is a K-D tree which can be divided into multiple parts in the form of the following figure.

2.1. What is KD tree

Kd-tree is an abbreviation of k-Dimension tree. It is a logarithmic point in k-dimensional space (e.g. 2d (x, y), 3d (x, y, z), k-dimensional (x1, y, Z..). ), which is mainly used for searching critical data in multidimensional space (e.g., range search and nearest neighbor search). Essentially, a KD-tree is a balanced binary tree.

First of all, it must be made clear that k-D tree is a spatial partition tree. To put it bluntly, the whole space is divided into several specific parts, and then relevant search operations are carried out in certain parts of the space. Imagine a three-dimensional space (multi-dimensional is a bit difficult for your imagination). The KD tree divides the three-dimensional space into multiple Spaces according to certain partition rules, as shown in the figure below:

2.2 construction of KD tree

The pseudo-code of kd tree construction is shown in the figure below:

Take a simple and intuitive example to introduce the K-D tree construction algorithm. Assume that there are six two-dimensional data points {(2, 3), (5, 4), (9, 6), (4, 7), (8, 1), (7, 2)}, data points in two-dimensional space, as shown in the figure below. In order to effectively find the nearest neighbor, k-D tree adopts the idea of divide and conquer, that is, the whole space is divided into several small parts. First, the thick black line divides the space into two parts, and then in the two subspaces, the thin black line divides the whole space into four parts, and finally the virtual black line divides the four parts further.

Six two-dimensional data points {(2, 3), (5, 4), (9, 6), (4, 7), (8, 1), (7, 2)} building kd tree steps as follows:

  1. Make sure: split field =x. Specifically, the variances of the 6 data points in x and Y dimensions are 39 and 28.63 respectively, so the variances on X-axis are larger, so the split domain value is X.
  2. Confirm: node-data = (7,2). Specifically, the data is sorted according to the value in the X-dimension, and the median value of the six data (the so-called median value, namely the value of the middle size) is 7, so the node-data field digit point (7,2). Thus, the split hyperplane of the node is the line x=7 through (7,2) and perpendicular to: split=x axis;
  3. Determine: left subspace and right subspace. Specifically, the hyperplane x=7 is divided into two parts: the part of x<=7 is the left subspace, which contains three nodes ={(2,3),(5,4),(4,7)}; The other part is the right subspace, which contains 2 nodes ={(9,6), (8,1)};

At the same time, after dividing the space shown above, we can see that point (7,2) can be the root node, the two thick red slashes pointing to (5,4) and (9,6) from the root node are the left and right children of the root node, and (2,3), (4,7) are the left and right children of (5,4) (connected by two thin red slashes). Finally, (8,1) is the left child of (9,6) (connected by thin red slashes). Thus, the following K-D tree is formed:

 

Data structure of K-D tree

For the kd tree data structure given in the above table, the specific code is transformed as follows (note: The following code analysis in this paper is based on the SIFT library maintained by Rob Hess) :

[cpp]
view plain
copy
print
?

  1. /** a node in a k-d tree */  
  2. struct kd_node  
  3. {  
  4. int ki; /**< partition key index *//
  5. double kv; /**< partition key value *//
  6.     int leaf;                     /**< 1 if node is a leaf, 0 otherwise */  
  7.     struct feature* features;    /**< features at this node */  
  8.     int n;                        /**< number of features */  
  9.     struct kd_node* kd_left;     /**< left child */  
  10.     struct kd_node* kd_right;    /**< right child */  
  11. };  

[cpp]
view plain
copy
print
?

  1. /** a node in a k-d tree */  
  2. struct kd_node  
  3. {  
  4. int ki; /**< partition key index *//
  5. double kv; /**< partition key value *//
  6.     int leaf;                     /**< 1 if node is a leaf, 0 otherwise */  
  7.     struct feature* features;    /**< features at this node */  
  8.     int n;                        /**< number of features */  
  9.     struct kd_node* kd_left;     /**< left child */  
  10.     struct kd_node* kd_right;    /**< right child */  
  11. };  
/** a node in a k-d tree */ struct kd_node { int ki; /**< partition key index *// /**< partition key value *// int leaf; /**< 1 if node is a leaf, 0 otherwise */ struct feature* features; /**< features at this node */ int n; /**< number of features */ struct kd_node* kd_left; /**< left child */ struct kd_node* kd_right; /**< right child */ };Copy the code

In other words, as mentioned before, kd stands for k-dimension in kd tree, and each node is a k-dimensional point. Each non-leaf node can be imagined as a segmented hyperplane, with the hyperplane perpendicular to the coordinate axis dividing the space into two parts, and so recursively dividing from the root node until there are no instances. The classical rules for constructing k-D tree are as follows:

  1. As the depth of the tree increases, coordinate axes are selected as normal vectors to segment the hyperplane. For a 3-D tree, the root node selects the X-axis, the child of the root node selects the Y-axis, the grandchild of the root node selects the Z-axis, and the great-grandchild of the root node selects the X-axis, and so on.
  2. In each case, the instance of the median of all corresponding instances is used as the sharding point, the sharding point is used as the parent node, and those on the left and right sides are used as the left and right subtrees.

For the K-dimensional data of n instances, the time complexity of establishing kD-tree is O(k* N *logn).

Now that we’ve built the KD tree, what about the nearest neighbor search? Can you tell by looking at the animated GIF below?



K-d tree algorithm can be divided into two parts, in addition to the upper part of the k-D tree itself such as the data structure of the establishment of the algorithm, the other part is in the establishment of k-D tree various such as insert, delete, search (nearest search) and other operations involved in the algorithm. Next, let’s look at the insert, delete and search operations of KD tree in turn.

2.3 Insertion of KD tree

Elements are inserted into a K-D tree in a similar way to binary search trees. Essentially, x coordinate values are compared in even layers and Y coordinate values are compared in odd layers. When we reach the bottom of the tree (i.e. when a null pointer appears), we know where the node will be inserted. The shape of the resulting K-D tree depends on the order in which nodes are inserted. Given N points, the average cost of insertion and retrieval for one node is O(log2N).

The following 4 images (source: China University of Geosciences e-course) illustrate the insertion sequence of (a) Chicago, (b) Mobile, (c) Toronto, and (d) Buffalo to establish a spatial K-D tree:







It should be noted that during the insertion described here, each node splits its plane into two parts. By comparison, Chicago divides all the nodes on the plane into two parts, one of which has an x coordinate value less than 35, and the other has an x coordinate value greater than or equal to 35. Similarly, Mobile divides all nodes whose X coordinate value is greater than 35 into two parts. The Y coordinate value of some nodes is less than 10, and the Y coordinate value of the other nodes is greater than or equal to 10. Toronto and Buffalo continue to divide according to the rule of bisection.

2.4 deletion of KD tree

Red and black tree


2.5. The nearest neighbor search algorithm of KD tree

In real life, there are many problems that need to be quickly analyzed and searched in multidimensional data. The most commonly used method for this problem is the so-called KD tree. The search of data in k-D tree is also an important link of feature matching. Its purpose is to retrieve the data point closest to the query point in k-D tree. The distance between two points in an N-dimensional Cartesian space is determined by the following formula:

[cpp]
view plain
copy
print
?

  1. void innerGetClosest(NODE* pNode, PT point, PT& res, int& nMinDis)  
  2. {  
  3.     if (NULL == pNode)  
  4.         return;  
  5.     int nCurDis = abs(point.x – pNode->pt.x) + abs(point.y – pNode->pt.y);  
  6.     if (nMinDis < 0 || nCurDis < nMinDis)  
  7.     {  
  8.         nMinDis = nCurDis;  
  9.         res = pNode->pt;  
  10.     }  
  11. if (pNode->splitX && point.x <= pNode->pt.x || ! pNode->splitX && point.y <= pNode->pt.y)
  12.         innerGetClosest(pNode->pLft, point, res, nMinDis);  
  13.     else  
  14.         innerGetClosest(pNode->pRgt, point, res, nMinDis);  
  15.     int rang = pNode->splitX ? abs(point.x – pNode->pt.x) : abs(point.y – pNode->pt.y);  
  16.     if (rang > nMinDis)  
  17.         return;  
  18.     NODE* pGoInto = pNode->pLft;  
  19. if (pNode->splitX && point.x > pNode->pt.x || ! pNode->splitX && point.y > pNode->pt.y)
  20.         pGoInto = pNode->pRgt;  
  21.     innerGetClosest(pGoInto, point, res, nMinDis);  
  22. }  
void innerGetClosest(NODE* pNode, PT point, PT& res, int& nMinDis) { if (NULL == pNode) return; int nCurDis = abs(point.x - pNode->pt.x) + abs(point.y - pNode->pt.y); if (nMinDis < 0 || nCurDis < nMinDis) { nMinDis = nCurDis; res = pNode->pt; } if (pNode->splitX && point.x <= pNode->pt.x || ! pNode->splitX && point.y <= pNode->pt.y) innerGetClosest(pNode->pLft, point, res, nMinDis); else innerGetClosest(pNode->pRgt, point, res, nMinDis); int rang = pNode->splitX ? abs(point.x - pNode->pt.x) : abs(point.y - pNode->pt.y); if (rang > nMinDis) return; NODE* pGoInto = pNode->pLft; if (pNode->splitX && point.x > pNode->pt.x || ! pNode->splitX && point.y > pNode->pt.y) pGoInto = pNode->pRgt; innerGetClosest(pGoInto, point, res, nMinDis); }Copy the code

Below, two simple examples (from the book Of Image locally invariant characteristics and Description) are used to describe the basic idea of nearest neighbor search.

2.5.1 Example: Query points (2.1,3.1)

Asterisks indicate the points to be queried (2.1,3.1). Through binary search, the nearest approximation point, namely leaf node (2,3), can be found quickly along the search path. The leaf node found is not necessarily the nearest one. The nearest one is definitely closer to the query point and should be located in the circular domain with the query point as the center and through the leaf node. To find the true nearest neighbor, a related ‘backtrace’ operation is required. In other words, the algorithm first reversely searches along the search path to see if there is a data point closer to the query point.

Take queries (2.1,3.1) as an example:

  1. Binary tree search: the binary search starts from point (7,2), then reaches point (5,4) and finally reaches point (2,3). At this point, the nodes in the search path are <(7,2), (5,4) and (2,3)>. First, (2,3) is taken as the current nearest neighbor point and its distance to query point (2.1,3.1) is calculated as 0.1414.
  2. Backtracking: after (2,3) is obtained as the nearest point of the query point, backtracking to its parent node (5,4) and judging whether there are data points closer to the query point in the space of other child nodes of the parent node. Draw a circle with (2.1,3.1) as the center and 0.1414 as the radius, as shown in the figure below. It is found that the circle does not settle with the hyperplane y = 4, so it is not necessary to enter the right subspace of node (5,4) (gray area in the figure) to search.
  3. Finally, back to (7,2), a circle centered on (2.1,3.1) with a radius of 0.1414 will not settle with the hyperplane x = 7, so there is no need to enter the right subspace of (7,2) for search. At this point, all the nodes in the search path have been backtracked, the whole search is finished, and the nearest neighbor point (2,3) is returned, and the nearest distance is 0.1414.



2.5.2 example: query point (2,4.5)

For example, the search point is (2,4.5), and the specific steps are as follows:

  1. Similarly, binary search is carried out first, from (7,2) to node (5,4). When searching, y = 4 is used as the partition hyperplane. Since the search point is y value 4.5, enter the right subspace to find (4,7), and form the search path <(7,2), (5,4), (4,7)>. However, the distance between (4,7) and the target search point is 3.202, while the distance between (5,4) and the search point is 3.041, so (5,4) is the closest point of the query point.
  2. Take (2,4.5) as the center of the circle and 3.041 as the radius of the circle, as shown in the figure below. It can be seen that the circle is delivered to the hyperplane y = 4, so it needs to enter the left subspace of (5,4) for search, that is, add node (2,3) to the search path to get <(7,2), (2,3)>; Then search to the leaf node (2,3). The distance from (2,4.5) to (5,4) is closer than that to (5,4), so the nearest neighbor point is updated as (2,3) and the nearest distance is updated as 1.5.
  3. Backtracking to (5,4) and finally to the root node (7,2), the circle is made with (2,4.5) as the center of the circle and 1.5 as the radius, and does not split the hyperplane with x = 7, as shown in the figure below. At this point, the search path is backtracked and the nearest neighbor point (2,3) is returned with the nearest distance of 1.5.

The above two examples show that when the neighborhood of the query point negotiates with the space on both sides of the partitioned hyperplane, the subspace on the other side needs to be searched, which results in the complicated retrieval process and reduced efficiency.

The research shows that the time complexity of k-dimensional K-D tree search process with N nodes is tworst =O (KN1-1 / K).

At the same time, for the sake of introduction, the two – or three-dimensional case is discussed above. However, in practical applications, such as SIFT feature vector 128 dimensions, SURF feature vector 64 dimensions, the dimensions are relatively large, and the performance of direct k-D tree rapid retrieval (dimension less than 20) drops sharply, almost close to greedy linear scanning. Assuming that the dimension of the data set is D, generally speaking, the data size N is required to meet N»2 D to achieve efficient search. Therefore, this leads to a series of improvement of K-D tree algorithm: BBF algorithm, and a series of m-tree, VP tree, MVP tree and other high-dimensional spatial index trees (below section 2.6 improvement of kd tree neighbor search algorithm: BBF algorithm, and 2.7 spherical tree, M tree, VP tree, MVP tree).

2.6 Improvement of kd tree neighbor search algorithm: BBF algorithm

Let’s follow the ideas in the last section and refer to the content of statistical learning methods in the book, and then summarize the nearest neighbor search algorithm of KD tree:








  1. Find the leaf node containing the target point X in the KD tree: start from the root node and recursively search down the KD tree. If the coordinate of the current dimension of the target point X is less than that of the split point, it moves to the left child; otherwise, it moves to the right child until the child is a leaf node.
  2. This leaf node is the “current nearest point”.
  3. Recursively, the following operations are performed on each node: (a) If the instance point saved by this node is closer to the target point than the current closest point, the “current closest point” is updated, that is, the instance point is “current closest point”. (b) The current nearest point must exist in the region corresponding to a child node of this node. Check whether the region corresponding to another child node of the parent node of the child node has a closer point. Particular way is to check whether another node corresponding to the area with the target point, its target point and the “current recently point” of the radius is the distance between the round or super sphere intersect: if the intersection, may exist in another child of the corresponding area nearer the target point, moving to another child node, then continue to recursively for nearest neighbor search; If they don’t intersect, trace back up.
  4. When you fall back to the root, the search ends, and the last “current nearest point” is x’s nearest neighbor.

If the instance points are randomly distributed, then the average computational complexity of kd tree search is O (N logn), where N is the training instance tree. Therefore, kd tree is more suitable for k-nearest neighbor search when the number of training instances is much larger than the space dimension. When the space dimension is close to the number of training instances, its efficiency will rapidly decline to the speed of linear scanning before liberation.

As described in step 4 of the k-nearest neighbor search algorithm above: “The search ends when the search returns to the root node”. The query comparison completion process of each nearest neighbor point eventually returns to the root node and ends, resulting in many unnecessary backtracking and comparison nodes. These redundant losses will make the search efficiency quite low when searching for high-dimensional data. So what can be done to improve on the original kD-tree nearest neighbor search algorithm?

It can be seen from the above standard KD tree query process that the “backtracking” in the search process is determined by the “query path”, without considering some properties of some data points on the query path. A simple improvement idea is to sort the nodes on the “query path”, such as sorting by the distance between the respective partition hyperplane (also called bin) and the query point, that is, backtracking always starts from the tree node with the highest priority (Best bin).

Feng& Bookboy comments on this BBF mechanism:

  1. In a certain layer, the partition surface is the ki-th dimension and the partition value is kV, so ABS (q[Ki]-kv) is the priority of the branch that is not selected, that is, the distance in that one dimension is calculated;
  2. Meanwhile, taking nodes from the priority queue only happens after a leaf node is searched. The nodes whose distances are calculated will not appear in the queue. For example, if the path from 1 to 10 is 1-5-7, then 1, 5, and 7 will not appear in the priority queue. In other words, the priority queue stores the opposite child nodes of the nodes in the query path. For example, if the left subtree is searched, the right node of the corresponding level is queued.

In this way, the improvement of the nearest neighbor search algorithm of KD tree to be discussed in this section is introduced: BBF (Best-bin-first) query algorithm, which is an approximate algorithm for high-dimensional data proposed by David Lowe, the inventor of SIFT algorithm, in a 1997 article, this algorithm can ensure the priority to search the space containing the nearest neighbor points with a high probability. In addition, BBF mechanism also sets a running timeout limit. After BBF query mechanism is adopted, KD tree can be effectively extended to high dimensional data sets.

Pseudocodes are shown in the figure below (the figure is taken from the book of Local Invariant Features and Description of image) :

Take the above query (2,4.5) as an example, the search algorithm flow is as follows:

  1. Put (7,2) into the priority queue;
  2. Extract (7,2) in the priority queue. Since (2,4.5) is located to the left of the hyperplane (7,2) segmentation, its left child node (5,4) is retrieved. At the same time, according to the BBF mechanism “search left/right subtree, put the sibling node corresponding to this layer (right/left node) into the queue”, and put the sibling node corresponding to (5,4) (right child node (9,6)) into the priority queue. At this time, the priority queue is {(9,6)}, and the optimal point is (7,2). The priority queue is {(2,3), (9,6)}, and the “optimal point” is (5,4).
  3. Extract the node with the highest priority (2,3) and repeat step 2 until the priority queue is empty.

2.7 ball tree, M tree, VP tree and MVP tree

2.7.1, ball tree

Let’s review the above content, for the following kd tree:

Now we want to find its nearest neighbor.

Through section 2.5 above, we can conclude that:

1, to find the nearest neighbor to a given target, need to start from the root node of the tree down along the tree to find the target area, as shown in the figure below, a given target points, marked with an asterisk, we seem to see, there is a point away from the goal point recently, because it falls in the target as the center of the circle smaller length as the dotted circle radius, However, in order to determine whether it is possible to return the nearest neighbor to the village, we will first check the sibling node of the leaf node. However, the dotted circle does not intersect the sibling node of the leaf node in the shaded part shown in the figure, so it is determined that the sibling leaf node cannot contain the nearest neighbor.

2. We then go back to the parent node and check the sibling nodes of the parent node. The sibling nodes of the parent node cover all the X-axis regions of the horizontal line in the figure. Because the dotted circle intersects the rectangle at the top right…

As we see above, KD tree is a tree structure that can be used to effectively find the nearest neighbor, but this tree structure is not perfect, and a basic conflict will be presented when dealing with unevenly distributed data sets: Both invite the tree to have a perfectly balanced structure, and require that the area to be found be approximately square, but neither approximately square, nor rectangle, nor even square, is the best use of shapes, because they all have angles.

        

What does that mean? That is to say, in the diagram above, if the instance of black point far from target star again, so will the dotted circle will expand as shown in the red line, so that the fellowship with the lower right corner of the upper left rectangle, since intersection, then will have to check the upper left rectangle, in fact, recent point close to the distance of the star, check the upper left rectangle is redundant. Here we see that the KD tree divides the two-dimensional plane into rectangles, but the Angle of the rectangular region is a difficult problem to deal with.

The solution is to use the ball tree as shown below:

First select a point from the sphere farthest from the center of the sphere, then select the second point farthest from the first point, assign all points in the sphere to the nearest one of the two cluster centers, then calculate the center of each cluster and the minimum radius required for the cluster to contain all its data points. The advantage of this approach is that the cost of splitting a ball containing n unique points increases only linearly with n.

Ball was used to identify a given target points of the nearest neighbor method is, first of all, from top to bottom through the whole tree to find contains the target in the leaves, and find out in the ball and the target point closest to the point, this will determine the target points from its nearest neighbor points of an upper limit, then like KD tree search, check the sibling node, If the distance between the target point and the center of the sibling node is greater than the sum of the sibling radius and the current upper limit, then there cannot be a closer point in the sibling node. Otherwise, the subtree below the sibling must be further examined.

Below, target or use a star, said the black point is the goal of the current known point of nearest neighbor, all the content in the grey ball will be ruled out, because of too far away from the center of the grey ball, so it can’t contain a closer, like this, recursive processing by tracing to the root node of the tree, and check all may contain a more close to the upper limit value of the current point of the ball.

Ball is top-down, the establishment of the tree and KD tree, the fundamental problem is to find a good way to point set will contain the data of the ball splits into two, in practice, don’t have to wait for a leaf node only two hu didn’t stop when the data points, and KD tree can be used the same way, once the data points in the node to a preset minimum quantity, You can stop the tree-building process ahead of time.

Described above, is to choose a ball from the ball at the center of the farthest point, and then select the furthest point from the first point, the second all the points assigned to the ball and the two on a recent clustering center, and then calculate each cluster center, and the clustering can contain it all data points required minimum radius. The advantage of this approach is that the cost of splitting a ball of N unique points increases only linearly with n.

2.7.2 Introduction to VP tree and MVP Tree

The distance index of high-dimensional feature vector is a key technology of content-based image retrieval. At present, the solution is to reduce the dimension of high-dimensional feature space first, and then adopt the mainstream multi-dimensional index structure including quadtree, KD tree, R tree family, etc. The starting point of this method is: At present, the mainstream multidimensional index structure has a good efficiency in dealing with the case of low dimension, but it is inadequate for the case of high dimension (the so-called dimension crisis).

The experimental results show that when the dimension of the feature space exceeds 20, the efficiency is obviously reduced, and the visual features are often described by high-dimensional vectors, which can reach the magnitude of 10^2 or even higher in general. The importance of each dimension of information is different in the high-dimensional vectors representing the visual features of images. Dimensionality reduction technology is used to remove feature vectors of secondary information and feature vectors with strong correlation, so as to reduce the dimension of feature space. This method has been applied in practice.

However, this method has some shortcomings, and dimension reduction technology may lead to the loss of effective information, especially for the case where the correlation of feature vectors in feature space is very small. In addition, most of the mainstream multi-dimensional index structures are aimed at Euclidean space, and the geometric properties of Euclidean space should be used in the design, while the similarity calculation of images is probably not limited to Euclidean distance. In this case, people pay more and more attention to the high dimensional index structure of metric space based on distance which can be directly applied to the high dimensional vector similarity query problem.

The measurement of distance between objects in metric space can only use the property of triangle inequality, but not other geometric properties. Vector space can be regarded as a special metric space composed of real coordinate strings. At present, there are many kinds of index structures proposed for the high-dimensional index problem of metric space, which can be roughly classified as follows, as shown in the figure below:

    

Comments:

  1. UESTC_HN_AY_GUOBO: Now there is mtree or MVPTree on the basis of KDTree. In fact, the key is how to choose Pivot and how to reduce the distance calculation in the metric space algorithm.
  2. Mandycool: MVP-tree uses the triangle inequality to narrow the search area, but the target of mVP-tree is slightly different. It queries points that are less than a certain value r from the query point. In addition, the data set of the author test only has 20 dimensions, so we don’t know what the effect is after hundreds dimension. One way to reduce the distance calculation is embedding, which excludes some points through inequalities.

FOR more information, see Paper 1: DIST ANCE-BASED INDEXING FOR DIMENSIONAL METRIC SP ACES, Tolga Bozkaya & Meral Ozsoyoglu, and Paper 2: Image Retrieval Based on High-dimensional Index Structure VP-Tree and MVP-Tree in Metric Space, Wang Zhi-qiang, GAN Guo-Hui, Cheng Qi-min.

, of course, if you think the paper is not enough to satisfy your appetite, here are a lot of on his neighbor algorithms related papers for you to see: scholar.google.com.hk/scholar?q=n… Spill-trees, An investigation of Practical Approximate neighbor algorithms

[cpp]
view plain
copy
print
?

  1. #include <iostream>  
  2. #include <vector>  
  3. #include <algorithm>  
  4. #include <string>  
  5. #include <cmath>  
  6. using namespace std;  
  7.   
  8.   
  9.   
  10.   
  11. struct KdTree{  
  12.     vector<double> root;  
  13.     KdTree* parent;  
  14.     KdTree* leftChild;  
  15.     KdTree* rightChild;  
  16. // Default constructor
  17. KdTree(){parent = leftChild = rightChild = NULL; }
  18. // Check whether the kd tree is empty
  19.     bool isEmpty()  
  20.     {  
  21.         return root.empty();  
  22.     }  
  23. // Check whether the kd tree is just a leaf node
  24.     bool isLeaf()  
  25.     {  
  26. return (! root.empty()) &&
  27.             rightChild == NULL && leftChild == NULL;  
  28.     }  
  29. // Check whether it is the root of the tree
  30.     bool isRoot()  
  31.     {  
  32. return (! isEmpty()) && parent == NULL;
  33.     }  
  34. // Check whether the root of the child KD tree is the left of its parent KD tree
  35.     bool isLeft()  
  36.     {  
  37.         return parent->leftChild->root == root;  
  38.     }  
  39. // Check whether the root of the child KD tree is the right node of its parent KD tree
  40.     bool isRight()  
  41.     {  
  42.         return parent->rightChild->root == root;  
  43.     }  
  44. };  
  45.   
  46. The int data [6] [2] = {{2, 3}, {5, 4}, {9, 6}, {4, 7}, {8, 1}, {7, 2}};
  47.   
  48. template<typename T>  
  49. vector<vector<T> > Transpose(vector<vector<T> > Matrix)  
  50. {  
  51.     unsigned row = Matrix.size();  
  52.     unsigned col = Matrix[0].size();  
  53.     vector<vector<T> > Trans(col,vector<T>(row,0));  
  54.     for (unsigned i = 0; i < col; ++i)  
  55.     {  
  56.         for (unsigned j = 0; j < row; ++j)  
  57.         {  
  58.             Trans[i][j] = Matrix[j][i];  
  59.         }  
  60.     }  
  61.     return Trans;  
  62. }  
  63.   
  64. template <typename T>  
  65. T findMiddleValue(vector<T> vec)  
  66. {  
  67.     sort(vec.begin(),vec.end());  
  68.     auto pos = vec.size() / 2;  
  69.     return vec[pos];  
  70. }  
  71.   
  72.   
  73. // Build the kd tree
  74. void buildKdTree(KdTree* tree, vector<vector<double> > data, unsigned depth)  
  75. {  
  76.   
  77. // Number of samples
  78.     unsigned samplesNum = data.size();  
  79. // Termination conditions
  80.     if (samplesNum == 0)  
  81.     {  
  82.         return;  
  83.     }  
  84.     if (samplesNum == 1)  
  85.     {  
  86.         tree->root = data[0];  
  87.         return;  
  88.     }  
  89. // Dimensions of the sample
  90.     unsigned k = data[0].size();  
  91.     vector<vector<double> > transData = Transpose(data);  
  92. // Select the shard property
  93.     unsigned splitAttribute = depth % k;  
  94.     vector<double> splitAttributeValues = transData[splitAttribute];  
  95. // Select the shard value
  96.     double splitValue = findMiddleValue(splitAttributeValues);  
  97.     //cout << “splitValue” << splitValue  << endl;  
  98.   
  99. // Divide the dataset into two subsets based on the selected shard attributes and shard values
  100.     vector<vector<double> > subset1;  
  101.     vector<vector<double> > subset2;  
  102.     for (unsigned i = 0; i < samplesNum; ++i)  
  103.     {  
  104.         if (splitAttributeValues[i] == splitValue && tree->root.empty())  
  105.             tree->root = data[i];  
  106.         else  
  107.         {  
  108.             if (splitAttributeValues[i] < splitValue)  
  109.                 subset1.push_back(data[i]);  
  110.             else  
  111.                 subset2.push_back(data[i]);  
  112.         }  
  113.     }  
  114.   
  115. // Subset recursively calls buildKdTree
  116.   
  117.     tree->leftChild = new KdTree;  
  118.     tree->leftChild->parent = tree;  
  119.     tree->rightChild = new KdTree;  
  120.     tree->rightChild->parent = tree;  
  121.     buildKdTree(tree->leftChild, subset1, depth + 1);  
  122.     buildKdTree(tree->rightChild, subset2, depth + 1);  
  123. }  
  124.   
  125. // Print the kd tree layer by layer
  126. void printKdTree(KdTree *tree, unsigned depth)  
  127. {  
  128.     for (unsigned i = 0; i < depth; ++i)  
  129.         cout << “\t”;  
  130.               
  131.     for (vector<double>::size_type j = 0; j < tree->root.size(); ++j)  
  132.         cout << tree->root[j] << “,”;  
  133.     cout << endl;  
  134. If (tree->leftChild == NULL && tree->rightChild == NULL)// leaf node
  135.         return;  
  136. Else // Non-leaf nodes
  137.     {  
  138. if (tree->leftChild ! = NULL)
  139.         {  
  140.             for (unsigned i = 0; i < depth + 1; ++i)  
  141.                 cout << “\t”;  
  142.             cout << ” left:”;  
  143.             printKdTree(tree->leftChild, depth + 1);  
  144.         }  
  145.               
  146.         cout << endl;  
  147. if (tree->rightChild ! = NULL)
  148.         {  
  149.             for (unsigned i = 0; i < depth + 1; ++i)  
  150.                 cout << “\t”;  
  151.             cout << “right:”;  
  152.             printKdTree(tree->rightChild, depth + 1);  
  153.         }  
  154.         cout << endl;  
  155.     }  
  156. }  
  157.   
  158.   
  159. // Compute the distance between two points in space
  160. double measureDistance(vector<double> point1, vector<double> point2, unsigned method)  
  161. {  
  162. if (point1.size() ! = point2.size())
  163.     {  
  164. Dimensions don’t match!! ;
  165.         exit(1);  
  166.     }  
  167.     switch (method)  
  168.     {  
  169. Case 0:// Euclidean distance
  170.             {  
  171.                 double res = 0;  
  172.                 for (vector<double>::size_type i = 0; i < point1.size(); ++i)  
  173.                 {  
  174.                     res += pow((point1[i] – point2[i]), 2);  
  175.                 }  
  176.                 return sqrt(res);  
  177.             }  
  178. Case 1:// Manhattan distance
  179.             {  
  180.                 double res = 0;  
  181.                 for (vector<double>::size_type i = 0; i < point1.size(); ++i)  
  182.                 {  
  183.                     res += abs(point1[i] – point2[i]);  
  184.                 }  
  185.                 return res;  
  186.             }  
  187.         default:  
  188.             {  
  189.                 cerr << “Invalid method!!” << endl;  
  190.                 return -1;  
  191.             }  
  192.     }  
  193. }  
  194. // Search the nearest neighbor of goal in the kd tree
  195. // Input: target point; Constructed KD tree
  196. // Output: the nearest neighbor of the target point
  197. vector<double> searchNearestNeighbor(vector<double> goal, KdTree *tree)  
  198. {  
  199. /* Step 1: find the leaf node containing the target point in the kd tree
  200. Recursively accesses the kd tree if the coordinates of the current dimension of the target point are less than that of the cut point
  201. Coordinates, then move to the left child, otherwise move to the right child, until the child is
  202. To the leaf node, which is the “current nearest point”.
  203. * /
  204. unsigned k = tree->root.size(); // Calculate the dimension of the data
  205. unsigned d = 0; // The dimension is initialized to 0, i.e. from the first dimension
  206.     KdTree* currentTree = tree;  
  207.     vector<double> currentNearest = currentTree->root;  
  208. while(! currentTree->isLeaf())
  209.     {  
  210. unsigned index = d % k; // Calculate the current dimension
  211.         if (currentTree->rightChild->isEmpty() || goal[index] < currentNearest[index])  
  212.         {  
  213.             currentTree = currentTree->leftChild;  
  214.         }  
  215.         else  
  216.         {  
  217.             currentTree = currentTree->rightChild;  
  218.         }  
  219.         ++d;  
  220.     }  
  221.     currentNearest = currentTree->root;  
  222.   
  223. /* Step 2: recursively back up and do the following on each node:
  224. (a) If the instance saved by this node is closer to the target point than the current nearest point, this example point is regarded as the “current nearest point”.
  225. (b) The current nearest point must exist in the region corresponding to a child node of a node, check the other of the parent node of the child node
  226. Whether the region corresponding to one child node has a closer point (that is, check whether the region corresponding to another child node matches the target point as a ball
  227. Center, the intersection of the sphere with the radius of the distance between the target point and the “current closest point”); If it intersects, it’s probably on the other one
  228. If there is a point closer to the target point in the region corresponding to the child node, it moves to another child node and then performs the most recursively
  229. Neighbor search; If they do not intersect, roll back up */
  230.   
  231. // The distance between the current nearest neighbor and the target point
  232.     double currentDistance = measureDistance(goal, currentNearest, 0);  
  233.   
  234. // If the root of the current child kd tree is the left child of its parent, search for the child represented by the right child of its parent
  235. //, and vice versa
  236.     KdTree* searchDistrict;  
  237.     if (currentTree->isLeft())  
  238.     {  
  239.         if (currentTree->parent->rightChild == NULL)  
  240.             searchDistrict = currentTree;  
  241.         else  
  242.             searchDistrict = currentTree->parent->rightChild;  
  243.     }  
  244.     else  
  245.     {  
  246.         searchDistrict = currentTree->parent->leftChild;  
  247.     }  
  248.   
  249. // If the root of the subkd tree corresponding to the search area is not the root of the entire KD tree, continue the search back
  250. while (searchDistrict->parent ! = NULL)
  251.     {  
  252. // The closest distance between the search area and the target point
  253.         double districtDistance = abs(goal[(d+1)%k] – searchDistrict->parent->root[(d+1)%k]);  
  254.   
  255. // If “the closest distance between the search area and the target point” is shorter than “the current nearest neighbor and the target point”, it indicates a search
  256. // There may be points closer to the target point in the region
  257. if (districtDistance < currentDistance )//&& ! searchDistrict->isEmpty()
  258.         {  
  259.   
  260.             double parentDistance = measureDistance(goal, searchDistrict->parent->root, 0);  
  261.   
  262.             if (parentDistance < currentDistance)  
  263.             {  
  264.                 currentDistance = parentDistance;  
  265.                 currentTree = searchDistrict->parent;  
  266.                 currentNearest = currentTree->root;  
  267.             }  
  268. if (! searchDistrict->isEmpty())
  269.             {  
  270.                 double rootDistance = measureDistance(goal, searchDistrict->root, 0);  
  271.                 if (rootDistance < currentDistance)  
  272.                 {  
  273.                     currentDistance = rootDistance;  
  274.                     currentTree = searchDistrict;  
  275.                     currentNearest = currentTree->root;  
  276.                 }  
  277.             }  
  278. if (searchDistrict->leftChild ! = NULL)
  279.             {  
  280.                 double leftDistance = measureDistance(goal, searchDistrict->leftChild->root, 0);  
  281.                 if (leftDistance < currentDistance)  
  282.                 {  
  283.                     currentDistance = leftDistance;  
  284.                     currentTree = searchDistrict;  
  285.                     currentNearest = currentTree->root;  
  286.                 }  
  287.             }  
  288. if (searchDistrict->rightChild ! = NULL)
  289.             {  
  290.                 double rightDistance = measureDistance(goal, searchDistrict->rightChild->root, 0);  
  291.                 if (rightDistance < currentDistance)  
  292.                 {  
  293.                     currentDistance = rightDistance;  
  294.                     currentTree = searchDistrict;  
  295.                     currentNearest = currentTree->root;  
  296.                 }  
  297.             }  
  298.         }//end if  
  299.   
  300. if (searchDistrict->parent->parent ! = NULL)
  301.         {  
  302.             searchDistrict = searchDistrict->parent->isLeft()?   
  303.                             searchDistrict->parent->parent->rightChild:  
  304.                             searchDistrict->parent->parent->leftChild;  
  305.         }  
  306.         else  
  307.         {  
  308.             searchDistrict = searchDistrict->parent;  
  309.         }  
  310.         ++d;  
  311.     }//end while  
  312.     return currentNearest;  
  313. }  
  314.   
  315. int main()  
  316. {  
  317.     vector<vector<double> > train(6, vector< double>(2, 0));  
  318.     for (unsigned i = 0; i < 6; ++i)  
  319.         for (unsigned j = 0; j < 2; ++j)  
  320.             train[i][j] = data[i][j];  
  321.   
  322.     KdTree* kdTree = new KdTree;  
  323.     buildKdTree(kdTree, train, 0);  
  324.   
  325.     printKdTree(kdTree, 0);  
  326.   
  327.     vector<double> goal;  
  328.     goal.push_back(3);  
  329. Goal. The push_back (4.5);
  330.     vector<double> nearestNeighbor = searchNearestNeighbor(goal, kdTree);  
  331.     vector<double>::iterator beg = nearestNeighbor.begin();  
  332.     cout << “The nearest neighbor is: “;  
  333. while(beg ! = nearestNeighbor.end()) cout << *beg++ << “,”;
  334.     cout << endl;  
  335.     return 0;  
  336. }  
#include <iostream>
#include <vector>
#include <algorithm>
#include <string>
#include <cmath>
using namespace std;




struct KdTree{
    vector<double> root;
    KdTree* parent;
    KdTree* leftChild;
    KdTree* rightChild;
    //默认构造函数
    KdTree(){parent = leftChild = rightChild = NULL;}
    //判断kd树是否为空
    bool isEmpty()
    {
        return root.empty();
    }
    //判断kd树是否只是一个叶子结点
    bool isLeaf()
    {
        return (!root.empty()) && 
            rightChild == NULL && leftChild == NULL;
    }
    //判断是否是树的根结点
    bool isRoot()
    {
        return (!isEmpty()) && parent == NULL;
    }
    //判断该子kd树的根结点是否是其父kd树的左结点
    bool isLeft()
    {
        return parent->leftChild->root == root;
    }
    //判断该子kd树的根结点是否是其父kd树的右结点
    bool isRight()
    {
        return parent->rightChild->root == root;
    }
};

int data[6][2] = {{2,3},{5,4},{9,6},{4,7},{8,1},{7,2}};

template<typename T>
vector<vector<T> > Transpose(vector<vector<T> > Matrix)
{
    unsigned row = Matrix.size();
    unsigned col = Matrix[0].size();
    vector<vector<T> > Trans(col,vector<T>(row,0));
    for (unsigned i = 0; i < col; ++i)
    {
        for (unsigned j = 0; j < row; ++j)
        {
            Trans[i][j] = Matrix[j][i];
        }
    }
    return Trans;
}

template <typename T>
T findMiddleValue(vector<T> vec)
{
    sort(vec.begin(),vec.end());
    auto pos = vec.size() / 2;
    return vec[pos];
}


//构建kd树
void buildKdTree(KdTree* tree, vector<vector<double> > data, unsigned depth)
{

    //样本的数量
    unsigned samplesNum = data.size();
    //终止条件
    if (samplesNum == 0)
    {
        return;
    }
    if (samplesNum == 1)
    {
        tree->root = data[0];
        return;
    }
    //样本的维度
    unsigned k = data[0].size();
    vector<vector<double> > transData = Transpose(data);
    //选择切分属性
    unsigned splitAttribute = depth % k;
    vector<double> splitAttributeValues = transData[splitAttribute];
    //选择切分值
    double splitValue = findMiddleValue(splitAttributeValues);
    //cout << "splitValue" << splitValue  << endl;

    // 根据选定的切分属性和切分值,将数据集分为两个子集
    vector<vector<double> > subset1;
    vector<vector<double> > subset2;
    for (unsigned i = 0; i < samplesNum; ++i)
    {
        if (splitAttributeValues[i] == splitValue && tree->root.empty())
            tree->root = data[i];
        else
        {
            if (splitAttributeValues[i] < splitValue)
                subset1.push_back(data[i]);
            else
                subset2.push_back(data[i]);
        }
    }

    //子集递归调用buildKdTree函数

    tree->leftChild = new KdTree;
    tree->leftChild->parent = tree;
    tree->rightChild = new KdTree;
    tree->rightChild->parent = tree;
    buildKdTree(tree->leftChild, subset1, depth + 1);
    buildKdTree(tree->rightChild, subset2, depth + 1);
}

//逐层打印kd树
void printKdTree(KdTree *tree, unsigned depth)
{
    for (unsigned i = 0; i < depth; ++i)
        cout << "\t";
            
    for (vector<double>::size_type j = 0; j < tree->root.size(); ++j)
        cout << tree->root[j] << ",";
    cout << endl;
    if (tree->leftChild == NULL && tree->rightChild == NULL )//叶子节点
        return;
    else //非叶子节点
    {
        if (tree->leftChild != NULL)
        {
            for (unsigned i = 0; i < depth + 1; ++i)
                cout << "\t";
            cout << " left:";
            printKdTree(tree->leftChild, depth + 1);
        }
            
        cout << endl;
        if (tree->rightChild != NULL)
        {
            for (unsigned i = 0; i < depth + 1; ++i)
                cout << "\t";
            cout << "right:";
            printKdTree(tree->rightChild, depth + 1);
        }
        cout << endl;
    }
}


//计算空间中两个点的距离
double measureDistance(vector<double> point1, vector<double> point2, unsigned method)
{
    if (point1.size() != point2.size())
    {
        cerr << "Dimensions don't match!!" ;
        exit(1);
    }
    switch (method)
    {
        case 0://欧氏距离
            {
                double res = 0;
                for (vector<double>::size_type i = 0; i < point1.size(); ++i)
                {
                    res += pow((point1[i] - point2[i]), 2);
                }
                return sqrt(res);
            }
        case 1://曼哈顿距离
            {
                double res = 0;
                for (vector<double>::size_type i = 0; i < point1.size(); ++i)
                {
                    res += abs(point1[i] - point2[i]);
                }
                return res;
            }
        default:
            {
                cerr << "Invalid method!!" << endl;
                return -1;
            }
    }
}
//在kd树tree中搜索目标点goal的最近邻
//输入:目标点;已构造的kd树
//输出:目标点的最近邻
vector<double> searchNearestNeighbor(vector<double> goal, KdTree *tree)
{
    /*第一步:在kd树中找出包含目标点的叶子结点:从根结点出发,
    递归的向下访问kd树,若目标点的当前维的坐标小于切分点的
    坐标,则移动到左子结点,否则移动到右子结点,直到子结点为
    叶结点为止,以此叶子结点为“当前最近点”
    */
    unsigned k = tree->root.size();//计算出数据的维数
    unsigned d = 0;//维度初始化为0,即从第1维开始
    KdTree* currentTree = tree;
    vector<double> currentNearest = currentTree->root;
    while(!currentTree->isLeaf())
    {
        unsigned index = d % k;//计算当前维
        if (currentTree->rightChild->isEmpty() || goal[index] < currentNearest[index])
        {
            currentTree = currentTree->leftChild;
        }
        else
        {
            currentTree = currentTree->rightChild;
        }
        ++d;
    }
    currentNearest = currentTree->root;

    /*第二步:递归地向上回退, 在每个结点进行如下操作:
    (a)如果该结点保存的实例比当前最近点距离目标点更近,则以该例点为“当前最近点”
    (b)当前最近点一定存在于某结点一个子结点对应的区域,检查该子结点的父结点的另
    一子结点对应区域是否有更近的点(即检查另一子结点对应的区域是否与以目标点为球
    心、以目标点与“当前最近点”间的距离为半径的球体相交);如果相交,可能在另一
    个子结点对应的区域内存在距目标点更近的点,移动到另一个子结点,接着递归进行最
    近邻搜索;如果不相交,向上回退*/

    //当前最近邻与目标点的距离
    double currentDistance = measureDistance(goal, currentNearest, 0);

    //如果当前子kd树的根结点是其父结点的左孩子,则搜索其父结点的右孩子结点所代表
    //的区域,反之亦反
    KdTree* searchDistrict;
    if (currentTree->isLeft())
    {
        if (currentTree->parent->rightChild == NULL)
            searchDistrict = currentTree;
        else
            searchDistrict = currentTree->parent->rightChild;
    }
    else
    {
        searchDistrict = currentTree->parent->leftChild;
    }

    //如果搜索区域对应的子kd树的根结点不是整个kd树的根结点,继续回退搜索
    while (searchDistrict->parent != NULL)
    {
        //搜索区域与目标点的最近距离
        double districtDistance = abs(goal[(d+1)%k] - searchDistrict->parent->root[(d+1)%k]);

        //如果“搜索区域与目标点的最近距离”比“当前最近邻与目标点的距离”短,表明搜索
        //区域内可能存在距离目标点更近的点
        if (districtDistance < currentDistance )//&& !searchDistrict->isEmpty()
        {

            double parentDistance = measureDistance(goal, searchDistrict->parent->root, 0);

            if (parentDistance < currentDistance)
            {
                currentDistance = parentDistance;
                currentTree = searchDistrict->parent;
                currentNearest = currentTree->root;
            }
            if (!searchDistrict->isEmpty())
            {
                double rootDistance = measureDistance(goal, searchDistrict->root, 0);
                if (rootDistance < currentDistance)
                {
                    currentDistance = rootDistance;
                    currentTree = searchDistrict;
                    currentNearest = currentTree->root;
                }
            }
            if (searchDistrict->leftChild != NULL)
            {
                double leftDistance = measureDistance(goal, searchDistrict->leftChild->root, 0);
                if (leftDistance < currentDistance)
                {
                    currentDistance = leftDistance;
                    currentTree = searchDistrict;
                    currentNearest = currentTree->root;
                }
            }
            if (searchDistrict->rightChild != NULL)
            {
                double rightDistance = measureDistance(goal, searchDistrict->rightChild->root, 0);
                if (rightDistance < currentDistance)
                {
                    currentDistance = rightDistance;
                    currentTree = searchDistrict;
                    currentNearest = currentTree->root;
                }
            }
        }//end if

        if (searchDistrict->parent->parent != NULL)
        {
            searchDistrict = searchDistrict->parent->isLeft()? 
                            searchDistrict->parent->parent->rightChild:
                            searchDistrict->parent->parent->leftChild;
        }
        else
        {
            searchDistrict = searchDistrict->parent;
        }
        ++d;
    }//end while
    return currentNearest;
}

int main()
{
    vector<vector<double> > train(6, vector<double>(2, 0));
    for (unsigned i = 0; i < 6; ++i)
        for (unsigned j = 0; j < 2; ++j)
            train[i][j] = data[i][j];

    KdTree* kdTree = new KdTree;
    buildKdTree(kdTree, train, 0);

    printKdTree(kdTree, 0);

    vector<double> goal;
    goal.push_back(3);
    goal.push_back(4.5);
    vector<double> nearestNeighbor = searchNearestNeighbor(goal, kdTree);
    vector<double>::iterator beg = nearestNeighbor.begin();
    cout << "The nearest neighbor is: ";
    while(beg != nearestNeighbor.end()) cout << *beg++ << ",";
    cout << endl;
    return 0;
}Copy the code