Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

**Previously on KNN:**

Let’s say we have a data set, and when given a new instance, the simplest and most crude way to do that is to calculate the distance between it and all the points, find k nearest neighbors, and then determine which class that instance belongs to by majority voting.

But if the training instances in this data set are very large and dense, and one instance has many features, then thousands of distances have to be calculated, which is a huge amount of computation.

In this case, we can use a faster calculation method – KD tree.

# What is a KD tree

Kd tree is basically a binary tree structure. According to this structure, k-dimensional space is continuously divided, and each node represents a k-dimensional super-rectangular region.

Two-dimensional (k=2) rectangular region:

Rectangular region of three dimensions (k=3) :

Note that k in kd tree represents the number of features, that is, the dimension of data, while k in k-nearest neighbor refers to the k neighbors closest to the new instance point.

# How to construct kd tree

## The principle of

Input: K-dimensional data set:

Xi =(xi(1), xi(2)… Xi (k) T, the superscript of xi(l) is the l th eigenvalue

Output: KD tree

### 1. Start: Construct the root node.

The optimal feature is selected at the root node for cutting, which is generally determined by comparing the variance of each feature. The feature with the largest variance is the coordinate axis to be selected.

Let’s say we pick x(1), use it as the coordinate axis, find the cut point, and cut the hyperrectangular region into two subregions. Usually, the median of data in the direction of X (1) is selected as the sharding point, and the left and right child nodes with depth of 1 are generated from the root node. The coordinate of the left node is smaller than the sharding point, and the coordinate of the right node is larger than the sharding point.

### 2. Repetition: selection and cutting of remaining features

For the node with depth j, x(1) is selected as the segmentation coordinate axis, l=j(mod k)+1. The median of coordinates of all instances x(1) in the node region is taken as the segmentation point, and the region is continuously divided into two sub-regions.

### 3. Stop: Get kd tree

The segmentation is stopped until there are no instances of the two sub-regions, that is, a KD tree is obtained.

## Example explanation

**Input:**

Output: KD tree

For convenience, the data points in the data set are labeled and visualized:

### 1. First split

Since the dimension of the training data set is 2, you can choose any feature. You may choose x(1) as the coordinate axis and sort the data in x(1) from smallest to largest, respectively: 2,4,5,7,8,9

The median might as well choose 7, that is, with (7, 2) as the root node, the region is segmented.

The children on the left are less than 7, and the children on the right are greater than 7.

### 2. Second segmentation

Divide the region again:

So let’s add 1 to the first feature, x(2).

For the left region after the first segmentation, the data in X (2) are sorted from small to large, which are: 2, 3, 4, 7

The median is 4, the coordinate of the segmentation point is (5,4), and then a horizontal line is drawn for the second segmentation.

Similarly, for the region on the right after the first segmentation, the data in x(2) are sorted from small to large, which are 1,6 respectively

Since there are only two points, it is advisable to choose 6 as the segmentation point. The coordinates of the segmentation point in the right region are (9,6). Draw a horizontal line for the second segmentation.

### 3. Continue to slice

The second feature plus one, x(3) is my coordinate, and there’s only x(1) and x(2), so I’m only dividing x(1). Or simply calculate 2(mod 2)+1=1 to get the selected feature.

It can be clearly seen that the three instance points A (2,3), D (4,7) and E (8,1) have not been segmented, so A line perpendicular to x(1) axis is drawn on these points as the root node for segmentation.

### 4. Draw a KD tree

The root nodes in these partitions are:

Level 1 :(7,2)

Level 2 :(5,4), (9,6)

Grade 3 :(2,3), (4,7), (8,1)

Draw the KD tree according to this hierarchy:

# How do I search for kd trees

## The principle of

Input: constructed KD tree with target point x

Output: the nearest neighbor of x

Search for the “current nearest point” : starting from the root node, recursively access the KD tree to find the leaf node containing X, which is the “current nearest point”.

Backtracking: Backtracking and iterating along the tree root with the distance between the target point and the “current nearest point”. The current nearest point must exist in the region corresponding to a child of the node. Check whether the region corresponding to another child of the parent of the child node has a closer point.

When you go back to the root, the search ends, and the last “current nearest point” is x’s nearest neighbor.

## Example explanation

Take the kd tree generated above as an example:

### Case 1

Input: KD tree, target point x=(2.1,3.1)

Output: The nearest neighbor point of the target

Find the current nearest point: starting from the root node, x=(2.1,3.1) in the left subregion of the root node (7,2), continue to the left subregion determined by (5,4), continue to the right subregion of (2,3), (2,3) is the current nearest neighbor point.

Backtrack: take (2.1, 3.1) as the center of a circle and draw a circle with a radius of its distance from the nearest neighbor point identified. There are no other points in this region, then it is proved that (2,3) is the nearest neighbor point of (2.1, 3.1).

### Case 2

Input: kd tree, target point x= (2,4.5)

Output: The nearest neighbor of the target point

Find the current nearest point: starting from the root node, x= (2,4.5) in the left subregion of the root node (7,2), continue to the upper subregion identified by (5,4), continue to the left subregion of (4,7), (4,7) is the current nearest neighbor point.

Back: we, centered on,4.5 (2) to (2,4.5) to (4, 7) draw a circle radius is the distance between two points, two nodes in this area, (2, 3) and (5, 4), (2,4.5) by calculation to the distance of two points, get to the nearest (2, 3), (2, 3) is the most neighboring points. Then, with (2,4.5) as the center and the distance between (2,4.5) and (2,3) as the radius, a circle is drawn. At this point, there are no other nodes in the circle, indicating that (2,3) can be confirmed as the nearest neighbor point of (2,4.5).

#### References:

Let’s run the numbers, Dr. Jane

#### Video materials:

3.3K Nearest Neighbor method – What is kd tree

3.3K nearest neighbor method – construct KD tree

3.3K nearest neighbor method – search kd tree