An overview of vector retrieval

** Abstract: ** In view of the problems to be solved by vector retrieval, this paper combs the mainstream vector retrieval related technologies, and analyzes a current trend of vector retrieval.

What is vector retrieval

First, let’s understand what a vector is. A vector is an array of N numbers (a binary vector is made up of n bits). We call it an N-dimensional vector. Vector retrieval is to retrieve K vectors (K-nearest Neighbor, KNN) similar to query vector in a given vector data set according to a certain metric method. However, due to the large amount of KNN calculation, We usually only focus on Approximate Nearest Neighbor (ANN) problems.

Vector measurement

There are four common vector measures: Euclidean distance, cosine, inner product and Hamming distance

Different measurement methods correspond to different scenes. Generally, Euclid distance is used for image retrieval, cosine is used for face recognition, inner product is mostly used for recommendation, and Haiming distance is usually used for large-scale video retrieval due to its relatively small vector.

With the measurement, we usually use recall rate (also known as precision) to evaluate the effect of vector retrieval. For a given vector Q, its K neighbors on the data set are N, and the set of K neighbors recalled by retrieval is M, then

! [](https://pic2.zhimg.com/80/v2-9d0b71d384e7410f02f965ea58d0f966_720w.jpg)

The closer the recall is to 100%, the better the index.

Vector retrieval solves the problem

Vector retrieval in essence, its thinking frame and the traditional retrieval method is no different, the following in the interpretation of vector retrieval index structure, experience can be more profound.

1. Reduce the candidate vector set

Similar to the traditional text retrieval, vector retrieval also needs some index structure to avoid in the match, the full amount of data for the traditional text retrieval is to filter out irrelevant documents by inverted index, and vector retrieval is based on vector indexing structure to bypass the uncorrelated vectors, this article focuses on the relevant vector index structure.

2. Reduce the complexity of single vector calculation

Traditional text retrieval is usually a funnel model when ordering, upper calculation is more lightweight, more to calculate more fine, but as you progress through the filter, the need to compute the number of documents and step-down, for high dimensional vector, a single vector computation or heavy, usually for vector quantization, approximate calculation for vector, Finally, sort the original vectors on a very small data set.

Below, we will focus on these two problems to expand the relevant discussion of vector retrieval.

Vector retrieval index structure

Building efficient index structures for vectors is the number one problem facing vector retrieval. Before we start to expand, let’s look at a Benchmark project that compares the recall performance of index structures in multiple public vector datasets. This project allows us to quickly get a visual understanding of the performance of various index structures.

Violence to calculate

Brute force calculation is the simplest but most complicated method. When calculating recall, the result of brute force calculation is used as the benchmark data of the answer. In the scene of face recognition, 100% recall rate is often required.

Tree-based approach

There are many tree-based methods, typical of which are KDTree, BallTree and VPTree. By analogy with traditional binary trees, tree structure is nothing more than deciding whether to expand to the left or to the right when building the tree. Different vector tree indexes depend on what criteria to make decisions. KDTree (as shown in the figure below) will select a dimension with the largest variance in the vector and take the median value as the judgment standard, that is, the hyperplane is used to divide the space, while BallTree uses the sphere to divide the space. VPTree will first select a commanding high point, then calculate the distance between each point and the commanding high point, and take the median value as the judgment standard. These methods usually use triangle inequalities to eliminate unnecessary exploration in retrieval.

There are many other types of tree-based methods, but they are always the same. They simply divide the vector space according to some criteria, but no matter how they are divided, because of the need for backtracking, the tree-based methods are slightly inferior in performance.

Hash method

Hashing is familiar to everyone, and vectors can also be accelerated by Hashing, which refers to Locality Sensitive Hashing (LSH). Different from traditional Hashing, locally Sensitive Hashing relies on collisions to find neighbors.

(d1,d2, P1,p2) -sensitive:

1. If d(x,y) <= d1, then h(x) = h(y) is at least p1;

2. If d(x,y) >= d2, then h(x) = h(y) is at least p2;

In human terms, the above expression is: if two points in a high-dimensional space are very close to each other, then design a hash function to calculate the hash values of these two points, so that the probability of their hash values is very high is the same. If the distance between two points is far, the probability of their hash values is very small. Different distance measures have different hash functions, and not all distance measures (such as inner product) can find the corresponding locally sensitive hash

Based on inversion method

The traditional inverted index is based on the document contains a word, and then put the current document into the inverted index of the changed word to build the index structure. How does vector build the inverted index? The whole vector space is divided into K regions through clustering, and each region is replaced by a center point C. In this way, each vector is compared with all center points and assigned to the corresponding inversion of the nearest center point, and the whole index structure is established

Another based on inverted index is the BOW, in equal principle, such as a picture can extract hundreds of local characteristics, first of all the characteristics of the cluster, form the center, the center as the basis of the inversion, indexing, the images of each local characteristics to its nearest center, establish inversion, retrieval will be according to the number of hits to filter the results.

Graph-based approach

Described above index structure can be classified as the method based on space partition, each vector can only belong to a division of a good area, the biggest problem is that in order to improve the recall of these methods need to review the vector of a big area, led to a surge in computation, that is there a better way to solve this problem? Method based on graph can better achieve this goal, the graph method is the most simple idea neighbor neighbor may also be a neighbor, so the nearest neighbor search into graph traversal, due to its connectivity, can be targeted to investigate some vector rather than by region, so can greatly reduce the investigation of vector.

In recent years, graph method is a hot spot in vector retrieval research, and a number of graph index methods such as KGraph, NSG, HNSW and NGT have emerged. But in fact, the main difference between these graph methods lies in the construction process. Different graph methods adopt different means to improve the quality of the graph, but the steps of graph retrieval are basically the same: A. Choose the entry point; B. Traverse the graph; C. convergence. In the continuous practice, we observed some characteristics to judge the quality of the graph index, guiding us to improve the direction of the graph method:

1. The neighbor point is close to K nearest neighbor

2. Minimize the number of neighbor points, that is, reduce the output degree

3. Ensure the connectivity of the graph as far as possible and increase the entry degree

Next, we take HNSW as an example to introduce the principle of the following method. HNSW is chosen because it is easy to understand and realize streaming index construction and update, which is a perfect embodiment of the traditional method in vector.

HNSW is actually the application of the hop table in the graph. As a simple and efficient index structure, the hop table is widely used in Redis, LevelDB and other systems, as shown in the following figure:

The first layer has a full amount of data, and the upper layer is determined by random coin insertion. The higher the data is, the less probability it will be thrown into. The nodes that are thrown into the upper layer have a record, which is used as a retrieval highway to move forward quickly.

HNSW is a Navigable Small World Graph (NSW) at each layer, which is organized by Graph data structure. The upper nodes also decide which layer by coin, and they are recorded in the lower Graph. The upper image can be regarded as an epitome of the lower image. During retrieval, the retrieval process is gradually approached to the desired vector space from top to bottom. In addition, IN the process of composition, HNSW ensures the connectivity of pictures by cutting edges. By the way, throughout many papers on drawing methods, edge cutting methods are described from their own perspective. However, no matter how cool the description of cutting edges is, the essence of all methods is the same, but only from different perspectives.

Vector quantization

In front of the mainstream vector retrieval index technology is basically outlined again, after the index structure of the vector to be investigated after cutting, for high-dimensional vector, a single calculation is still very large, is there a way to reduce the calculation? Quantitative techniques based on this purpose, the quantitative refers to a large value space quantization to a smaller value range, a simple example, the world has a population of more than 60, more than 60, one hundred million also was a measure of the earth the person, we can make quantitative for two types of men and women, so that a space of more than 60 quantitative into the scope of only two values.

Commonly used quantization generally includes PQ (and its optimization OPQ, LOPQ) and binary.

The principle of PQ is shown in the figure. P in PQ means product, so how can a product emerge? In the figure below, a 128-dimensional vector space is divided into 4 segments after PQ processing, and each segment is quantified by 256 center points. Therefore, the original 128-dimensional space can now be expressed by the combination of center points in each segment, namely 256 * 256 * 256 * 256 combinations, which is the so-called product.

As for binary vectors, modern CPUS provide special instructions, so it is very fast to calculate hamming distance. Binary vectors are usually directly generated in massive vector retrieval. Common methods for binary vector generation include ITQ (Iterative Quantization) and DeepHash.

Other technologies

clustering

Vector clustering mainly refers to a series of K-means methods, such as classic K-means, K-means++, K-means#, One Pass K-means, YinYang K-means, etc. In practice, we have also used hierarchical methods to obtain massive central points.

Inner product transform

In hash chapters we have seen in front of the inner product is no way to find the corresponding local sensitive hash function, inner product also does not meet the triangle inequality, thus lead to a lot of product index structure cannot be used directly as the measures, most of it is first to do the normalized vector to retrieve, but can turn normalized vector space distribution. Through a mathematical observation, the researchers found a way to make the inner product and Euclidean achieve negative correlation through some regular transformation, as shown in the figure below:

After vector transformation, it can be retrieved directly by Euclidean distance measure.

Development trend of vector retrieval

At the beginning of this paper, we mentioned that the thinking frame of vector retrieval is no different from that of traditional retrieval. So far, the dividend of traditional method applied to vector retrieval has been basically exhausted. A development trend in the past two years is the combination of various methods, by absorbing the advantages of various methods, to obtain better performance and carry larger vectors.

Quantitative figure

We can see clearly from the Benchmark, are in advantage position method is the basic figure bully screen, at the same time, we know that quantitative can reduce the amount of calculation of a single vector, so naturally one idea is to make the two big mergers, in the process of graph traversal using quantitative calculation to accelerate, the obvious advantages in the way that this kind of method for high dimension, The acceleration ratio can reach more than three times.

Figure + inversion

Although the performance of the graph is excellent, its disadvantages are also very obvious. In order to ensure the retrieval effect, the saving on the graph consumes a lot of storage

Especially in the vector itself is very small binary vector, this is unbearable. In addition, the neighbor of the graph itself has randomness, which is very unfavorable to the hardware processing vector. However, we can get some inspiration from the observation of violent computation. Violent computation has two perspectives from the perspective of space partition: it can be regarded as clustering all vectors into one; The violence calculation can also be viewed as clustering N centers (N is the size of the vector data set), with each center being the vector itself. Both of these two methods can be recalled 100%, but one is because there are too many vectors below the center point, and the other is because there are too many center points of the cluster, which leads to unacceptable computation. Can we find a balance point? The right number of centers to strike a good balance between recall and performance? The pattern of graph + inversion emerged as The Times required. Mass center points were used to segment vector space, and mass center points were organized by graph method.

Huawei cloud vector retrieval practice

The application of vector retrieval has witnessed unprecedented prosperity. Huawei Cloud Search Service (CSS) has opened its vector retrieval capability to the outside world. We provide solutions for two typical scenarios of high performance and massive data.

We mainly based on the graph method, in the cutting edge optimization, connectivity enhancement, instruction acceleration, multithreading and other aspects of a lot of enhancement. In the comparison test, our recall rate is much higher than Faiss, and the gap is more obvious in the high-dimension data. Under the same recall, our performance is more than 2.3 times that of Faiss in the open data sets such as SIFT and GIST. The comparison tests were done on the same machine, using the same algorithm.

Conclusion:

Aiming at the problems to be solved in vector retrieval, this paper combs the relevant technologies of mainstream vector retrieval and analyzes a current trend of vector retrieval. In the development process of vector retrieval, there are many related methods not included, but all changes are the same, these methods are either the extension of the traditional method or the optimization of the application of triangle inequality or the application of hierarchical methods and so on. It is expected that this paper can help to sort out the context of vector retrieval.

References:

1. J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 5. 18:50 9-517197

2. P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proc. of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms, Pages 311 — 321, 1993

3. Liu, T.; Moore, A. & Gray, A. (2006). New Algorithms for Efficient High-Dimensional Nonparametric Classification. Journal of Machine Learning Research. 7:1135-1158.

4. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, Pages 604 — 613, Dallas, TX, 1998.

5. W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586. ACM, 2011.

6. Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical Navigable Small World Graphs. CoRR, ABS /1603.09320, 2016.

7. Arxiv.org/abs/1707.00…

8. Yongyuan. Name/blog/vector…

9. A. Shrivastava and P. Li. An improved scheme for asymmetric lsh. Technical report, arXiv:1410.5410, 2014.

10. H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117{128, 2011.

11. arxiv.org/abs/1810.07…

Note: this article part of the index structure figure from yongyuan. Name/blog/vector…

Click to follow, the first time to learn about Huawei cloud fresh technology ~