In March, Facebook announced its Facebook AI Similarity Search (Faiss) project, which provides a library that can quickly search for similar items from multimedia documents — a scenario that challenges traditional query-based search engines. Facebook AI Lab (FAIR) set a new record by building an implementation of the nearest neighbor search algorithm based on billion-level data sets that is approximately 8.5 times faster than the most advanced and fastest K-Selection algorithm implemented on gpus in the known literature. Including the first k-nearest neighbor graph constructed based on a billion high dimensional vector.

About Similarity Search

Traditional databases consist of structured data tables containing symbolic information. For example, an image set can be represented as a data table, with each row representing an indexed image and containing information such as image identifiers and descriptors; Each row can also be associated with entities in other data tables, such as a picture of a user that can be associated with a user name table.

Deep learning trained AI tools such as text embedding (WORD2vec) or convolutional neural network (CNN) descriptors can generate high-dimensional vectors. This representation is far more powerful and flexible than a fixed symbolic representation, as we’ll explain later. However, traditional databases using SQL queries are not suitable for these new representations. First, the influx of massive multimedia information produces billions of vectors. Second, and more importantly, looking up similar entities means looking up similar higher-dimensional vectors, which can be very inefficient and difficult if you just use a standard query language.

How do I use vector representation?

Suppose you have a picture of a building — say, the town hall of a medium-sized city whose name you can’t remember — and you want to find all the pictures of that building in the image set. Because you don’t remember the name of the city, the key/value queries commonly used in traditional SQL won’t help at this point.

This is where similarity search comes in. Vectorization of images aims to generate similarity vectors for similar images, where similarity vectors are defined as the vectors closest to Euclidean distance.

Another application of vectorization is classification. Suppose you need a classifier to determine which images in an album belong to the chrysanthemum. The training process for classifiers is well known: feed the algorithm images of chrysanthemums and non-chrysanthemums (such as cars, sheep, roses, cornflowers, etc.); If the classifier is linear, it outputs a classification vector whose attribute value is its dot product with the image vector, reflecting the possibility that the image contains chrysanthemums. The classifier can then compute the dot product with all the pictures in the album and return the picture with the largest dot product. This query is called a “maximum inner product” search.

Therefore, for similarity search and classification, we need to do the following processing:

  • Given a query vector, returns a list of database objects closest to the Euclidean distance of that vector.
  • Given a query vector, returns the list of database objects with the largest dot product with that vector.

An added challenge is doing these calculations on a very large scale, say, billions of vectors.

The software package

None of the existing software tools are sufficient to perform the above database retrieval operations. Traditional SQL database systems are also less suitable because they are optimized for hash-based retrieval or 1-dimensional interval retrieval; Similarity search features in software packages like OpenCV are severely limited in scalability; Meanwhile, other similarity search libraries are mainly suitable for small-scale data sets (e.g., 1 million size vectors). Other packages are basically academic research output for publication, designed to show the effects of certain Settings.

The Faiss library addresses the limitations mentioned above and has the following advantages:

  • Provides a variety of similarity search methods, supporting a variety of different uses and feature sets.
  • Especially optimized for memory usage and speed.
  • A state-of-the-art GPU implementation for the most relevant indexing methods.

Similarity search evaluation

Once vectors have been extracted from the learning system (from images, videos, text files, and elsewhere), they are ready to be used in the similarity search class library.

We have a violence algorithm as a reference comparison that calculates all the similarities — very accurate and complete — and then returns a list of the most similar elements. This provides a gold standard reference list of results. It is important to note that the efficient implementation of violence algorithms is not simple and generally depends on the performance of other components.

If some precision is sacrificed, such as allowing a slight deviation from the reference result, similarity searches can be orders of magnitude faster. For example, if the first and second similarity search results for an image are swapped, it may not be too much of a problem because they are both likely to be correct results for a given query. Speeding up the search also involves the preprocessing of the data set, which is often referred to as indexing.

This brings us to the following three indicators:

  • Speed. How long does it take to find the 10 or more vectors most similar to the query? Expect to take less time than brute force algorithms, otherwise what’s the point of indexing?
  • Memory consumption. How much RAM does this method consume? More or less than the original vector? Faiss supports searching only on RAM, and disk databases are orders of magnitude slower, even on SSDS.
  • Accuracy. How well does the returned result list match the violence search results? Accuracy can be evaluated by counting the number of true nearest neighbor results returned in the first digit of a query (commonly called 1-recall@1), or by measuring the average proportion of the first 10 returned results (10-intersection) containing the 10 nearest neighbor results.

There is usually a trade-off between speed and precision with a certain amount of memory. Faiss focuses on ways of compressing raw vectors, because this is the perfect way to scale to multibillion vector datasets: when you have to index a billion vectors, 32 bytes per vector, it consumes a lot of memory.

Many index libraries are suitable for small data sets of a million vectors or so. Nmslib, for example, contains some very efficient algorithms for data of this size, which are much faster than Faiss but consume more storage.

Evaluation based on a billion vector

Since there is no accepted benchmark in the engineering community for data sets of this size, we evaluate them based on research results.

The accuracy of the assessment is based on Deep1B, a dataset of 1 billion images. Each image has been processed by CNN and one of the CNN activation images is used for image description. Comparing the Euclidean distances between these vectors quantifies how similar the images are.

Deep1B also comes with a smaller set of query images, as well as real similarity search results generated by the brute force algorithm. Therefore, if you run a search algorithm, you can evaluate 1-recall@1 in the results.

Choose the index

For evaluation purposes, we limited memory to 30 gigabytes. This memory constraint is the basis for selecting index methods and parameters. The index method in Faiss is represented as a string, in this case called OPQ20_80,IMI2x14,PQ20.

The string contains information about a preprocessing step applied to the vector (OPQ20_80), a selection mechanism (IMI2x14) indicating how the database is partitioned, and an encoding component (PQ20) indicating that the vector encoding uses a product quantizer (PQ) to generate a 20-byte encoding. So in terms of memory usage, including other overhead, the total is less than 30GB.

This sounds technical, so the Faiss documentation provides guidance on how to choose the best index for your needs.

With the index type selected, you can begin the indexing process. The algorithm implementation in Faiss takes a billion vectors and places them in an index library. Indexes exist on disk or are used immediately, and retrieval and addition/removal of indexes can be interspersed.

Index of the query

When the index is ready, a set of search time parameters are set to adjust the algorithm. For ease of evaluation, single-threaded search is used here. Because memory consumption is limited and fixed, optimization is a trade-off between accuracy and search time. This means, for example, that in order to get 40% of 1-recall@1, parameters can be set to take the shortest possible search time.

Fortunately, Faiss comes with an auto-tuning mechanism that scans the parameter space and collects the parameters that provide the best point of operation; That is, the most likely search time corresponds to a certain precision, and vice versa, the optimal accuracy corresponds to a certain search time. Operation points in Deep1B are visualized as follows:

In this figure, we can see that 40% 1-recall@1 requires that the query time must be less than 2ms, or if the query time can be optimized to 0.5ms, it can reach 30% 1-recall@1. The processing capability of 500 QPS in a single CPU is 2ms.

That’s pretty much on par with the latest research in the industry, Babenko and Lempitsky’s CVPR 2016 paper “Efficient Indexing of Billion-Scale Datasets of Deep Descriptors” introduces the Deep1B dataset, It takes them 20ms to reach 45% 1-recall@1.

GPU computing for 1 billion data sets

GPU implementation has also done a lot of input, in the native multi-GPU support can produce amazing single-machine performance. GPU implementations can now be used as a substitute for the corresponding CPU device, and the PERFORMANCE of the GPU can be mined without any knowledge of the CUDA API. Faiss supports all Gpus released after Nvidia 2012 (Kepler, 3.5+).

We use the Roofline Model as a guide, which states that you should try to get the memory bandwidth or floating-point operation unit full. Faiss’s GPU implementation is five to 10 times faster on a single GPU than its CPU counterpart, and newer Pascal architecture hardware like nvidia’s P100 is even more than 20 times faster.

Some key performance numbers:

  • For an approximate index, using 95 million images in the YFCC100M dataset, a violent K-nearest Neighbor graph (k=10) based on 128D CNN descriptor can be constructed in 35 minutes with only 4 Maxwell Titan X Ggpu, including index construction time.
  • K-nearest neighbor diagrams of billion-order vectors are now at your fingertips. Based on the Deep1B data set, it is possible to build a violent K-NN graph (k=10) that reaches a 10-intersection of 0.65 in less than 12 hours using four Maxwell Titan X Gpus, or up to 0.8. Using 8 Pascal P100-pcie Gpus takes less than 12 hours. Titan X configurations can generate low quality diagrams in less than 5 hours.
  • Other components also showed impressive performance. For example, building the Deep1B index above required clustering 6701 million 120-dimensional vectors to 262,144 clusters using k-means. For 25 e-M iterations, 139 minutes were spent on 4 Titan X Ggpu (12.6 tflop/s). Or 43.8 minutes on 8 P100 Gpus (40 Tflop /s). Note that the training data set for clustering does not need to be in GPU memory, as the data can be streamed to the GPU as needed with no additional performance impact.

The underlying implementation

Facebook’s AI research team began developing Faiss in 2015, building on a lot of research and engineering practice. For the Faiss library, we chose to focus on some basic technical optimizations, especially on the CPU side, which we heavily used:

  • Multithreading is used to utilize multi-core resources and perform parallel retrieval on multiple Gpus.
  • Use the BLAS library to compute distances efficiently and accurately through matrices and matrix multiplication. A violent implementation that does not use BLAS is hardly optimal. BLAS/LAPACK is the only software Faiss is forced to rely on.
  • Machine SIMD vectorization and POPcount are used to accelerate the distance calculation of independent vectors.

On the GPU

For the GPU implementation of similarity search described above, K-Selection (finding k minimum or maximum elements) presents a performance problem because traditional CPU algorithms (such as heap lookup algorithms) are not GPU-friendly. For Faiss GPU, we designed the fastest lightweight K-Selection algorithm known in literature (k<=1024). All intermediate states are stored in registers for high-speed reading and writing. K-select can be performed on the input data once, running up to a theoretical peak performance of up to 55%, as the peak GPU memory bandwidth of the output. Because its state is stored separately in a register file, it is easily integrated with other kernels, making it an extremely accurate and approximate retrieval algorithm.

A lot of effort went into laying the groundwork for efficient strategies and kernel implementations of approximate search. Data sharding or data copy can provide support for multi-core Gpus without being limited by the available memory size of a single GPU. There is also support for semi-precision floating-point numbers (FLOAT16), full Float16 calculations on supported Gpus, and intermediate FLOAT16 storage provided on earlier architectures. We find that precision and lossless acceleration can be achieved by using float16 encoding vector technology.

In short, continuous breakthroughs in key factors are very important in practice, and Faiss has really worked hard on engineering details.

Start using Faiss

Faiss is implemented in C++ and supports Python. Just download the source code from Github, compile it, and import the Faiss module into Python to get started. Faiss also fully integrates Numpy and supports all functions that construct Numpy (using Float32) arrays.

Get Faiss: github.com/facebookres…

Index object

Faiss (both C++ and Python) provides instances of the Index Index. Each Index subclass implements an Index structure to indicate which vectors can be added and searched. For example, IndexFlatL2 is a violent index that can be searched using L2 distance.

__Tue Nov 28 2017 16:16:39 GMT+0800 (CST)____Tue Nov 28 2017 16:16:39 GMT+0800 (CST)__import faiss # Import faiss index = Faiss.indexflatl2 (d) # # here we assume xb contains a n-by-d numpy matrix of Type float32 index.add(xb) # print index.ntotal __Tue Nov 28 2017 16:16:39 GMT+0800 (CST)____Tue Nov 28 2017 16:16:39 GMT+0800 (CST)__Copy the code

This will print out the number of index vectors. Adding them to an IndexFlat simply means copying them to the index’s internal store, since no further operations will be performed on the vector.

Perform a search:

__Tue Nov 28 2017 16:16:39 GMT+0800 (CST)____Tue Nov 28 2017 16:16:39 GMT+0800 (CST)__# xq is a n2-by-d matrix with Query Vectors k = 4 # need 4 similar vectors D, I = index.search(xq, K) # print I __Tue Nov 28 2017 16:16:39 GMT+0800 (CST)____Tue Nov 28 2017 16:16:39 GMT+0800 (CST)__Copy the code

I is an integer matrix. The output looks like this:

__Tue Nov 28 2017 16:16:39 GMT+0800 (CST)____Tue Nov 28 2017 16:16:39 GMT+0800 (CST)__[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]]
__Tue Nov 28 2017 16:16:39 GMT+0800 (CST)____Tue Nov 28 2017 16:16:39 GMT+0800 (CST)__Copy the code

For the first vector of Xq, the index of the most similar vector in XB is 0 (starting from 0), the second similar is #393, and the third is #363. For the second vector of Xq, the list of similar vectors is #1, #555, and so on. In this case, the first three vectors of Xq look the same as the first three vectors of XB.

The matrix D is a square distance matrix, the same size as I, and represents the square Euclidean distance for each result vector query.

Faiss implements more than a dozen index types composed of other indexes. The optional GPU versions have exactly the same interface and have channels that communicate between the CPU and CPU index. Python interfaces are primarily generated by C++ to highlight C++ indexes, so Python validation code can easily be converted into integrated C++ code.

Check the English text: code.facebook.com/posts/13737…


Thanks to CAI Fangfang for proofreading this article.

To contribute or translate InfoQ Chinese, please email [email protected]. You are also welcome to follow us on Sina Weibo (@InfoQ, @Ding Xiaoyun) and wechat (wechat id: InfoQChina).