1. The introduction of

There are many indicators that can be used to measure the similarity of vectors, such as cosine distance, Hamming distance, Euclidean distance, etc.

In the field of image, video, text, audio, do vector similarity search, there are many applications, such as: image recognition, speech recognition, spam filtering.

This scheme based on similarity retrieval is different from that of machine learning model. For example, face recognition based on supervised learning model has a low interpretability, while face recognition based on similarity search has a higher interpretability.

But when there’s a lot of data, like tens of millions of images, it’s harder to do similarity searches. The exhaustive method is feasible, but very time consuming. For this scenario, this article focuses on schema-level information.

2. Common schemes

General vector similarity retrieval methods are mainly as follows

  1. Exhaustive search

This scheme is O(N) time complexity and only applies to the case of small amount of data.

  1. MySQL based search

By cleverly constructing database table structure to achieve. For example, if the vector is split into N segments, according to the pigeon nest principle, the vector with similarity of M (M<N) must have n-M segment values that are exactly the same, which can be queried by combining multiple segments OR in SQL.

  1. KD-Tree

Build a tree, very similar to BST, that can be used for binary lookup, order logN time.

  1. ES

ElasticSearch has been integrated with vector search since version 7.x.

  1. Vector search engine

Facebook and other companies have developed similarity search engines specifically for vectors, and some can support GPU, distributed deployment.

3. Vector retrieval algorithm

In August 2019, Google introduced a new algorithm called ScaNN (Scalable Nearest Neighbors) based on their testing data (github.com/google-rese…

4. Vector retrieval engine

Vector similarity search engine integrates a variety of algorithms and indexes internally, which can dynamically select search strategies according to different types of data and different similar search requirements, so as to achieve better search efficiency and is a better choice. The current common vector search engines are as follows

  1. faiss
  • Developer: Facebook
  • Open source: Open source
  • Link: github.com/facebookres…
  • Advantages:
    • Relatively mature, relevant information, github star high
    • Supports CPU and GPU acceleration
  1. milvus
  • Developer: Zilla (Domestic software, Shanghai company)
  • Open source: Open source
  • Link: github.com/milvus-io/m…
  • Advantages:
    • I have a lot of information and have joined the Linux fund project. I maintain the technical community well and have my own blog updated in time
    • Support GPU acceleration
    • Domestic Ali, Xiaomi and many other big companies are using it
  1. SPTAG
  • Developer: Microsoft
  • Open source: Open source
  • Link: github.com/microsoft/S…
  • Advantages: Does SPTAG have any advantages compared to other engines
  1. annoy
  • A library that does not support acceleration and other complex applications like gpus
  1. Zsearch
  • Developer: Ant Financial
  • Open source: No open source
  • The link: segmentfault.com/a/119000002…
  1. vearch
  • Developer: JINGdong
  • Open source: Open source
  • Link: www.infoq.cn/article/gxY…
  • Pros: Based on FAISS, Vearch provides a flexible and easy-to-use API similar to ES
  1. ESKNN
  • Developer: AMAZON
  • Open source: Open source
  • Link: github.com/opendistro-…
  • Advantages: AMAZON directly on the basis of ES modification, have seen that the use of more memory consumption

5. Performance data

This article looked up performance data for some of the vector search engines mentioned above, summarized in the following table:

Engine Performance Data Size Vector Size Link
ES 0.6 s 1000000 128 Github.com/jobergum/de…
ES-aliyun 0.09 s 20000000 128 Developer.aliyun.com/article/738…
milvus 27ms 1000000000 128 Github.com/milvus-io/m…
SPTAG not good N/A N/A Github.com/microsoft/S…
Github.com/microsoft/S…

As you can see, by comparison, Milvus performs very well.

6. Milvus is introduced

  1. How to install (easy to configure with Docker)
  • Milvus. IO/cn/docs/ins…
  1. The Python SDK and its usage (actually very efficient)
  • Github.com/milvus-io/p…
  1. Graphical interface (measured, easy to install, configure and use)
  • milvus.io/cn/gui/

The original published in: blog.csdn.net/ybdesire/ar…