Ant Financial ZSearch's exploration on vector retrieval

The introduction

ElasticSearch (ES for short) is a very popular distributed full-text search system, which is often used for data analysis, search, multidimensional filtering and other scenarios. Ant Financial has been providing ElasticSearch service to internal business parties since 2017. We have summarized a lot of experience in ant Financial’s financial level scenario, and this time we will share our exploration on vector search.

ElasticSearch pain points

ElasticSearch is widely used in log analysis, multidimensional analysis, search and other scenarios within Ant Financial. As we have more ElasticSearch clusters and richer user scenarios, we will face more and more pain points:

How to manage the cluster;
How to facilitate user access and user management;
How to support different personalized needs of users;
.

To address these pain points, we developed the ZSearch generic search platform:

Based on K8s base, quickly create ZSearch component, fast operation and maintenance, automatic replacement of failure machine;
Cross machine room replication, important business side high protection;
Plug-in platform, user – defined plug-in hot loading;
SmartSearch simplifies user search out of the box;
The Router works with the ES internal multi-tenant plug-in to improve resource utilization.

Vector retrieval requirement

ZSearch, a universal search platform based on ElasticSearch, is becoming more and more perfect, with more and more users and richer scenes.

With the rapid development of business, the demand for search will increase, such as: search pictures, voice, similar vector.

To address this need, should we add a vector search engine or extend ElasticSearch’s ability to support vector search?

We chose the latter because it makes it easier to take advantage of ElasticSearch’s good plugin specification, rich query functions, and distributed scalability.

Next, I will introduce you to the basic concepts of vector retrieval and our practice on it.

Basic concepts of vector retrieval

A vector is just a one-dimensional array. The problem we need to solve is to measure the distance using the following formula to find the K vectors that are most similar.

Euclidean distance:
- The smaller the real distance between two points is, the closer the distance is.
Cosine distance:
- The greater the cosine value, the more similar the cosine is;
Hamming distance:
- In general, it applies to binarized vectors, and binarized means that each column of a vector has either 0 or 1 values.
- The value of hamming distance is xOR sum of each column of two vectors. The smaller the value is, the more similar it is. It is generally used for image recognition.
Jekard similarity coefficient:
- Vector is regarded as a set, so it can not only be represented by numbers, but also other codes, such as words. The larger the value is, the more similar it is. It is generally used for similar statement recognition.

Because vectors in vector retrieval scenes are of high dimensions, such as 256,512 bits, which requires a lot of calculation, the corresponding algorithm is introduced next to realize similarity recall of topN.

Vector retrieval algorithm

KNN algorithm

The KNN algorithm represents the exact recall vector of topK, and there are two main algorithms here, one is KDTtree and the other is Brute Force. We first analyzed the algorithm of KDTree and found that KDTree was not suitable for high-dimensional vector recall, so we implemented the ES Brute Force plug-in and used some Java tricks to speed up the calculation.

KDTree algorithm

To put it simply, the data is segmented according to the plane, and the binary tree is constructed to represent such segmentation. In the retrieval, pruning can reduce the search times.

Build a tree

In two-dimensional plane point (x, y) set (2, 3), (5, 4), (9, 6), (4, 7), (8, 1), (7, 2), for example:

photo
Blog.csdn.net/richard9006…

Sort by x, determine the median value 7, and divide the other coordinates into two sides;
The second layer is sorted by y, and the third layer by X;
And maintain the maximum and minimum values of X and y in each node during construction.

Find the nearest point

photo
Blog.csdn.net/richard9006…

Search for (3,5) nearest neighbor:

The distance from the root node is 5;
Traversed to the right node (9,6) and found that the X-axis of the whole right subtree was 8, so the distance between all nodes of the right subtree and the query node must be greater than 8-3=5, so all nodes of the right subtree did not need traversing.
Similarly, in the left subtree, (7,2) is excluded when compared with (5,4);
After traversing (2,3),(4,7), the nearest point (5,4) returns;

conclusion

Lucene implements BKDTree, which can be understood as a block KDTree, and can be seen from the source code MAX_DIMS = 8, because the query complexity of KDTree is O(KN ^((k-1)/k)), k represents the dimension, n represents the amount of data. It indicates that the larger k is, the closer the complexity is to linear, so it is not suitable for high-dimensional vector recall.

Brute Force

As the name implies, violence ratio is the distance of each vector, we use BinaryDocValues to implement BF plug-in on ES. Further, we want to speed up the calculation, so we use the JAVA Vector API. The JAVA Vector API is in openJDK Project Panama and is optimized using SIMD instructions.

conclusion

Optimized with AVx2 instructions, 100W of 256-dimensional vectors, single fragment alignment, RT at 260ms, 1/6 of regular BF. For ElasticSearch, vector retrieval is also available in version 7.3. The bottom layer is based on Lucene’s BinaryDocValues, which is integrated with Painless syntax to make it more flexible to use.

ANN algorithm

It can be seen that the algorithm of KNN increases linearly with the increase of data. In recommendation scenarios, for example, faster response times are required, allowing for some loss of recall.

ANN means approximate K proximity, not necessarily recall all of the nearest points. ANN algorithm is more, there are open source ES ANN plug-in implementation includes the following:

Hash based LSH;
IVFPQ based on code;
Graph-based HNSW;

ZSearch also developed ANN plug-in (HNSW algorithm adapted to Proxima vector search engine of Damoyuan) based on its own business scenarios.

LSH algorithms

Local Sensitive Hashing, we can split vectors across planes to hash. For example, in the following illustration, 0 means that the point is on the left side of the plane, and 1 means that the point is on the right side of the plane. Then we hash the vector for many times, and it can be seen that the points with the same hash value are all closer. Therefore, after hash, we only need to calculate vectors with similar hash value to recall topK more accurately.

IVF – PQ algorithm

PQ is a kind of coding, such as the 128-dimensional vector in the figure. Firstly, the vector is divided into 4 parts, and kmeans clustering is performed for each piece of data. In this way, the original vector can be re-coded with the number of the cluster center. And then, of course, record the vector in the center of the cluster, which is called the codebook.

photo
Yongyuan. Name/blog/vector…

After the compression of PQ coding, there is still a large number of queries to achieve good results, so a layer of coarse filtering can be added in front of it, as shown in the figure. The vectors are first clustered into 1024 class centers by Kmeans to form an inverted index. After the residual of each original vector and its center is calculated, PQ quantization is carried out on the residual data set. The reason why residual is treated with PQ instead of the original data is that the variance energy of residual is smaller than that of the original data.

In this way, during the query, we first find out several center points close to the query vector, and then calculate the top vector after PQ quantization in these center points. Finally, we make an accurate calculation of the filtered vector and return the topN result.

photo
Yongyuan. Name/blog/vector…

HNSW algorithm

Before talking about HNSW algorithm, we first talk about NSW algorithm, as shown in the following figure, which is a sequential graph construction process:

For example, the process of constructing point D for the fifth time;
During construction, we agreed that only three edges were connected to each node to prevent the graph from becoming larger. In actual use, the data of the node should be used.
A random node, such as A, saves its distance from A, and then traverses along the edge of A, with E closest to the edge. Then search again and do not repeat until you have added 3 edges;

The lookup process is included in the insert process in the same way, but instead of building edges, the result is returned.

HNSW algorithm is a layered optimization of NSW algorithm. It borrows the idea of Skiplist algorithm to improve query performance. It starts from the sparse graph and gradually goes into the bottom graph.

ElasticSearch is a plugin for ElasticSearch.

| | | LSH plug-ins IVSPQ plug-in

HNSW plug-in
An overview of the
Pass in the sample data when creating index and compute the hash function. Add hash function field while writing. Recall using minimum_should_match to control the amount of calculation
advantages
1. Simple implementation and high performance

2. Without the other lib library 3. Without considering the memory | 1. High performance 2. High recall rate >90%

3. Without considering the memory | 1. Query performance 2. The highest recall rate > 95% | | | 1. Recall rate is less than other two algorithms, probably around 85% 2. Recall rate affected by the initial sampling data 3. ES metadata is large | 1. Need to use tools such as faiss ahead of pre training 2. ES metadata is large | 1. At build time, segment merge operations consume a significant amount of CPU 2. Segment next query performance will be worse 3. | all memory

ZSearch HNSW plug-in

We chose to implement the HNSW plug-in on ES based on our scenario (the lightweight output scenario). Since our users are all newly added data and more concerned about top10 vectors, we used seqNo to join vector retrieval engine to reduce CPU consumption and redundant DocValues overhead.

Docking Porxima vector retrieval framework: **

Proxima is a universal vector search engine framework developed by Alibaba internal Dharma Academy, similar to Facebook’s open source Faiss.
Support multiple vector retrieval algorithms;
Unified method and architecture, convenient for users to adapt;
Support heterogeneous computing, GPU;

Implement ProximaEngine

After Lucene is written, write Proxima framework first. Proxima framework data will be directly flush to disk through Mmap. At the end of a request, Translog flushed to disk. This is a complete write request. For segments in memory, ElasticSearch asynchronously reaches a condition that is flush to disk.

The Query process

During query, VectorQueryPlugin was used to search topN vector from proxima vector retrieval engine to obtain seqNo and similarity. Then FunctionScoreQuery of newSetQuery was constructed. Go to join other query statements.

The underlying numeric newSetQuery is traversed by BKDTree, which is very efficient.

Failover process

Of course, we also have to deal with various Failover scenarios:

When data is replicated from a remote device:
- We intercepted ElasticSearch recovery action;
- The snapshot of Proxima index is then generated. Write locks are required to prevent data from being written. Snapshot generation is actually very fast because it is all in memory.
- Copy the Proxima snapshot to the destination.
- Other ElasticSearch processes

When data is recovered from the local translog, the snapshot’s LocalCheckPoint is recorded. If the current CheckPoint is smaller than or equal to LocalCheckPoint, you can skip it. Otherwise, Proxima is searched back to prevent data retry.
There is also a case where data is duplicated, where translog may have written to Proxima before the primary and secondary shards all fail.

contrast

Sift -128-EUClidean 100W Dataset (github.com/erikbern/an…)

	HNSW plug-in	ZSearch – HNSW plug-in
Data is written to
(1000 bulk writes in a single thread)	1. Initial write

5min, 25 segments, Max CPU 300%; 2. Merge into 1 segment 5min, maximum CPU 700%(local maximum); | build index 15 min, because of the single thread, so the maximum 100% CPU | | | query 1. Top 10, the recall rate was 98%; 2. Rt 20ms, 5ms after merging into 1 segment; | 1. The Top 10, the recall rate was 98%; 2. Rt 6 ms; Advantages of | | | compatible failover | CPU consumption less, no additional storage | | faults | CPU is used up big, query performance related to segment | main fragmentation and hang all repeat | cases there will be a small amount of data

conclusion

Best practices for configuring ES parameters

100W 256-dimensional vector takes up about 0.95GB of space, relatively large:
- So more out-of-heap memory is allocated to Pagecache;
- For example, on an 8C32GB machine, the JVM is set to 8GB and the other 24GB is reserved for the system and Pagecache;
Set max_concurrent_shard_requests:
- The default value of x is 5 x the number of nodes. If a node has many cpus, you can set a larger number of shards and increase this parameter.
BF algorithm uses AVX2 CPU, basically ali Cloud ECS support;

Algorithm is summarized

KNN suitable scenarios:
- Small amount of data (less than 100W in a single fragment);
- First filter other conditions, only a small amount of data, and then vector recall of the scene;
- Recall rate is 100%;
ANN:
- Large amount of data (more than ten million levels);
- Vector filtering and other filtering;
- The recall rate does not need to be 100%;
- LSH algorithm: the recall rate is not high, a small number of additions and deletions;
- IVFPQ algorithm: high recall rate, large amount of data (tens of millions of levels), a small number of additions and deletions, need to be built in advance;
- HNSW algorithm: recall rate can be required, moderate amount of data (below ten million), index full memory, memory enough;

The future planning

All the algorithm models in deep learning will be transformed into high-dimensional vectors. When recalling topN, similarity formula is needed to recall topN, so the scenes of vector retrieval will become richer and richer.

We will continue to explore the vector recall feature on ElasticSearch, adding more vector retrieval algorithms for different business scenarios, and sinking the process of model to vector into the ZSearch plugin platform to reduce network consumption. I hope to communicate with you and make progress together.

The authors introduce

Lv Liang (name: Ten times) joined Ant Financial data middleware in 2017. He is responsible for ZSearch infrastructure of general search platform. He is responsible for the implementation of ZSearch component on K8s and the development of high-performance query plug-in based on ES, and has rich experience in ES performance tuning.

The attachment

github.com/StaySense/f…
Summary: vector algorithm yongyuan. Name/blog/vector…
ANN Performance Testing Framework: github.com/erikbern/an…

Financial Class Distributed Architecture (Antfin_SOFA)

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Ant Financial ZSearch’s exploration on vector retrieval

The introduction

ElasticSearch pain points

Vector retrieval requirement

Basic concepts of vector retrieval