An overview of the

Through neural network model of deep learning, unstructured data such as pictures, videos, voices and texts can be converted into feature vectors. In addition to structured vectors, these data often require additional attributes. Such as face pictures, you can add gender, whether to wear glasses, picture capture time and other labels; Text can be tagged with language type, corpus classification, text creation time and so on.

In the past, feature vectors were usually stored in a structured tag attribute table. However, traditional database can not search for massive and high-dimensional feature vectors effectively. At this time, a feature vector database is needed to store and retrieve feature vectors efficiently.

The solution

Milvus is a vector search engine that makes it easy to search a large number of vectors with high performance. Combined with the traditional relational database such as PostgreSQL, this paper uses it to store the unique identification ID and corresponding attributes of vectors by Milvus. Query the Milvus vector results in PostgreSQL, and you can quickly get mixed query results. The solution is as follows:

Eigenvector storage

The solid blue line above represents the feature vector stored procedure for Milvus mixed queries. First, source feature vector data is stored in Milvus feature vector database, and Milvus will return the corresponding ID for each source vector data. Then, the unique ID of each feature vector and its tag attributes are stored in a relational database, such as PostgreSQL.

Eigenvector retrieval

The solid orange line in the figure above represents the feature vector retrieval process of Milvus mixed query. If the feature vector data to be queried is fed into Milvus, Milvus will get the query result ID with the highest similarity to the search vector, and use the result ID to query in PostgreSQL, and finally get the mixed query result of the search vector.

Milvus mixed query

At this point, you might be wondering, why not just store the eigenvectors and their corresponding attributes in a relational database? Next we will use Milvus (version 0.6.0) to test 100 million data in ANN_SIFT1B.

1. Feature vector data set

In this Mivus hybrid query, its feature vectors are extracted from 100 million data (128 dimensions) in Base Set file of ANN_SIFT1B. The ANN_SIFT1B data set is assumed to be the face feature vector, and labels of gender, glasses wearing and image capture time are added for each vector:

Extract 100 million data from the Base Set file for importing Milvus
vectors = load_bvecs_data(FILE_PATH,10000000)

# Generate gender, glasses or not, image capture time tag randomly for vector
sex = random.choice(['female'.'male'])
get_time = fake.past_datetime(start_date="-120d", tzinfo=None)
is_glasses = random.choice(['True'.'False'])
Copy the code

2. Feature vector storage

Import 100 million data into Milvus, and the ids returned by Milvus is the unique representation ID of the vector. Store the labels of ids and vectors in PostgreSQL. Of course, the original feature vectors can also be stored in PostgreSQL (optional) :

# Import 100 million raw data into Milvus
status, ids = milvus.add_vectors(table_name=MILVUS_TABLE, records=vectors)

# Save ids and vector tags to PostgreSQL
sql = "INSERT INTO PG_TABLE_NAME VALUES(ids, sex, get_time, is_glasses);"
cur.execute(sql)
Copy the code

3. Feature vector retrieval

Pass the vectors to be searched into Milvus. Set TOP_K = 10 and DISTANCE_THRESHOLD = 1 (can be modified as required). TOP_K indicates the top 10 results with the highest similarity between the query vector and the query vector. DISTANCE_THRESHOLD indicates the distance threshold between the query vector and the search result vector.

ANN_SIFT1B uses Euclide distance calculation. After parameter setting, Milvus will return IDS of query results, and use this IDS to query PostgreSQL, and finally mix query results.

Extract the vectors to search from the Query set based on query_location
vector = load_query_list(QUERY_PATH,query_location)
Pass vectors to Milvus to search for
status, results = milvus.search_vectors(table_name = MILVUS_TABLE,query_records=vector, top_k=TOP_K)

Query PostgreSQL with results. Ids returned by Milvus
sql = "select * from PG_TABLE_NAME where ids = results.ids ;"
Copy the code

It takes 70 ms to search Milvus feature vectors when running a mixed query on 100 million data, and less than 7 ms to query ids of Milvus search results in PostgreSQL.

In general, Milvus feature vector database can be used to quickly realize the mixed query of vector and structured data. If only traditional relational database is used for vector query, not only large-scale vector data storage is difficult, but also feature vector retrieval can not be performed with high performance.

Milvus eigenvector search time PostgreSQL Indicates the time of searching ids
70 ms 1 ms ~ 7 ms

conclusion

This demonstration implements a milvus-based mixed query. In the case of 100 million feature vector data set, the mixed query time does not exceed 77 ms.

And based on the characteristics of Milvus easy to manage and easy to use, by referring to the tools provided by Milvus mixed query solution, it can easily realize the mixed query of vector and structured data, and better support business needs.

Milvus is building an online developer community, so if you’re interested in Milvus’s technology discussions and trials, join us on Slack.