Milvus, as a similarity search engine for massive feature vectors, can handle billions of data scales on a single server. For billions or billions of data, Milvus cluster with horizontal expansion ability is needed to meet the demand of high performance retrieval of massive vector data. Mishards is a Milvus cluster middleware developed in Python. The Milvus cluster built by Mishards can realize core functions such as request forwarding, read/write separation, horizontal expansion, dynamic expansion and so on, helping users to acquire the ability to process super-large scale vector similarity retrieval. This article will show how to use Mishards sharding middleware to build Milvus clusters to achieve Milvus clustering capability. The paper is divided into three chapters:

  • Overview of Mishards: How Mishards works and the Milvus cluster architecture

  • Setup procedure: Take two servers as an example to set up a Milvus cluster

  • Test: Import 100 million data sets into the built Milvus cluster, and analyze the operation of the cluster



Mishards profile

Working principle:

Mishards is responsible for splitting requests from clients, routing them to internal sub-instances, and finally summarizing the sub-instance results back to the client. Mishards’ workflow is as follows:


Workflow:

  1. Sends client requests to the Proxy

  2. Split the client request

  3. Routes to internal submolecule instances

  4. Each instance sends its own results to the Proxy

  5. Summary results

  6. Get the final result and return it to the client

Cluster architecture:

The Mishards instance can be installed in a cluster
Any aOn the server. The overall cluster architecture of Mishards is as follows:
Select one of the Milvus instances to be writable and the rest to be read-only. Milvus instances set to be writable import data to shared storage via Mishards. With Mishards, data is allocated from shared storage to each instance in the cluster (including writable instances) using a consistent hash algorithm. After each instance completes the query task, the final result is summarized and returned to the client.



Set up steps

Milvus cluster requires two or more servers and a shared storage device:


  • You need to install and start Milvus on each server

  • You only need to start Mishards on either server

  • You can select any server as the shared storage



This example will set up A Milvus cluster using two servers A and B: start Mishards and the first Milvus instance on server A (192.168.1.38); Start the second Milvus instance on server B (192.168.1.85) and use this server as shared storage to store data for all Milvus instances in the cluster.
Note: Parameters need to be configured before Milvus and Mishards are enabled.


The specific construction process is as follows:


1. The installation of MySQL

MySQL services only need to be in the cluster
Any aStart it on the server. This example is installed on server A (192.168.1.38).
  • Install and start the MySQL service according to the MySQL official website tutorials

  • Or install MySQL from Docker

2. Start Milvus

In the cluster
eachThe Milvus instance must be installed on each server. Different Milvus instances have different read and write permissions. In this example, the Milvus instance on the first server is configured to be writable and the Milvus instance on the second server is configured to be read-only.
Only one Milvus instance in the cluster can be configured as writable. The others are read-only.
Follow the instructions on the official website to install Milvus, but modify the configuration file
server_config.yml. Modify the parameters as follows:
Locate the following segments in the configuration file and modify related parameters based on the actual environment:
Version: 0.1# config versionserver_config: Address :0.0.0.0# milvus server IP address (IPv4) Port :19530# Milvus server port, must in range [1025, 65534] deploy_mode: cluster_readonly # deployment type: single, cluster_readonly, cluster_writable time_zone: UTC+8# time zone, must be in format: UTC+Xdb_config: primary_path:/var/lib/milvus # path used to store data and meta secondary_path:# path used to store data only, Split by semicolon backend_URL ://root:[email protected]:3306/milvus # URI format: dialect://username:password@host:port/databaseCopy the code
In the configuration file, parameters
deploy_modeDetermines whether an instance of Milvus is read-only or writable. In the standalone version, this parameter is set to
single; When using Mishards, each Milvus instance is configured to
cluster_writable
cluster_readonly
cluster_writableIndicates that the Milvus instance is writable
cluster_readonlyIndicates that the Milvus instance is read-only
parameter
backend_urlChange the IP address and port of the server where MySQL is installed in the preceding format. For other configurations, refer to the Milvus standalone configuration.
In addition, all device data storage locations in the cluster should be set to the same shared storage. In this example, the second server is selected as the shared storage.
After modifying the configuration, start the Milvus service.
3. Start Mishards


The Mishards instance only needs to be in the cluster
Any aStart it on the server. This example starts Mishards on server A.
Mishards needs to be started with Docker, where the relevant configuration files are as follows. This parameter needs to be modified before startup
cluster_mishards.ymlCorresponding parameters in the file:
version:"2.3"services:    mishards:        restart: always        image: milvusdb/mishards        ports:-"0.0.0.0:19531:19531"-"0.0.0.0:19532:19532"#volumes:#- /tmp/milvus/db:/tmp/milvus/db# - /tmp/mishards_env:/source/mishards/.env command:["python","mishards/main.py"] environment: FROM_EXAMPLE:'true' SQLALCHEMY_DATABASE_URI: Mysql + pymysql: / / root: [email protected]:3306 / milvus? charset=utf8mb4 DEBUG:'true' SERVER_PORT:19531 WOSERVER: TCP: / / 192.168.1.85:19530 DISCOVERY_PLUGIN_PATH: static DISCOVERY_STATIC_HOSTS: 192.168.1.85, 192.168.1.38 DISCOVERY_STATIC_PORT:19530Copy the code
Parameter Description:
SERVER_PORT: Defines the service port for Mishards.
WOSERVER: Defines the address of a writable instance of Milvus. Currently only static Settings are supported. Reference format:
TCP: / / 127.0.0.1:19530
DISCOVERY_PLUGIN_PATH: User-defined search path for service discovery plug-ins. By default, the system search path is used.
DISCOVERY_STATIC_HOSTS: List of service addresses, separated by commas, for example
192.168.1.188, 192.168.1.190
DISCOVERY_STATIC_PORT: Service address Listening port.


Parameter modification:
SQLALCHEMY_DATABASE_URI: Change it to the IP address of the MySQL server.
WOSERVER: Change it to the IP address of Milvus’s writable example.
DISCOVERY_STATIC_HOSTS: indicates all IP addresses in the cluster.
Start the Mishards after the changes are made.
More detailed please refer to Bootcamp:https://github.com/milvus-io/bootcamp/tree/master/solutions/Mishards build step.


test

Data preparation

The original data set used in this experiment is SIFT1B. Please refer to this data set for details:
http://corpus-texmex.irisa.fr/.


In this test, we extracted 100 million pieces of data from the original data set, which is about 13 GIGABytes in size.

100 million test data set download address: https://pan.baidu.com/s/1N5jGKHYTGchye3qR31aNnA


Once you have set up and started Mishards, you can use Milvus to do the same. The Milvus service is connected to the cluster based on the IP address of the Mishards server and the Mishards service port

>>> milvus =Milvus()>>> milvus.connect(host='192.168.1.38', port='19531')Copy the code
You can then complete table building, vector insertion, index creation, query, and other column operations.


Test steps:

The Bootcamp: https://github.com/milvus-io/bootcamp.git clone to the server, enter the Bootcamp/benchmark_test/scripts directory.
1. Build tables:
$ python3 milvus_toolkit.py --table <table_name>--dim <dim_num>-cCopy the code
2. Depending on the format of the file you want to import (NPY, CSV, FVECs, BVCES), refer to the instructions of the Milvus Bootcamp script: https://github.com/milvus-io/bootcamp/tree/0.5.3/scripts, modify
milvus_load.pyThe path to the file you imported. After the modification, run the following command to import data:
$ python3 milvus_load.py --table <table_name>-bCopy the code
Create index ();
$ python3 milvus_toolkit.py --table <table_name>--index <sq8 or sq8h or flat or ivf>--buildCopy the code
4. The query:
$ python3 milvus_toolkit.py --table <table_name>--nprobe <np_num>-s# execute -s to query performance. Np specifies the number of buckets to search for when queryingCopy the code

Operation of the

You can look at the Mishards log to see how each server is doing, as shown below:

According to the run log, IP address 192.168.1.85 and IP address 192.168.1.38 participate in the query.

As shown in the following two figures, all cpus at 192.168.1.85 and 192.168.1.38 are working. It can be observed in the RES column of the PID USER line that the memory usage of 192.168.1.85 is 10.7G, that of 192.168.1.38 is 9344M, and the total memory usage is 19.825G

When the data set was processed using Milvus standalone, the memory footprint was 15.9G. With the Mishards, the memory footprint is nearly 4 gigabytes higher than with the standalone version. This is due to multiple Milvus instances, each of which consumes memory. Although Mishards consume virtually no memory, the memory footprint has increased due to the increase in instances of Milvus.



This paper uses Mishards to build a Milvus cluster, and conducts relevant tests and operation analysis of Milvus cluster using 100 million data sets. When you need to process a large number of feature vectors, you can use the Mishards-based Milvus distributed cluster solution for a better experience. Future releases of Mishards will continue to be updated, and we welcome your input or code to explore a better clustering solution based on your scenario requirements.

For more on Mishards, visit:
https://github.com/milvus-io/milvus/blob/master/shards/README_CN.md




Welcome to the Milvus community

Milvus source


github.com/milvus-io/milvus
Milvus website


milvus.io
Milvus Slack community


milvusio.slack.com




Milvus online communication group
Follow ZILLIZ public account > click the menu
“Online Communication”> Add ZILLIZ bot and return to the group


© 2020 ZILLIZ ™