An overview of the

  • In order to make the similar image retrieval scene of “search by image”, a search by image system is designed based on ES vector index calculation and image feature extraction model VGG16.
  • Open source: github.com/thirtyonele…

Retrieve the scene

  • Reasoning process: the image is read and the algorithm generates feature vectors
  • Feature storage: Feature vectors are stored in Milvus
  • Retrieval process: on-line real-time vector retrieval
  • The specific process is as follows:

Milvus server installation

  • Installation Guide: Milvus. IO /cn/docs/mil…
  • Download the configuration
  mkdir -p milvus/conf && cd milvus/conf
  wget https://raw.githubusercontent.com/milvus-io/milvus/0.10.6/core/conf/demo/server_config.yaml
Copy the code
  • The service start
Docker run - d - name milvus_cpu_0. 11.0 \ -p 19530:19530 \ -p 19121: \ 19121 - v < ROOT_DIR > / milvus/db: / var/lib/milvus/db \ -v <ROOT_DIR>/milvus/conf:/var/lib/milvus/conf \ -v <ROOT_DIR>/milvus/logs:/var/lib/milvus/logs \ -v < ROOT_DIR > / milvus/wal: / var/lib/milvus/wal \ milvusdb/milvus: 0.10.6 - CPU - ddc2 d022221-64Copy the code

Milvus vector index library

  • The h5PY vector library is selected to build the library
  • The retrieval type is the inner product: metricType.ip
H5f = h5py.File(index_dir, 'r') self.retrieval_db = h5f['dataset_1'][:] self.retrieval_name = h5f['dataset_2'][:] h5f.close() # 2. List_collections Milvus if self.index_name in self.client.list_collections()[1]: self.client.drop_collection(collection_name=self.index_name) self.client.create_collection({'collection_name': self.index_name, 'dimension': 512, 'index_file_size': 1024, 'metric_type': MetricType.IP}) self.id_dict = {} status, ids = self.client.insert(collection_name=self.index_name, records=[i.tolist() for i in self.retrieval_db]) for i, val in enumerate(self.retrieval_name): self.id_dict[ids[i]] = str(val) self.client.create_index(self.index_name, IndexType.FLAT, {'nlist': 16384}) # pprint(self.client.get_collection_info(self.index_name)) print("************* Done milvus indexing, Indexed {} documents *************".format(len(self.retrieval_db)))Copy the code

Milvus retrieval implementation

  • According to the definition of index loading, the dot product distance calculation method is adopted for retrieval here, and the specific code is as follows:
_, vectors = self.client.search(collection_name=self.index_name, query_records=[query_vector], top_k=search_size, params={'nprobe': 16})
Copy the code
  • Switch to Euclidean: metricType.l2

Introduction to operation

  • Download the project source code: github.com/thirtyonele…
  • Operation 1: Build the base index
Python index.py --train_data: specifies the path to the training images folder. The default path is' <ROOT_DIR>/data/train '--index_file: Custom index file storage path, default is' <ROOT_DIR>/index/train.h5 'Copy the code
  • Operation two: Use similarity search
Python Retrieval. Py --engine=milvus --test_data: Custom test image details address, default '<ROOT_DIR>/data/test/001_accordion_image_0001.jpg' --index_file: H5 '--db_name: specifies the ES or Milvus index name. The default is' image_retrieval' --engine: User-defined search engine type. The default search engine type is' numpy '. The options are numpy, FAiss, ES, or MilvusCopy the code

conclusion

  • Library-based management is easy to understand
  • Using posture is similar to ES but performs better
  • Because Milvus currently only supports vector retrieval and does not support scalar correlation, we need to build our own business library if scalar filtering is involved
  • The Milvus community will continue to support distribution, making it easier to handle large index scenarios

That’s all!