Content-based recall is a common recall strategy in the recommendation system, including label recall based on users or items or recall based on users’ age and region. Generally, the implementation of this strategy is based on the open source software Elasticseach. Although the recall results are reasonable, the novelty and surprise degree of the recall are low. For example, through the label “Andy Lau” recall, basically recall are containing the words of Andy Lau, it is unlikely to recall “Dawn”, “Jacky Cheung” and other four heavenly king items. In recent years, with the Embedding of everything, especially the successful application of word2Vec, Item2Vec, Graph2vec and other technologies, the method of recalling item vector through item vector has become a more commonly used recall strategy in recommendation system. This article focuses on building a vector search service through the open source software Vearch, and successfully realize the function of searching by image.

introduce

Recently, I have been doing recommendation optimization of small videos, and the goal of optimization is that every person can enter the screen. According to the previous experience recommended by the information stream, I hope to recall more relevant videos for consumption according to the user’s playing record. The recommended scene of the small video is similar to the popular douyin, which is to automatically play a short video (about 5-30s). The small video basically occupies a whole screen. Users can like, share and comment on the small video. If you don’t like it, you can scroll up to watch the next video. Given that it’s unlikely the title in the two lines below the video will determine whether a user wants to watch it, it’s more about whether the cover image will interest the user. Based on this project background, I want to build a service to search for pictures to recall the cover image of the small video.

Since I have used gRPC and Faiss to build vector recall service before, plus the image classification project I have done before, both of them are to convert the image into vector as the input of the classifier, so to do this basically need to solve two things:

  1. The image is preprocessed and converted into a fixed length vector
  2. Enter the vectors of each image into Faiss and use them to complete the vector search task

For example, in the application of blog Faiss in the project, the author uses SIFT algorithm to extract image features, which correspond to a 128-dimensional vector. The feature vectors of each image are input to Faiss for recall of similar vectors. For example, in the practice of Faiss Server based on gRPC, MXPlayer’s technical team upgraded the user/item vector recall service originally developed based on Flask framework, and the QPS of single load test was more than twice as high as before. Originally, I planned to change the Faiss service based on gRPC to meet the requirements of the current business scenario. However, I accidentally found the open source software of JINGdong, Vearch, and put the previous idea out of my mind. I decided to learn this open source software well.

Service composition

The graph search service consists of two parts. One is the vector search service, which is provided by Vearch. One is to extract features from images into feature vectors, which is provided by Vearch’s python-algorithm-plugin.

Vector search service -Vearch

Vearch is an elastic distributed system for high performance similarity search for large-scale deep learning vectors. At its heart is a vector search, based on a Faiss implementation called Gamma Engine. In addition to vector searching, however, Gamma can also store documents that contain scalars and quickly index and filter those scalar fields. To be clear, both scalars and vectors are supported, whereas generic versions like Elasticsearch only support scalars. While Faiss can only build a single machine vector search service, Vearch uses Gamma as the vector search engine, Raft protocol for multiple copy storage, and Master and Router components to build a resilient distributed system for vector similar search. The architecture diagram is as follows:In the figure, there are three main components: Master, Router, and PartitionServer. Their functions are as follows:

  • The Master manages schemas and coordinates source data and resources at the cluster level
  • The Router provides RESTful apis for adding, deleting, modifying, and querying routes, forwarding requests, and merging results
  • PartitionServer is mainly based on raft protocol to achieve multiple copy storage, while the specific storage, indexing and retrieval capabilities are provided by Gamma engine.

Gamma is to Vearch what Lucene is to Elasticsearch.

Image processing services -Vearch plugin

Image processing service Vearch also provides a plug-inpython-algorithm-plugin. Vearch’s goal is to build a resilient distributed system with high performance similar search. Text, images, and videos can all be converted into vectors, so the Vearch team has provided plug-ins for better integration into Vearch. For images, the plug-in provides target detection, feature extraction and similar search and other functions. The processing logic is as follows:Its logic is to extract vector features from images and store them in The Gamma engine of Vearch and provide retrieval services.

Service building

The graph search service consists of two services. One is the vector search service provided by Vearch. The other is image feature extraction as vectors, provided by Vearch’s plug-in python-algorithm-plugin.

vearch

Vearch is written in Go, and Gamma, the core engine, is written in C++ (after all, Faiss was also developed in C++), so service deployment is fairly straightforward. For single-machine mode, use./ vearch-conf config.toml to start the service. For cluster service, use. Use the last command parameter./ vearch-conf config.toml ps/router/master.

However, because the version of our online server Gcc is too low, there is no Go environment and other factors, so we adopt the Docker method. In order to understand in detail how the Faiss service evolved into Vearch, a resilient distributed system, the source code was compiled and installed.

# Download source code
git clone https://github.com/vearch/vearch
# Switch to the mirror compilation directory
cd vearch/cloud
Vearch /vearch_env:3.2.2, install GCC, git, faiss, Rocksdb, go, etc
Docker pull vearch/vearch_env:3.2.2
sh compile_env.sh
# Compile binary vearch using vearch_env, mainly pulling gamma source code to compile
sh compile.sh
# Package vearch/vearch:3.2.2, put the packaged binary vearch and dependent libraries into the image.
# You can directly use the official image docker pull vearch/vearch:3.2.2
sh build.sh
Copy the code

Official image packaging still has optimized space, packaging is recommended to use centos source, prepare Faiss, RocksDB, Go and other source files.

The image processing

There is no Docker image available for image processing services, and there is a problem with the image packaging provided on Github repository. You can use the following repository for packaging.

# Download source code (using modified Dockfile file)
git clone -b study https://github.com/haojunyu/python-algorithm-plugin
Switch to the image directory and package the image vearch/images:3.2.2
Docker pull haojunyu/vimgs:3.2.2
cdPython-algorithm-plugin && docker build -t haojunyu/vimgs:3.2.2.Copy the code

Swarm to start

Yml: Docker stack deploy-c docker-compose. Yml: docker-compose. Yml: docker-compose.

Version: '3.3' services: vearch: image: vearch/vearch:3.2.2 ports: - "8817:8817" - "9001:9001" volumes: - ./config.toml:/vearch/config.toml - ./data:/datas - ./logs:/logs deploy: mode: replicated replicas: 1 restart_policy: condition: on-failure delay: 10s max_attempts: 3 logging: driver: "json-file" options: max-size: "1g" imgs: image: Haojunyu /vimgs:3.2.2 ports: - "4101:4101" volumes: - ./python-algorithm-plugin/src/config.py:/app/src/config.py - ./images/imgs:/app/src/imgs command: ["bash", "../bin/run.sh", "image"] deploy: mode: replicated replicas: 3 restart_policy: condition: on-failure delay: 10s max_attempts: 3Copy the code

Note: The mount file python-algorithm-plugin/ SRC /config.py is a configuration file for the image processing service.

  • portSpecifies the port for the image processing service, 4101 by default
  • gpusSpecifies whether the service uses the GPU. The default value is -1
  • master_addressrouter_addressThe master and Router services of the Vearch service

Service usage

Because the image service and the Vearch service are highly integrated. Generally, the image service is called directly, and the image vector input Vearch to the image service itself. Refer to the documentation for detailed operations of Vearch.

Service monitoring

# Master_server refers to the vearch master node and its corresponding port: localhost:8817
Check the cluster status
curl -XGET http://master_server/_cluster/stats
Check the health status
curl -XGET http://master_server/_cluster/health
# Check the port status
curl -XGET http://master_server/list/server
When creating a table, the cluster will be locked. If the service is abnormal during the process, the lock cannot be released. You need to manually clear the lock before creating a table.
curl -XGET http://master_server/clean_lock
# Copy capacity expansion and reduction
curl -XPOST -H "content-type: application/json"  -d' { "partition_id":1, "node_id": 1, "method": 0 } ' http://master_server/partition/change_member
Copy the code

Library and space operations

The concept of libraries and Spaces is similar to the concept of databases and tables in mysql.

  • Library operation
View and crowd-owned libraries
curl -XGET http://master_server/list/db
# create the library
curl -XPUT -H "content-type:application/json" -d '{ "name": "sv_month" } ' http://master_server/db/_create
# to check the library
curl -XGET http://master_server/db/$db_name
Select * from repository where tablespaces exist;
curl -XDELETE http://master_server/db/$db_name
# View all tablespaces under the specified librarycurl -XGET http://master_server/list/space? db=$db_name
Copy the code
  • Tablespace operations
Create tablespace test(image) from library sv_month
curl -XPUT -H "content-type: application/json" -d '{ "name":"test", "partition_num":1, "replica_num":1, "engine":{ "name":"gamma", "index_size":70000, "max_size":10000000, "id_type":"String", "retrieval_type":"IVFPQ", "retrieval_param":{ "metric_type":"InnerProduct", "ncentroids":256, "nsubvector":32 } }, "properties":{ "itemid":{ "type":"keyword", "index":true }, "feature1":{ "type":"vector", "dimension":512, "model_id":"vgg16", "format":"normalization" } } }' http://image_server:4101/space/sv_month/_create
Copy the code

Data manipulation

  • Insert data into
Insert local image data into tablespace
curl -XPOST -H "content-type: application/json"  -d' { "itemid":"COCO_val2014_000000123599", "feature1":{ "feature":".. /images/COCO_val2014_000000123599.jpg" } } ' http://image_server:4101/sv_month/test/AW63W9I4JG6WicwQX_RC
Copy the code
  • Data search
# Query similar results
curl -H "content-type: application/json" -XPOST -d '{ "query": { "sum": [ { "feature":"../images/COCO_val2014_000000123599.jpg", "field":"feature1" }] } }' http://image_server:4101/sv_month/test/_search
Copy the code

Service effect and on-line

The effect

After the successful construction of the service, it is necessary to check the effect of searching the map, and the identification of the effect is preliminatively based on the manual, and finally based on the online index data. There should be an expectation for the effectiveness of a service at the beginning of the construction of the service, such as:

  1. The same images have to be close to 100 percent alike
  2. The same type should get the same type of results, such as dogs for dogs, cars for cars, etc

The following is the screenshot of the effect of the search:

Overall, the results are quite good.

online

Recommended policies can be launched in the following ways:

  1. Direct service line, like sorting model. This approach requires services to support high concurrency, high performance, and high availability
  2. Online call + cache, like content search. This approach requires the service to support high performance and high availability, and the cache has a high probability of being hit
  3. Results written offline to cache, such as CF, hot and so on can be calculated in advance of good results.

After importing a total of 90,000 new videos in the last 7 days and videos exposed in the last 30 days into the single-machine service, Vearch was unable to sustain an average impact of 48 QPS between two buckets. The second approach was used to solve the online problem. And the corresponding user vector (the mean value of the image vector played) search image vector strategy through the third way online.

reference

  1. Use of Faiss in the project
  2. faiss-web-service
  3. Faiss Server practice based on gRPC
  4. Jingdong distributed vector retrieval system Vearch
  5. Vearch Chinese documents
  6. The vearch core engine gamma
  7. Vearch image processing plug-in
  8. Image search page

If this article has helped you, or you are interested in technical articles, you can pay attention to the wechat public number: Technical tea Party, can receive the relevant technical articles in the first time, thank you!

This article was automatically published by ArtiPub, an article publishing platform