The task of image retrieval refers to finding images containing the same or similar instances from the image database given a query image. This paper studies the update of AMap POI information, that is, every new or adjusted POI is timely made into data according to its own image source. This is a very typical image retrieval vertical application, the whole set of convenience behind also contains a lot of CV technology. In this paper, we study the computer vision technology applied to the business background of “Amap POI information update” with the sharing of octopus, a senior CV engineer.

Read the full text in one picture

Get “computer vision” industry solutions

The project implementation code, project data set, paper collection and article collection of “recommendation and calculation advertising” series have been sorted into dACHang industry solutions. Go to the public account (AI Algorithm Research Institute) background reply keyword “computer vision” to obtain.

Related code implementation reference

ShowMeAI community technical experts partners have also implemented the typical algorithm of image retrieval and built relevant applications. For details on “CNN and triplet Based Image Retrieval implementation”, please visit our GitHub project (github.com/ShowMeAI-Hu…). See the implementation code. Thank you to the ShowMeAI community for participating in this project and welcome to PR and Star!

Recommended reading | click “computer vision” series

Giant | image retrieval and its technical implementationtaobaoApplication @ Computer Vision series

I. The business background of Autonavi Image Retrieval

The technology applied in this paper is image retrieval, the application scene is Amap, and the application point is THE UPDATE of POI information of AmAP (POI plaque and POI correspond one by one in The Image data of AmAP).

POI: Point of Interest. On electronic maps, POI stands for restaurants, supermarkets, government offices, tourist attractions, transport facilities, etc. POI is the core data of electronic map.

  • POI data contains name information, location information and so on, which can meet the basic needs of users — to use electronic maps to “find destinations” and thus evoke navigation services.
  • POI data can support the electronic map to provide “search nearby”, “comment” and other functions, these operations can improve the user’s use and active time.
  • POI data is also a link between online and offline interaction, which is an important component of Location Based Service industry.

For amap’s business scenarios, every new or adjusted POI needs to be timely produced into data according to its own image sources.

Generally speaking, within a short period of time (monthly), the change of POI at the same location is very low (as shown, only “Soup fire Kung Fu” POI is a new listing). Therefore, from a technical implementation point of view, a “do all the POI every time” approach is not feasible because the operational costs are too high. A better implementation is to automatically filter out poIS that do not change. This scene task is a very typical image retrieval task, among which the key technology is image matching.

1.1 Task definition of image retrieval

Image retrieval problem definition: Given a Query image, similar images can be searched in the large image Gallery by analyzing the visual content. Image retrieval has always been a long-term research topic in the field of computer vision, and it has been widely used in “pedestrian re-recognition”, “face recognition” and “visual positioning” tasks.

The process of image retrieval requires “image feature extraction” + “comparison retrieval” :

1.1.1 Image feature extraction

It usually includes: global feature, local feature, auxiliary feature, etc., mainly for the corresponding optimization of different task characteristics. For example: pedestrian re-recognition and face recognition have strong rigid constraints, and have obvious key features (pedestrian/face key points), so human body segmentation or key point detection information will be integrated into the model feature extraction.

1.1.2 Comparative retrieval

The core technology is metric learning, whose goal is to narrow the sample of the same category and push the sample of different categories in the feature space of fixed dimensions. In the era of deep learning, there are mainly several classical structures, which are optimized through the definition of positive and negative samples and the design of loss function:

  • Contractive Loss
  • Triplet Loss
  • Center Loss

1.2 Autonavi business problems and difficulties

POI plaque image retrieval is quite different from mainstream academic retrieval tasks (such as pedestrian re-recognition), mainly including the following points:

  • Heterogeneous data
  • Keep out effect
  • Text dependency

1.2.1 Heterogeneous data

Heterogeneous data refers to the differences in images taken by different cameras, in different environments and under different conditions. For example, in the POI plaque retrieval scenario, there are serious heterogeneous data problems. As shown below, different source images under different shooting conditions.

The final brightness, shape and clarity of POI plaques vary greatly due to the different camera quality and perspective. How to realize POI plaque retrieval in heterogeneous data is a very challenging problem.

1.2.2 Influence of occlusion

In the road scenes, there are often trees, vehicles and other interference information, and due to the shooting Angle, the POI plaques often face serious occlusion problems. Occlusion presents a great challenge for POI plaque retrieval.

1.2.3 Text dependency

Another unique feature of POI plaque is its strong dependence on text, mainly on the text of POI name.

In this scenario, it is hoped that the two plaques do not match. This requires the introduction of text features to enhance feature discrimination. Occlusion also affects the effective expression of text features, so it is necessary to combine image features to make trade-offs. However, text features and image features come from multiple modes. How to fuse multi-mode information is also a unique technical difficulty of this business.

Two, technical implementation of the overall plan

The technical scheme of tablet retrieval mainly includes “data generation” and “model optimization”. The overall technical framework is shown in the figure below:

2.1 Data generation module

The “data generation” module is divided into two steps: “automatic data generation at cold start” and “model iterative data generation” :

  • [1] The traditional matching algorithm Sift was used to automatically generate the training data required by the model and complete the cold start of the model.
  • [2] After the model went online, the online manual operation results were automatically mined and organized into training data for iterative model optimization.

2.2 Model optimization module

In the “Model Optimization” module, considering the rich text information of the plaque, the autonavi team integrates visual information with text information, and designs a “multi-modal retrieval model” based on the metric learning framework of Triplet Los:

  • Design “visual branch” and “text branch” two parts. The input of “visual branch” is the image information of POI plaque, and the feature extraction is carried out by the double branch. The input of “text branch” is the text information of POI plaque, and BERT is used for feature extraction.
  • For feature extraction of visual information, “global feature branch” and “local feature branch” are further designed and optimized respectively.

Data generation module

In order to train the retrieval model, it is usually necessary to annotate at the instance level, i.e. by POI plaque granularity. However, screening the same POI plaque from different materials is a very complicated work. If manual labeling is carried out, it will bring high labeling cost and cannot be large-scale labeled. Therefore, Autonavi designed a set of simple and efficient automatic generation of training data for the cold start of the model without any manual annotation.

The specific process is as follows: referring to the traditional feature point matching algorithm, Sift feature point matching algorithm is used to match all plaques in the two data in pairs, and the matching results are screened by the number of interior points, that is, matching plaques with the number of interior points greater than the threshold are regarded as the same plaque.

3.1 Problems existing in the traditional feature point matching algorithm

The traditional feature point matching algorithm has the problem of insufficient generalization, and the resulting training data is likely to lead to poor learning of the model, which is embodied in:

  • [1] The training samples are relatively simple.
  • [2] Category conflict, that is, the same plaque is divided into multiple categories.
  • [3] Category error, that is, different plaques are divided into the same category.

3.2 Autonavi team’s optimization method

[1] Using the matching results of “multi-trip data” to improve the diversity of plaques under the same category;

Autonavi uses the matching results of “multi-pass data” to generate training data. Because in different data, there are multiple shooting results of the same plaque from different perspectives, which ensures the diversity of the plaque in the same category and avoids the problem that the samples automatically generated are all simple samples.

[2] “Batch sampling strategy” and “MDR Loss” are adopted to reduce the sensitivity of the model to mislabeled data.

Batch sampling policy, that is, sampling by category, and the total number of categories in the data is much larger than Batch Size, so the problem of category conflict can be alleviated.

Based on Triplet Loss, MDR Loss designs a new metric learning framework that regularizes constraints according to different distance intervals, so as to reduce overfitting of model to noise samples.

The following is a comparison diagram of Triplet Loss and MDR Loss.

MDR Loss distance regularization constraint hopes that the distance between the Positive sample and the benchmark sample will not be pulled to infinity, while the Negative sample does not want to be pushed to infinity.

For the classification error noise sample, different plaques are mistakenly classified into the same category. According to the optimization objective of Triplet Loss, the model will be forced to learn the distance between them to an infinite distance. In this case, the model will overfit the noise sample, resulting in a poor final effect.

4. Model optimization module

To optimize plaque retrieval, Autonavi’s solution designs a multi-modal retrieval model that integrates visual and textual information in plaques.

  • For visual information, the extraction ability of Global feature and Local feature of the model was optimized.
  • For text information, BERT was used to encode ** OCR ** results of plaques as auxiliary features, and metric learning was performed after fusion with visual features.

4.1 Global Features

In general, global features extracted by deep learning model have higher robustness for retrieval tasks, and can adapt to different scenes such as plaque Angle of view, color, illumination change and so on. In order to further improve the robustness of global features, Autonavi’s solution is optimized from the following two aspects:

  • Use the “Attention mechanism” to focus on important features.
  • Backbone network improvements to focus on more fine-grained features.

4.1.1 Introduce attention mechanism

In the actual business scenarios of Autonavi, there are some plaques with similar appearance but different details. In this case, it is hoped that the model can pay attention to the fine-grained information in the plaques, such as “font of the words in the plaques”, “text typesetting” or “text content” itself.

The attentional mechanism helps the model to accurately focus on the more critical parts of a large amount of information that can distinguish different plaques. Therefore, it is reasonable to introduce the attention module in the network to make the model learn key information, so as to improve the discrimination ability of global features.

The Autonavi team used “Spatial group-wise Enhance” (SGE). SGE adjusts the importance of features at each spatial location by generating an attention factor for each spatial location on the feature map. The SGE module is as follows:

  • First, the feature maps are grouped.
  • The semantic feature vectors are calculated for each group of feature images.
  • Using semantic feature vector and feature graph, position-wise dot product is obtained.
  • The position-wise dot product between the attention diagram and the feature map is used to enhance the feature, so as to obtain better spatial distribution of semantic features.

4.1.2 Improving Network Backbone

In order to reduce the loss of local features, network Backbone can be improved:

  • Cancel downsampling in the last block of ResNet network to make the final feature graph contain more local information.
  • Replace the last Global Average Pooling layer with GeM Pooling layer: GeM is a feature aggregation method that can be learned. Global Max Pooling and Global Average Pooling are special cases of GeM Pooling. The Global feature robustness can be further improved by using GeM Pooling.

4.2 Local Features

After optimization for global features, the existing model still does not perform well in the following three aspects:

  • The truncated plaque features poor learning quality, as shown in figure (a).
  • Occluded plaque features that introduce some irrelevant contextual information, as shown in figure (b).
  • Similar but different plaques are indistinguishable, as shown above (c).

In view of the above three points, Autonavi’s solution further designed a local feature branch to make the model pay more attention to the “local information” such as “geometry” and “texture” of the plaque, and perform plaque retrieval together with the “global feature”.

For the extraction of “local features”, the main idea is to vertically cut the plaque into several parts, respectively pay attention to the local features of each part, and optimize the local features after alignment.

The alignment operation is shown in the figure above:

  • By vertical pooling of feature images, local feature images of blocks are obtained.
  • Calculate the similarity matrix between the local features of the two images, and then find the shortest distance to align the two images according to Formula 1.

S i . j = { d i . j i = 1 . j = 1 S i 1 . j + d i . j i indicates 1 . j = 1 S i . j 1 + d i . j i = 1 . j indicates 1 min ( S i 1 . j . S i . j 1 ) + d i . j i indicates 1 . j indicates 1 \displaystyle S_{i, j}=\left\{\begin{array}{ll} d_{i, j} & i=1, j=1 \\ S_{i-1, j}+d_{i, j} & i \neq 1, j=1 \\ S_{i, j-1}+d_{i, j} & i=1, j \neq 1 \\ \min (S_{i-1, j}, S_{i, j-1})+d_{i, j} & i \neq 1, j \neq 1 \end{array}\right.
  • Iii, JJJ represents the feature of the i-th block and the feature of the j-th block in the two graphs.
  • Dijdijdij represents the Euclidean distance of the features of block III and Block JJJ in the two graphs.

In this way, local feature alignment can improve the retrieval effect of plaque in the case of truncation, occlusion and inaccurate detection frame.

4.3 Text Features

POI plaque is strongly dependent on the text, and there may be a scene of “only the text of the name of the plaque changes”. In the global feature branch and local feature branch of the above design, text features can be learned to a certain extent, but text information accounts for a relatively small proportion of the overall information, and the supervision signal is only about whether the two graphs are similar, so text features are not well learned.

The solution is to make use of the existing text OCR recognition results, introduce BERT to encode the OCR results and get the text features, which are used as auxiliary feature branches and visual features for fusion, and the fused features are used in the final measurement learning of plaque retrieval.

A detail in the solution: When extracting OCR results of plaque, in order to reduce the impact of inaccurate identification results in a single frame, the multi-frame OCR results of the same plaque in the data were used, and the OBTAINED OCR results were splicing. When BERT was used to encode the OCR results, Insert SEP symbols between OCR results from different frames to distinguish between them.

Five, business effect

Under the new technical scheme, the image retrieval of AUtonavi POI plaque has achieved very good results. The accuracy rate and recall rate are both greater than 95%, greatly improving the online indicators, and the model speed has also been greatly improved. In the process of optimization, some very difficult cases are gradually solved, as shown in the figure.

(a), (b) and (c) show the Bad Case before this scheme (the left figure is query image, and the right figure is Rank1 retrieval result). It is not difficult to find from Bad cases that plaque retrieval has very high requirements for fine-grained feature extraction, because these cases are generally characterized by overall similarity but local features are different. These Bad cases are also the original intention of designing the multi-modal retrieval model, and are gradually solved in the optimization process, as shown in Figure (d), (e) and (f).

  • Multimodal retrieval model based on the characteristics of global optimization and alignment with local characteristics, makes the model more attention to the plaque is more distinct local characteristics, such as text, text font, plate, plaque texture, etc., so the model for similar appearance has better ability to distinguish between different plaque, as shown in figure (a) and (d) contrast effect.

  • In addition, it is very difficult to retrieve some plaques only using visual features due to the occlusion of plaques from different perspectives, different light intensity during shooting, and large color difference between different cameras. Therefore, the OCR information is added through the auxiliary feature branch to further enhance the robustness of features, so that the plaque retrieval can take the visual information of the plaque and the text information in the plaque into comprehensive consideration for retrieval, as shown in Figure (b) and Figure (e) for effect comparison.

Sixth, summary and next optimization direction

The image retrieval scheme mentioned above is applied in the actual business of Autonavi to help complete some automatic data production. However, the model is not perfect, and there are still Bad cases, which can be considered as follows:

  • Semi-supervised/active learning automatically supplements data.
  • Introduce the Transformer structure.

6.1 Semi-supervised/active learning data mining

Data is very important, and the model itself is data-driven. Supplementary data is of great help to the model, and targeted supplementary data can help the model deal with extreme cases. The key to supplementary data is how to mine Corner Case and automatically mark targeted cases. Semi-supervised learning and active learning are relatively promising methods.

6.1.1 Semi-supervised learning

  • The model trained with label data is used to generate pseudo labels for massive unlabeled data, and then the model is optimized after further mixing label data and pseudo label data.
  • Semi-supervised learning completely generates labels by the model itself, but it may lead to the upper limit of model effect.

6.1.2 Active learning

  • The model trained with labeled data is used to mine massive unlabeled data, and the mined valuable data is annotated manually.
  • Active learning can raise the upper limit of semi-supervised learning to a certain extent.
  • Effective combination of the two can better supplement training data and solve Corner Case.

6.2 Feature extraction and fusion of Transformer

Transformer has been a very effective model structure in recent years, and it has been very popular in the NLP field. Most of the pre-training models use its structure and design ideas. It also has excellent performance in a large number of scene tasks (classification, detection, segmentation, tracking and pedestrian re-recognition) in the field of computer vision.

  • Compared with CNN, Transformer has global receptive field and can conduct high-order correlation modeling, and has better characterization ability in feature extraction of many tasks.
  • Transformer’s input is also flexible and it can easily encode other modal information and input it into the model with image features, so it also has great advantages in multi-model feature fusion.

One optimization direction is to solve the matching effect of POI plaque in occlusion/truncation scene through correlation modeling of image Patch by Transformer, and realize the fusion of multiple model features by encoding text features.

Seven, code implementation reference

Get “computer vision” industry solutions

The project implementation code, project data set, paper collection and article collection of “recommendation and calculation advertising” series have been sorted into dACHang industry solutions. Go to the public account (AI Algorithm Research Institute) background reply keyword “computer vision” to obtain.

Related code implementation reference

ShowMeAI community technical experts partners have also implemented the typical algorithm of image retrieval and built relevant applications. For details on “CNN and triplet Based Image Retrieval implementation”, please visit our GitHub project (github.com/ShowMeAI-Hu…). See the implementation code. Thank you to the ShowMeAI community for participating in this project and welcome to PR and Star!

Recommended reading | click “computer vision” series

Giant | image retrieval and its technical implementationtaobaoApplication @ Computer Vision series

8. References

  • [1] Zhang X, Luo H, Fan X, et al. Alignedreid: Surpassing human-level performance, Surpassing person identification[J]. ArXiv Preprint arXiv:1711.08184, 2017.
  • [2] Kim, Yonghyun, And Wonpyo Park. “Multi-level Distance Regularization for Deep Metric Learning.” arXiv preprint arXiv:2102.04223, 2021.
  • [3] Radenović F, Tolias G, Chum O. Fine-tuning CNN image retrieval with no human annotation[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 41(7): 1655-1668.
  • [4] Li X, Hu X, Yang J. Spatial group-wise enhance: Improving Semantic Feature Learning in Convolutional Networks [J]. ArXiv Preprint arXiv:1905.09646, 2019.

  • Authors: Han Xinzi @Showmeai, Octopus @Gaode
  • Address: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source