Paper puts forward the attention convolution of the binary neural tree weak supervision and fine-grained classification, combined with the attention on the edge of the tree structure of the convolution operation, on each node using the routing function to define the path from the root node to leaf node calculation, combining all the leaf nodes of the predicted the final forecast, the paper ideas and the results were very good sources: Xiao Fei algorithm engineering notes public number

Attention Convolutional Binary Neural Tree for fine-grained Visual Categorization

  • Thesis Address:Arxiv.org/abs/1909.11…

Introduction


Fine-grained Categorization (FGVC) is a branch of image Categorization. Due to the high similarity among categories, it is difficult for ordinary people to distinguish between them. Therefore, it is a field of great research significance. Inspired by the neural tree research, the paper designs a attention convolutional binary Neural tree architecture (ACNet) for weakly supervised fine-grained classification. The main contributions of the paper are as follows:

  • ACNet, a binary neural tree structure combined with attentional convolution, is proposed for fine-grained classification. Attentional convolution operations are combined at the edge of the tree structure, and routing functions are used at each node to define the computational path from the root node to the leaf node, similar to neural networks. This structure enables the algorithm to have the expression ability similar to that of neural network, and can learn features from coarse to fine levels. Different branches focus on different local areas, and finally make the final prediction by combining the predicted values of all leaf nodes
  • Add attention Transformer module to enhance network acquisition of key features for accurate classification
  • SOTA was achieved on three datasets, CUB-200-2011, Stanford Cars and Aircraft

#Attention Convolutional Binary Neural Tree


ACNet consists of four modules, which are backbone Network, Branch Routing, Attention Transformer and Label Prediction, as shown in Figure 2. Define ACNet as.Is the tree topology structure,Is the operation set of the tree side. This paper uses full binary tree.For the node,Is the edge, for tree depth, a total ofNode,Edge. Each node is a routing module that determines the next computing node and operates with attention Transformer. In addition, full binary treesThe use of asymmetric structures, such as two Transformer modules on the left and one Transformer module on the right, facilitates the extraction of features of different sizes

Architecture

  • Backbone network module

Since the key features of fine-grained categories are highly local and relatively small receptive fields are required to extract features, truncated VGG-16 network is used for the backbone network and the input is changed to

  • Branch routing module

Branch routing is used to determine the selection of child nodes, as shown in Figure 2b.– th floor-th Indicates the routing modulebyConvolution consists of a global Context block

The general structure of the Global Context block is shown in Figure A, which comes from the paper of GCNet. Simplified NL block is used in the Context Modeling and fusion steps and SE block is used in the Transform step. This module can extract features well with context information. Finally, global Average pooling, element-wise square-root, L2 regularization and sigmoID activated full-connection layer output scalar are used

Suppose a branch routing moduleThe output samplesThe probability of going to the right-hand node is zero, then the probability of output to the left node is, the node with higher probability has greater influence on the final result

  • Attention transformer

The Attention Transformer module is designed to enhance the network’s ability to capture key featuresAfter convolution, the attention module with the structure shown in Figure 2C is inserted. The bypass output of this module has a size ofThe input features are weighted by the Channel Attention map

  • Label prediction

For each leaf node of ACNet, the label prediction module is usedTo predict the targetThe categories,For the targetThe cumulative probability from the root node to the i-th node in k layer is determined by the prediction moduleThe convolution layer, Max pooling layer, L2 normalized layer, full connection layer and Softmax layer are composed of the convolution layer, Max pooling layer, L2 normalized layer, full connection layer and Softmax layer. The final prediction is obtained by suming the product of the prediction results of all leaf nodes and the cumulative probability of paths

The final predictionIf you are interested, you can go to see it. The accumulative probability sum of leaf nodes is 1, and the sum of predicted results of each leaf node is also 1

Training

  • Data augmentation

During the training phase, data enhancement was performed using cropping and flipping operations, first scaling the image to 512 pixels on the short edge and then randomly cropping to, and flip it randomly

  • Loss function

The loss function of ACNet consists of two parts, namely, the loss generated by leaf node prediction and the loss generated by the final result.As the tree height,As the GT,Is the negative logarithmic likelihood loss of the final predicted result,For the firstNegative logarithmic likelihood loss of 1 leaf prediction results

  • Optimization

The trunk network was pre-trained on ILSVRC, and all convolutional layers were randomly initialized using “Xavier”. The whole training process consisted of two stages: the first stage was 60 cycles of fixed trunk network training, and the second stage was 200 cycles of fine-tune for the whole network with a small learning rate

Experiments


A total of 512G memory and 8 V100s are required for training. The following experiment is mainly compared with the weakly supervised fine-grained algorithm, that is, the fine-grained algorithm without additional annotations

CUB-200-2011 Dataset

Stanford Cars Dataset

Aircraft Dataset

Ablation Study

  • Effectiveness of the tree architecture

As shown in Figure 5, tree-like structure can significantly improve accuracy. Grad-cam is used to generate heatmap to visualize the response area corresponding to leaf nodes, and it is found that different leaf nodes focus on different feature areas

  • Height of the tree

  • Asymmetrical architecture of the tree

This paper compares the influence of the symmetry of the numbers in the left and right paths on the recognition of attention Transformer

  • Effectiveness of the attention transformer module

As shown in Figure 5, the Attention Transformer module can effectively improve the accuracy of the model

  • Components in the branch routing module

The paper finds that different branch routing modules pay attention to different feature regions. The visualization results in Figure 6 are the response regions obtained by grad-CAM of R1, R2 and R3 nodes in Figure 2 respectively

CONCLUSION


Paper puts forward the attention convolution of the binary neural tree weak supervision and fine-grained classification, combined with the attention on the edge of the tree structure of the convolution operation, on each node using the routing function to define the path from the root node to leaf node calculation, combining all the leaf nodes of the predicted the final forecast, the paper ideas and the results were very good





If this article is helpful to you, please click a like or watch it. For more content, please follow the wechat public account.