This article appears at blog.aistudyclub.com

Relation Net is a paper of CVPR2018, which is linked to:Arxiv.org/abs/1711.06… Deep learning has achieved great success in visual recognition tasks. The author points out that the training model requires a large number of annotated images and multiple iterations to train parameters. Each time a new object category is added, it takes time to label it, and some new and rare object categories may not have a large number of labeled images at all. However, humans can achieve small sample (FSL) and no sample (ZSL) learning with very little cognitive learning. As an example, children can easily identify zebras from a picture or book, or simply from hearing them described as “striped horses.” In order to solve the problem of poor classification effect of models with few samples in deep learning, and inspired by the small sample and sample-free learning ability of human beings, small sample learning has regained some popularity. Fine-tune technology in deep learning can be used in some cases with small samples, but in the case of only one or several samples, even with data enhancement and regularization techniques, there will still be problems of over-fitting. At present, the reasoning mechanism of other small sample learning is relatively complex, so the author of this paper puts forward a simple structure model Relation Net that can be trained end-to-end.

In FSL tasks, data sets are generally divided into training set, support set and testing set. Support set and testing set have common tags. Training set does not include tags for support set and testing set. If there are K marked data and C different categories in the support set, it is called C-way K-shot. During training, sample set/ Query set is selected from training set to correspond to support set/testing set. The specific methods will be explained in the training strategy later. Relation Network is composed of Relation Model and Relation model. The core idea of Relation Network is that the feature graph of the images in support set and testing set is extracted separately by embedding model, and then the dimension representing channel number in the feature graph is spliced to get a new feature graph. Then, the new feature graph is sent into relation Model for calculation to obtain Relation Score, which represents the similarity of two graphs. The following figure shows the network structure and process of accepting 1 sample in the case of 5-way 1-shot. The features of 5 sample set images and 1 Query set images are extracted and spliced separately by embedding model to obtain 5 new feature images, which are then sent to Relation Net for calculation of Relation score. You end up with a one-shot vector, with the highest score representing the corresponding category.

The loss function used in training is also relatively simple, using mean square error as the loss function. In the formula, RI and j represent the similarity between picture I and picture J. Yi and yj represent the actual labels of the image.

Reproduction of Relation Network Relation Network model structure definition based on flying OARS please check:Github.com/txyugood/pa…I’ll share the technical details of the reappearance with the developers below. 1. Constructing Relation Network model is composed of Relation Model and Relation Model, both of which are mainly composed of [Conv+BN+Relu] module, so first define a BaseNet class. And implement the conv_BN_layer method as follows:

class BaseNet:
    def conv_bn_layer(self,
                      input,
                      num_filters,
                      filter_size,
                      stride=1,
                      groups=1,
                      padding=0,
                      act=None,
                      name=None,
                      data_format='NCHW') :
        n = filter_size * filter_size * num_filters
        conv = fluid.layers.conv2d(
            input=input,
            num_filters=num_filters,
            filter_size=filter_size,
            stride=stride,
            padding=padding,
            groups=groups,
            act=None,
            param_attr=ParamAttr(name=name + "_weights", initializer=fluid.initializer.Normal(0,math.sqrt(2. / n))),
            bias_attr=ParamAttr(name=name + "_bias",
                                initializer=fluid.initializer.Constant(0.0)),
            name=name + '.conv2d.output.1',
            data_format=data_format)

        bn_name = "bn_" + name

        return fluid.layers.batch_norm(
            input=conv,
            act=act,
            momentum=1,
            name=bn_name + '.output.1',
            param_attr=ParamAttr(name=bn_name + '_scale',
                                 initializer=fluid.initializer.Constant(1)),
            bias_attr=ParamAttr(bn_name + '_offset',
                                initializer=fluid.initializer.Constant(0)),
            moving_mean_name=bn_name + '_mean',
            moving_variance_name=bn_name + '_variance',
            data_layout=data_format)
Copy the code

There are two network definition modes: static graph and dynamic graph. The static graph I chose here defines the conv_BN layer most frequently seen in the convolutional neural network. However, it should be noted that the momentum of batch_norm layer is set to 1 to achieve the effect of not recording the global mean and variance. The specific meanings of parameters are as follows:

  • Input: Passes in the tensor object to be convolved
  • Num_filter: number of convolution kernels (number of channels for output convolution results)
  • Filter_size: indicates the convolution kernel size
  • Stride: convolution step length
  • Groups: indicates the number of groups for the group convolution
  • Padding: padding size, set to 0 to indicate no padding after convolution.
  • Act: The activation function next to the BN layer. If None, the activation function is not used
  • Name: Name of the object in the operation graph

Next, we define the embedding Model part in Relation Network.

class EmbeddingNet(BaseNet) :
    def net(self,input) :
        conv = self.conv_bn_layer(
            input=input,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='embed_conv1')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='embed_conv2')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=1,
            act='relu',
            name='embed_conv3')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=1,
            act='relu',
            name='embed_conv4')
        return conv
Copy the code

Start by creating an EmbeddingNet class that inherits from the BaseNet class, which inherits the conv_BN_layer method. The NET method defined in EmbeddingNet, whose parameter input represents the input image tensor, is used to create a static graph of the network. The input went through a Conv+BN+ RELu module to get the feature graph embed_conv1, and then a maximum pooling operation was performed. The function of pooling is to reduce the feature graph on the premise of retaining important features, and the subsequent convolution and pooling operations have the same effect. The final embed_conv4 output feature graph shape was [-1,64,19,19]. The first latitude represented batCH_size. Since batch_size was uncertain during static network creation, -1 could be any value. The second latitude represents the channel number of feature graph. After embedding model, the channel number of feature graph is 64. Finally, the third and fourth dimensions represent the width and height of the feature map, in this case 19×19. Relation Model code is as follows:

class RelationNet(BaseNet) :
    def net(self, input, hidden_size) :
        conv = self.conv_bn_layer(
            input=input,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='rn_conv1')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='rn_conv2')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        fc = fluid.layers.fc(conv,size=hidden_size,act='relu',
                             param_attr=ParamAttr(name='fc1_weights',
                                                  initializer=fluid.initializer.Normal(0.0.01)),
                             bias_attr=ParamAttr(name='fc1_bias',
                                                 initializer=fluid.initializer.Constant(1)),
                             )
        fc = fluid.layers.fc(fc, size=1,act='sigmoid',
                             param_attr=ParamAttr(name='fc2_weights',
                                                  initializer=fluid.initializer.Normal(0.0.01)),
                             bias_attr=ParamAttr(name='fc2_bias',
                                                 initializer=fluid.initializer.Constant(1)))return fc
Copy the code

Create a RelationNet class that inherits from the BaseNet class and inherits the conv_BN_layer method. In the NET method, similar to the embeding model, the first several layers of the model use the [Conv+BN+Relu] module to extract features. In the end, two full connection layers are used to map feature values to a scalar relation score, representing the similarity of two images.

In the training process, the image in sample set and the image in Query set get the feature graph with the shape of [-1,64,19,19] after embedding model, which needs to be splicing before being sent to Relation Model. This code is a little complicated. Let me explain it in sections.

sample_image = fluid.layers.data('sample_image', shape=[3.84.84], dtype='float32')
query_image = fluid.layers.data('query_image', shape=[3.84.84], dtype='float32')
         
sample_query_image = fluid.layers.concat([sample_image, query_image], axis=0)
sample_query_feature = embed_model.net(sample_query_image)
Copy the code

The tensor sample_query_image is segmented on the latitude of batch_size to extract features. Get sample_query_feature.

sample_batch_size = fluid.layers.shape(sample_image)[0]
query_batch_size = fluid.layers.shape(query_image)[0]
Copy the code

This part of the code takes the 0 dimension of the image tensor as batch_size.

sample_feature = fluid.layers.slice(
                sample_query_feature,
                axes=[0],
                starts=[0],
                ends=[sample_batch_size])
if k_shot > 1:
# few_shot
      sample_feature = fluid.layers.reshape(sample_feature, shape=[c_way, k_shot, 64.19.19])
      sample_feature = fluid.layers.reduce_sum(sample_feature, dim=1)
query_feature = fluid.layers.slice(
      sample_query_feature,
      axes=[0],
      starts=[sample_batch_size],
      ends=[sample_batch_size + query_batch_size])
Copy the code

Since the images have been splicing before, after the feature, it is also necessary to slice on the 0 dimension corresponding to the batch_size of the sample_query_feature to obtain the sample_feature and query_feature respectively. Here, if k-shot is greater than 1, it is necessary to change the shape of sample_feature, and then sum the k-shot tensors on the 1 dimension corresponding to k-shot and delete the dimension, then the shape of sample_feature becomes [c-way,64,19,19]. The value of sample_batch_size should be C-way.

sample_feature_ext = fluid.layers.unsqueeze(sample_feature, axes=0)
query_shape = fluid.layers.concat(
       [query_batch_size, fluid.layers.assign(np.array([1.1.1.1]).astype('int32'))])
sample_feature_ext = fluid.layers.expand(sample_feature_ext, query_shape)
Copy the code

Since each image feature in the sample set needs to be spliced with C types of image features, a new dimension is added here through unsqueeze. According to the parameter requirements of expand interface, Create a new query_shape tensor to duplicate the sample_feature tensor query_batch_size and get a shape of [query_batch_size, sample_batch_size, 64, 19, The tensor of 19].

query_feature_ext = fluid.layers.unsqueeze(query_feature, axes=0)
if k_shot > 1:
sample_batch_size = sample_batch_size / float(k_shot)
sample_shape = fluid.layers.concat(
      [sample_batch_size, fluid.layers.assign(np.array([1.1.1.1]).astype('int32'))])
query_feature_ext = fluid.layers.expand(query_feature_ext, sample_shape)
Copy the code

As in the previous operation, the query set characteristics need to be added to a dimension, where sample_batch_size needs to be copied several times. It is worth noting that if k-shot is greater than 1, since we have already done the reduce_MEAN operation, we need to divide sample_batch_size by k-shot to get the new sample_batch_size. Finally, a tensor of [sample_batch_size, query_batch_size, 64, 19, 19] is obtained by copying.

query_feature_ext = fluid.layers.transpose(query_feature_ext, [1.0.2.3.4])
relation_pairs = fluid.layers.concat([sample_feature_ext, query_feature_ext], axis=2)
relation_pairs = fluid.layers.reshape(relation_pairs, shape=[-1.128.19.19])
Copy the code

Finally, sample_feature_ext and query_feature_ext have the same shape by transpose. Finally, a tensor relation_pairs with the shape of [query_batch_size x sample_batch_size, 128, 19, 19] is obtained by stitching and modifying the shape of the two features.

relation = RN_model.net(relation_pairs, hidden_size=8)
relation = fluid.layers.reshape(relation, shape=[-1, c_way])	
Copy the code

Finally, the features previously stitched are sent into the Relation Model module. First, a vector of query_batch_size x sample_batch_size is obtained. Then change the shape to get the tensor of [query_batch_size, sample_batch_size] (sample_batch_size is actually c-way), The sample_batch_size vector represents each query image category in one-hot form.

The code for the loss function is as follows:

one_hot_label = fluid.layers.one_hot(query_label, depth=c_way)
loss = fluid.layers.square_error_cost(relation, one_hot_label)
loss = fluid.layers.reduce_mean(loss)
Copy the code

Firstly, the label query_label of query image is converted to one-hot form, and relation obtained before is also one-hot form. Then, MSE of Relation and one_HOT_label is calculated to obtain the loss function. 2. Training strategy In FSL tasks, if only the support set is used for training, reasoning prediction of testing set can also be carried out. However, due to the small number of samples in the Support set, the performance of the classifier is generally poor. Therefore, training set is generally used for training, so that the classifier can have a better performance. There is an effective method called episode Based Training. The realization steps of episode Based Training are as follows:

  • The training needs to iterate N episodes, and each episode will randomly select K data in C categories in the training set to form a sample set. C and K correspond to c-way K-shot in the Support set, and there are a total of C x K samples.
  • Then, several samples were randomly selected from the remaining samples in C categories as query sets for training.

For 5-way 1-shot learning, the batch_size of sample set is 5, and the batch_size of query set is 15. For 5-way 5-shot learning, the batch_size of sample set is selected as 25 (5 graphs per category), and the batch_size of query set is selected as 10. For the trained optimizer, Adam optimization was selected and the learning rate was set to 0.001. For data augmentation, AutoAugment method is applied to the images of sample set and Query set during data reading to increase the diversity of data.

3. Only minImageNet used in the experiment in the paper was used in the data set for the validation of model reproduction effect, with a total of 100 categories and 600 images for each category. The classification of 100 were divided into the training/validation/testing three data sets, the number 64, 16, and 20 respectively. The article mentioned that the accuracy of the model in minImageNet’s testing data set is as follows:

The accuracy of Relation Net reached about 50.44 and 65.32 in 5-way 1-shot and 5-way 5-shot respectively. The 5-way 1-shot accuracy of Relation Net based on flying paddles in minImageNet’s testing data set was also used:5-way 5-shot Accuracy:Consistent with the accuracy in the paper, the model is reproduced. Code Address:Github.com/txyugood/pa…