1 introduction

As the most common backbone network, ResNet plays an important role in target detection algorithm. Many classical target detection algorithms such as RetinaNet, Faster R-CNN and Mask R-CNN are tuned based on ResNet as the backbone network. Meanwhile, most of the subsequent improved algorithms perform fair comparison with RetinaNet, Faster R-CNN and Mask R-CNN as baseline.

Both TIMM and TorchVision have recently unveiled their latest training techniques to improve ResNet performance. The scheme was called ResNet Strikes Back (RSB) in TIMM, and the TOP1 accuracy of ResNet50 was increased from 76.1 to 80.4 in ImageNet 1K dataset. TorchVision, which calls it TorchVision New Recipes (TNR), has improved its top1 accuracy to 80.86, both big improvements.

With such a strong pre-trained ResNet backbone, would there be a huge improvement in applying it to downstream target detection tasks? This is a question worth thinking about. To this end, the MMDetection team provided a good answer to this question through extensive experimentation and parameter tuning. Taking Faster R-CNN as an example, the performance table on COCO Val data set is shown as follows:

Serial number 1 is Faster R-CNN Baseline. It can be seen that the ResNet model R50-MMCLS based on high-precision pre-training is optimized by the optimizer, learning rate and weight attenuation coefficient. The Faster R-CNN mAP performance can be improved by 3.4 at the highest (R50-MMCLS refers to the pre-training model trained on MMClassification using RSB strategy). At the same time, we have searched a set of optimal parameters for each backbone to facilitate user reference.

2 comparison of training strategies between RSB and TNR on ResNet50

In this paper, the training strategies of RSB and TNR will be explained in detail, and then how to fine-tune the downstream target detection tasks to significantly improve the performance of the classical detection model.

2.1 the summary table

First of all, in order to facilitate viewing and comparison, we combed the following comparison table:

  • Resnet50-base indicates the ResNet50 baseline result
  • ResNet 50-RSB refers to the training results of ResNet Strikes Back proposed by TIMM, specifically A1 strategy
  • Resnet 50-TNR refers to the training result of New Recipe strategy proposed by TorchVision
  • ResNet 50-DEIT-S refers to the results of ResNet training based on deIT-S algorithm strategy adopted in TIMM. This experiment is to fairly compare DEIT-S and ResNet Strikes Back

2.2 ResNet Baseline Training Skills Details

ResNet Baseline is the ResNet 50-Base column in the table above. Note that ResNet has two versions for historical reasons: The difference between Resnet-PyTorch and Resnet-caffe lies in the Bottleneck module, which is stacked 1×1-3×3-1X1. In Caffe mode, the stride=2 parameter is placed at the first 1×1 convolution. And the stride equals 2 in Pyorch mode is placed at the second 3x convolution. A simple example is as follows:

if self.style == 'pytorch': 
      self.conv1_stride = 1 
      self.conv2_stride = stride 
else: 
      self.conv1_stride = stride 
      self.conv2_stride = 1 
Copy the code

Baseline is resnet-PyTorch. ResNet50 was trained from scratch on the ImageNet 1K training data set and top-1 accuracy was calculated on the ImageNet 1K verification set. The training techniques are as follows:

  • Batch size: 32*8, 8 cards, each card 32 BS
  • Optimizer: SGD and Momentum 0.9
  • Learning rate: the initial learning rate was 0.1, and the learning rate decayed to the original 0.1 every 30 epochs
  • Epoch total: 90
  • Weight canonical: weight decay was 1E-4
  • Training data enhancement
    • Random resizing crop
    • RandomHorizontalFlip
    • Random ColorJitter
  • Image input size: The image size is 224 during training and testing

Based on the above configuration, ResNet50 has a top-1 accuracy of 76.1 on the ImageNet 1k validation dataset.

2.3 TIMM training skills details

TIMM summarized the latest training techniques and applied them to ResNet and proposed resNET-RSB version. It comes in three variants, corresponding to epochs 600, 300, and 100, called versions A1, A2, and A3, as shown below:

  • A1 is designed to provide the best performance model on ResNet50
  • A2 is for a similar comparison with DeiT (not a completely fair comparison because BS/training trick is different)
  • The A3 is intended to provide a fair comparison with the original ResNet50

The authors evaluated on three data sets, specifically:

  • Val means to validate the dataset in ImageNet 1K
  • V2 represents the ImageNet 1K V2 version dataset

Taking A1 as an example, its training skills are as follows:

  • Batch size: 512×4=2048, 4 cards, each card 512 BS
  • Optimizer: LAMB
  • Learning rate: The initial learning rate is 5×10^-3. The learning rate scheduling policy adopts Consine
  • Epoch total: 600
  • Weight decay is 0.01
  • Wramup: Total 5 epochs
  • Training data enhancement
    • Random resizing crop
    • RandomHorizontalFlip
    • Random Augment Rand Augment 7/0.5
    • Repeated Aug
    • Mixup Aug, parameter alpha 0.2
    • Cutmix Aug, parameter alpha 1.0
  • Instead of CE, Loss is replaced with BCE
  • Training model disturbance
    • Parameter 0.1 of Label Smoothing
    • Browse-depth, parameter 0.05
  • Image input size:
    • The image size of the training input network is 224×224
    • Based on the FixRes strategy, Resize the image to 236 and crop to 224

It can be seen that compared with resNet-Base version, many new data enhancement and model perturbation strategies were introduced in the training due to the longer training epoch. ResNet50 was retrained based on the above strategy and top-1 accuracy was 80.4 on the ImageNet 1k validation dataset. In addition to the above results, the author also made other findings through the experiment:

  • Although the model performance can be improved by adding so many and strong data enhancement and model disturbance, the convergence speed will be slow in the early stage of network training
  • When the total training batch is 512, both SGD and AdamW can converge. However, when the total training batch is 2048, it is difficult to converge if SGD and BCE Loss are used

The very detailed comparison table provided by the authors is as follows:

We also verify the generalization capability of A1, A2 and A3 under different architectures.

Where the plus sign denotes TorchVision results, while ∗ comes from DeiT results. Resnet-50 and DEIT-S are compared as follows:

2.4 Details of TorchVison training techniques

TorchVision has also launched its own training techniques, which are detailed in an official tweet and discussed at github.com/pytorch/vis… , the final result is as follows:

The author also lovingly plots the improvement each trick brings, as follows:

Training skills summary:

  • Batch size: 128×8=1024, 8 cards, each card 128 BS
  • Optimizer: SGD and Momentum 0.9
  • Learning rate: The initial learning rate is 0.5. The learning rate scheduling policy adopts Consine
  • Epoch total: 600
  • Weight canonical: weight decay is 2E-05, and norm does not decay
  • Wramup: total 5 epoch, linear warmup, lR_WARMup_decay 0.01
  • Training data enhancement
    • Random resizing crop
    • RandomHorizontalFlip
    • TrivialAugment
    • Mixup, the parameter alpha is 0.2
    • Cutmix, the parameter alpha is 1.0
    • Random Erase, the probability parameter is 0.1
  • Training model disturbance
    • Parameter 0.1 of Label Smoothing
    • EMA, decay parameter 0.99998, updated every 32 iterations
  • Image input size:
    • The image size of the training input network is 176×176
    • Based on the FixRes strategy, Resize the image to 232, and then crop to 224

It can be seen that the strategy proposed by RSB and TorchVision focuses on the introduction of strong Aug, more model perturbations and longer training epoch. In addition, the author also made other findings through the experiment:

  • Using more complex optimizers such as Adam, RMSProp, and SGD with Nesterov Momentum, the results were no better, but the authors did not experiment with LAMB
  • The authors tried different LR scheduler schemes, such as StepLR and Exponential. Although the latter tends to work better with EMA, it usually requires extra hyperparameters such as defining minimum LR to work properly, so the authors ended up with the less hyperparameter sensitive cosine instead
  • The author has tried different augmentation strategies, such as AutoAugment and RandAugment, but none of these is superior to the simpler no-parameter TrivialAugment
  • Using bicubic or nearest neighbor interpolation does not provide better results than bilinear interpolation
  • Using the Sync Batch Norm did not produce significantly better results than using the regular Batch Norm
  • When Mixup and Cutmix are used together, one of them can be randomly selected with equal probability. Mixup alone can improve 0.118, while Cutmix can increase 0.278
  • In FixRes, the authors found that using 176 image sizes for training is best compared to using 272 for testing. However, the authors used 224 for baseline consistency, while using 256 for training is best

3 performance of high-performance pre-training model on target detection task

This section discusses the performance of high-performance pre-training model on target detection task. In this experiment, COCO 2017 data set was mainly used for Faster R-CNN FPN 1X. For details, see the MMDetection configuration file.

# https://github.com/open-mmlab/mmdetection/blob/master/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py 
_base_ = [ 
    '../_base_/models/faster_rcnn_r50_fpn.py', 
    '../_base_/datasets/coco_detection.py', 
    '../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py' 
] 
Copy the code

Several core configurations are:

  • 8 card training, the total batch size is 16
  • 1x training duration is 12 epoch
Optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)Copy the code
  • The optimizer configuration is SGD+ 0.9 Momentum, LR 0.02, and weight_decay 0.0001

If you want to understand Faster – R – CNN code and configuration parameters, etc. Detail information is available Easy control algorithm is commonly used in MMDetection (2) : Faster – R – CNN | Mask R – article on CNN.

3.1 Only replace performance under pre-training weights

In order to quickly evaluate the performance of pre-training weights of different performance under the Faster R-CNN FPN baseline configuration, we directly replace the pre-training weights to verify the performance on the Faster R-CNN, and the results are shown as follows:

Model download links: download.pytorch.org/models/resn… Download.openmmlab.com/mmclassific… Github.com/rwightman/p… Download.pytorch.org/models/resn…

It should be noted that in order to ensure the fairness of the experiment, random seeds (Seed=0) were set in the experiment. All experiments were carried out on 8 x V100, and batch size = 16(8×2).

It can be seen from the above table that after ResNet is replaced with high-precision pre-training weight, the Faster R-CNN does not significantly improve or even some performance degrades seriously, indicating that ResNet with high-precision pre-training may no longer be suitable for the same set of superparameters, so it is very necessary to conduct parameter tuning for it. The SGD optimizer could not adapt well to the pre-training model due to the training strategy adjustment of the pre-training model. So we plan to fine-tune the detector by adjusting the optimizer, learning rate, and weight regularization.

3.2 ResNet Baseline pretraining model parameter tuning experiment

Since the AdamW optimizer was used for training in ResNet Strikes Back, we tried to use AdamW as the optimizer for target detection downstream tasks, hoping to achieve the same test accuracy as using SGD optimizer.

The details can be seen in the following table:

It can be seen that the overall accuracy of the AdamW optimizer can exceed the SGD optimizer when the learning rate is 0.0001, and the performance is optimal when the weight regularity is 0.1.

3.3 MMCLS RSB pre-training model parameter tuning experiment

By modifying the pre-training model in the configuration file, we can replace ResNet’s pre-training model with MMClassification’s pre-training model trained by RSB. On this basis, we train the Faster R-CNN by AdamW and SGD respectively, so as to obtain the effect of MMClassification on the detection task of the pre-training model trained by RSB. The configuration file of MMDetection is written as follows:

_base_ = [ '../_base_/models/faster_rcnn_r50_fpn.py', '../_base_/datasets/coco_detection.py', '../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py' ] checkpoint = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.pth' # noqa model = dict( backbone=dict( init_cfg=dict( type='Pretrained', prefix='backbone.', Optimizer = dict(_delete_=True, type='AdamW', lr=0.0002, weight_decay=0.05, paramwise_cfg=dict(norm_decay_mult=0., bypass_duplicate=True))Copy the code

Based on the priors in the previous section, we first use AdanW as the optimizer, learning set to 0.0001.

Specific values are shown in the table below:

In order to verify the effect of learning rate on accuracy, we do a learning rate validation experiment.

Specific values are shown in the table below:

Based on the above experiments, we found that the detection accuracy was significantly improved when the learning rate was 0.0002, so we set the control experiment with the learning rate of 0.0002:

Specific values are shown in the table below:

It can be seen that the accuracy is highest at LR =0.0002 and weight decay=0.05. At the same time, it can also be found that weight decay does not have a great influence on the accuracy within a certain range, but the accuracy will decrease obviously once the range is exceeded.

3.4 Parameter tuning experiment of TIMM RSB pre-training model

Next, we replaced the ResNet pre-training model with PyTorch Image Models (TIMM) model. On this basis, we use AdamW to train Faster R-CNN, so as to obtain the effect of TIMM pre-training model on detection task. The configuration in MMDetection is written as follows:

_base_ = [ '../_base_/models/faster_rcnn_r50_fpn.py', '../_base_/datasets/coco_detection.py', '../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py' ] checkpoint = 'https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-rsb-weights/resnet50_a1_0-14fe96d1.pth' # noqa  model = dict( backbone=dict( init_cfg=dict( type='Pretrained', Optimizer = dict(_delete_=True, type='AdamW', lr=0.0002, weight_decay=0.03 paramwise_cfg=dict(norm_decay_mult=0., bypass_duplicate=True))Copy the code

Based on the above fine-tuning prior information, we first fixed the learning rate as 0.0001 and 0.0002 respectively, and adjusted weight decay. The experimental results are as follows:

\

Specific values are shown in the table below:

As you can see, although this is an improvement over the base Bbox mAP=37.4, the maximum is 39.8. However, there is still a certain gap compared with the maximum Bbox mAP of 40.8 obtained by using the pre-training model of MMCLS. Then we also adjusted the learning rate to observe the results:

Specific values are shown in the table below:

Based on the previous results, it can be seen that the accuracy gap of AdamW is not large when the learning rate is 0.0001 and 0.0002, and the accuracy will decrease significantly after more than 0.0003.

3.5 TorchVision TNR pre-training model parameter tuning experiment

Finally, we also replaced ResNet’s pre-training model with TorchVision’s high-precision model trained by new techniques, and trained Faster R-CNN by SGD and AdamW respectively. Thus, the effect of TorchVision’s high-precision model trained by new skills on detection tasks can be obtained. The configuration file in MMDetection is written as follows:

_base_ = [ '../_base_/models/faster_rcnn_r50_fpn.py', '../_base_/datasets/coco_detection.py', '../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py' ] checkpoint = 'https://download.pytorch.org/models/resnet50-11ad3fa6.pth' model = dict( backbone=dict( init_cfg=dict( type='Pretrained', Optimizer = dict(_delete_=True, type='AdamW', lr=0.0001, weight_decay=0.1, paramwise_cfg=dict(norm_decay_mult=0., bypass_duplicate=True))Copy the code

We first use SGD algorithm to optimize the Faster R-CNN and try to search the optimal learning rate and weight decay:

Optimal learning rate search experiment for fixed weight decay under SGD algorithm

Specific values are shown in the table below:

Optimal weight decay experiment was searched with fixed learning rate under SGD algorithm

Specific values are shown in the table below:

According to the experimental results, when the training parameters are consistent, the accuracy can be increased by 2.2(37.4 -> 39.6) points only by changing the pre-training model to TorchVision’s high-precision pre-training model. When the learning rate was 0.04 and weight decay was 0.00001, r50-TNR was used as the pre-training model, and the Faster R-CNN optimized under SGD algorithm could achieve the highest 39.8% mAP result.

Next, we try to optimize the model using AdamW algorithm:

Optimal learning rate search experiment for fixed weight decay under AdamW algorithm

Specific values are shown in the table below:

Optimal weight decay experiment with fixed learning Rate searched by AdamW algorithm

Specific values are shown in the table below:

Through experiments, it can be concluded that when using the AdamW optimizer, the learning rate of 0.0001 is much better than 0.0002. Weight decay reached the highest value around 0.1, and its change had little influence on the final result. When the learning rate was 0.0001 and weight decay was 0.1, the Faster R-CNN loading of R50-TNR reached 40.2% mAP of maximum accuracy, 0.4 higher than SGD (39.8 -> 40.2).

4 summarizes

Through previous experiments, we can see that the high-precision pre-training model can greatly improve the effect of target detection. The highest results of all pre-training models and corresponding parameter Settings are shown in the following table:

It can be seen from the table that using any high-performance pre-training model can improve the performance of target detection task by about 2 points. Among them, the high-precision model trained by MMClassification increases the Faster R-CNN by 3.4 points and reaches the highest 40.8% mAP, which proves that the use of high-performance pre-training model is of great help to the target detection task.

If you want to reproduce or experiment further, you can refer to the relevant profile and PR

  • TorchVision high-precision model configuration and PR: github.com/open-mmlab/…
  • TIMM high-precision model configuration: github.com/open-mmlab/…

Welcome to MMDetection and thank the MMClassification team for proofreading this article!

If we share to bring you certain help, welcome to like collect attention, than heart ~