Motivation
Word span 1, word span 1, word span 2 resolutions.
2, The motivation behind this scale-decreased architecture design: “High resolution may be needed to detect the presence of a feature, While its exact position needs not to be determined with equally high precision.”
3, Intuitively, a scale-decreased backbone throws away the spatial information by down-sampling, making it challenging to recover by a decoder network.
4, While this structure is well suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection).
Proposal
We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.
Improvements
1, First, the scales of intermediate feature maps should be able to increase or decrease anytime so that the model can retain spatial information as it grows deeper.
2, Second, the connections between feature maps should be able to go across feature scales to facilitate multi-scale feature fusion.
Search Space
The search space of this paper is divided into three parts:
Scale permutations(dicide the ordering of blocks): Permuting intermediate and output blocks respectively, resulting in a search space size of (N − 5)! 5! .
Cross-scale connections(dicide the inputs for each block): The parent blocks can be any block with a lower ordering or block from the stem network. . The search space has a size of
3, Block Adjustments: Allow Block to adjust its scale level and type.
Thinking points
Emmmmm, which feels like a reference, is the “degree of freedom” of this model. The model built in this paper has a very high degree of freedom, but the model search complexity is also very high. This paper does not write how long the training of this model will take, indicating that the time will not be too short. Well, that’s the idea of what NAS should ideally look like.