1. The innovation point

Existing visual infrastructure models such as CLIP(Radford et al., 2021), Align(Jia et al., 2021) and Wu Dao 2.0(Wud) focus on mapping image and text representations to cross-modal shared representations, And we introduced the new computer vision base model Florence, which can be built from:

  1. to expand the representations from coarse (scene) to fine (object)
  2. from static (images) to dynamic (videos)
  3. from RGB to multiple modalities (caption, depth)
  1. You can extend the rough (scene) to the elaborate (object)
  2. Can be extended from static (image) to dynamic (video)
  3. Extending multiple modes from RGB (title, depth)

Conclusion 2.

Florence has successfully extended it to different tasks in space, time and mode with great portability and has achieved new SOTA results on a wide range of visual benchmarks

3. Implementation method

The whole structure of Florence is made up ofData management,Model pretraining,The task to adaptandFoundation Training FrameworkComposition, as shown in the figure below:

1. Data management: 900 million data sets of image-text pairs (FLD-900m) were established and corrected through uniCL.

The final form of the FLD-900M data set includes 900M images and 900M free-form text (from a word, phrase, to sentence), 9.7 million unique queries, and a total of 7.5B tokens. Because an image may have multiple descriptions, and by default, an image has only one text corresponding to it, this pair is considered positive and the others are negative, i.e. : The image image1, described as Text1, (image1:text1) is defined as a positive class, while others such as (image1,text2), (Image11,text3)… Is classified as a negative class. But in real life, a picture may have multiple descriptions, such as Image: dog, Text1: a dog, and Text2: a cute dog. So uniCL is designed to eliminate this situation. Makes (image1,text1) a positive class, and (image1,text2) a positive class.

2. Model pre-training: Pre-training model based on Transformer, which uses the twin tower structure: A 12-layer Transformer acts as a language encoder (similar to CLIP) and a layered vision Transformer (ViT) acts as an image encoder, the CoSwin Transformer (to be added…).

12 layer Transformer is used as the language encoder. We use a visual transformer as the image encoder, becoming CoSwin Transformer, and replace patch embedding and patch merging modules in Swin Transformer by the convolution embedding layer. Two linear projection layers are added on the basis of image encoder and language encoder to match the size of image and language feature.

  1. Task adaptation: The dynamic header adapter is used for the spatial dimension, the CoSwin adapter for the temporal dimension, and the METER adapter for the modal dimension.

Based on the hierarchy of coswin-H image encoder, multiple scale features pyramid can be obtained, and the scale of feature pyramid can be cascaded, reduced or enlarged.The idea of the dynamic head adapter is to deploy three kinds of attention, horizontal, spatial, and channel, on the orthogonal dimensions of the tensor (H,W, and C).

The Main purpose of the METER adapter is to extend fine-grained language representationIt uses pre-trained RoBERTs as the language encoder, CoSwin as the image encoder, and then uses common attention blocks to learn the context representation.Common attention block consists of self-attention block, cross-attention block and feedforward network.

The Video Recognition Adapter (VideoCoSwinAdapter) has only minor changes compared to the CoSwin adapter.

  1. The image tokenization layer is transformed into video tokenization layer, that is, 2D Convolution layer of CoSwin is replaced with 3D convolution layer
  2. The videoCoSwinAdapter uses the patch merging operator of 3D convolution
  3. In CoSwin’s self-attention layer, 2D shifted Windows are replaced by 3D locally shifted Windows
  4. A dynamic window size strategy is used, with a smaller shifted window at the beginning and a larger shifted window at the end
  1. Training infrastructure: To reduce computation and memory consumption, the authors integrate various key technologies such as ZeRO, activation checkpoints, mixed precision training, gradient caching, and more
  • ZeRO uses) to partition optimizer states, gradients, and parameters across gpus
  • Activate checkpoint: Rerun forward propagation during reverse propagation
  • Mixed precision: Train various operations with different numerical precision
  • Gradient cache: Replace the mass precision with smaller batch for training

Finally, the performance of Florence model in each data set given in the last paper is as follows:

Prohibit reprint without consent…….