This is the sixth day of my participation in the First Challenge 2022

Deep learning Xiaobai wrote the second paper reading notes, there may be more mistakes and omissions, please forgive me.

I look forward to improving in the process of continuous output and sharing more valuable content with you.

Kaleido-bert: Vision-language Pre-training on Fashion Domain

Motivation

Current multimodal pretraining models (VL-Bert, UNITER, etc.) focus on general vision-language representation, but multimodal pretraining models in e-commerce focus more on fine-grained representation (such as material of goods) rather than coarse-grained representation (what, where, etc.) in general scenarios. Therefore, this paper proposes a model applied in the field of fashion to solve the above problems.

Related Works

The authors summarized 28 representative multimodal methods, as shown in figure:

Most of the visual-language models in the figure above focus on relatively crude general representations and will not be discussed here. There are two similar efforts:

  1. FashionBERT. FashionBERT is the first pre-training model for fashion (and indeed their work), which uses fixed-size image blocks and focuses on cross-modal retrieval tasks between images and text.

  2. MAAF. MAAF aims to derive a modally agnostic attention fusion strategy for undifferentiated text and image retrieval tasks. MAAF uses an image-level attention mechanism.

This study suggests that they both limit the representational ability of pre-trained models, especially in fine-grained understanding of fashion tasks. Therefore, the design of multi-scale image blocks as fine-grained input is urgently needed in academia/industry. Kaleido-bert proposed in this paper is the first model to implicitly associate image-text semantics using a pre-aligned mask strategy.

Methods

The core idea of this paper is to focus on fine-grained representation learning and reduce semantic barriers between texts and texts. For this purpose, the author uses the following methods:

  1. Using the “Kaleido” strategy, a series of fine-grained image blocks with different scales are extracted from the image side to obtain multi-scale image features, which can be better applied to downstream tasks.

  2. The SAT network is introduced to reduce the semantic barrier across the modes and generate Kaleido image blocks and text word cases.

Model

The model structure of Kaleido-Bert is shown as follows:

From bottom to top are Kaleido Image Block Generator (KPG), Attention Alignment Generator (AAG), Pre-aligned Mask Strategy (AGM), cross-modal Transformer, and three pre-training tasks.

1. Kaleido Patches Generator

KPG firstly uses significance detection network (BAS, EGNet, ICON, etc.) to extract foreground segmentation map and frame subject target according to foreground map.

Each image was then cut into different scales (1×1, 2×2… , 5×5), a total of 55 image patches (i.e. Kaleido patches).

Finally, ResNet50 is used as the backbone network for feature extraction.

2. Attention-based Alignment Generator

Directly train FashionGen data sets using the SAT network (which is the data set built by the authors’ team in the FashionBERT article) and then use them as text generators to generate image descriptions.

If the generated description and the original description have co-occurrence, the attention heat map of the co-occurrence word will be used to determine which Kaleido image block the word tends to associate with.

In this stage, the partial alignment information of words in the original description and Kaleido image blocks can be obtained.

3. Alignment Mask Masking

Different from random mask strategy, pre-aligned mask strategy will give higher priority to mask words or image blocks with pre-aligned information.

According to experience, this paper picks out 1 block from 3×3 image blocks, 2 from 4×4 image blocks and 3 from 5×5 image blocks respectively for mask.

4. Cross modal Transformer

In this paper, the original BERT is used to build multimodal Transformer to ensure that Kaleido-Bert is easy to develop and migrate.

Pre-Training

Three pre-training tasks are designed in this paper:

1. Pre-aligned Mask Language Model (AMLM)

Through the surrounding word feature and image block feature, to restore the masked word.

2. Image matching Task (ITM)

Determines whether the input image and text match.

3. Pre-aligned Kaleido image block model (AKPM)

As mentioned above, KPG generated a total of 55 image blocks of 5 different levels, and AKPM designed independent tasks for each level of image blocks.

  • Rotation Recognition (RR)
  • Task 2: Jigsaw Puzzle Solving (JPS)
  • Mission 3: Camouflage (Camouflage Prediction, CP)
  • Task 4: Grey-to-color Modeling (G2CM)
  • Task 5: Repair (Blank-to-color Modeling, B2CM)

Experiments

Results

SumR=(Rank@1+Rank@5+Rank@10)×100SumR=(Rank@1+Rank@5+Rank@10)\times100SumR=(Rank@1+Rank@5+Rank@10)×100 The results in text retrieval (ITR) and image retrieval (TIR) are significantly better than those in SOTA.

Kaleido-bert also achieved a significant performance improvement in category prediction (CR, SUB) and fashion description (FC) tasks (SumCLS=(ACC+ Macro −F)×100andSumCAP=Bleu−4+METEOR+ROUGE−L+CIDErSum) CLS=(ACC+macro-F)\times100 and Sum CAP = Bleu – 4 + + CIDErSumCLS METEOR + ROUGE – L = (ACC + macro – F) by 100 andsumcap = Bleu – 4 + METEOR + ROUGE – L + CIDEr)

It is worth noting that compared with ViLBERT and VLBERT extracting RoIs as features, FashionBERT method uses image blocks as image input features and achieves better results, indicating that the latter is a more appropriate method in the field of fashion.

Ablation Study

  • The fixed-size cutting, Kaleido and Kaleido+ significance detection methods were compared.

  • Random mask strategy and pre-aligned mask strategy (AGM) are compared.

  • Ablation experiments were performed on five pre-training tasks.

The result is shown below:

Summary

The author proposes a multi-modal pre-training model kaleido-bert, which is applied in the field of fashion, and has achieved excellent results in the downstream tasks of image-text retrieval, category prediction, fashion description and so on.

Main contributions:

  • Kaleido image block generator

  • Attention alignment generator

  • Pre-align mask policy