Make writing a habit together! This is the 10th day of my participation in the “Gold Digging Day New Plan · April More Text Challenge”. Click to view the activity details MirrorGAN tries to re-generate text description from the generated images by learning text-image-text, so as to strengthen the consistency between text description and visual content. This paper was accepted by the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Address: arxiv.org/abs/1903.05…

Code address: github.com/qiaott/Mirr…

The basic principle is that if the image generated by T2I is semantically consistent with the given text description, its redescription by I2T should have exactly the same semantics as the given text description.

This blog is a close reading of the paper report, including some personal understanding, knowledge development and summary.

A summary of the original text

Generating images from a given text description has two goals: visual authenticity and semantic consistency. While significant progress has been made in using generative adversarial networks to generate high-quality, visually realistic images, ensuring semantic consistency between text descriptions and visual content remains very challenging. In this paper, we propose MirrorGAN, a new global-local concerned and semantically preserved text-image-text framework to solve this problem. MirrorGAN takes advantage of the idea of learning text to image generation through redescription and consists of three modules: Semantic text embedding module (STEM), global-local collaboration attention module (GLAM) for cascading image generation, and semantic text regeneration and alignment module (STREAM). STEM generates word – and sentence-level embedding. GLAM has a cascading architecture for generating target images from coarse to fine, using local word attention and global sentence attention to gradually enhance the diversity and semantic consistency of generated images. STREAM attempts to regenerate a text description from the generated image that is semantically consistent with the given text description. In-depth experiments on two common benchmark datasets show that the MirrorGAN method is superior to other representative, up-to-date methods.

Ii. Why is MirrorGAN proposed

While using the generative against network (GAN) generates visual lifelike image has made significant progress, but due to the areas of difference between text and images, only rely on the discriminator, difficult and inefficient to semantic consistency model, add attention mechanism is improved, but the word note does not ensure that the global consistency semantics, It is ultimately difficult to ensure that the generated image is semantically aligned with the input text.

Iii. MirrorGAN Overall framework

MirrorGAN implements a mirror structure by integrating T2I and I2T. It leverages the idea of learning T2I generation by redescribing it. After the image is generated, MirrorGAN regenerates its description so that its underlying semantics are consistent with the given text description. Technically, MirrorGAN consists of three modules: STEM, GLAM and STREAM. The details of this model are described below.

3.1 STEM: Semantic embedding module

In STEM module, recursive neural network (RNN) is used to extract semantic embedding from a given text description T, including word embedding W and sentence embedding S:

W, w, s = RNN (T) s = RNN (T), w, s = RNN (T) among them, the text describes T = {Tl ∣ = 0, l… , L – 1} T = \ left \ {T_ {L} \ mid = 0, L \ ldots, L – 1 \ right \} T = {Tl ∣ = 0, L… ,L−1},L denotes the length of sentences, w={wl∣ L =0… , L – 1} ∈ RD x Lw = \ left \ {w ^ {L} \ mid = 0, L \ ldots, L – 1 \ right \} \ in \ mathbb {R} ^ {} \ times L D w = {wl ∣ = 0, L… , L – 1} ∈ RD x L

3.2 GLAM: global-local collaborative attention module in cascade image generator

This part also adopts multiple image generation networksMultilevel cascade generator, its basic structure andAttnGANThe cascade structure is similar in.

The first is the attention mechanism at the word level: words are embedded and transformed into the semantic space of visual features through the perception network layer, then multiplied by visual features to obtain attention score, and then the attentive word-context feature is obtained by the inner product of the results of the first two steps.

Secondly, the attention mechanism at the sentence level is similar to that at the word level. It is also the semantic space where sentence embedding is transformed into visual features through the perceptual network layer, and then dot with visual features to obtain the attention score. Finally, the final features are obtained through the dot inner product of the two results.

This part introduces global attention again on the basis of attnGAN and combines it with local attention in the previous attnGAN, focusing not only on local details and semantic generation, but also on global details and semantic generation.

In formula reasoning, {F0, F1… , Fm−1} represents m visual feature converters, {G0, G1… , Gm−1} represents m image generators (typically three layers). F0 = f0 (z, sca) fi = fi (fi – 1, Fatti (fi – 1, w, sca)), I ∈ {1, 2,… , m – 1} = Gi (fi), Ii I ∈ {0,… , m – 1} \ begin {array} {l} f_ {0} = f_ {0} \ left (z, s_ {a} c \ right) \ \ f_ {I} = f_ {I} \ left (f_ 1} {I -, f_ {a t t_ {I}} \ left (f_ 1} {I -, w, A s_ {c}, right), right), I \ \ in {1, 2, \ ldots, m – 1 \} \ \ I_ {I} = G_ {I} \ left (f_ {I} \ right), I \ \ {0, \ ldots, in M – 1 \} \ end f0 = {array} f0 (z, sca) fi = fi (fi – 1, Fatti (fi – 1, w, sca)), I ∈ {1, 2,… , m – 1} = Gi (fi), Ii I ∈ {0,… , m – 1}

The fi ∈ RMi ∈ x Ni and Ii Rqi x qif_ {I} \ \ mathbb in ^ {R} {M_ {I} \ times N_ {I}} \ text {and} I_ {I} \ \ mathbb in ^ {R} {q_ {I} \ times Q_ {I}} fi ∈ RMi ∈ x Ni and Ii Rqi x qi

Atti – 1 w = ∑ l = 0 l – 1 (Ui – 1 wl) (softmax (fi – 1 t (Ui wl) – 1)) TA t t_ {1} I – ^ = {w} \ sum_ ^ {l = 0} {1} l \ left (U_ {1} I – w ^ {l} \ right/left (\ text {softmax} \ left (f_ {1} I – ^ {T} \ left (U_ {1} I – w ^ {l} \ right) \ right) \ right) ^ {T} Atti – 1 w = ∑ l = 0 l – 1 (Ui – 1 wl) (softmax (fi – 1 T (Ui wl) – 1)) T the Ui – 1 ∈ RMi – 1 x D and Atti – 1 w ∈ RMi 1 x – Ni – 1 u_ {1} I – \ \ mathbb in ^ {R} {M_} {1} I – \ times D \ text {and} A T T_ {1} I – ^ {w} \ in \ mathbb {R} ^ {M_ {1} I – \ times N_ {I – 1}} Ui – 1 ∈ RMi – 1 x D and Atti – 1 w ∈ RMi 1 x – Ni – 1

Atti – 1 s = (sca) Vi – 1 ∘ (softmax ⁡ (fi – 1 ∘ (sca) Vi – 1)) A t t_ {1} I – ^ = {s} \ left (V_ {1} I – s_ {A} c \ right) \circ\left(\operatorname{softmax}\left(f_{i-1} \circ\left(V_{i-1} s_{c A}, right), right), right) Atti – 1 s = (sca) Vi – 1 ∘ (softmax (fi – 1 ∘ (sca) Vi – 1)) the Atti – 1 s ∈ RMi 1 x – Ni – 1 a t t_ {1} I – ^ {s} \ in ^ \ mathbb {R} {M_ {1} I – \ times N_ {I – 1}} Atti – 1 s ∈ RMi 1 x – Ni – 1, Vi ∈ RMi x D ‘V_i ∈ \ mathbb {R} ^ {M_ {I} \ times D ^ {\ prime}} Vi ∈ RMi x D’

3.3 STREAM: Semantic text reconstruction and alignment module

The semantic text reconstruction and alignment module is used to regenerate the text description from the generated image so that the text description is semantically aligned with the given text description as much as possible. Firstly, the generated image is input to the image encoder, and then the text is decoded by the recurrent neural network.

The image encoder is a pre-trained convolutional neural network (CNN) on ImageNet, and the decoder is RNN.


x 1 = C N N ( I m 1 ) x t = W e T t . t { 0 . L 1 } p t + 1 = R N N ( x t ) . t { 0 . L 1 } \begin{array}{l} x_{-1}=C N N\left(I_{m-1}\right) \\ x_{t}=W_{e} T_{t}, t \in\{0, \ldots L-1\} \\ p_{t+1}=R N N\left(x_{t}\right), t \in\{0, \ldots L-1\} \end{array}

Including 1 x – ∈ RMm – 1 x_ {1} \ \ mathbb in ^ {R} {M_ {m – 1}} – 1 x ∈ RMm – 1 after encoding the visual characteristics of image formation, We∈RMm−1×DW_{e} \in \mathbb{R}^{M_{m-1} \times D}We∈RMm−1×D represents the word embedding matrix, which maps word features to the visual feature space. Pt +1p_{t+1}pt+1 is the predicted probability distribution of words.

4. Loss function

The first part of the loss function is to generate the loss of adversarial network, which is similar to the loss function in StackGAN++ and AttnGAN. The MirrorGAN also adopts the alternating training of multistage generator and discriminator, and also adopts the unconditional loss (visual authenticity) + conditional loss (text-image pairing semantic consistency) :

LGi = 12 eii ~ pIi – [log ⁡ (Di) (Ii)] – 12 eii ~ pIi [log ⁡ (Di (Ii, s))], and \ begin {array} {c} \ mathcal {L} _ {G_ {I}} = – \ frac {1} {2} E_ {I_ {I} \ sim p_{I_{i}}}\left[\log \left(D_{i}\left(I_{i}\right)\right)\right] \\ \quad-\frac{1}{2} \mathbb{E}_{I_{i} \sim p_{I_{i}}}\left[\log \left(D_{i}\left(I_{i}, s\right)\right)\right], \end{array}LGi=−21EIi ~ pIi[log(Di(Ii))]−21EIi ~ pIi[log(Di(Ii,s))], where IiI_iIi is the image generated by sampling from the pIip_{Ii}pIi distribution at layer I, where the first part is unconditional loss, The second part is conditional loss, which is used to determine whether the image conforms to semantics.

Similar to the generator, the loss of the discriminator also adopts unconditional loss (visual authenticity) + conditional loss (text-image pairing semantic consistency) :

LDi = – 12 eiigt ~ pIiGT [log ⁡ (Di) (IiGT)] – 12 eii ~ pIi [log ⁡ (1 – Di (Ii))] – 12 eiigt ~ pIiGT [log ⁡ (Di (IiGT, s))] – 12 eii ~ pIi [log ⁡ (1 – Di (Ii, s))] \begin{array}{c} \mathcal{L}_{D_{i}}=-\frac{1}{2} \mathbb{E}_{I_{i}^{G T} \sim p_{I_{i} G T}}\left[\log \left(D_{i}\left(I_{i}^{G T}\right)\right)\right] \\ -\frac{1}{2} \mathbb{E}_{I_{i} \sim p_{I_{i}}}\left[\log \left(1-D_{i}\left(I_{i}\right)\right)\right] \\ -\frac{1}{2} \mathbb{E}_{I_{i}^{G T} \sim p_{I_{i} G T}}\left[\log \left(D_{i}\left(I_{i}^{G T}, s\right)\right)\right] \\ -\frac{1}{2} \mathbb{E}_{I_{i} \sim p_{I_{i}}}\left[\log \left(1-D_{i}\left(I_{i}, s\right)\right)\right] Eiigt \ end LDi = {array} – 21 ~ pIiGT [log (Di) (IiGT)] – 21 eii ~ pIi/log (1 – Di (Ii))] – 21 eiigt ~ pIiGT [log (Di (IiGT, s))] – 21 eii ~ pIi [log (1 – Di ( Ii,s))] where IiGTI_{I}^{G T}IiGT is the sampling of the real image distribution pIiGTp_{I_{I}^{G T}}pIiGT from the I layer

The second part of the loss function is the text-semantic reconstruction loss based on CE, which makes the reconstructed text description as consistent as possible with the given text description:


L stream  = t = 0 L 1 log p t ( T t ) \mathcal{L}_{\text {stream }}=-\sum_{t=0}^{L-1} \log p_{t}\left(T_{t}\right)

Add the two terms, adjust the weight by λ, and finally get the loss function:


L G = i = 0 m 1 L G i + Lambda. L stream  \mathcal{L}_{G}=\sum_{i=0}^{m-1} \mathcal{L}_{G_{i}}+\lambda \mathcal{L}_{\text {stream }}

Five, the experiment

5.1. Data set

CUB Bird: contains 8855 training images and 2933 test images, belonging to 200 categories, and each bird image has 10 text descriptions. MS COCO: contains 82783 training images and 40504 test images, each with 5 text descriptions.

5.2 Evaluation criteria

Inception Score: IS Score measures the objectivity and diversity of the generated image R-Precision: Evaluates the visual semantic similarity between the generated image and its corresponding textual description.

5.3. Experimental results

5.4. Quantitative analysis

MirrorGAN achieved the highest IS score on both CUB and COCO datasets, and achieved a higher R-precision score compared to AttnGAN.

5.5 Qualitative analysis

Subjective visual comparison and human perception test, the author compared and analyzed the images generated by each GAN, and then collected the visual survey of several volunteers. For a brief introduction, you can read the original text if you are interested.

5.6 ablation study

1. Ablation experiment of MirrorGAN elementThe results show that global and local attention in GLAM can work together to help generators generate visually realistic, semantically consistent images and tell generators where to focus.

2. Visual Study of Cascade Structure In order to better understand the cascade generation process of MirrorGAN, the author visualized the intermediate image and the attention diagram of each stage, as shown in the figure belowAs you can see, in the first stage, the low resolution image generates raw shapes and colors, lacking in detail. Under the guidance of GLAM in phase, MirrorGAN gradually improves the quality of the generated images by focusing on the most relevant and important areas to generate images.

1) Global attention is focused more on the global context in the early stage, and then on the context around a specific region in the later stage; 2) Local attention helps generate images with fine-grained detail by directing producers to focus on the most relevant words, and 3) global and local attention are complementary and together they contribute to the progress of the network.

In addition, MirrorGAN can capture small differences (such as color) between text descriptions to generate corresponding images:

conclusion

The contributions of this article are summarized below

1. A new text generation image framework MirrorGAN is proposed to model T2I and I2T together, specifically targeting T2I generation by reflecting the idea of learning T2I generation through redescription.

2. A global-local collaborative attention model is proposed, which is seamlessly embedded into the cascade generator to maintain cross-domain semantic consistency and smooth the generation process.

3. In addition to the loss of commonly used generators, a text semantic reconstruction loss based on cross entropy (CE) is proposed to supervise the generator to generate visually realistic and semantically consistent images.

Extension: 2016-2021 Text to Image (T2I