Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Source link

Introduction to the

Automatic image generation from natural language description is a basic problem in many applications such as art generation and computer aided design. The text-to-image synthesis methods based on Generative Adversarial Networks (GANs) are the most popular strategies.

In gANS-based methods, a commonly used strategy is to encode the entire text description into a global sentence vector, which is used as a condition for web-based image generation. However, this strategy is only limited to the global sentence vector and lacks important fine-grained information at the word level, thus hinders the generation of high-quality images. To overcome this problem, we propose an Attentional Generative Adversarial Network (AttnGAN) model with multi-stage fine-grained text-image attention generation. The model consists of two new components:

  • The first component is an attention-generating network in which an attentional mechanism is used for the generator to draw different sub-regions of the image by focusing on the words most relevant to the sub-regions being drawn. (See Figure 1)
  • The other component is a Deep Attentional Multimodal Similarity Model (DAMSM), which can use global sentence level information and fine-grained word level information to calculate the Similarity between the generated images and sentences. DAMSM therefore provides additional fine-grained image-text matching losses to the training generator.

Attention generation adversarial network AttnGAN

The architecture of AttnGAN is shown in Figure 2 and consists of two components: 1. 2. Multi-modal similarity model of deep attention.

Attention generation network

In this section, a new attentional model is proposed with a generative network capable of drawing different subregions of an image according to the words most relevant to the subregions.

As shown in Figure 2, the attention production network has MMM generators (G0,G1… , Gm – 1) (G_0, G_1,… ,G_{m-1})(G0,G1,… ,Gm−1), which uses the implied states H0, H1,.. , hm – 1 h_0, h_1,.. ,h_{m-1}h0,h1,.. ,hm−1 as input, produces small – to large-scale images (x^0,x^1… – 1), x ^ m (\ hat {x} _0, \ hat {x} _1,… ,\hat{x}_{m-1})(x^0,x^1,… , x ^ m – 1). Details are as follows:

ZZZ is a noise vector, usually sampled from a normal distribution. E ˉ\bar{e} E ˉ is a global sentence vector, and Eee is the word vector matrix. FcaF^{CA}Fca represents Conditioning Augmentation, which can transform E ˉ\bar{e}eˉ into conditional vector. FiattnF^{attN}_iFiattn represents the attention model of stage III in AttnGAN, Fca,Fiattn,Fi,GiF^{CA},F^{attn} _I,F_i,G_iFca,Fiattn,Fi,Gi are all modeled as neural networks.

Fattn(e,h)F^{attn}(e,h)Fattn(e,h) has two inputs, Word vector E ∈RD×Te\in \mathbb{R}^{D\times T}e∈RD×T and image feature H ∈RD^×Nh\in \mathbb{R}^{hat{D}\times N}h∈RD^×N from the previous hidden layer. Firstly, a new perceptron layer is added to transform word features into the common semantic space of image features, namely: E ‘= Ue, U ∈ RD ^ * De’ = Ue, U \ \ mathbb in ^ {R} {\ hat} {D} \ times D e ‘= Ue, U ∈ RD ^ x D. Then, according to the implicit feature HHH (Query) of the image, word context vector is calculated for each sub-region of the image, that is, the feature vector of each sub-region of the image in each column of HHH. For the JJJ subregion, its word context vector is a dynamic representation of word vector about HJH_jhj, and its calculation method is as follows:


c j = i = 0 T 1 Beta. j . i e i . w h e r e    Beta. j . i = e x p ( s j . i ) k = 0 T 1 e x p ( s j . k ) c_j=\sum^{T-1}_{i=0}\beta_{j,i}e’_{i},\pmb{where}\ \ \beta_{j,i}=\frac{exp(s’_{j,i})}{\sum^{T-1}_{k=0}exp(s’_{j,k})}

Where, sj, I ‘=hjTei’s ‘_{j, I}= H ^T_je’_isj, I’ =hjTei ‘, and βj, I \beta_{j, I}βj, I represents the weight of the third word in the generation of the JJJ subregion. Then, the word context matrix is defined for the image feature set HHH by: Fattn(e,h)=(c0,c1,…. CN – 1) ∈ RD (NF ^ ^ {attn} (e, h) = (c_0, c_1,… ,c_{N-1})\in \mathbb{R}^{\hat{D}\times N}Fattn(e,h)=(c0,c1,…. CN – 1) ∈ RD ^ * N. Finally, the image features are combined with the corresponding word context features to generate the image of the next stage.

In order to generate real images with multi-level (sentence-level and word-level) conditions, the final objective function of attention generation network is defined as:


L = L G + Lambda. L D A M S M . w h e r e   L G = i = 0 m 1 L G i L=L_G+\lambda L_{DAMSM},\pmb{where} \ L_G=\sum^{m-1}_{i=0}L_{G_i}

λ\lambda lambda is the equilibrium parameter of the two terms (LLL) in the equation.

The first term is GAN loss, which combines joint approximate conditional distribution and unconditional distribution. In stage III of AttnGAN, the generator GiG_iGi has a corresponding discriminator, DiD_iDi, so GiG_iGi’s antagonism loss is defined as:

The unconditional loss determines the authenticity of the image and the conditional loss determines whether the image matches the sentence.

For GiG_iGi training, each discriminator DiD_iDi is trained to classify inputs into true and false by minimizing cross entropy loss, then the loss is defined as:

Xix_ixi comes from the true distribution of the image at the third scale PdataiP_{data_i}Pdatai, x^ I \hat{x}_ix^ I comes from the model distribution at the same scale PGiP_{G_i}PGi, AttnGAN’s discriminator is structurally disjoint, So they can be trained in parallel, and each discriminator is focused on a single image scale.

The second term LDAMSML_{DAMSM}LDAMSM is a one-word level fine-grained image-text matching loss, which is calculated by DAMSM and explained in detail in the next section.

Deep attention multimodal similarity model DAMSM

DAMSM learns from two neural networks that map image subareas and sentence words into a common semantic space to measure image-text similarity at the word level and calculate the fine-grained loss used to generate images.

The Text encoder (LSTM) is a bidirectional LSTM used to extract semantic vectors from text descriptions. In bidirectional LSTM, each word corresponds to two hidden states, and each direction corresponds to one hidden state. Therefore, its two hidden states are connected to represent the semantic meaning of a word. The eigenmatrix of all words is expressed as e∈Rd×Te\in \mathbb{R}^{d\times T}e∈Rd×T, and the third of EIE_IEI is listed as the feature vector of the third word. DDD is the dimension of the word vector, TTT is the number of words. Meanwhile, the last hidden state of bidirectional LSTM can be segmented into a global sentence vector, which can be expressed by E ˉ∈RD\bar{e}\in \mathbb{R}^Deˉ∈RD.

The image encoder: Is a convolutional neural network (CNN), which is used to map images to semantic vectors. The middle layer of CNN learns local features of different sub-regions in the image, and the latter layer learns global features of the image. Specifically, the image encoder is built on the Inception- V3 model pre-trained by ImageNet. First, scale the input image to 299 × 299 pixels, Then, the local eigenmatrix F ∈R768×289f \in \mathbb{R}^{768\times 289} F ∈R768×289 (from 768×17×17768\times 17 \times) is extracted from the mixed_6E layer in Inception-v3 17768×17×17 zoom), each column of FFF is the feature vector of a sub-region in the image, 768 is the dimension of the local feature vector, 289 is the number of sub-regions in the image. At the same time, we can extract fˉ∈R2048\bar{f}\in \mathbb{R}^{2048}fˉ∈R2048 from the last platform pooling layer of Inception-v3. Finally, transform image features into a common semantic space for text features by adding a perceptron layer:


v = W f . v ˉ = W ˉ f ˉ v=Wf,\bar{v}=\bar{W}\bar{f}

Where: V ∈RD× 289V \in \mathbb{R}^{D\times 289}v∈RD×289, its third column VIv_IVI represents the world feature vector of the third sub-region of the image. V ˉ∈RD\bar{v} \in \mathbb{R}^{D}vˉ∈RD is the global vector of the whole image, and DDD represents the dimension of multimodal feature space (text and image modes).

The attention-driven Image-text Matching Score: This score is designed to measure image-sentence matching based on an attentional model that is intermediate between image and text. First, the similarity matrix of all possible word combinations in the sentence and the neutron region of the image is calculated:


s = e T v s=e^Tv

S ∈RT×289s \in \mathbb{R}^{T\times 289}s∈RT×289, si,js_{I,j}si,j is the dot product similarity between the third word in the sentence and the JJJ subregion in the image. The author of this paper finds that normalization of the similarity matrix is better:


s ˉ i . j = e x p ( s i . j ) k = 0 T 1 e x p ( s k . j ) \bar{s}_{i,j}=\frac{exp(s_{i,j})}{\sum^{T-1}_{k=0}exp(s_{k,j})}

Then, an attention model is established to calculate the region context vector of each word (query). Region-context vector CIC_ICI is the dynamic representation of an image sub-region associated with the third word in the sentence, and is the weighted sum of visual vectors of all regions:


c i = j = 0 288 Alpha. i v i . Alpha. = e x p ( gamma 1 s ˉ i . j ) k = 0 288 e x p ( gamma 1 s ˉ i . k ) c_i=\sum^{288}_{j=0}\alpha_i v_i, \alpha=\frac{exp(\gamma_1\bar{s}_{i,j})}{\sum^{288}_{k=0}exp(\gamma_1\bar{s}_{i,k})}

γ1\ GAMma_1 γ1 is a factor that determines how much attention should be paid to the features of a related subregion of a word when calculating its regional context vector.

Finally, the cosine distance between CI, EIC_i, E_ICI and EI is used to define the correlation between the third slave and the image, namely: R (c1, ei) = (ciTei)/(∣ ∣ ci ∣ ∣ ∣ ∣ ei ∣ ∣) R (c_1, e_i) = (c ^ T_ie_i)/(| | c_i | | \ | | e_i | |) R (c1, ei) = (ciTei)/(∣ ∣ ci ∣ ∣ ∣ ∣ ei ∣ ∣). Inspired by the formula of minimum classification error in speech recognition, the attention-driven image-text matching score between the whole image (represented by QQQ) and the whole text description (represented by DDD) is defined as:


R ( Q . D ) = log ( i = 1 T 1 e x p ( gamma 2 R ( c i . e i ) ) ) 1 / gamma 2 R(Q,D)=\log(\sum^{T-1}_{i=1}exp(\gamma_2 R(c_i,e_i)))^{1/\gamma_2}

Among them, γ2\gamma_2γ2 is a factor determining the importance of how to amplify the most relevant word-region-context pairs. γ2→∞\gamma_2 \ Rightarrow \inftyγ2→∞ R (Q, D) R (Q, D) R (Q, D) is approximately equal to the Max ⁡ I = 1 T – 1 R (c1, ei) \ Max ^ {T – 1} _ {I = 1} \ R (c_1, e_i) maxi = 1 T – 1 R (c1, ei).

DAMSM Loss: This loss is designed to learn the model of attention in a semi-supervised manner, where the only supervision is the matching between the entire image and the entire sentence (a sequence of words). For a batch of image-sentence pairs {(Qi,Di)} I =1M\{(Q_i,D_i)\}^M_{I =1}{(Qi,Di)} I =1M, and sentence DiD_iDi and its matching image QiQ_iQi, the posterior probability is calculated as follows:


P ( D i Q i ) = e x p ( gamma 3   R ( Q i . D i ) ) j = 1 M   e x p ( gamma 3   R ( Q i . D j ) ) P(D_i|Q_i)=\frac{exp(\gamma_3\ R(Q_i,D_i))}{\sum^M_{j=1}\ exp(\gamma_3\ R(Q_i,D_j))}

Here γ3\ GAMma3 γ3 is the smoothing factor determined by experiment. For this set of sentences, only DiD_iDi matches the image QiQ_iQi, and the remaining M−1M-1M−1 sentences are regarded as mismatched descriptions. The loss function is defined as negative log posterior probability matching between the image and its corresponding ground truth (WWW represents Word) :


L 1 w = i = 1 M log P ( D i Q i ) L^w_1=-\sum^M_{i=1}\log P(D_i|Q_i)

Symmetrical, minimized:


L 2 w = i = 1 M log P ( Q i D i ) L^w_2=-\sum^M_{i=1}\log P(Q_i|D_i)

Among them:


P ( Q i D i ) = e x p ( gamma 3   R ( Q i . D i ) ) j = 1 M   e x p ( gamma 3   R ( Q j . D i ) ) P(Q_i|D_i)=\frac{exp(\gamma_3\ R(Q_i,D_i))}{\sum^M_{j=1}\ exp(\gamma_3\ R(Q_j,D_i))}

Is the posterior probability of matching between sentence DiD_iDi and image QiQ_iQi. L1s,L2sL^s_1,L^s_2L1s,L2s can be obtained in a similar manner.

DAMSM loss is finally defined as:


L D A M S M = L 1 w + L 2 w + L 1 s + L 2 s L_{DAMSM}=L^w_1+L^w_2+L^s_1+L^s_2