This article has participated in the “New person creation Ceremony” activity, and started the road of digging gold creation together.

This paper, Text to Image (T2I) with GAN, was published by Reed et al in 2016 and accepted by ICML conference. It can be said that GAN is the first text to generate images.

Links to papers: arxiv.org/pdf/1605.05…

Code link: https://github.com/zsdonghao/text-to-image

This article is an intensive reading of the paper, including some personal understanding, knowledge development and summary.

A, in this paper,

It would be interesting and useful to automatically synthesize real images from text, but current AI systems are nowhere near that goal. However, in recent years, general and powerful recursive neural network structures have been developed to learn discriminative text feature representations. At the same time, deep convolutional generative adversarial networks (gans) have begun to generate attractive images of specific categories, such as faces, album covers, and room interiors. In this work, we developed a new deep architecture with GAN to effectively bridge these advances in text and image modeling, transforming visual concepts from characters to pixels. We show that the model can generate specious images of birds and flowers from detailed text descriptions.

Two, key words

Deep Learning, Generative Adversarial Network, Image Synthesis, Computer Vision

Iii. Relevant work

This research focuses on a subset of multimodal machine learning. Mode: Each source or form of information can be called a mode. For example, people have touch, hearing, sight and smell; Media of information, including voice, video, text, etc. A variety of sensors, such as radar, infrared, accelerometer, etc. Each of these can be called a mode. Multimodal learning aims to achieve the ability to process and understand multi-source modal information through a machine learning approach. Key challenges in multimodal learning include learning shared representations across modes and predicting missing data in one mode under the condition of another mode.

Denton et al. (2015) synthesized multi-resolution images using a Laplath pyramid antagonist generator and discriminator. This results in compelling high-resolution images and also allows for controlled generation of class tags. Laplacian pyramid generative adversarial network. Image pyramid is a kind of multi-scale expression in images. The pyramid of an image is a series of images arranged in pyramid shape with gradually reduced resolution and derived from the same original image. Laplacian pyramid: used to from the pyramid image reconstruction of the upper lower sampling image in digital image processing is to predict residual error, the image can be the greatest degree of reduction, the gaussian pyramid used to drop sampling image, while the Laplacian pyramid to upward from the bottom image sampling (i.e. size, double resolution + +) reconstruction of an image.

Radford et al. (2016) used a standard convolution decoder, but developed an efficient and stable architecture combined with batch standardization to achieve significant image synthesis results. In Mansimov et al. (2016), variable cycle autoencoders (VAE) were used to generate images from text headings, but the generated images were not yet authentic.

The main difference between this article and the GAN described above is that 1) our model conditions are textual descriptions, not class tags. 2) The first end-to-end architecture from character level to pixel level. 3) A manifold interpolation regularizer is introduced, which can significantly improve the quality of generated samples.

4. Background knowledge

4.1, GAN

The loss function is:For more math on GAN, see this blog:Understand the math behind GAN

4.2 、Deep symmetric structured joint embedding

In order to obtain the Visual discrimination vector representation of the text description, the article adopted the method of Learning Deep Representations of Fine-grained Visual DescriptionsConvolutional recurrent neural network text encoderAnd learn the corresponding function of the image as follows. includingAn image classifier and a text classifier, in this paper,Image classifiers use GoogLeNet and text classifiers use LSTM and CNN. After the text features are obtained, the compressed text features and image features need to be spliced together and put into DC-GAN.

Three, the main methods

3.1 framework,

Trained a manText encoder based on convolutional recurrent neural networkDeep Convolutional Generative Adversarial Networks (DC-gan). Generator network G and discriminator network D both perform feedforward reasoning based on text features.Both generator and discriminator use text encoding φ (t). Generator G preprocesses text information (convolutional recurrent neural network text encoder) to obtain feature expression, and then combines it with noise vector. In the figure above, the blue cuboid represents the feature expression of text information, and the white cuboid corresponding to Z is the noise vector. Input the resulting combination vector toDeconvolution networkAfter multi-layer processing, an image is finally obtained.After the convolution operation of the image, discriminator D combines the text information with the feature vector obtained by the convolution of the original image in the depth direction, and finally obtains a binary element to judge the authenticity of the image.

3.2. The first improvement: GAN-CLS

GAN – CLS:Match perception discriminator.In the previous adversarial network, the input of discriminator D includes two kinds: the correct picture and its corresponding text, the composite picture and any text. So the discriminator needs to recognize two situations: one is the composite image, and the other is the real image and the mismatched text. In this articleThe input of D was added with a real image and a false text description. In this way, D can better learn the corresponding relationship between text description and picture content. The pseudocode is:

3.3. The second improvement: Gan-int

GAN – INT:Manifold interpolation learning. A large number of additional text embeds are generated by simply interpolating between embeds of the text in the training set. Crucially, these inserted text inserts do not need to correspond to any actual written text, so there is no additional tag cost. This is because the feature representation learned by deep network is interpolable.The average value of embedding in C is close to that of embedding in A and B respectively.For example, “A: A cow eating grass” and “B: A bird in A tree” can be seen after embedding depth features, which are similar to that of A and B. 1) Blend the formula of two texts:Beta is the fusion ratio, which is 0.5 in the paper, that is, half of each sentence is fused

2) Style transfer formula:S extracts the style information of an image from the generator and gets S (style). Then, the random noise is changed into the extracted S, and S and embedding (t) are input into the generator to generate the image in a certain style.

Four, the

4.1. Data set

Data set: CUB(bird) and Oxford-102 (flower) were used. CUB is divided into 150 training class sets and 50 test class sets, while Oxford-102 has 82 training class sets and 20 test class sets. Each image is accompanied by 5 corresponding texts.

4.2. Pre-training of text features

For the text features, the convolutional recurrent neural network text encoder is first used for pre-training, namely char-CNN+RNN, which is embedded with 1024 dimension GoogLeNet image (Szegedy in 2015) for structural joint embedding. The reason for pre-training text encoders is simply to increase the speed of training other components for faster experimentation.

4.3 Training process

The training image size was set to 64×64×3. The text encoder produces 1024 dimensional inserts that are projected to 128 dimensions in the generator and discriminator before being deeply connected to the convolutional feature map. Alternate steps were taken to update the generator and discriminator networks, the learning rate was set to 0.0002, ADAM Solver (momentum 0.5) was used, and random noise from the generator was sampled from a 100-dimensional unit normal distribution. Minibatch was 64 in size and trained 600 epochs.

4.4. Experimental results

4.5 separate content and style

By content, we mean the visual properties of the bird itself, such as body shape, size and color. By style, we mean all the other variables in the image, such as background color and bird posture.Text embedding mainly consists of content information and is usually style independent.GAN uses random noise to make style. K-means is used to group images into 100 clusters, where images from the same cluster share the same style. Images of the same style (e.g. similar poses) should be more similar to each other than images of different styles. The models of GAN-INT and Ganint-CLS perform best in this task.

In this paper, images are grouped into 100 categories by k-means according to the background color and the posture of birds or flowers. The trained CNN network was used to predict the style with the images generated by G, and the cos similarity of style and its similar and different kinds of images was calculated.As you can see from the image below, the caption text shows a straight line, indicating that the style of the text and image is completely unrelated(The closer the ROC curve is to the upper left corner, the higher the sensitivity is and the lower the misjudgment rate is, the better the performance of the diagnostic method is)

4.6. Results of manifold interpolation

Text manifolds learned through interpolation can accurately reflect color information through control interpolation, such as birds changing from blue to red while posture and background remain unchanged. By controlling the interpolation between the two noise vectors and keeping the content fixed, a smooth transition between the two styles of bird images is generated.As shown in the figure below, the left figure changes the weight of the two sentences (i.e. the content changes and the style remains unchanged) while the random noise remains unchanged. The figure on the right is interpolating two random noises (i.e. unchanged content and style) with the sentence unchanged.

4.7 generalization

In order to test generalization, we conducted training tests on mS-COCO dataset. From a distance, the results are encouraging, but on closer inspection it is clear that the resulting scenes are often incoherent.

4.8. Experimental Conclusions

A simple and effective model is developed for generating images based on detailed visual text descriptions. We show that the model can synthesize many reasonable visual interpretations of a given text title. Our manifold interpolation regularizer greatly improves text-to-image composition on CUB. We show the separation of style and content, and the transformation of bird-pose and background from query images to text descriptions. Finally, we demonstrate the generality of our method for generating images with multiple objects and variable backgrounds using results on the MS-COCO dataset.

Five, the result

In this paper, GAN is the first text generated image, the author uses convolutional recurrent neural network text encoder + deep convolutional generative adversarial network (DC-GAN). On this basis, the author also made three improvements:

1) GAN-CLS: Matching perception discriminator, which adds a group: input of real image and false text description. In this way, D can better learn the corresponding relationship between text description and picture content.

2) GAN-int: learning manifold interpolation, interpolating between the embeddings of the text in the training set, increasing the changes of the text, so that G has a stronger generation ability.

3) Separate content and style: use random noise to make style, and use K-means to group images into 100 clusters for style elaboration. So that Z can characterize the style, so as to solve the problem that the text description itself does not have any description of the style. Random Z can add different styles, so as to increase the authenticity and diversity of generated samples.

The receiver operating characteristic curve (ROC) is also called sensitivity curve. The area under the ROC curve (AUC) refers to the area around the ROC curve and x axis and x=1. As long as the area under the ROC curve is greater than 0.5, it is proved that the diagnostic test has certain diagnostic value. Meanwhile, the closer the AUC is to 1, the better the diagnostic test is. The closer the ROC curve is to the upper left corner, the higher the sensitivity and the lower the misjudgment rate, the better the performance of the diagnostic method. It is known that the point on the ROC curve closest to the upper left corner has the maximum sum of sensitivity and specificity.

Further reading

StackGAN: Text to photo-realistic image Synthesis with Stacked GAN has Stacked generation against network Text to image Synthesis

Reading Guide: 2016-2021 Text to Image (T2I) Reading Route and reading guide

Adversarial Text-to-image Synthesis: A Review