Make writing a habit together! This is the third day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

StackGAN++: Realistic Image Synthesis with Stacked GAN

The main work of this paper is to change the original two-stage Stack structure of Stack GAN into a tree structure. There are multiple generators and discriminators that are distributed like a tree structure, and each generator produces a different sample resolution. The network structure is also improved. This paper was admitted by International Conference on Computer Vision (ICCV) in 2017.

Address: arxiv.org/pdf/1710.10…

Code address: github.com/hanzhanggit…

This blog is a close reading of the paper report, including some personal understanding, knowledge development and summary. This article introduces stackgan-v1, which was explained in StackGAN in Text to Image. This blog only summarizes stackgan-v2.

A, in this paper,

Despite the remarkable success of generative adversarial networks (gans) in a variety of tasks, they still face challenges in generating high-quality images. In this paper, we propose StackGANs, which is designed to generate high resolution photorealistic images. First, stackGan-V1, a two-stage generative adversarial network architecture for text image synthesis, is proposed. Stag-i GAN draws the original shape and color of the scene based on the given text description to generate a low-resolution image. Stage II GAN takes stage I results and text descriptions as input and generates a high-resolution image with realistic details of the photo. Secondly, stackGan-V2, an advanced multi-stage generative adversation network architecture, is proposed for conditional and unconditional generative tasks. Our StackGan-V2 consists of multiple generators and discriminators arranged in a tree structure. Multiple scale images corresponding to the same scene are generated from different branches of the tree. By jointly approximating multiple distributions, StackGan-V2 shows more stable training behavior than StackGan-V1. A large number of experiments show that the proposed stacked generative adversarial network is superior to other state-of-the-art methods in generating photorealistic images.

Two, key words

Text to Image, Generative Adversarial Network, Image Synthesis, Computer Vision

Why stackgan-v2?

By modeling data distributions at multiple scales, if any one of these model distributions shares support with the real data distribution at that scale, the stack structure can provide good gradient signals to accelerate or stabilize training of the entire network at multiple scales. For example, an approximate low-resolution image distribution at the first layer produces an image with basic color and structure. Generators for subsequent branches can then focus on completing details to produce higher-resolution images.

To put it simply: if the distribution of generated images at any scale is as close as possible to the real images at that scale, it can provide a good gradient signal to stabilize or promote the training of the real network.

Stackgan-v2 showed more stable training behavior, achieved better FID and initial scores on most data sets, and did not suffer from schema crashes compared to V1.

Iv. Main contents

4.1 StackGAN – v1 and StackGAN – v2

Stackgan-v1 has two independent networks, stage 1 GAN and Stage 2 GAN, for modeling image distribution from low to high resolution.

In order to make the framework more versatile, a new end-to-end network stackGAN-V2 is proposed to simulate a series of multi-scale image distributions. Stackgan-v2 consists of a tree structure of multiple generators (G) and multiple discriminators (D). Images from low to high resolution are generated from different branches of the tree. At each branch, the generator captures the distribution of images for that scale, and the discriminator resolves the authenticity of samples from that scale. Generators are trained jointly to approximate multiple distributions, and generators and discriminators are trained alternately.

4.2 Multi-scale image distribution

Each generator has its hidden feature. The hidden feature of the first generator is H0 =F0 (z), where Z is noise and usually adopts the standard normal distribution. The hidden feature of the ith generator is Hi =Fi (H (i-1),z), that is, noise and hidden layer feature H (i-1) are used together as input to calculate HI. Thus the generator produces samples from small to large scales.

4.3 Joint conditional and unconditional distribution

Unconditional image generation: The discriminator identifies the real image from the generated image. Conditional image generation: The image and its corresponding conditional variables (such as text embedding) are input into the discriminator to determine whether the image and conditional variables match, which guides the generator to approximate the conditional image distribution. That is, h_0=F_0 (C,z) in H_0,z represents random noise, but h_I =F_i (H_ (i-1), C) in h_I, C represents conditional vector. The objective function of the training condition Stackgan-V2 discriminator D now consists of two items: unconditional loss and conditional loss, as shown below:

4.4 Color consistency regularization

When we improve the image resolution on different generators, the images generated at different scales should have similar basic structure and color.Therefore, the color consistency regularization term is introduced to keep the samples generated from the same input on different generators more consistent in color, so as to improve the quality of generated images. The color consistency regularization term aims to minimize texture differences between different scales.Let Xk=(R,G,B)^T be used to represent a pixel in the generated image, and then calculateThe mean and variance are calculated, with N representing the total number of pixels.The color consistency regularization term is designed to minimize the following formula, thereby minimizing the difference in mean and variance between each scale.

4.5 Implementation Details

The model is designed to generate 256For 256 images, the input vector (noise Z and text embedding) is first set to 4464N_g, where N_g is the number of channels that are converted to 64 by the generator644 n_g, 1281282 n_g, 2562561N_g, conditional or unconditional variables are also input directly into the middle layer of the network to ensure that coded information is not ignored. All discriminators have a subsample block and 3The convolution kernel of 3, and the discriminator converts the image to 448N_g, and finally the sigmoID function is output to judge the probability.

Five, the experiment

5.1 Metrics

Inception Score (IS) : IS = exp ⁡ (EX Dkl (p (y | x) | | p (y))), the marginal distribution p (y) and conditional distribution p (y | x) the KL divergence, IS the bigger the better.

Frechet Inception Distance (FID) : FID measures the distance between the synthetic data distribution and the real data distribution,Where M and C represent the mean and variance derived from generated data, while Mr And Cr represent the mean and variance derived from real data.The smaller the FID, the better.

5.2 Experimental Results

5.3 Stackgan-V1 and Stackgan-V2 comparison

The end-to-end training scheme, along with color-consistent regularization, enables StackGan-V2 to generate more feedback and regularization for each branch, resulting in better consistency in the multi-step generation process. By jointly optimizing multiple distributions, StackGan-V2 showed more stable training behavior and achieved better FID and initial scores on most data sets, but the convergence rate was slower than V1 during training and required more GPU resources.

T-sne is a good tool to test the comprehensive distribution and evaluate its diversity. T-sne is used to do the collapse experiment of the model on the images generated by STACKgan-V1 and StackGan-V2 on the CUB test set. Stackgan-v1 has a two-part mode collapse, while StackGan-V2 does not:

5.4 Some Failures

Classify failures as mild, moderate, or severe. Light refers to the resulting image has a smooth, coherent appearance, but lacks vivid objects; Medium refers to the generated image has obvious artifacts, usually a sign of mode collapse; Heavy means that the generated image is in mode crash. Experiments show that Stackgan-V2 can effectively avoid the severe failure of mode crash.

5.5 Ablation experiment

Stackgan-v2-no-jcu represents removing the common approximate conditional distribution and unconditional distribution modules. Stackgan-v2-g2 represents only G2 instead of Mr. G0 and Mr. G1 for fuzzy images; Stackgan-v2-3g2 indicates that three G2s are used to generate images with different noises. Stackgan-v2-allg2 indicates that three G2s form a stack structure to generate images. Experimental results show that stackGan-V2 structure is effective.Ablation experiment of color consistency regularization (the first line is not, the second line is). The results show that the additional constraints provided by color consistency regularization can facilitate multi-distribution approximation and help generators with different branches generate more coherent samples.

Six, experience and experience

There are three innovations in this paper: (1) The original two-stage Stack structure of Stack GAN is changed into a tree structure. There are multiple generators and discriminators, their distribution is like a tree structure, and each generator produces a different sample resolution. The benefits of such multi-scale image distribution are as follows: If the distribution of generated images at any scale is as close as possible to the real images at that scale, good gradient signals can be provided to stabilize or facilitate training of the entire network. (2) Conditional and unconditional loss functions are added to the discriminator model. (3) Color consistency regularization is added, which can ensure that vectors from the same input are as consistent as possible in color at the generator end of the non-input, so as to ensure the quality of 256 x 256 images generated in the end.