“This is the 16th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Use BicyleGAN for multiple image conversion

Pix2pix and CycleGAN are very popular gans, with many variants and applications based on them in academia. However, they all have one drawback — the output of the image almost always looks the same. For example, if we were to perform a zebra-to-horse conversion, the converted photo of the same horse would always have the same appearance and hue. This is due to the inherent property of GAN, which learns to filter out the randomness of noise. In order to vary the image conversion, this article explains in detail how BicycleGAN solves this problem to produce a richer image, and implements this BicycleGAN using Tensorflow2.

BicyleGAN effect display

Using BicyleGAN it is possible to convert a picture in variety to produce different styles and colors:

BicycleGAN architecture

Before we begin to implement BicycleGAN, a brief introduction. When you hear this name for the first time, you might think BicycleGAN is a variation on CycleGAN, but this BicycleGAN has nothing to do with CycleGAN. Instead, it is an improvement on Pix2pix.

Pix2pix is a one-to-one mapping where the output for a given input is always the same. When you try to add noise to the generator input, the network ignores the noise and there is no change in the output image. Therefore, a way needed to be found to force generators not to ignore noise, but to use noise to generate diverse images, a one-to-many mapping.

The following is the model and configuration of BicycleGAN. Figure (a) is the configuration of inference, and image A is combined with noise to produce image B^\hat BB^, which can be thought of as cGAN. In BicyleGAN, the image A of shape (256, 256, 3) is the condition, and the noise sampled from the underlying code ZZZ is A one-dimensional vector of magnitude 8. Figure (b) is the training configuration of Pix2PIx + noise. The two configurations of figures (c) and (d) are used by BicycleGAN in training:

In short, BicycleGAN can find the relationship between the underlying code z and the target image B, so that the generator can learn to generate different images B^\hat BB^ when given different z. As shown above, BicycleGAN does this by combining two models, cVAE-GAN and cLR-GAN.

cVAE-GAN

The authors of VAE-GAN believe that L1L_1L1 loss is not a good indicator of visual quality of images. For example, if the image is moved a few pixels to the right, the human eye may look no different, but it will result in a large L1L_1L1 loss. Therefore, GAN’s discriminator is used to learn the objective function to judge whether forged images are real or not, and VAE is used as a generator to generate clearer images. If the image AAA in the figure above (c) is ignored, then VAE-GAN is called conditional cVAE-GAN because AAA is the condition. The training steps are as follows:

  1. VAETake the real picture
    B B
    Encode potential codes for multivariate Gaussian distribution and then sample them to create noise inputs. This process is the standard VAE workflow.
  2. Image AAA is used as the condition and noise sampled from latent vector ZZZ is used to generate pseudo image.

The data flow in the training is B−> Z −>B^B-> Z ->\hat BB−> Z −>B^ (solid line arrow in Figure (c)). The total loss function consists of three losses:

  1. LGANVAE\mathcal L_{GAN}^{VAE}LGANVAE: Fight against loss
  2. L1VAE\mathcal L_1^{VAE}L1VAE: L1L_1L1 Rebuild losses
  3. Lkl\ mathcal L_{KL}LKL: KLKLKL divergence loss

cLR-GAN(Conditional Latent Regressor GAN)

In cVAE-GAN, real image B is encoded to provide real samples of potential vectors and sampled from them. However, clR-gan is processed in a different way. It first uses generator to generate pseudo image B^\hat BB^ from random noise, then codes pseudo image B^\hat BB^, and finally calculates the difference between pseudo image B^\hat BB^ and input random noise. The forward calculation steps are as follows:

  1. First of all, something likecGAN, generate some noise randomly, and then series image A to generate pseudo image
    B ^ \hat B
    .
  2. After that, use fromVAE-GANThe same encoder will fake the image
    B ^ \hat B
    Coded as latent vector.
  3. Finally, z^\hat zz^ is sampled from the coded latent vector and the loss is calculated using the input noise ZZZ.

Data flow for z – > B ^ – ^ > z z – > \ hat – > B \ hat zz – > B ^ – > z ^ (solid arrows in figure (d)), there are two losses:

  1. LGAN\mathcal L_{GAN}LGAN: Fight losses

  2. L 1 l a t e n t \mathcal L_1^{latent}
    Noise:N(z)And the underlying code
    L 1 L_1
    loss

By combining the two data streams, a double mapping loop is created between the output and the potential space. The BI in BicycleGAN comes from a bicycle-mapping (bi-injective). This is a mathematical term which simply means a one-to-one mapping and is reversible. In this case BicycleGAN maps the output to the potential space and similarly from the potential space to the output. The total losses are as follows:


l o s s B i c y c l e = L G A N V A E + L G A N + Lambda. L 1 V A E + Lambda. l a t e n t L 1 l a t e n t + Lambda. K L Loss_ = {Bicycle} \ mathcal L_ ^ {GAN} {VAE} + \ mathcal L_ {GAN} + lambda \ mathcal L_1 ^ + lambda _ {latent} {VAE} \ mathcal L_1 ^ + lambda _ {KL} {latent}

In the default configuration, lambda =10 lambda =10 lambda =10, lambda =0.5 lambda _{latent} =0.5 lambda =0.5, lambda =0.01 lambda _{latent} =0.01 lambda =0.01.

BicycleGAN implementation

There are three types of network in BicycleGAN — generator, discriminator, and encoder. Using separate discriminators for cVAE-GAN and cLR-GAN can improve image quality, so we will use four networks – generator, encoder and two discriminators.

Insert latent code into the generator

There are two ways to insert potential code into the generator, as shown below:

  1. Cascade with input image;
  2. Insert it into another layer in the generator’s downsample path.

Experiments have found that the former works well. There are several ways to combine inputs and conditions of different shapes. BicycleGAN uses the method of repeating the underlying code several times and then linking it to the input image.

In BicycleGAN, where the potential encoding length is 8, we extract eight samples from the noise distribution, each repeated H×W times to form a tensor of shape (H, W, 8). In other words, in the 8 channels, its (H,W) feature graph is the same. The following code shows the concatenation and concatenation of potential codes:

input_image = layers.Input(shape=image_shape, name='input_image')
input_z = layers.Input(shape=(self.z_dim,), name='z')
z = layers.Reshape((1.1, self.z_dim))(input_z)
z_tiles = tf.tile(z, [self.batch_size, self.input_shape[0], self.input_shape[1], self.z_dim])
x = layers.Concatenate()([input_image, z_tiles])
Copy the code

The next step is to create two models, cVAE-GAN and cLR-GAN, to merge the networks and create forward information flows.

cVAE-GAN

The following code to create the cVAE-GAN model, the implementation of the forward calculation:

images_A_1 = layers.Input(shape=input_shape, name='ImageA_1')
images_B_1 = layers.Input(shape=input_shape, name='ImageB_1')
z_encode, self.mean_encode, self.logvar_encode = self.encoder(images_B_1)
fake_B_encode = self.generator([images_A_1, z_encode])
encode_fake = self.discriminator_1(fake_B_encode)
encode_real = self.discriminator_1(images_B_1)
kl_loss =  - 0.5 * tf.reduce_sum(1 + self.logvar_encode - tf.square(self.mean_encode) - tf.exp(self.logvar_encode))
self.cvae_gan = Model(inputs=[images_A_1, images_B_1], outputs=[encode_real, encode_fake, fake_B_encode, kl_loss])
Copy the code

We use KLKLKL divergence loss in our model. Kl_loss is simpler and more effective because it can be calculated directly from mean and logarithmic variance without passing in an external label in the training step.

cLR-GAN

The following is the implementation of CLR-gan, the implementation of forward calculation:

images_A_2 = layers.Input(shape=input_shape, name='ImageA_2')
images_B_2 = layers.Input(shape=input_shape, name='ImageB_2')
z_random = layers.Input(shape=(self.z_dim,), name='z')
fake_B_random = self.generator([images_A_2, z_random])
_, mean_random, _ = self.encoder(fake_B_random)
random_fake = self.discriminator_2(fake_B_random)
random_real = self.discriminator_2(images_B_2)      
self.clr_gan = Model(inputs=[images_A_2, images_B_2,   z_random],
			outputs=[random_real, random_fake, mean_random])
Copy the code

Now that we have defined the model, the next step is to implement the training steps.

The training steps

The two models were trained together, but with different image pairs. So, in each training step, we fetch the data twice, once per model, which is done by creating a data pipeline that will be called twice to load the data:

images_A_1, images_B_1 = next(data_generator)
images_A_2, images_B_2 = next(data_generator)
self.train_step(images_A_1, images_B_1, images_A_2, images_B_2)
Copy the code

We can use two different methods to perform training. One is to define and compile the Keras model using the optimizer and loss functions, and then call train_on_batch() to perform the training steps, which works well on well-defined models. Additionally, we can use TF.gradienttape to better control gradient updates. BicycleGAN has two models that share a generator and an encoder, but we need to update their parameters with different combinations of loss functions, which makes the train_on_batch method infeasible without modifying the original setting. Therefore, we use TF.Gradienttape to combine the generators and discriminators of the two models into a single training step, as follows:

  1. The first step is to perform forward passing and collect the output of both models:
    def train_step(self, images_A_1, images_B_1, images_A_2, images_B_2) :
        z = tf.random.normal((self.batch_size, self.z_dim))
        real_labels = tf.ones((self.batch_size, self.patch_size, self.patch_size, 1))
        fake_labels = tf.zeros((self.batch_size, self.patch_size, self.patch_size, 1))

        with tf.GradientTape() as tape_e, tf.GradientTape() as tape_g, tf.GradientTape() as tape_d1, tf.GradientTape() as tape_d2:
            encode_real, encode_fake, fake_B_encode, kl_loss = self.cvae_gan([images_A_1, images_B_1])
            random_real, random_fake, mean_random = self.clr_gan([images_A_2, images_B_2, z])
Copy the code
  1. Next, we update the discriminator by reverse gradient propagation:
            # discriminator loss
            self.d1_loss = self.mse(real_labels, encode_real) + self.mse(fake_labels, encode_fake)
            gradients_d1 = tape_d1.gradient(self.d1_loss, self.discriminator_1.trainable_variables)
            self.optimizer_d1.apply_gradients(zip(gradients_d1, self.discriminator_1.trainable_variables))

            self.d2_loss = self.mse(real_labels, random_real) + self.mse(fake_labels, random_fake)
            gradients_d2 = tape_d2.gradient(self.d2_loss,self.discriminator_2.trainable_variables)
            self.optimizer_d2.apply_gradients(zip(gradients_d2, self.discriminator_2.trainable_variables))
Copy the code
  1. We then calculate the losses based on the output of the model. withCycleGANSimilar,BicycleGANAlso useLSGANLoss function, i.e. mean square error:
			self.LAMBDA_IMAGE = 10
			self.LAMBDA_LATENT = 0.5
			self.LAMBDA_KL = 0.01
            # Generator and Encoder loss
            self.gan_1_loss = self.mse(real_labels, encode_fake)
            self.gan_2_loss = self.mse(real_labels, random_fake)
            self.image_loss = self.LAMBDA_IMAGE * self.mae(images_B_1, fake_B_encode)
            self.kl_loss = self.LAMBDA_KL * kl_loss
            self.latent_loss = self.LAMBDA_LATENT * self.mae(z, mean_random)
Copy the code
  1. Finally, there are updates to the weights of generators and encoders. L1L_1L1 potential encoding loss is only for the update generator, not for the update encoder. Because simultaneous optimization for loss causes them to hide information about the underlying code and not learn meaningful patterns in the underlying code. Therefore, losses need to be calculated separately for generator and encoder and weights updated accordingly:
            encoder_loss = self.gan_1_loss + self.gan_2_loss + self.image_loss + self.kl_loss
            generator_loss = encoder_loss + self.latent_loss

            gradients_generator = tape_g.gradient(generator_loss, self.generator.trainable_variables)
            self.optimizer_generator.apply_gradients(zip(gradients_generator, self.generator.trainable_variables))
            
            gradients_encoder = tape_e.gradient(encoder_loss, self.encoder.trainable_variables)
            self.optimizer_encoder.apply_gradients(zip(gradients_encoder, self.encoder.trainable_variables))
Copy the code

Training results and analysis

To train BicycleGAN, you may choose from two data sets – a sketch of the building structure or a sketch of the line of the shoe. The image of the shoe dataset is simpler and therefore easier to train, so take training as an example. The following image is an example of BicycleGAN training results. The first image is the line draft, the second image is the real image corresponding to the line draft, and the four images on the right are generated:

As you can see, the main difference between different images generated by the same draft is color. They all capture the structure of the shoe almost perfectly, but are less than ideal in terms of detail capture, which can be achieved through increased training and hyperparameter tuning.