This is the 16th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

Deep Learning with Python

This article is one of a series of notes I wrote while studying Deep Learning with Python (2nd edition, by Francois Chollet). This post marks the turn from Jupyter Notebooks to Markdown, as you can check out the original.ipynb notebooks at GitHub or Gitee.

You can read the original copy of the book online (in English) at this website. The book’s author also gave the accompanying Jupyter notebooks.

This paper is one of the notes in Chapter 8. Generative Deep Learning.

Images are generated from the variational encoder

Generating images with variational autoencoders

Both DeepDream and Neural Style Transfer introduced in the first two articles are limited “modifications” to existing works. Digan and VAE are more creative. They both sample from the potential space of images and create new images or edit existing ones.

  • VAE: Variational AutoEncoder
  • GAN: Generative Adversarial Network

Sampling from potential space

Latent space is a vector space in which any point can be mapped to a realistic image. The module to realize this mapping (potential points -> images) is GAN’s generator, or VAE’s decoder.

The key for GAN and VAE to generate images is to find a low-dimensional “latent space of representations”. Once such a potential space is found, it is sampled from it, mapped to the image space, and an entirely new image can be generated.

There is a big difference between GAN and VAE learning potential space:

  • VAE is good at learning about well-structured potential Spaces where a specific direction can encode (represent) a meaningful change in the axis of data.
  • The images generated by GAN can be very realistic, but the underlying space is not well structured and does not have enough continuity.

The concept of vector

Concept vector: Given a representation of a potential space or an embedded space, a particular direction in the space may represent a meaningful axis of variation in the original data. For example, in the potential space of a human face image, there may be a vector representing the concept of “smile” (called a smile vector) : for a potential point Z representing a human face, Z + S is the expression of a smile on the same human face.

Having found some of these concept vectors, we can edit the image in this way: project the image into the potential space, operate with the concept vector to move its representation, and then decode it into the image space to change a particular concept in the image — such as the degree of smile:

Variational autoencoder

An autoencoder is a type of network that receives an image, maps it to a “potential space” through an Encoder module, and then decodes it into an output of the same size as the original image through a decoder module. The goal when this thing is trained is to make the output the same as the input, so we use the same picture for the input and the output. So what the autoencoder learns is to rebuild the original input.

By imposing restrictions on the encoding (the encoder’s output), the autoencoder can learn useful potential representations of the data. For example, limit the encoding to low dimensions and sparse, so that the encoder can compress the input data into fewer bits of information:

VAE, variational autoencoder is a modern autoencoder. It is a generative model, especially for image editing using concept vectors. VAE can learn more continuous and highly structured potential Spaces than classical autoencoders.

VAE does not compress the input images into a fixed code in the potential space, but converts the images into statistical distribution parameters — mean and variance. VAE uses the mean value and variance to decode a random element from the distribution and decode the element into the original input. So VAE’s coding/decoding process is somewhat random.

The randomness of the process improves the robustness of VAE potential space: VAE needs to ensure that each point of the potential space sampling can be decoded as an effective output, which forces any position of the potential space to be meaningful.

The figure above shows VAE’s working principle:

  1. The Encoder module will input the sampleinput_imgTo represent parameters in the underlying spacez_meanz_log_variance;
  2. Sample a random point Z from this potential normal distribution:z = z_mean + exp(z_log_variance) * epsilon, where epsilon is a random tensor with a small value;
  3. The Decoder module maps this potential point back to the original input image.

Epsilon is random, so every point near the potential position (Z-mean) encoded by Input_img needs to be decoded as an image similar to Input_img. This property forces the potential space to be continuously meaningful: any two adjacent points in the potential space will be decoded as a highly similar image. Continuity and the low dimensions of the potential space force each direction in the potential space to represent a meaningful axis of change in the data, which can be manipulated through concept vectors.

Using Keras to realize VAE pseudocode is as follows:

z_mean, z_log_variance = encoder(input_img)
z = z_mean + exp(z_log_variance) * epsilon
reconstructed_img = decoder(z)
model = Model(input_img, reconstruced_img)
Copy the code

Training VAE requires two loss functions:

  • Reconstruction loss: Make the decoded sample match the initial input;
  • Regularization loss: it enables the potential space to have a good structure (continuity, availability of concept vector), and also reduces the overfitting on the training data;

Before you start writing code, turn off just-in-time execution mode:

import tensorflow as tf
tf.compat.v1.disable_eager_execution()
Copy the code

We specifically implement the encoder network: through a convolutional neural network, the input image X is mapped to two vectors z_mean and Z_log_var:

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
import numpy as np

img_shape = (28.28.1)
batch_size = 16
latent_dim = 2    # Dimension of potential space: 2D plane

input_img = keras.Input(shape=img_shape)
x = layers.Conv2D(32.3, padding='same', activation='relu')(input_img)
x = layers.Conv2D(64.3, padding='same', activation='relu', strides=(2.2))(x)
x = layers.Conv2D(64.3, padding='same', activation='relu')(x)
x = layers.Conv2D(64.3, padding='same', activation='relu')(x)

shape_before_flattening = K.int_shape(x)

x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)

z_mean = layers.Dense(latent_dim)(x)
z_log_var = layers.Dense(latent_dim)(x)
Copy the code

The following code uses z_mean and z_log_var to generate (sample) a potential spatial point Z.

# Potential spatial sampling function

def sampling(args) :
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim),
                              mean=0.,
                              stddev=1.)
    return z_mean + K.exp(z_log_var) * epsilon

z = layers.Lambda(sampling)([z_mean, z_log_var])    # Encapsulation as layers
Copy the code

Then comes the implementation of the decoder: resize the vector Z to the image size, and then use several convolution layers to get the final image output.

# VAE decoder network

decoder_input = layers.Input(K.int_shape(z)[1:])
x = layers.Dense(np.prod(shape_before_flattening[1:]),
                 activation='relu')(decoder_input)
x = layers.Reshape(shape_before_flattening[1:])(x)
x = layers.Conv2DTranspose(32.3,
                           padding='same',
                           activation='relu',
                           strides=(2.2))(x)
x = layers.Conv2D(1.3,
                  padding='same',
                  activation='sigmoid')(x)

decoder = Model(decoder_input, x)

z_decoded = decoder(z)
Copy the code

VAE needs to use two losses, so can’t write loss(input, target) directly, we need to write a custom layer, in which use the built-in add_loss method to create the needed loss.

Custom layers for VAE losses:

class CustomVariationalLayer(keras.layers.Layer) :
    def vae_loss(self, x, z_decoded) :
        x = K.flatten(x)
        z_decoded = K.flatten(z_decoded)
        
        xent_loss = keras.metrics.binary_crossentropy(x, z_decoded)
        kl_loss = -5e-4 * K.mean(
            1 + z_log_var - K.square(z_mean) - K.exp(z_log_var),
            axis=-1)
        return K.mean(xent_loss + kl_loss)
    
    def call(self, inputs) :
        x = inputs[0]
        z_decoded = inputs[1]
        
        loss = self.vae_loss(x, z_decoded)
        self.add_loss(loss, inputs=inputs)
        
        return x
    
y = CustomVariationalLayer()([input_img, z_decoded])
Copy the code

Finally, the model is instantiated and the training begins. Since our loss is included in the custom layer, we do not need to specify an external loss (loss=None) at compile time, so we do not need an externally specified target (y=None).

Here we train it with MNIST, the potential space for generating handwritten numbers:

from tensorflow.keras.datasets import mnist

vae = Model(input_img, y)
vae.compile(optimizer='rmsprop', loss=None)
vae.summary()

(x_train, _), (x_test, y_test) = mnist.load_data()

x_train = x_train.astype('float32') / 255.
x_train = x_train.reshape(x_train.shape + (1,))

x_test = x_test.astype('float32') / 255.
x_test = x_test.reshape(x_test.shape + (1,))

vae.fit(x=x_train, y=None,
        shuffle=True,
        epochs=10,
        batch_size=batch_size,
        validation_data=(x_test, None))
Copy the code

After training the model, we can use decoder to transform vectors in any potential space into images:

import matplotlib.pyplot as plt
from scipy.stats import norm

n = 15    # Display 15x15 numbers
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * n))

grid_x = norm.ppf(np.linspace(0.05.0.95, n))  The PPF function transforms linearly separated coordinates to generate the value of the underlying variable z
grid_y = norm.ppf(np.linspace(0.05.0.95, n))

for i, yi in enumerate(grid_x):
    for j, xi in enumerate(grid_y):
        z_simple = np.array([[xi, yi]])
        z_simple = np.tile(z_simple, batch_size).reshape(batch_size, 2)
        x_decoded = decoder.predict(z_simple, batch_size=batch_size)
        digit = x_decoded[0].reshape(digit_size, digit_size)
        figure[i * digit_size: (i + 1) * digit_size,
               j * digit_size: (j + 1) * digit_size] = digit

plt.figure(figsize=(10.10))
plt.imshow(figure, cmap='Greys_r')
plt.show()
Copy the code

The book ends here, and does not go into the application of concept vectors mentioned above 😂, what a pity.