preface

Remember Deepfake that went viral on Reddit in March 2018? Replacing the head of a video with another person’s head can be a bit rough and fuzzy, but can be a good way to look real without requiring too much resolution.

For example, replace Mrs Clinton with a clip of Mr Trump’s speech in the image below.

There’s also the trump and Nicola face-swap:

Of course, this is just the tip of the iceberg for DeepFake, which became popular in the first place as a result of a concerted effort by nerdy programmers. That is, uh, you can put a face you want into AV. Of course, there will be no demo here, and the relevant Reddit forums have been banned. So I’m not going to talk much about it here.

This article is github.com/Oldpan/Face… Code collocation tutorial (fill the previous buried hole), here is a simple explanation of this, fill the previous vacancy.

Related research

In fact, deep learning face-changing studies have been very popular, some are based on GAN and some are based on Glow, but they are essentially generated models with a different implementation method. This DeepFake uses an autoencoder in machine learning, which has a similar structure to neural network and is more robust. We can study it to get a general idea of generative networks, so that we can use it when we encounter similar networks or constructs later.

Technical advice

Face interchange is a popular application in the field of computer vision. Face interchange can be used for video synthesis, providing privacy services, portrait swapping and other innovative applications. Before the earliest, face interchange was realized by analyzing similar information of two faces respectively, that is, feature points were matched to extract feature information such as eyebrows and eyes from one face and then matched to another face. This implementation does not require training, does not require a data set, but the implementation is relatively poor, cannot modify the expression in the face itself.

And in depth study of the development of the technology recently, we can through the depth of neural network to extract the deep information of the input image, and read out the hidden deep features to implement some novel tasks, such as migration style (style transfer) is by reading the trained model to extract the deep information of image to achieve style swaps.

There is also the use of neural network face swap (face-swap), which uses VGG network to extract features and achieve face exchange. Here we use a special self-encoder structure to achieve face exchange, and achieve good results.

Basic background: autoencoder

An autoencoder is similar to a neural network, a type of neural network that can be trained to try to copy input to output. Like neural networks, autoencoders have hidden layers

H, the input can be parsed into a coded sequence to reproduce the input. There is a function inside the autoencoder
H =f(x) can be encoded, and there is also a function r=g(h) to decode, as shown in the figure below.

X is the input, f is the encoding function, x is encoded as the implicit characteristic variable H, and G is the decoding network. Through the reconstruction of the implicit variable H, the output result R is obtained. We can also see that after the number 2 is put into the encoder, the encoding of its implicit layer is Compressed representation, It is then regenerated by the decoder.)

Modern autoencoders can also be understood as random mappings at macro level

P_encoder (h/x) and P_decoder (x/h). Before autoencoders are generally used for data dimensionality reduction or image denoising. But in recent years, due to the development of neural networks, more latent variables have been studied, and autoencoders have also been brought to the forefront of generative modeling, which can be used in image generation and other aspects.

More about Autoencoders: Understanding Deep Learning: Networks Similar to Neural Networks – Autoencoders (PART 1)

The network architecture

So how should realize our face changing technology through self-encoder?

Before we’ve known since the encoder can learn of the input image information and thus to encode the input image information and to save information encoded as a hidden layer, and the decoder use learn the hidden layer of information prior to regenerate the input image, but if we directly to the two different images of individual image set of input to the encoder?

As shown above, if we simply throw the set of two different faces into the self-coding network and pick a loss function to train, we get nothing by training this way, so we need to redesign our network.

How do you design it?

Since we want to interchange the two faces, we can design two different decoding networks, that is, use one coding network to learn the common features of two different faces, and use two decoders to generate them separately.

As shown in the figure above, we design an input or encoder (input two different faces), and then two output or two decoders, so that we can generate two different faces through the hidden layer.

The concrete frame

In this case, our specific framework is as follows:

As shown in the figure above, you can see that the autoencoder has one input and two outputs. At the input end, two different sets of faces (Nikolai and Trump) are input respectively, and then the previous faces are generated at the output end. Note that each decoder here is responsible for generating different individual faces. In the hidden layer, the common information of two different individuals is learned, so that two different decoders can restore the previously input image according to the learned common information.

To be more specific, it looks like this:

As shown in the figure above, we input image sets of two different individuals respectively. At this time, the Encoder is the same coder, and then the common implicit information generated by the same Encoder will be used. The decoder A and B of two different individual images are used to generate their own images for comparison by criterion (L1 loss function is used here), and gradient descent is carried out by Adam optimization algorithm.

Then we use decoder A and B to regenerate the information learned by the hidden layer after input image B and A respectively, as shown in the following formula:

The network structure

The following is the specific structure of the aforementioned network design, we can see that the network structure has one input and two outputs, input comprised convolution and full connection layer, the output is also formed by convolution layer, but you need to pay attention to the input end here is under sampling convolution, while the output is on the sample convolution, That is, the resolution of the image is lower and then gradually higher.

Take a look at the network design code in Pytorch:

class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()

        self.encoder = nn.Sequential(   # Coding network
            _ConvLayer(3, 128),     # 3, 64, 64
            _ConvLayer(128, 256),   # 32/2 = 16
            _ConvLayer(256, 512),   # 16/2 = 8
            _ConvLayer(512, 1024),  # 8/2 = 4
            Flatten(),
            nn.Linear(1024 * 4 * 4, 1024),
            nn.Linear(1024, 1024 * 4 * 4),
            Reshape(),
            _UpScale(1024, 512),
        )

        self.decoder_A = nn.Sequential(     # Decoded network
            _UpScale(512, 256),
            _UpScale(256, 128),
            _UpScale(128, 64),
            Conv2d(64, 3, kernel_size=5, padding=1),
            nn.Sigmoid(),
        )

        self.decoder_B = nn.Sequential(
            _UpScale(512, 256),
            _UpScale(256, 128),
            _UpScale(128, 64),
            Conv2d(64, 3, kernel_size=5, padding=1),
            nn.Sigmoid(),
        )

    def forward(self, x, select='A') :if select == 'A':
            out = self.encoder(x)
            out = self.decoder_A(out)
        else:
            out = self.encoder(x)
            out = self.decoder_B(out)
        return out
Copy the code

The above code network can be represented by the legend below:

However, these structures can be changed, and the size of the convolution kernel can be adjusted according to the demand. We can also look for a similar structure to replace the fully connected network that can disrupt the spatial structure. In addition, sub-pixel’s up-sampling Convolve technology is also used in the last layer of the decoder stage (on the Upsample side of this blog Convolve: The Efficient sub-pixel -convolutional-layers technique can generate image details quickly and well.

In short, we want to achieve the operation of face changing, on the basis of the overall structure remains unchanged, the following points need to be met:

As shown in the figure above, it is a coding network similar to VGG, a fully connected network that can disrupt spatial structure, and a sub-pixel network that can quickly and well sample images from the ground. We conducted additional tests and found that as a coded network, the traditional VGG form of network structure works best and minimizes the loss function. However, if other advanced networks are used, the effect is not as good as the traditional VGG architecture, which is why style migration and image generation are more likely to use THE VGG network format.

Similarly, if we remove the fully connected network from the encoder or replace it with convolutional network, then the image cannot be generated normally, that is, the encoder cannot learn any useful knowledge, but we can replace the fully connected network with 1X1 convolutional network, as shown in the following example:

nn.Conv2d(1024, 128, 1, 1, 0, bias=False),
nn.Conv2d(128, 1024, 1, 1, 0, bias=False),
Copy the code

It also has the property of disrupting spatial structure. The advantage is that it runs faster than a fully connected network, but the effect is not as good as a fully connected network.

At this point, we briefly explained the basic architecture and network layer selection.

Image preprocessing

Of course, there are many Tricks in training, because we need to learn the common characteristics of two different individuals at the hidden layer. There are some Tricks that we can use to make the training process faster and smoother, similar to the importance of Normalization of data in deep learning training discussed earlier. What we need to do is to make the distribution information of the trained images as close as possible:

As shown in the figure above, we can add the average difference (RGB three-channel difference) of image A(Trump) set to the average difference (RGB three-channel difference) of the two image sets to make the distribution of the two input images as close as possible, so that our loss function curve will decline more quickly.

The code is:

images_A += images_B.mean(axis=(0, 1, 2)) - images_A.mean(axis=(0, 1, 2))
Copy the code

Image enhancement

Of course, there is image enhancement technology, this technology needless to say, is basically a universal magic tool. We can double the number of training images by rotating, scaling, and flipping them.

As shown above, for human faces, the use of distorted enhancement techniques can effectively reduce the value of loss during training, of course, there is a limit to the enhancement, do not go too far, the opposite is possible. Oh, and I forgot to mention that the faces we trained with were taken only from the original images, such as the aligned image in the upper right, we used the OpenCV library to take screenshots of faces for training.

Image post-processing

For image post-processing, of course, we put the face images we generated back together:

Poisson fusion and Mask edge fusion are used here, and we can easily see the fusion effect.

conclusion

In general, the changing face of technology is a simple but rich in knowledge structure of a small project, its structure is simple, easy to use and modify and can generate good results, but because it has more parameters, its running speed is not fast (of course we can change the encoding and decoding layer structure to speed up the training of the generated). And an image with a foreign object on the face may produce an unreal effect.

This is because the self-encoder network does not learn the parts of the image that need to be learned, but learns all of them. Of course, there will be many impurities, which can be appropriately improved through the mechanism of attention.

So much for that

This article is from OLDPAN blog, welcome to visit: OLDPAN blog

Welcome to Oldpan blog public account, continue to brew deep learning quality articles: