How to design a small and optimal deep learning model for real-time audio and video?

At GDG’s DevFest 2018, Zhong zhong, Agora’s chief scientist, shared his talk “Deep Learning for MobilePlatforms: Complexity and Performance Analysis. The following is a transcript of the speech, compiled and published by GDG.

Welcome to the RTC developer community to exchange experience with more real-time audio and video, codec technology developers.

About the Speaker

Speech ‘

opening

Recently I saw a very interesting picture in the wechat group. Everyone should have seen the mobile phone film beside the subway station in the streets and lanes, right? But the sign on this picture doesn’t say “phone film” it says “model adjustment”. Although this example is not appropriate, artificial intelligence learning algorithms are indeed becoming a necessary skill for our survival, which shows how popular artificial intelligence is.

Start with the application of AI

So, getting back to our topic today, what do lunch and deep learning have in common? Lunch boxes and black boxes. They’re both boxes. The meal box has a salad, main course and fruit at the end of the meal. The black box cannot be opened or seen.

Today’s talk is an attempt to get a little bit inside the black box, and as engineers, we all want to know why and why. I’m just going to tease a little bit about some aspects of deep learning based on the work we’ve done.

Sonnet mainly provides real-time audio and video communication transmission services, encoding and decoding, pre and post processing, etc. We focus more on the field of real-time communication interaction.

There are a lot of AI applications in social entertainment apps right now. Facial beauty, stickers, including the recognition of some interactive actions, face changing and voice changing are all style transformation. AI algorithm has a good application in these aspects.

In addition to these, AI also has many applications on the post-processing side. For example, how to improve the details of fuzzy images, more clearly presented to everyone, improve the viewing experience; For example, due to packet loss on the network transmission line and distortion of received data, AI algorithm can also be used to compensate.

AI also has more applications in the cloud, such as content policing, such as the identification of pornographic images, and the identification of violent images, as well as speech to text, and emotional computing, many of which have great applications in real-time communications.

Super resolution to restore blurred images

The depth algorithm and application of artificial intelligence are introduced as an example of recovering blurred images.

We all know that super resolution (super resolution, SR) is helpful in recovering details. In our scenario, due to the limited network bandwidth, packet loss will occur, so the image will be compressed and transmitted at a low bit rate, and the decoded image will usually be a little blurred, affecting the viewing experience. Especially in live streaming apps, users want to see a clear face and hear a clear voice.

The overscore is our post-processing step, and as a processing unrelated to the previous one, this step is placed last. The video source is encoded and transmitted on the network. After receiving the video decoder, the decoder decodes it into a fuzzy image. The details are enhanced or enlarged by super resolution processing and then displayed.

Deep neural networks have been proved to be good at generating image details. GAN model is a very effective model, and it is also the basic model of our hyperfraction algorithm. The performance and complexity analysis is mainly based on GAN.

GAN model

Here is the basic idea of GAN. It usually consists of two networks, a generator and a discriminator, and the two models operate in a cooperative and adversarial manner, culminating in a balance that allows the generator to produce pseudo-real data, such as:

When the discriminator receives real image data, it accepts it as real data.
When the generator input is low resolution data, we want to generate high resolution data, and we want the generator to produce real data. But the discriminator’s job is to do the opposite: it tries not to let the generated data get away with it, and it tries to kick it out.

Once distinguished, generators train and tune the data to make it more realistic; Discriminators also train and improve, getting better and better at detecting fake data. When the two converge, the discriminator can no longer tell whether the data generated by the generator is true or false, and the result is accepted by the discriminator.

More than two-thirds of the articles at top AI conferences in recent years are likely to be GAN related.

How to design small and optimal deep learning models?

As we all know, the best result of deep learning is to adopt a relatively large model. On a relatively large machine platform, such as a server, there are thousands of Gpus for parallel processing to train, and a large number of training data sets are required.

However, there are a lot of applications for mobile devices, and the challenge we face in mobile social networking, live broadcasting and communication is to design a small model that can best meet the following three conditions:

This small model can run in real time on mobile devices without consuming too much power, causing heat and so on.
In addition, its results should be good enough, not because the model is small and no effect, this is meaningless.
Training should be based on a reasonable number of data sets to achieve good results, millions or even tens of millions of data is often not realistic because of the high cost of data collection.

Next we do complexity analysis, and our goal is to shrink the model. Let’s look at some typical models, some classical deep neural networks for image analysis and recognition, such as VGG model, are very large models, and the number of parameters, that is, the number of weights, is an important indicator to measure the complexity of this model.

The VGG16 model has more than 100 million parameters. A lot of work has gone into pruning, compressing and retraining the model to make it work on mobile platforms, as well as more complex techniques such as reinforcement learning to make smaller models. These methods all have some potential problems, such as the resulting structure is not simple enough, the computation is not small enough, or it is not easy to implement in parallel.

Google has also done a good job of this, and the latest result is that MobileNet V2’s 3.4 million parameters, less than 3% of the number of parameters for VGG16, are much smaller, nearly two orders of magnitude, and very small. But for us, 3.4 million parameter model is still very large, especially in mobile devices software implementation is still not ideal, of course, our task is a little different, we do image hypersegmentation, and the above model is to do object recognition.

CNN based on ReLU

Let’s take a look at CNN based on ReLU, which is actually a piecewise linear function. This is easy to understand, especially when the stride is 1, the piecewise linear mapping is still maintained.

To better understand what follows, LET me introduce another concept, flow patterns. For example, the flow pattern is easy to understand. For example, the image of a human face, although it can be 1000×1000 with one million pixels, the real representation of a human face does not need one million points, but one or two hundred parameters. In fact, it can be represented in a lower dimensional space. The process of mapping from its background space to its parameter space, or hidden space, is actually a coding process, a dimensionality reduction process.

Conversely, going from this lower dimensional space to higher dimensional space is a decoding process, a generator. In general, encoding compresses data into a lower-dimensional parameter space called the hidden space. There is a mapping, from a higher dimensional flow pattern to a lower dimensional space, both the positive and the inverse mappings are continuous, this is a homeomorphic mapping. We want to do something in hidden space, a low dimensional parameter space.

As I mentioned earlier, deep learning is actually a piecewise linear mapping, which is a piecewise linear approximation of a convective type. For example, when the model is simple, it is approximated by two lines, while when the model is complex, it is approximated by four lines, and the degree of approximation is higher. In fact, a more complex deep learning network can produce better results, with higher approximation accuracy and, of course, higher complexity.

In addition, different approximation method to achieve the effect of different also, in a better way may be close to, in fact different weights for different mapping, corresponding to different approximation effect, the process of our training is looking for an optimal approximation, at least is to find a local approximation process of optimization, make it reach the effect of a certain sense is optimal. Precision is measured by the mass of the approximation.

I reported our results, and we ended up with only 10,000 parameters, two majority orders of magnitude smaller than Google’s MobileNet V2 model for mobile devices. The sound-net model is less than 1% of that, much smaller. There is usually a problem when the model becomes smaller, and the implicit problems of the GAN itself become more prominent. Schema collapse is one of these problems.

Pattern of collapse

What’s the problem with schema collapse? Generators have difficulty learning multimodal distributions. Take an example of eight Gaussian distributions on a ring. The generator wants to learn this distribution, but the training process and the final result in a simple model can only converge to one of the Gaussian distributions. To use on the examples of practical applications, such as generating Numbers, we expect it like the first line in the image to generate the Numbers 0 to 9, but it’s like the second line easily produce only one of the Numbers, such as the total generated 1 or a vague shape, because 1 easily in discriminant manager to muddle through, although it did the right thing, But it can’t be generated into other numbers, so it’s not very useful.

How can this problem be solved or mitigated? So we did a bunch of things, and in a nutshell, we added some constraints, we added constraints on the local area, we added constraints on the tangent space, and we added optimization on the hidden space, and we can’t go through all of that, but we’ll just talk a little bit about optimization on the hidden space.

Optimization of hidden space

I mentioned earlier that DNN actually maps a flow pattern into hidden space or parameter space. An image is usually encoded into a lower dimensional space called a hidden space. Here for everyone to do a straightforward explanation, we directly in the coding space, that is implicit in the space to do recovery, first in the hidden space evenly to sample some of the points, and then through these point input to the generator, the generator to reconstruct the image point, overlapping them to the original image, some out of a point is very close, but in some places such as the head face is very thin, This means poor recovery of the head. The generator collapses into a local optimum, and this refactoring is difficult or not so easy to get good results. Of course, we could take more dense sampling points, and eventually cover the head, but that would be expensive.

We can take this hidden space and optimize it, and then sample it uniformly, and then feed it into the generator, and the generator comes out with the same number of sample points, and the reconstructed points that come out of the reconstructed image are also uniform, so the stuff that comes out of the point cloud is uniform.

For example, if you take a sheet of flat paper, and you fold it many times in various irregular ways, you’re asked to take a certain number of sample points from the folded wad of paper, and then when you unroll the paper and get it flat again, you want those sample points to be uniform. It’s hard. You can’t do it unless you use force to get enough points. But that would be very complicated, and it would defeat our goal. We want to control both complexity and effect. We apply a similar implicit space optimization to our model training. Because the parameters of our final model are very small, the power consumption is very low on the iPhone7, and the phone doesn’t get hot. So basically we can do, let’s say a 360P video, and we can get a 720P video, and we can get a hd video.

In the future, we also want to understand more about some deep networks from mathematics, so that we can describe a certain point or a certain problem in a mathematical way to further improve the image clarity. That’s what we’re going to do in the future. Thank you.

Finally, it is recommended to developers who want to develop real-time audio and video App or want to learn WebRTC
Some blog posts and information