This article summarizes the speech of Zhong, chief scientist of Sonnet, at RTC 2018 Real-time Internet Conference. The following is the content of the speech. If you need PPT and video of the speech, please get it at the end of the article.

About the author: Zhong, chief scientist of Sonnet, received his B.S. and Ph.D. degrees in mathematics from Peking University, and postdoctoral fellow at the Center for Automation research, University of Maryland, USA. There are about 100 patents for inventions. He has been a major member of MPEG/JVT (H.264) and INCITS, a member of IEEE, and has published more than 30 academic papers covering the technical fields of pattern recognition, video codec, computer vision and so on. He has served as senior chief scientist and technical director of Broadcom, Vice President of Technology of Huaya Microelectronics, and General manager of Hisense Group chip Company.

Welcome to the RTC developer community to share your experience with more RTC developers.



As we all know, deep learning has many applications in real-time video communication end-to-end system. For example, we use it to do super resolution, which can achieve better results. We use it to do image restoration, can also achieve better results. When it comes to challenges, in mobile applications we have to consider the limitations of complexity, have a small model that can run in real time on mobile platforms, and have the right limits on power consumption and CPU usage. In addition, it should achieve better learning effect on relatively reasonable data sets, so that its generalization ability is very strong.

Just to show you the results, we usually get fuzzy results with traditional algorithms, but with deep learning we can recover more details and even generate some details.

In terms of the amount of calculation, we can enlarge 480×360 to 960×720 and achieve 120fps on iPhone6 GPU at present, so that the complexity can be effectively controlled.

We use generative adversarial network to do the superscore. Generative adversarial network is quite hot in the recent two or three years. In the academic conferences of artificial intelligence learning algorithms, more than 2/3 papers are related to generative adversarial network. Generative adversarial networks usually consist of a generator and a discriminator. The generator tries to simulate real data as much as possible to fool the discriminator into thinking that the generated data is real and conforms to the distribution of real data. The discriminator’s job is to do the opposite. It tries to make the data it generates fail the test. The higher the standard, the higher the probability of failure. So the generator and the discriminator evolve together over each other, to the point where the discriminator can’t tell whether it’s true or false.

A generator is a random distribution, a noise Z, that goes through the generator to produce an image that looks like the real thing. The graph below graphically shows the distribution of the generator as it approximates the real data. Green is the distribution generated by this model, and the black dotted line is the distribution that gradually reaches the real data in the conflict. Z, that’s what I just said, let’s say a random variable, it’s going to produce the result that we want, and what the generator is actually doing, in terms of the formula, is maximising the probability that the discriminator will make a mistake, which is that the discriminator can’t tell true from false, it can’t tell that something is false, is making it make a mistake.

This discriminator, which is to maximize the probability that a real data is true first, and minimize the probability that a generator is true is the conflict that I was talking about, and it can be expressed by the formula. The best solution to this discriminator has a mathematical solution, which is to reach a Nash equilibrium. Combining these two generators and discriminators is a Max/min optimization of a value function.

What’s wrong with this? The generator in order to pass the test of discriminant, just looking for some it is better to generate generated model, so after finishing training such as big probability to generate 1, because 1 is very good, is a vertical, so that the generator cleverness points on learning something, it will try to learn some of the most easy to learn samples, to produce more easily judged as samples, That’s what generators are doing, but it’s not ideal.

In other words, if the distribution is a uniform circle, the generator may converge to a certain place, and always converge to a certain place and always pass. Because the discriminator always passes, the network state eventually converges to such a state. Generators have difficulty generating this multimodal, multi-cluster distribution, a phenomenon we call pattern collapse.

So what’s the challenge, and I’m going to talk a little bit about how we can mitigate this pattern collapse, is to make sure that the generator doesn’t get into a state where it outsmarts the discriminator. The second is how well we give a convolutional neural network, how well it performs and how well it learns. In other words, given a deep learning task, how small can deep convolutional neural network be and achieve good results?


To reduce the probability of pattern collapse, a local constraint is usually first imposed, requiring the generator not only to fool the discriminator, but also to make its noisy input look like a real sample, so that the output is not much different from the real sample. It’s like adding a term to the loss function that produces something like the target, which is supervised learning.

Another way to look at it is, in fact, a neural network for deep learning, it is a manifold, and this manifold is a topological space that maps a homeomorphism to an N-dimensional real number space, and homeomorphism means that both forward and inverse mappings are continuous. Let me just say a little bit about the concept, for example, a surface in three dimensions, is a two dimensional manifold, from a coding point of view, it can correspond to a hidden space, hidden space is two dimensions, forward mapping is dimension reduction, it’s a coding process, or in the classification problem we try to divide better in hidden space. On the other hand, going from a hidden space to a manifold is becoming a generator, it’s the decoding process, it’s going from the data being stripped down to what it looks like and what we want it to look like.

This surface is in three dimensions, let’s call it environment space. Wasserstein designed a generative adversarial network that actually had many layers, up to ten layers. All he had to do was take two Gaussian distributions, one at zero and one at 40 by 40, and learn the distribution. It turns out that a deep learning network with up to ten layers can’t learn, and when it converges, it shows orange dots, which is the final state of convergence. When the data distribution has multiple clusters or multiple peak mixed distribution, such manifolds are challenging to generate adversative networks.

What is a convolutional neural network? Let’s take a look at the convolutional neural network based on the corrected Linear unit (ReLU), which can be regarded as a piecewise linear mapping. We can see that these commonly used activation functions are piecewise linear in fact. They are all piecewise linear mappings regardless of whether they have parameters or are random.

So this manifold is divided into many subspaces by these piecewise linear mappings, into many little cubes, so this manifold, when it goes through the encoder, becomes many little Spaces, all piecewise linear, many little polyhedra.

How do you understand how this pattern collapse comes about? When the encoder E maps the manifold M to the hidden space E (M), its distribution tends to be extremely uneven, and it is very difficult to classify or control the singular distribution in this uneven distribution. The question is, can we introduce another hidden space, which maps to Z, which is composed with the generator G, G*T, which maps this Z prime distribution to a better, more uniform distribution, which should be easier to classify, and to control the sampling points. Professor Yau and others have done some analytical work, using the best quality mapping, they can map the cube I just mentioned back to a better position.

If you don’t do the best quality mapping and apply the decoder directly, there will be problems. Uniform sampling in coding domain (usually regularly, such as the uniform is the most we can grasp, non-uniform things we hard to control well), so I put it overlaps in the coding on the figure of the sampling points, if directly with the generator (and decoder) reconstruction, restore these points, on the original figure, you can see the head is very sparse, This sparsity can be interpreted as decoding with these uniform sampling points in the hidden space after coding, and it is difficult to solve the effect that can be uniformly recovered in the head, which is also a kind of mode collapse.

If you add this best quality transport map, sample evenly in this Z prime space, and then recover. Put the best quality map together with the generator and the result will be fairly uniform. You can see that this mass is going to be better, so this optimal mass mapping makes it very easy to control in a uniformly distributed hidden space.

Professor Yau and others found that decoders and encoders can be linked mathematically by a closed formula. Simply speaking, as long as one of them is present, the other can be derived, which is mathematically guaranteed. With this conclusion, deep learning means that as long as one of them is well trained, the other one can be recovered through geometric calculation. There is no need to train the other one, which eliminates worries about data. But in fact, it is difficult to derive the optimal quality mapping in high-dimensional space, and it is not easy to do it with limited computing resources. So it doesn’t completely change our understanding of deep neural networks.

There is a problem here. This optimal quality mapping can also be learned by way of deep neural networks. The second natural question, do we have to learn it twice? Can we learn this composition mapping all at once? Obviously this is a very practical question: take two models and combine them into one.

Let’s look at pattern collapse from a different perspective, one that might make a little bit more sense. For example, there is a two-dimensional surface in a three-dimensional space, and there is a section at each point. For a more normal manifold, the section should be a two-dimensional plane. When the two-dimensional plane degenerates into a line or even a zero-dimensional point, mode collapse must occur. Because when you degenerate to a line, it doesn’t matter how much you change the other axis in the normal direction, it’s mode collapse. This is especially true when reduced to zero dimensions.

We can add another penalty to the loss function, which is the difference from an identity matrix, and add this to the loss function. It tries to make the tangent space full rank and not degenerate into one or zero dimensions, which can also effectively reduce the occurrence of mode collapse, which is another view of the problem.

Next question, if given a convolutional neural network based on the corrected piecewise linear activation function (ReLU), how strong is the learning ability? In other words, given a task, how small can we design a neural network to complete the task? We want to be able to limit the complexity of it, rather than explore it completely open-ended. So that kind of gives us some guidelines for exploring deep learning algorithms on mobile devices.

Just now I mentioned that the encoder and decoder are piecewise linear functions. The decoder divides the cube into smaller pieces, and the more cubes, the more gaps can be filled. The quality of the approximation determines the final effect of the encoder and decoder. And this is easy to understand, if a curve is approximated by one line and approximated by four lines, four segments are better approximated by four lines, or even approximated by more segments infinitely, and this of course has certain limitations on the original curve, like convex surfaces and so on.

The complexity of this correction, the complexity of a piecewise map is a measure of the ability to represent approximation. It’s defined as, in N dimensional time space, the largest number of connected subsets, and on every connected subset the encoder is linear, piecewise linear. This is a representation of the capabilities of this decoder. A K+2 layer deep convolutional neural network, represented by the most complex piecewise linear mapping it can represent.

Each set of different parameters defines a set of piecewise linear functions, of course, with different parameters, it has different capabilities. So the conclusion is that the complexity of deep neural networks is upper bounded, and that’s a good conclusion. If we know that the task we’re going to learn is more complex than that, then our deep neural network is designed to be too small to learn well. There are many manifestations of poor learning, such as poor generalization ability. No matter how many samples you train, the distribution you might learn is not consistent with the distribution of the actual data, it’s biased. We can imagine that in the real world, some of the data is not that good.

It also has a lower bound, and the lower bound is easier to understand, a weight that minimizes the complexity of the network.

So the representation capability of deep convolutional neural networks has upper and lower bounds, basically answering the question I just said. I’ve learned a few things. One reason is that homeomorphism mapping is required in topological space, which is actually a strong restriction. In fact, we can only learn a few simple topological structures, but not too complex things, or only one part, which is good at learning, but difficult to learn globally. The best quality mapping can be helpful, but figuring out the best quality mapping in higher dimensions can be a bit of a challenge. The third conclusion is that given any deep convolutional neural network, there must be a manifold embedded in its input environment space, and its distribution cannot be learned by the neural network. There are ways to mitigate pattern collapse; As for the complexity of the algorithm, we can have a certain way to define the complexity of neural network.

If you want to know more details, click here to get the RTC Conference PPT. Meanwhile, the conference has two addresses for live playback: video playback address 1 and video playback address 2.