The progress of deep neural network in discriminant model field is much faster than that in generative model field, the main reason is that compared with generative model, the discriminant model has clear goal, relatively simple logic and easy to implement. To use a popular metaphor, the discriminant model is equivalent to processing incoming materials, that is, feeding a series of data to the model, and then discriminating the corresponding results from the model. The typical representative is the classifier model. The generative model is equivalent to a production and creation factory, which needs to have certain creative ability, although at the present stage, the acquisition of such creative ability is largely obtained through imitation learning, without real creative ability. In the application of generative model, how to evaluate generative model accurately and efficiently is very important. It is the key point of generative model learning and creation to guide and correct deviation of generative model generation process effectively. The generative adversarial framework proposed in this paper is a kind of game theory to evaluate the generative process of generative models.Copy the code

Abstract

In this paper, a new framework is proposed to evaluate generative models, namely “adversarial process”. In this framework, we train two models simultaneously: 1). Generate model G, which is used to capture data distribution; 2). Discriminant model D, which is used to discriminate the probability that input data samples are derived from training data rather than from generated model G. The training of model G is to make the discriminant model D judge the error with the maximum probability. This framework is similar to the minimax algorithm in two-player games, and it embodies the idea of game theory. For any generating function G and discriminant function D, the game problem has a unique solution, that is, model G can accurately simulate the distribution of training data, and the output of discriminant model D is equal to 50%. If both G and D are defined as multilayer perceptrons, then the entire system of G and D can be trained using a backpropagation algorithm. In this framework, no matter in training stage or in sample generation stage, it is no longer dependent on Markov chain or rolling unrolled simulation inference network. In the experimental part of this paper, we evaluate the sample data generated by model G from both qualitative and quantitative aspects to demonstrate the potential of this framework.

1, the introduction

It is hoped that deep learning can be used to find expressive and hierarchical models to express the probability distribution of various data types used in AI applications, such as picture data, acoustic data including speech, and symbols in natural language. So far, some of the most influential applications in deep learning have involved discriminant models, which typically map high-dimensional sensor perception data into a category of tags. These applications rely on error backpropagation algorithms and random neuron suppression algorithms, both of which use piecewise linear elements to achieve better gradient performance. Instead, the depth is much weaker in the influence of emergent model, its for two reasons: the first is for emergent model, we need to depend on the maximum likelihood estimation and other related policies, and the mathematical calculation is inevitably requires a lot of probabilistic estimation, that there is a big difficulty in actual use; The second reason is that it is difficult to make full use of the advantages of piecewise linear elements in generative models to obtain better gradient performance. In this paper, we propose a new generative model estimation method, which can bypass these problems.

In the framework of adversarial network proposed in this paper, there are two models: one is generation model; The other is the model that is antagonistic to the generative model (hereinafter referred to as the discriminant model), and there is an antagonistic relationship between the two models. Discriminant models need to learn whether data samples come from model distribution or data distribution. The generation model can be likened to a counterfeiting gang that attempts to produce counterfeit money. Think of the discriminant model as a police officer trying to detect counterfeit money. As the two teams continue to play the game, both groups continue to improve their ability levels until the counterfeit money is indistinguishable from the real one.

The framework can derive many specific training algorithms that use a variety of models and optimization algorithms. In this paper, we only discuss a special case: that is, the discriminant model is also a multilayer perceptron and generates sample data by feeding random noise to the generating model through multilayer perceptron. We call this special case adversarial networks. In this example, both models are trained using the very successful back propagation algorithm and random neuron suppression algorithm, in which the sample data generated by the generation model is obtained through the forward algorithm. For details of training, please refer to reference [16]. The whole model training process no longer depends on analog reasoning or Markov chain.

2. Related research

Until now, most of the researches on generative models have focused on some given probability distribution function parameters and trained the models by maximizing logarithmic likelihood. Among these models, the most successful is the Deep Boltzmann machine. For details, please refer to reference [25]. These models usually involve very tricky likelihood functions, so a lot of likelihood gradient estimation is required. These difficulties push the Generative Machine model forward. The generator model, which no longer displays a dependence on likelihood functions, is able to generate sample data according to the desired distribution. Generative stochastic networks are a type of Generative machine model that can be trained directly using backpropagation algorithms, unlike boltzmann machines that require extensive approximations. In this paper, the idea of generative machine is further extended, and the Markov chain used in generative random network is no longer used.

At the time of this study, we did not know that Kingma, Welling(see reference [18]) and Rezende et al(see reference [23]) had developed a more general rule of random back propagation, which enables the model to operate within the range of finite variance. Gaussian distribution is used for gradient back propagation. At the same time, these back propagation rules enable the model to learn the conditional variance of the generator, which is treated as a hyperparameter in our study. In the studies of Kingma and Welling[18] and Rezende et al[23], they also used random backpropagation rules to train variable autoencoders (VAEs). Similar to generative adversarial networks, VAE pack a differentiable generative network and another neural network into network groups. However, unlike gans, the second neural network in VAE is a recognition model for performing some approximate reasoning tasks. GAN needs to be differentiable in the visual element, so it cannot be used to model discrete data. VAE requires hidden layer units to be derivable, so hidden variables with discrete values cannot be included in the model. There are many other methods similar to VAE, see [12,22], but these methods are not relevant to the methods studied in this paper, so they are not discussed in depth.

Criteria have been used to train generative models in previous studies (see references [29,13]), but the criteria used in these studies are difficult to apply to generative models. The approach they use, even using depth models, is also difficult to approximate because it involves a probability ratio that cannot be approximated using a variable approximation method, which underestimates the probability. NCE(Noise comparison estimation)[13] was used to train the generative model, and the weight parameters obtained were helpful for the model to distinguish the data from the noise distribution of some fixed patterns. If the pre-trained model is used as the noise distribution data source and a model sequence is trained in multi-level series, the performance of the generated model will be improved. This approach can be viewed as an informal competition mechanism, with much the same idea as adversarial networks. The real flaw of the NCE method is that its discriminator is defined by the ratio between the probability density of the noise distribution and the probability density of the model distribution.

There are also some studies that use the concept of confrontation in the general sense, that is to construct two neural networks to compete with each other. The most representative study is the Predictability minimization model [26]. In the testability minimization model, each hidden unit of the first neural network is trained to keep its output different from that of the second neural network, and the main goal of the second neural network is to predict the value of the hidden neuron given the output value of other hidden layer neurons. Compared with the testability minimization model, the study in this paper has three significant differences: 1). In this paper, competition between networks is the only training criterion, and this criterion is sufficient to support the training of the whole neural network. Testability minimization model is only a regulator, which imposes some constraints on the hidden layer element of neural network, so that the hidden layer element can satisfy other tasks while maintaining statistical independence. 2). The nature of competition is different. In the testability minimization model, the outputs of two networks are played. The goal of one network is to make the outputs of the two networks similar, while the goal of the other network is to make the outputs of the two networks different. The output values of the network are scalar values. In the generative adversarial model (GAN), a network is used to generate a high-dimensional vector and takes it as the input item of the second network. Then, an input is selected and sent to the second network, and the second network is unaware of the input data. 3). Learning process is different. Testability minimization model is an optimization problem whose objective is to optimize the objective function to the minimum point. GAN is based on the maximum and minimum game problem, and there is a value function in the training process. The goal of one network is to maximize the value function, while the goal of another network is to minimize the value function. The game ends up at the saddle point, which is the maximum for one network and the minimum for the other.

Generative adversarial networks are sometimes confused with the concept of “adversarial samples” [28]. Adversarial samples are found when the gradient – based optimization algorithm is applied directly to the input samples into the classifier network. The main purpose of this is to find samples that are similar to the input sample but have been misclassified. Therefore, “adversarial sample” is different from the generative adversarial network proposed in this paper, and the adversarial sample is not a mechanism that can be used to train the generative model. In fact, the adversarial example serves primarily as an analytical tool to explore the weird behavior of the neural network itself. For example, if we have an image based on neural network classifier, the classifier in a picture can carry on the classification to a very high degree of confidence, but if the pictures by random add some human eyes almost impossible to distinguish the noise interference, the classifier can with a high degree of confidence will be assigned to the wrong category to the pictures. The existence of such “adversarial examples” indicates that the training of the generative adversarial network is not sufficient, because it indicates that the discriminant model can even distinguish the sample class without any simulation of the class attributes of the sample.

3. Adversarial networks

In the next section, we will give the theoretical analysis of combat network, the discriminant model D discriminant ability under the premise of enough (e.g., not limited to any parameter D do), minimizing the above formula contains training guidelines to generate model G recovers sample data distribution (i.e. generation model G learned the input sample data distribution function, The data recovered from G can be as good as real). For an illustrative explanation, see Figure 1. In practice, we must use iterative numerical methods to implement such games. Optimization of discriminant model D in the internal cycle of training cycle is very expensive, and it will lead to over-fitting of the model in limited data sets. A feasible alternative is to optimize model D and model G alternately. To be specific, after k steps of optimization for each pair of discriminator model D, a single step optimization is carried out for the generated model G, and the process continues alternately until convergence. The advantage of this scheme is that as long as the generator model G changes slowly enough, the discriminator model D will remain at its optimal solution position. See Algorithm 1 for the pseudocode of this process.Copy the code

4. Theoretical analysis

4.2 Convergence of Algorithm 1

5. Empirical analysis

We trained adversal networks on a range of datasets, including MNIST, TFD(Toronto Face Database), and CIFAR-10. Generator network G uses both the modified linear activation function and sigmoID activation function, while discriminator network uses the Maxout activation function. We also use the Dropout technique when training the discriminator network. Although the theoretical framework supports dropout and other noise-related operations in the middle layer of the generator, in practice we only add noise at the lowest layer of the network as input signals to the generator network.

The advantages mentioned above are mainly computational, and the adjudgment model also has some statistical advantages, that is, the generated network G is not directly updated by the data sample, but only uses the gradient signal output by the D network to update the corresponding weight parameters of the G network. This means that the parameters that generate network G are not directly derived from the input data. Another advantage against networks is that networks can represent some very sharp, even attenuated types of data distribution.

reference

[1] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.

[2] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.

[3] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Parasitological representations of deep structures In rice fields.

[4] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. In ICML ’14.

[5] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic networks trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning (ICML ’14).

[6] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.

[7] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating Representative samples from an RBM-derived Process. Neural Computation, 23(8), 2053 — 2073.

[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep Sparse Neural networks. In AISTATS ‘2011.

[9] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout Networks. In ICML ‘2013.

[10] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction of deep Boltzmann Machines. In NIPS ‘2013.

[11] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., And Bengio, Y. (2013c). Pylearn2: A Machine Learning Research Library. ArXiv Preprint arXiv:1308.4214

[12] Gregor, K., Danihelka, I., Mnih, A., Blundell, C., andWierstra, D. (2014). Regressive Deep autoregressive networks. In ICML ‘2014.

[13] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS ’10).

[14] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.

[15] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). Wake -sleep algorithm for Unsupervised neural Networks. Journal of Neural Networks, 268, 1558 — 1161.

[16] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, ArXiv: 1207.0580.

[17] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV ’09), Pages 2146 — 2153. IEEE.

[18] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).

[19] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.

[20] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS ‘2012.

[21] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, Proceedings of the IEEE, 86(11), 2278 — 2324. P. (1998). Gradient-based learning applied to document Recognition. Proceedings of the IEEE, 86(11), 2278 — 2324.

[22] Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. Technical report, ArXiv preprint arXiv: 1402.0030.

[23] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. Technical report, ArXiv: 1401.4082.

[24] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). Generative process for sampling contractive auto-encoders. In ICML ’12.

[25] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann Machines. In AISTATS ‘2009, Pages 448 — 455.

[26] Schmidhuber, J. (1992). Learning Factorial Codes by Predictability. Neural Computation, 4(6), 863 — 879.

[27] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010-001, U. Toronto.

[28] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014). Intriguing properties of neural networks. ICLR, ABS /1312.6199.

[29] Tu, Z. (2007). Learning generative models via discriminative approaches. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on Computer Science and Technology, Pages 1 — 8.