Original heart of machine \

Author: Siyuan

Like quantum mechanics, which seeks to unify in the age of physics, deep learning may also need a unified theoretical framework.

If we could have a theory that tells us what model architecture, what method of operation best represents certain data, what loss function, what iterative method most efficiently learns certain capabilities, and what Settings enable this capability to handle various contingencies. Then, such deep learning, even machine learning, is a big discipline with a smooth theoretical foundation.

Surprisingly, we can see a lot of recent cutting-edge research becoming more systematic and more insightful. The most obvious is the AI Summit held in recent years, where we can see many award-winning papers trying to lay the foundation for DL from a more fundamental and profound perspective. This article will introduce the theoretical basis of deep learning and the new findings based on the award-winning papers of the 2019 Ai Summit.

Mathematical basis ≠ theoretical basis

In the course of introductory deep learning, we often hear a variety of mathematical terms, linear algebra and matrix differentiation, probability theory and random process, it seems that in order to understand various models, you must first understand these mathematical concepts. So aren’t these mathematical descriptions the “theoretical basis” of deep learning?

We need to be clear here that mathematics is a language, a tool, and using it to describe deep learning does not necessarily equate to building the theoretical foundations of DL. This is a mathematical basis, not a theoretical basis for the field of consolidation. Indeed, many deep models start from mathematical derivation and get some good properties, but this only shows that the models are guaranteed by theory, and they cannot form the theoretical basis of deep learning.

For example, graph convolutional networks or variational autoencoders start out by deriving certain properties from a mathematical point of view, and then build the whole model based on such properties. We can say that the theoretical basis of these models is very solid, and if we want to understand them, we also need the corresponding mathematical basis. In addition, in the actual modeling, we do not necessarily follow the theoretical derivation completely, and can modify it slightly to obtain stronger computational effect.

In deep learning, there are many models that are mathematically elegant. There are also many models that describe the mathematical expression of the whole learning process from experimental and intuitive concepts. They’re all important, but they don’t solve the fundamental question of deep learning: why do deep models learn so effectively? Why do deep models have better properties than shallow models? Why does deep learning generalize well?

Like the discovery of various quantum phenomena and explanations at the beginning of the last century, the great era of physics is struggling to find a unified “quantum mechanics.” Now that deep learning has all kinds of very efficient models and all kinds of amazing structures, we may also need a unified underlying theoretical framework.

The theoretical basis of DL

When it comes to the theoretical basis of deep learning, the first thing that may come to mind is the Universal approximation Theorem, which states that a single-layer feed-forward network with infinite neurons can approximate any continuous function on a compact subset of real numbers. In layman’s terms, as long as there are enough neurons, a single-layer feed-forward neural network “has the potential” to approximate arbitrarily complex continuous functions.

Since the general approximation theorem was proposed in 1989, we have at least one basic theoretical foundation, that is, neural networks have the potential to solve all kinds of complex real-world problems.

MIT professor Tomaso Poggio once expressed in his series of studies [1] that deep learning theoretical research can be divided into three categories:

  • Representation Problems: Why do deep networks perform better than shallow networks?
  • Optimization problem: Why can gradient descent find a good minimum solution? What are the characteristics of good minima?
  • Generalization problem: Why can overparameterization still have good Generalization but Generalization?

For the representation problem, we want to know how to determine the expression ability of deep neural network, the “compound function”, and what is its compound mechanism. We are no longer satisfied with the qualitative description of “can fit any function”. We want to know whether there is a way to describe the fitting capability of 50-layer ResNet and 12-layer Transformer, and whether their theoretical properties and processes can be clearly understood.

With the ability of representation, it only has the potential of fitting. Deep learning also needs to find a set of sufficiently good extremum points, which is the optimal solution of the model. What the “optimal Landscape” of different neural networks looks like, how to find the excellent extreme points of such high-dimensional complex functions, and various attributes of the extreme points need perfect theoretical support.

Finally, generalization, the ability of a deep model to generalize to an unknown sample directly determines its value. So how to determine the generalization boundary of the depth model, what kind of extreme point has better generalization performance, many important characteristics are waiting for us to determine a set of theoretical benchmark.

In a word, when it comes to the theoretical basis of deep learning, we always hear these key words:

Nineteen years of theoretical research

In 2019, we will see the AI Summit select theoretical research as the best or outstanding papers. They are not necessarily making new contributions to architecture or algorithms, but rather, many of them want to understand deep learning from a mathematical and theoretical perspective, its behavior and boundaries. Because the top research works tend to be at the bottom, we find them increasingly difficult to read.

We have checked the awards of AI summit papers in 2019, and here we confirm that the “summit” is mainly based on the CATEGORY A conference (artificial intelligence field) recommended by CCF, plus ICLR not included in CCF. We divided the winning papers into partial model algorithm and partial theoretical research, in which the theoretical research may be the theoretical research on deep learning, or the theoretical understanding of model algorithm.

Statistical data can be found in Appendix A:www.jiqizhixin.com/articles/20…

Papers on general partial algorithms or models are very friendly to read, introducing intuitive ideas, formal processes and final results. But theoretical papers will require knowledge in many areas, especially a solid mathematical foundation. A few representative studies will be detailed later in this article, but for now, let’s look at the overall picture.

The theoretical basis of deep learning

In fact, the theoretical basis of DL is still relatively narrow, and the three major problems of representation, optimization and generalization are also the most important and basic directions. NeurIPS 2019’s New Directions Outstanding paper [2] focuses specifically on deep learning generalization.

Theoretically, currently deep learning mostly derives the generalization upper bound based on consistency convergence, but CMU researchers say that this generalization bound is problematic. Through a series of experiments and theoretical analyses, it is proved that no matter how refined the uniform convergence boundary is, it cannot be proved to explain the generalization. Therefore, a large family of generalization boundaries derived from uniform convergence is problematic.

In ICLR’s Best paper 2019 [3], Lottery Ticket Hypothesis proposed by MIT researchers is very interesting, which indicates that if some parameters are important in the model, they are important before the training starts. In other words, the neural network has sub-networks after initialization. Training this sub-network can obtain similar performance of the whole network. Such sub-structures can be called Winning tickets. This special substructure also demonstrates the new characteristics of neural network representation ability.

In addition, there are also new findings in convergence analysis. ICML best paper [4] analyzed the convergence rate of variational Gaussian process and proposed a more convenient way to calculate.

Theoretical understanding of the model

In addition to the theoretical basis, there are more theoretical studies focusing on the theoretical understanding of model algorithms and the use of this to propose new solutions. The most significant is NeurIPS ‘best paper of 2018, ODENet [5], which understood the residual network as ordinary differential equation and thus had a new solution. In the 2019 Top Prize winning papers, even partial algorithm research results, there will be some theoretical understanding, but this paper focuses on more theoretical components of the research.

First, AAAI Best paper [6] solved the information imperfect game from the perspective of iterative algorithm, so as to further build a stronger agent. It abstracts complex games into simple game problems and constructs a new algorithm from the perspective of game theory, which has excellent theoretical properties.

In NeurIPS 2019’s outstanding paper [7], CMU researchers theoretically analyzed a family of large loss functions to explore what the loss function of GAN actually is. Similarly, in ACL 2019’s outstanding paper [8], researchers believe that the establishment of a theoretical model for automatic text summarization can deepen our understanding of the task and also help improve the text summarization system. Therefore, researchers such as HKUST have defined some concepts of text abstract strictly and put forward a theoretical modeling framework.

All of these top research achievements can not be separated from the support of theory. Here we introduce what the new research looks like from the theoretical basis and understanding.

Problematic generalization

What model is good for generalization? Is good model generalization on test sets really good?

Many previous models used the error in the test set as the generalization error, regardless of whether or not “peep” the test data, this error is only an empirical indicator. Deep learning needs to analyze the generalization ability of learning methods theoretically. Let’s first look at the definition of generalization error:

In fact, the above expression is not complicated, it describes that the generalization error should be the model’s “average” prediction error over all unknown data, that is, the mathematical expectation of all errors. Note that “all unknown data” is not available, so this is just a definition.

The decline of traditional generalization theory

Previous theoretical studies mainly rely on the analysis of the probability upper bound of the generalization error, which is often heard as the upper bound of the generalization error. In traditional machine learning, the upper bound of generalization error is considered as a function of sample size. When the sample number increases, the upper bound of generalization error tends to 0. At the same time, the upper bound of generalization error is also a function of model capability. The stronger the model capability is, the more difficult it is to learn and the larger the upper bound of generalization error is.

For example, VC dimension, a well-known traditional theory, discusses how strong the expression ability of a family of functions is by considering the consistent convergence boundary of a family of model functions. In the NeurIPS 2019 New Directions Outstanding paper, the researchers show that this approach, which considers consistent convergence boundaries, does not work. The traditional generalization error can be roughly expressed as follows:

We believe that the test error cannot exceed the training error plus some boundary. This boundary decreases as the training set increases and increases as the number of model parameters (depth×width) increases.

But the problem is that the traditional upper bound of generalization error does not consider the magic phenomenon of “over-parameterization” of deep neural networks. Not all parameters contribute to the final prediction and there are a lot of redundant parameters in the depth model. Therefore, depth×width cannot correctly describe the learning difficulty of depth models, and over-parameterization will make learning easier.

The innovative way of modern generalization theory

Now that traditional generalization theory has failed, deep learning researchers are looking for a new way out. Recent researchers have wondered: “Can we determine how the underlying data distribution and algorithms work together to constrain deep neural networks into a ‘simple’ family of functions?” Therefore, by using a family of norm-constrained functions, perhaps we can apply consistency convergence to more concise and accurate boundaries:

This class of methods looks very different, but is essentially a different representation of uniform convergence. In this paper of CMU, they found that, in fact, the consistent convergence boundary cannot fully explain the generalization problem of deep learning, and we should discuss the generalization boundary on the basis of consistent convergence.

What exactly is wrong with consistent convergence?

First, we need to determine uniform convergence in machine learning, which is simply to answer the question “why does reducing training losses reduce test losses?” If the empirical risk of the family of functions converges consistently with the total risk, then the problem is learnable.

Vaishnavh Nagarajan, co-author of the paper [2], said: “Most previous studies have considered generalization boundaries based on consistent convergence, but our study shows that such problems are likely to be limited.” Rademacher’s current Complexity, Covering Numbers and Pac-Bayes and many other cutting-edge generalized boundary analysis are potentially problematic.

Vaishnavh indicated that the consistent convergence boundary would increase as the number of parameters increased, so such a boundary would be too large for deep networks and would not behave like a true generalization boundary. But it is also possible that the uniform convergence boundary will be very tight, except that it is not the boundary of the original neural network, and it is likely to be a new boundary refined by techniques such as model compression.

In order to further understand why uniform convergence does not provide a solid theoretical basis, researchers did a lot of experiments and deductions, and finally found that the main problems are reflected in two aspects.

First, the generalization boundary should grow as the training set grows, which is very problematic. Because as we intuitively understand, when the data set is infinite, the gap between the training error and the test error should be reduced to zero. “Even though we observed a normal decrease in test errors as the data set increased, the generalization boundary grew abnormally,” Vaishnavh said.

The reason for this problem is that the complexity of the model was measured by the number of parameters before, and the later modified method also measured complexity by the weight norm. The problem, however, is that the weight norm increases significantly as the data set increases and cancels out the growth rate of the denominator data set. “Parameter dependence is only one part of the generalization problem,” Vaishnavh says. “We also need to pay special attention to dataset dependence.”

Secondly, for the second problem, we investigate all the theoretical analyses of generalization boundaries and show that no uniform convergence boundary, no matter how rigorous its derivation and application, can explain the generalization problem of NEURAL networks trained by SGD. As Vaishnavh said, in deep learning, no matter how refined the uniform convergence boundary is, it cannot be proved to explain generalization.

As shown in the above equation, even with further refinement, the uniform convergence boundary may be derived to be approximately equal to 1, but the true generalization gap may be close to 0. The result is very false, and it doesn’t make a difference.

In the end, Vaishnavh shows that in over-parameterized deep learning, decision boundaries are extremely complex. As shown above, the decision boundary may have some small bending at each training data point, which will affect the consistency convergence but does not affect the generalization. Therefore, perhaps we need mathematical tools to describe the complex decision boundaries of deep neural networks, and need some theory on consistency convergence to discuss deep learning.

Some parameters are inherently unequal

Deep learning has a strong over-parametric phenomenon, and the number of parameters far exceeds the amount of data. And it’s important to note that not all metrics are created equal, and some metrics aren’t important at all, and deleting them won’t make a difference. In ICLR best Paper 2019 [3], MIT researchers make theoretical hypotheses from underlying mechanisms and verify such hypotheses experimentally.

The expression above is that if the neural network has done random initialization, then it contains a subnetwork. With the same number of iterations trained from the beginning, the sub-network can get the same effect as the whole network. Such an assumption seems counterintuitive, since we have long had the notion that “a pruned subnetwork learns from scratch worse than the whole network”.

But the researchers hypothesize that once initialization is complete, valid sub-structures, called “winning tickets,” are determined. To determine whether such assumptions are true, experiments are needed, of course.

Do the experiment

Because model pruning inherently builds a subnetwork, the researchers first verify whether the subnetwork is a valid substructure. The researchers found that by fixing such subnetworks and reinitializing the weights, the training results could not match the previous results. Thus, this also demonstrates the effect of initialization on efficient substructures.

Assuming such a substructure exists in the neural network, we can find it in four steps. The core idea is that since conventional model pruning can keep its accuracy almost constant with a large number of weights removed, it is a valid substructure under that initialization condition. If we save the results of the previous initialization and use the pruned substructure, can we achieve good training results?

Specifically, for neural networks f(x; θ), θ is the weight of initialization. After training and pruning, we re-assign the previous initialization weight to the substructure, which can be expressed as f(x; M even theta). Where m has the same dimension as θ, and each element is either 0 or 1, f(x; M ⊙θ represents a properly initialized subnetwork. Now there are four big steps:

  • Random initialization neural network F (x; θ_0), where θ_0 obeys some distribution D_0;
  • The training network was iterated j times, and the optimal parameter θ_j was obtained.
  • Mask m can be obtained by cutting out p% parameter in θ_j.
  • Restore the initial parameter θ_0 and create a valid substructure F (x; M even theta _0).

Through various experimental methods, researchers found “winning tickets” of fully connected network in MNIST and “winning tickets” of convolutional network in CIFAR-10. These subnetworks have only 10-20% of the number of parameters of the original network, but retraining can achieve similar results.

The test results of VGG-19 on CIFAR-10 are the results of 30K, 60K and 112K iterations from left to right, respectively. Selected from: arXiv:1803.03635.

As shown in the figure above, the horizontal axis represents the weight retained, the number axis represents the accuracy, and the legend shows the learning rate. Solid lines in each color represent subnetworks that are “effectively initialized,” while dashed lines represent random initialization. By looking at the green solid line, the green dotted line, and the blue solid line, it is clear that the researchers have found “winning tickets”.

“This series of experiments demonstrates the existence of efficient substructures of neural networks,” the researchers said. “It is only a hypothesis, but it is useful for further theoretical studies, especially on optimization and generalization.”

Loss function analysis of hard core

If you don’t think that kind of theoretical understanding is mathematically elegant, the winning paper will also have a series of rigorous mathematical arguments. In NeurIPS 2019, an award-winning paper [7] analyzed a family of loss functions named Besov IPM, which includes L_p norm distance, total variational distance, Wasserstein distance, Kolmogorov-Smirnov distance and many other loss functions.

For such a large group of loss functions, researchers analyze their upper and lower bounds, and clarify the choice of loss functions and the interaction of data assumptions, and how they determine the optimal convergence rate of minimax processes.

For GAN, if generator and discriminator functions are expressed as F and P respectively, then the whole GAN can be regarded as probability distribution estimation:

The above expression describes that such probability distribution estimation of GAN can directly minimize the empirical IPM risk for the empirical distribution P_Ntilde. After a complex mathematical analysis, the researchers came to three main conclusions:

1. We prove the minimax convergence rate of distribution estimation under IPM loss function, and what are its lower and upper bounds (Theorem 4 and Theorem 5). For IPM loss functions, generate distributionsAnd discriminant distribution

All belong to Besov space. The upper bound of convergence was mainly obtained through the threshold-thresholding estimator proposed by Donoho et al. [9], and the results showed that the optimal convergence rate was wider than the previously known loss range. Specifically, if M(F, P) represents minimax risk, then forThere are:

2. Theorem 7 shows that for p “_d ≥ p_g and σ_g ≥ D/p_g, there is no estimator whose convergence rate is faster than that of linear estimator:

This “linear estimator” includes empirical distribution, kernel density estimator and recently proposed orthogonal series estimator. The lower bound described by the above expression shows that in many cases linear estimators can only achieve suboptimal convergence rates.

3. After regularization, GAN can achieve minimax convergence rate through generators and discriminators of finite size. As the first theoretical results of separating GAN from other nonparametric tools, it may help explain why GAN has been so successful with high-dimensional data.

Finally, whether it is the real theoretical basis of deep learning, or the construction of new methods and models based on theories, at least at the 2019 AI Summit, we are glad to see that various cutting-edge research is turning away from new discoveries of “heuristics” and instead focusing more systematically on their basis. Perhaps these new discoveries will eventually lead us to a systematic field, a mature discipline.

References:

[1]Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization, arXiv:1908.09375

[2]Uniform convergence may be unable to explain generalization in deep learning, arXiv:1902.04742

[3]The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, arXiv:1803.03635

[4] Convergence of Sparse Gaussian Process Regression, arXiv:1903.03571

[5]Neural Ordinary Differential Equations, arXiv:1806.07366

[6]Solving Imperfect Information Games via Beijing Subway With Regret Minimization, arXiv:1809.04040

[7]Nonparametric density estimation & convergence of GANs under Besov IPM losses, arXiv:1902.03511

[8]A Theoretical Model of Importance for Summarization, arXiv:1801.08991

[9]Density estimation by wavelet thresholding, David L Donoho et al.

This article was originally written by the heart of the machine. \

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

Note: If you join our wechat group or QQ group, please reply to "add group" to get a discount coupon of knowledge planet, please reply to "knowledge Planet".Copy the code

Like articles, click Looking at the