Zhou Zhihua: If these three conditions are met, we can consider not using deep neural networks

Product | AI technology base (public ID: rgznai100)

At the JINGdong ARTIFICIAL Intelligence Innovation Summit held on April 15, Professor Zhou Zhihua, who just took office as the academic general consultant of JINGdong Artificial Intelligence Nanjing Branch, shared a thought on deep Learning publicly.

In recent years, deep neural networks have made outstanding progress in the fields of speech and image, so that many people equate deep learning with deep neural networks. However, Zhou Zhihua said that the summary of Kaggle competition results can be found that neural networks often win in the image, video, sound and other typical tasks, and in other tasks involving hybrid modeling, discrete modeling, symbolic modeling, compared to other models.

Why did this happen? Zhou Zhihua started from the deep meaning of deep neural network, and summarized three reasons for the success of neural network:

There are layers of processing
Characteristic internal variation
Sufficient model complexity

It is concluded that if these three conditions are satisfied, it is not necessary to use only deep neural network.

Due to some defects of neural network, people have to consider other models in many cases. Zhou zhihua introduced the GCFOREST method proposed by his team, saying that the method has advantages such as good cross-task performance and adaptive model complexity.

For the significance of GCFOREST research, as Zhou Zhihua said in his sharing, deep learning is a dark room, which is known to have deep neural networks in it before. Now we open a door to this room and put GCForest in it. I think there may be more things in the future. This is the more important value of the work from the perspective of academic and scientific development.

Professor Zhou zhihua is a member of ACM, AAAS, AAAI, IEEE, IAPR, IET/IEE, etc. It has achieved the Grand Slam of AI scholars and is the only AAAI scholar who has obtained all degrees in Mainland China. It has made outstanding contributions to integrated learning, multi-label learning and semi-supervised learning in machine learning. He also co-founded and served as dean of the School of Artificial Intelligence at Nanjing University.

The following is the full text of the speech.

You may have heard recently that nanjing University has established the School of Artificial Intelligence, which is the first artificial intelligence discipline in China’s C9 universities. Today, I would like to talk to you about our own very shallow views on deep learning, just for your criticism and discussion.

▌ What is deep learning?

As we all know, artificial intelligence is hot right now, and one of the most important technologies that set off this craze is deep learning technology. When we talk about deep learning today, we can actually see it in a variety of applications, including image, video, sound, natural language processing, and so on. If we ask the question, what is deep learning? Most people basically think of deep learning as something like a deep neural network.

Let me show you an example. One very famous society is the International Society of Industrial and Applied Mathematics, and they have a newspaper called SIAM News. There was a front-page article last June that focused on what is deep learning? It is a subfield of machine learning in which deep neural networks are used.

So basically if we’re going to talk about deep learning, we’re going to start with neural networks. Neural network is not a new thing, people have been studying for more than half a century. But in the past we used to use neural networks that had one hidden layer or two hidden layers in the middle. What is each unit of such a neural network? It’s a very simple computational model.

For example, this is a computational model that was actually developed more than half a century ago. We receive some input, which is amplified by some connections, and when it gets to the cell, if it adds up above a threshold, the cell is activated. In fact, it’s a very simple formula, and a neural network is a mathematical system of nested iterations of many of these formulas.

What do we mean today when we talk about deep neural networks? It’s basically a neural network with many layers, deep, deep layers. About how much? Take a look at a data, in 2012, when deep learning was just getting attention, a champion of ImageNet contest used 8 layers, 152 layers in 2015, and 1207 layers in 2016, which is a very large system.

It’s a very difficult system to train, but the good news is that the most important activation function of the cell in a real neural network is continuous and differentiable. In neural networks we used to use Sigmoid, which is continuously differentiable, and now in deep neural networks we often use tanh or a variant of tanH, which is also continuously differentiable. Now, with this property, we have a very nice result, and the result is that we can now easily calculate the gradient of the system. This makes it easy to train the system using the famous BP algorithm.

Today, neural networks have had a lot of success with algorithms like this, but one of the things that hasn’t really been clear in academia is why do we use such deep patterns? Today, many people may say that deep learning has achieved a lot of success, but one of its big problems is that its theoretical basis is not clear. We still have no clear theory on how to do it. Why did it work? What’s the key here? In fact, we don’t know from what Angle to look at it. Because if we’re going to do a theoretical analysis, first of all you have to have a little intuition about what it is that makes you useful, and then you go down that path and you get critical results.

As to why deep neural network can be deep, in fact, there is no unified view on this matter in the academic circle up to now. I’m going to give you an argument that we gave a little while ago, and that argument is really in terms of the complexity of the main models.

What is the key to success in deep learning?

We know that the complexity of a machine learning model is actually related to its capacity, and this capacity directly determines its learning ability, so learning ability is related to complexity. We have known for a long time that if we can increase the complexity of a learning model, its learning ability can be improved. So how can we increase complexity?

There are two obvious ways to do this with a model like a neural network, one is we make the model deeper, one is we make it wider, but deeper is more effective in terms of increasing complexity. When you get wider you just increase the number of units, you increase the number of functions, and when you get deeper you not only increase the number, you actually increase the level of embedding, so you get more expressive. So from that point of view, we should try to go deeper.

You might ask, well, if it’s going to be deeper, didn’t you know about it? Why start now? In fact, this relates to another problem, we in machine learning to improve the learning ability, which is not really a good thing. Because one of the things that we struggle with in machine learning is that we often come across fits.

Given a data set, we hope to learn the things in the data set, but sometimes we may learn some characteristics of the data itself, and this characteristic is not the general law. When the wrong things learned as a general rule to use, will make a huge mistake, this phenomenon is overfitting. Why learn the characteristics of the data itself? It’s because our models are so good at learning.

So we didn’t use too complicated models before, why can we use this model now? There are a number of factors. The first factor is that now we have a lot of data, so if I have, say, 3,000 or so data, it’s unlikely that the properties that I learn are general. However, if there are 30 million or even 30 million data points, then the characteristics in these data points are general rules, so the use of large data itself is the key condition to alleviate the overfitting.

The second factor is that there are a lot of very powerful computing devices today, so it’s possible to train models like this, and also through the efforts of many scholars in the field, we have a lot of techniques and algorithms for training complex models like this, so it’s possible to use complex models.

In that vein, there are really three things: first, we have bigger data today; The second; Powerful computing equipment; Third, there are many effective training techniques.

This leads us to use highly complex models. Deep neural network is a kind of highly complex model which is easy to implement. So this set of theoretical explanations, if we call it an explanation, seems to tell us why we can now use deep neural networks. Why does it work? Because of the complexity.

More than a year ago, when we put forward this explanation, many colleagues at home and abroad also agreed with this explanation, because they thought it sounded reasonable. In fact, I have not been particularly satisfied with this explanation. Why is this? There’s a potential question that we haven’t answered. In terms of complexity, there is no way to say why flat or wide networks can’t perform as well as deep neural networks. Because the fact that we’re making the network wider, even though it’s not as efficient, it also has the ability to increase complexity.

In fact, we had a theoretical proof in 1989 that neural networks have the ability to approximate everything: as long as you use a hidden layer, you can approximate a continuous function of any complexity defined on a compact set with any precision.

It doesn’t have to be very deep. One of the things I’m going to quote here is that neural networks have approximation power, and some people might think that’s one of the main reasons why neural networks are so powerful, but it’s a misunderstanding.

All models that we use in machine learning, it has to have universal approximation. If you don’t have that capability, it doesn’t work at all. So the simplest, even the Fourier transform, it already has this capability, so this capability is not unique to neural networks. So what’s the one thing we want to emphasize here? In fact, as long as I have a hidden layer, and I add an infinite number of neurons, it becomes very powerful, it becomes very complex. But no matter how hard we tried such models in the application, we found that they were not as good as deep neural networks. So from a complex point of view it may be difficult to solve this problem, we need to think a little bit more deeply.

So we have to ask the question: What is the essence of deep neural networks? Today our answer may be to do something to show the ability to learn. In the past, we used machine learning to first get a data, for example, the data object is an image, and then we would describe it with many features, such as color, texture and so on. These features are designed by human experts manually, and then we can learn them after they are expressed.

And today we have the deep learning, now no longer need to manually design features of the data from one end of the throw into, model come out from the other side, the middle completely through their learning to solve all of the features, which is characteristic of the so-called learning or said, this compared to previous machine learning technology is a great progress, We no longer need to rely entirely on human experts to design features.

Sometimes our friends in industry will say that there is a very important part of this called end-to-end learning, and people think it’s very important. In fact, this matter can be divided into two aspects: on the one hand, when we combine feature learning and classifier learning, we can achieve the effect of joint optimization, which is a good aspect; But on the other hand, if we don’t know what’s going on in there, then end-to-end learning isn’t really that good, because maybe the first part is going east, the second part is going west, and it all adds up to a little bit more going east, but there’s something inside that cancels out.

In fact, end-to-end learning has long been involved in machine learning, such as feature selection, but are these methods better than other feature selection methods? Not necessarily, so that’s not the most important thing, what’s really important is feature learning or representation learning.

Let’s ask the next question, what is the key to learning? We now have an answer to this, which is layer by layer. Now we will quote a very popular book deep learning in one picture, when we get an image, if we consider neural network layers, first in the bottom what we see is some pixels, when we up one layer, slowly have the edge, then up a contour, etc., In a real neural network model you don’t necessarily have this level of hierarchy, but in general you’re abstracting objects upwards.

And this characteristic, we now believe it is as if the deep learning really one of the key factors of success, because the flat neural network can do a lot of deep neural network had done, but there is a little less than it does: when it is compressed to a depth of processing, so the depth of layered abstraction may be the key. So if we look at it a little bit more, you might say, well, layer by layer, this is not a new thing in machine learning.

There used to be a lot of layer-by-layer stuff, like decision trees, which are layer-by-layer, which is a very typical model. This is 50 or 60 years old, but why isn’t it as good as deep neural networks? First of all, its complexity is not enough, because the depth of decision tree, if we only consider discrete features, its deepest depth will not exceed the number of features, so its model complexity has an upper limit. Second, in the whole learning process of decision tree, there is no feature change inside it, but it is always carried out in a feature space, which may also be a problem.

Boosting is an advanced machine learning model, and it is believed that this model has not been successful in deep learning. I think the problem is pretty much the same, first of all, it’s not complicated enough, and second, more importantly, it’s always doing things in the original feature space, and all of these learners are in the original feature space, without any feature transformation in between.

Why are deep neural networks successful? What are the key reasons? I think the first thing we need is two things, one is layer by layer, and the second thing we need is an internal feature transformation. When we consider these two things, the depth model is a very natural choice. With a model like this, we can easily do both of these things. However, when we choose to use such a deep model, we will have a lot of problems, it is easy to overfit, so we need to use big data, it is difficult to train, we need a lot of training tricks, the system is very expensive to calculate, so we need to have very powerful computing equipment, such as GPU and so on.

In fact, all of these things are a result of choosing the deep model, they are not the reason we use deep learning. So it’s a little bit different from the way we used to think that having these things led us to use the depth model, but now we think that the causality is the other way around — because we’re using them, we’re thinking about these things.

And the other thing that we need to be aware of is that when we have very large training data, we have to have very complex models. If we had a linear model, if we gave you 20 million samples or 200 million samples, it wouldn’t make a big difference to it, it wouldn’t learn. And we have enough complexity, and we actually see that it adds another point to our use of depth models.

For these reasons, we think this is probably the most important thing in deep learning. So here’s a realization we have now:

First we have to have layer by layer processing;
Second, we should have characteristic internal changes;
Third, we need to have enough model complexity.

These three things are key to why we now think deep neural networks work, or it’s a guess. If I meet these conditions, I can actually immediately think, you don’t have to really use a neural network, the neural network is one of several options, I just have to do all three things at the same time, other models can do it, it doesn’t have to be a deep neural network.

▌ Defects of deep neural networks

We have to think about, do we need to think about models outside of neural networks? Well, there is. Because everybody knows that neural networks have a lot of flaws.

One, as anyone who’s used a deep neural network knows, you spend a lot of effort tuning its parameters, because it’s a huge system. This brings up a lot of problems. First of all, when we tune the parameters, the experience is actually hard to share. Some friends may say that my experience in the first image dataset can be reused when I use the second image dataset. However, have we ever thought that, for example, if we build a large neural network in the image, the experience of adjusting parameters in the image may not be very useful for reference in the voice problem, so when we cross tasks, the experience may not be effective.

Moreover, it also brings the second problem. We all pay much attention to the repeatability of results today. No matter in scientific research or technological development, we all hope that the results can be repeated. All too often, one group of researchers will publish a paper reporting a result that is difficult for other researchers to replicate. Because even if you use the same data, the same method, as long as the design of the hyperparameters is different, you will get different results.

When we use a deep neural network, the complexity of the model must be specified in advance, because before we train the model, what the neural network looks like must be determined, and then we can train it with THE BP algorithm and so on. And that’s a big problem, because how do we know how big this complexity should be before we solve the task? So in fact, what people usually do is they make it more complex.

If you look at the advances in deep neural networks, deep learning over the last three or four years, what do you see a lot of the cutting edge work being done? They are effectively reducing the complexity of the network. ResNet, for example, and model compression, which is used a lot these days, is not always a way of reducing complexity, but a way of using too much complexity and then reducing it.

So is it possible to make the complexity of the model vary with the data from the beginning? This may be difficult for neural networks, but it is possible for other models. There are many other problems, such as the difficulty of theoretical analysis, the need for very large data, black box models and so on.

On the other hand, you might say, well, you might think about these things when you’re doing academic research, but I’m in the application business, and you just solve the problem for me. Even from this point of view, it’s important that we look beyond neural networks. Although the neural network is so popular and successful, in fact, we can see that the best performance in many tasks is not entirely deep neural network. For example, Kaggle competition, which is often concerned by people, has a variety of real questions, such as airline tickets, hotel booking, product recommendation and so on.

If we look at the winners, a lot of them today are not neural networks, a lot of them are models like random forests and so on. If we really pay close attention, real neural networks tend to win in typical tasks such as image, video and sound, while in other tasks involving hybrid modeling, discrete modeling and symbolic modeling, neural networks actually perform worse than other models.

So if we resummarize this from an academic point of view, we can see that all the deep models we talked about today are basically deep neural networks. In the jargon, it is a model composed of multiple layers of parameterized differentiable nonlinear modules that can be trained using the BP algorithm.

So there are two problems: first, the nature of the various problems we encounter in the real world is not absolutely differentiable, or can be best modeled with differentiable models; Second, over the past few decades, our machine learning community has made many, many models, which can be used as the building blocks of a system, and there are quite a few modules in between that are not differentiable.

So can these be used to build depth models? Can you get better performance by building depth models? Can we get better results by making them deeper so that today’s depth models can’t beat random forests?

So we now have a big challenge, not just academically but technically, to be able to build deep models out of non-differentiable modules.

Once this question is answered, many other questions can be answered at the same time. For example, is the depth model a deep neural network? Can we make it deep with non-differentiable models, at this point we can’t train with the BP algorithm, and can we make the depth model win on more tasks. In fact, after we raised this question, some international scholars also put forward some similar views. For example, Professor Geoffrey Hinton, the famous leader of deep learning, also proposed that deep learning could be done without THE BP algorithm in the future, and he came up with this idea a little later than we did. So I think these questions are being explored from a very cutting edge perspective.

So we ourselves are inspired by this, we need to consider these three things, which are the three conclusions from the analysis with you: first, we need to do layer by layer processing, second, the internal transformation of features, and third, we want to get a sufficient model complexity.

▌ Deep forest

My own research group has recently done some work in this area, and we have recently proposed a deep forest approach.

In this approach, which I won’t go into technical details with you today, it is a tree model based approach that borrows many ideas from integrated learning. Secondly, on many different tasks, the results obtained by its model are highly similar to those obtained by deep neural network, except for some large-scale images and so on. On other tasks, especially across tasks, we can use the same set of parameters and get good performance across different tasks without having to slowly tune the parameters task by task.

Another important feature is adaptive model complexity, which automatically determines how long the model is based on the size of the data. It has a lot of good properties in the middle, a lot of friends may also download our open source code to try, then we will have a larger distributed version and so on, to do a larger task must have a larger scale implementation, is no longer a stand-alone version can do things.

But on the other hand, we should see this is actually a new train of thought on development disciplines to explore, so today although it has been able to solve part of the problem, but we should be able to see it again down development, prospects may be today we still not able to fully anticipate, so simple review convolution neural network on my side, It’s a very popular technology, and it’s actually developed over a long period of time.

The first convolution in signal processing is actually more than a century old, but the history of deep neural networks began in 1962 with the work of two Nobel Prize winners on the biological visual cortex. But anyway, the first time convolution was introduced into a neural network was in 1982, and after that they did a lot of work, the BP algorithm was introduced in 1989, and by that time the algorithm was in shape, and then in 1995 there was the first complete description of CNN, In 1998, the identification of American checks had great success. In 2006, the deep model through unsupervised layer by layer training was proposed. In 2009, this technology was introduced to CNN, and we could do deep CNN. Directly set off a wave of deep learning.

If you look at the history, from the emergence of convolutional neural networks to the fact that this algorithm has really made a huge impact in the industry, it’s been 30 years of development, and I always say that we don’t really have any disruptive technology, all of the technology has evolved step by step. Today we have new discoveries, new discoveries that can solve some of the problems, but we should look at the long term, after many years of further work by many people, today’s discoveries should be a more important foundation for future technology.

The work that we did, I think it was actually the beginning of a big model called deep Forest, I won’t go into the technical details, but it used integrated learning across the board, diversity enhancement as FAR as I know, all the techniques used, so if you’re interested, this is a book that I wrote myself.

What is the most important meaning of the work I do? We used to say that deep learning is a dark room. What’s in this dark room? It’s known to have deep neural networks, and now that we’ve opened a door in this room and put deep forests in, I think there’s probably more to come. So this work has a more important value in the sense of academic scientific development.

The most important thing in the ERA of AI is talent

Finally, I would like to spend two minutes to talk about the comprehensive and in-depth cooperation between the School of Artificial Intelligence of Nanjing University and JINGdong in scientific research and talent cultivation.

Regarding the development of the AI industry, we have to ask a question: What do we need? Did they say they needed equipment? In fact, artificial intelligence research does not need special secret equipment, as long as you spend money, these equipment can be bought, GPU these are not high-end prohibited goods. The second is lack of data? Not really. Nowadays, our data collection, storage, transmission and processing capabilities have greatly improved. There are data everywhere.

In fact, the era of artificial intelligence is the lack of talent. Because for the industry, how good you are, how good the AI. So we’re seeing a global scramble for AI talent, not just in China, but in the United States as well. That’s why we set up the Institute of Artificial Intelligence.

After informatization, human society will inevitably enter intelligence, which can be said to be an irreversible and unchangeable trend. Because we provide intelligent assistance based on data to make things easier for people, and that’s what we all want. The revolution of the steam engine is a liberating us from manual labor, the revolution of artificial intelligence should be we humans from some simple repetitive strong intellectual labor, and artificial intelligence in this subject and other short-term investments tuyere and short-term hot spots are different, it passes through more than 60 years of development, already has a huge, true knowledge.

Maybe there are some words in our investment boom. One year they are hot, and the next year they will be gone. If we investigate these words, what is the scientific meaning of them? Perhaps few people have made it clear, and artificial intelligence is completely different from these things. It is a discipline developed after more than 60 years.

There is a serious shortage of high-level AI talents, which is a worldwide problem. Many of our enterprises spend a lot of money to hire people, but in fact, hiring people cannot bring incremental growth. Therefore, we should start from the source to cultivate high-level AI talents for the development of the country, society and industry.

Zhou Zhihua: If these three conditions are met, we can consider not using deep neural networks

Related Posts

How is Netflix using machine learning to improve streaming?

NetEase Yunxin released virtual image real-time interactive integration SDK, the curtain of the meta-universe is about to open

Horovod (15) — Broadcast & Notification