On February 9, 2020, the main conference hall of AAAI 2020 welcomed three important guests on stage, who are pioneers of the era of deep learning that we all know and love: Geoffrey Hinton, Yann LeCun, and Yoshua Bengio.

It was only a few years ago that they were rarely seen at computer science conferences, and Hinton said he had stopped attending AAAI meetings for a long time — after all, a decade ago neural networks were rejected by mainstream computer science researchers. Papers will not be accepted at conferences. Now, with deep learning becoming the absolute mainstream of machine learning research and the core technology in artificial intelligence for the masses, the 2018 Turing Prize has finally been awarded to these three men, in (belated) recognition of their contributions.

During the two-hour special event, each of the three speakers will give a 30-minute speech, followed by a 30-minute roundtable discussion that will also answer questions from the audience.

Hinton was the first to speak, and it’s customary to make a brief introduction before the guest speaker takes the stage. Vincent Conitzer, one of the two program chairs for AAAI 2020, said: “We all know that what happened to these three men was a story of talent and persistence. It may be hard to imagine today, but at the time neural networks could not have been more unusual. Jeff, Yann, and Yoshua have done a lot of key things at times like this, and their stories encourage us to follow our own academic direction rather than jump on the hottest topics.”

Vincent also tells an interesting story about Hinton, which Hinton himself told others. We all know that Hinton was trying to figure out how the human brain works. One day Hinton told his daughter, “I know how the brain works,” and her response was, “Dad, why are you saying that again?” And it even happens every few years.

The audience laughed, and Then Geoffrey Hinton took the podium to applause. AI Tech Review has compiled the full text of his speech below.

Today I’m going to talk about some recent research done with Adam, Sara and Yee-Whye. I’m not going to talk about philosophy today, and I’m not going to explain why I haven’t been to AAAI meetings for a long time. (Laughter) I’m going to tell you about this research.

Again, start by criticizing CNN

Object recognition is a task that falls into two main categories. One is older party-based models that use modular, perceptive representations, but often also require a lot of artificial feature engineering, so they often don’t have a learned component hierarchy either. Another class of methods are convolutional neural networks, which are developed entirely through end-to-end learning. There is a basic law in object recognition. If a feature detector is effective at one position in the image, it will also be effective at another position. CNN has this property, so different signals can be combined and generalized to different positions with good performance.

But there’s a big difference between CNN and human perception. The first part of my talk today is kind of all about Yann LeCun, and I’m going to point out what’s wrong with CNN and tell you why CNN is crap. (Laughter)

CNN’s design can handle panning, but it doesn’t handle other kinds of perspective shifts very well, such as rotation and zooming — though a bit better than we generally think. One way to do this is to change the two-dimensional feature map to a four – or six-dimensional one, but the computational cost increases too much. Therefore, different perspectives should be used when training CNN to make the model learn how to generalize to different perspectives. It’s inefficient. The ideal neural network would require little extra effort and be able to generalize naturally to new perspectives — after learning to recognize something, it would be appropriate to magnify it ten times, rotate it 60 degrees and still recognize it. We know that computer graphics is like this, and we want to design neural networks that are closer to this.

Equivalence and invariance. The typical CNN, especially the network with pooling, does not change with the change of perspective, as its invariance, and obtains the representation of “equivalence”, which is different. “Equivalence” means that as the perspective changes, the representation also changes. What I believe is that in the human perceptual system, when your perspective changes, the pattern of neural activity changes; I’m not saying that the perceived labels change, obviously the labels need to stay the same, but the representation of your perceptual activity can change a lot. What doesn’t change with perspective are connection weights, which encode relationships between different things. I’ll come back to that.

CNN also couldn’t parse the images. When you ask CNN to identify an image, it doesn’t do any explicit parsing, doesn’t try to figure out what’s part of what and what’s not part of what. We can understand CNN in this way. It focuses on a variety of pixel positions and establishes richer and richer descriptions of what exists at each pixel position based on more and more environmental information. Finally, when your description is very rich, you know there is something in the image. But CNN doesn’t explicitly parse images.

The way CNN recognizes objects is obviously very different from human beings. Add a little noise to an image and CNN will recognize it as something completely different. But we humans barely see a change in the image. This phenomenon is very strange. This seems to me to be evidence that CNN uses completely different information to identify images than we humans do. This does not mean that CNN is wrong, but it is very different from human practice.

Another complaint I have with CNN is that it calculates the dot product of the layers below and then multiplys the weight to decide whether to activate. It’s a process of finding clues and adding them up; If you add enough clues, it activates. It’s a coincidence-seeking activation, and it’s special. Coincidences are actually very important, just as physics is largely about coincidences between two different quantities; Coincidences can form the two sides of an equation, both theory and experiment. In high-dimensional space if there is a coincidence, it is very significant, for example, if you heard on the radio “on February 9, New York”, and then again see a few times in other information “on February 9, New York,” was all on February 9, and New York, you will feel shocked, this is the coincidence in the high dimension space, is very significant.

So the kind of neurons we’re using now don’t look for coincidence; But things are changing, and we’re starting to use Transformer models, and Transformer looks for coincidences; I’ll explain that in a minute. You take the dot product of two active vectors, which is much better than you did before; So that’s the calculation of whether these two activity vectors match, and if so, activate. This is how Transformer works, which leads to better filters. This also leads to models that are more responsive to covariance structures and images. What’s really important here is the covariance structure, the covariance structure of pixels.

Finally, the most serious problem of CNN is that CNN does not use coordinate system. When we humans look at something, whenever we see a shape, we assume a coordinate system for it. This is a fundamental feature of human perception. I’ll try to convince you of this by giving examples; But I don’t have a lot of time, so I’ll try to convince you very quickly with examples.

Because I don’t have time to look at any of these fancy demos, let’s just look at these two shapes. The one on the left looks like a map of some country, a bit like Australia; But if I tell you that this shape is not positive, it’s oblique, it looks like Africa. Once you see that it looks like Africa, it doesn’t look like the mirror image of Australia it felt at first. But it doesn’t look like Africa at first glance, and if you’re told it’s a country, you just see it as a country.

If you look at the shape on the right, it’s either a very positive diamond, or it’s a square turned 45 degrees; Depending on what you think it looks like, your perception of it will be completely different. If you think of it as a rhombus, then you notice any slight difference between the two angles on the right and the left, but you don’t notice if these four angles are right angles or not, and your observation doesn’t care about that at all. In other words, if I pull it up and down a little bit, so that the four inside angles are not right angles, it still looks like a very positive diamond to you.

But conversely, if you think of it as a square rotated 45 degrees, you’ll notice that all four angles are right angles; Even if it’s just going from 90 degrees to 88 degrees, you can tell it’s not a right Angle anymore; But at the same time, you don’t care whether the left and right angles are the same or not.

So, depending on which coordinate system you choose, your inner perception will be completely different. The design of CNN cannot explain this phenomenon. There is only one perception for each input, and this perception does not depend on the selection of coordinate system. I think it has something to do with the adversarial sample, which is that CNN is very different from human perception.

I think a good way to do computer vision is to think of it as the reverse of computer graphics, which goes back a long, long time. Computer graphics programs use hierarchical models that model the structure of space, using matrices to represent the transformation relationship between the coordinate systems embedded in the whole and each part’s own coordinate system.

For the whole object, it has its own, built-in coordinate system, and we can specify one; And then each part of the whole has its own coordinate system. After all the coordinate systems are selected, the relationship between the components and the whole can be determined, which is a simple matrix operation; For a rigid body, this is a linear relationship.

So this is a very simple linear structure, and that’s the idea that computer graphics uses. For people who do computer graphics, if you ask him to put the things show you another way, they wouldn’t say “I’m glad to, but we have no trained from another Angle, so only 15 degrees turn” in this way, they can go directly to whatever you want, because they have a real 3 d model, They model the spatial structure, the relationship between the parts and the whole. These relationships are also completely independent of perspective.

I think there’s a real problem if you don’t use this beautiful structure when you’re dealing with images of three-dimensional objects. One reason is that if you extrapolate over long distances, linear models are very easy to extrapolate; Models with higher degrees are harder to extrapolate. And we’ve been looking for linear implicit manifolds, and we know what they are in computer vision; There is an implicit linear structure in the perspective shift, which has a great effect on the image, and we have not taken advantage of this structure.

The 2019 edition of the latest capsule network

Now I’m going to introduce a system called Stacked Capsule Auto-Encoders. Some of you may have read about the capsule network, but this is another version of the capsule. Each year I design a completely different kind of capsule network, NeurIPS 2017 is about routing, ICLR 2018 is using EM algorithm, and then NeurIPS 2019 has a new one, which I’m going to introduce now.

So, first of all, forget everything about those previous versions of the capsule network, they were all wrong, only this one is right. The previous versions used discriminant learning, which I knew was bad, and I always thought unsupervised learning was the right thing to do, so the previous versions went in the wrong direction; And they all use a “component-whole” relationship, which doesn’t work well either. The whole-component relationship is much better. When you use the component-whole relationship, if the parts have fewer degrees of freedom than the whole, like the parts are points, and you want to use points to form a constellation, it’s hard to predict the position of the whole constellation from the position of one point, you need to use the position of many points; So you can’t predict the whole from a single component.

In this new version, we use unsupervised learning, and whole-component relationships.

The idea behind the capsule is to build more structures into the neural network, and then hope that these new structures will help the model generalize better. It was also inspired by CNN, where Yann only designed a few simple structures, that is, the feature detector can be copied between different translation transformations, which produced great benefits. So my next question is, can we go further in this direction, can we design more modular structures that can do parsing trees and things like that.

So, the capsule will represent whether something exists, it will learn what entity it should represent, and it will have some parameters of that entity. In the 2019 capsule, the final, correct capsule, it will have a logical unit, the light blue thing on the far left, that indicates whether the entity is present in the current image, no matter where it is in the image range covered by the capsule. In other words, the capsule itself can be convolution.

There will be a matrix inside the capsule, the red one on the right, that represents the spatial relationship between the entity represented by the capsule and the observer, or the inherent coordinate system embedded in the entity and the observer; So you know what direction it’s facing, how big it is, where it is, etc. There’s also a vector that contains other properties, which will contain things like transformations; If you want to process video, it also contains information about speed, color, and so on.

Let me repeat the point: capsules are designed to capture inherent geometric properties. So, a capsule representing an object can predict the attitude of its parts based on its attitude, and the relationship between the object and its parts does not change with the perspective. That’s what we want to store in the neural network in terms of weights, that’s the knowledge worth storing, and then we can use that view-independent knowledge for object recognition.

Pay attention, understand the slide, and you understand the new capsule. The idea here is that we have some kind of autoencoder and we train it voraciously at first — from pixels to parts, from parts to bigger parts, from bigger parts to bigger parts. The training process is greedy: once you get the parts from the pixels, you don’t go back and re-select the pixels and parts, you just use what you’ve got, and then you go up to the next level and try to put the parts together into a more familiar whole.

This slide shows a decoder in a two-layer autoencoder, but the units are not traditional neurons, they are more complex capsules. This layer below is some information we have collected from the image in the capsule – this is a kind of induction – we’ve got some low-level capsule, already know of their existence, their vector attribute what is what, posture, and the relationship between the observer, is now on the basis of their study higher layer of capsules. We hope that each higher level capsule can explain several lower level capsules, that is, a whole capsule corresponds to multiple component capsules, so there is a learning process.

In such a generative model, we don’t generate low-level data directly, we generate predictions of what low-level data might be based on high-level capsules. So what we’re going to do is we’re going to find a vector of parameters in the cheek pouches, and then the dotted green line here, using those parameters extracted from this entity, we’re going to predict for each of these components the spatial relationship between the whole and the components.

If you have a rigid body, you don’t need these dotted green lines, the matrix is just a constant; If it’s a mutable object, you need these green dotted lines. For each of the high-level capsules — and I’ll explain how they’re instantiated in a moment — each of the high-level capsules that have been instantiated predicts the attitude for each of the low-level capsules that have been extracted from the image. The three red squares circled by the ellipse here are the predictions made by the three high-level capsules about the posture of a low-level capsule.

What we’re interested in here is that one of the higher-level capsules should be able to interpret. So we’re going to use a hybrid model here. There is an implicit assumption in using hybrid models that one of them is the correct interpretation, but generally you don’t know which one is correct.

The objective function we chose was to maximize the logarithmic likelihood of the posture generated by the hybrid model for the high-level capsules and already observed on the low-level capsules. Under this mixed model, logarithmic likelihood is computable. These structures are trained in backpropagation, learning how to instantiate high-level capsules.

When propagating back through the mixed model, the posterior probability of elements that do not explain the data well is almost zero. So when you compute back propagation, back propagation doesn’t change them, because they don’t do anything; Those elements that provide the best explanation get the largest derivatives and can be learned and optimized.

That’s the design of the generative model. It should be noted that there are two ideas in the generative model. First, each low-level capsule is interpreted by only one high-level capsule — creating an analytic tree in which each element has only one parent. Secondly, the posture of low-level capsules can be deduced from high-level capsules. That is, the posture of low-level capsules relative to the observer can be obtained by matrix multiplication of high-level capsules relative to the observer and the posture of the whole relative to the component. Two things that are very important in vision, dealing with perspective changes and building parse trees, are designed into the model.

Now, I haven’t shown you how to do the encoder, the perceptual part. This is a difficult reasoning problem. In the previous version of capsules, we did some manual engineering on the encoder. It was particularly difficult to vote the high-level capsules and see if the voting results were consistent. Sarah put a lot of time and effort into it, and she got it up and running, but it was very difficult.

Fortunately, Transformer came along while we were doing this. Transformer is designed to handle languages, but it’s very cleverly designed. So we have a situation where we have some parts, and we want to reason from the parts to the whole, which is a very difficult reasoning problem to deal with. But with Transformer, we can try to input everything directly into Transformer and let them touch themselves.

We used a multi-layer Transformer model, eventually combining a simple generation model with a complex coding model. This multi-layer Transformer model will determine how consistency is handled and how different parts are organized, and we just need to find ways to train it.

To train Transformer, we generally need to have the right answer. But you don’t actually need the right answer here, you just have to train the derivative, just look at the answer it gives, and make it give a better answer than it already has. This is from the generation model.

The idea is to take all the capsules that have been extracted and feed them into a multi-layer Transformer Set model, which describes the orientation of each low-level capsule, and then move up through the model, This vector description is constantly updated with information about other capsules as the background. When the descriptions of these parts are updated well enough, they are converted into predictions in the last layer that predict where the overall object should be.

This multi-layer Transformer suite is easy to train because we have a corresponding generative model that feeds derivatives to Transformer. The goal of the training Transformer model is the same as that of the training generative model, which is to maximize the logarithmic likelihood of the actual observed component pose given the predicted pose of the high-level capsule. We also designed a sparse tree structure inside, encouraging it to activate only a few high-level capsules at a time.

For those interested in the multi-level Transformer suite model, read the paper without going into more details.

I’m sure many of you know how Transformer works, and I’m running out of time, so I’ll go over how Transformer works very, very quickly.

This is the case for sentences, right? The way it deals with sentences is it takes a bunch of word vectors, and then it runs a convolutional network on top of it, so that each word vector can be updated according to the neighboring vectors. The whole design can be trained with unsupervised learning to reconstruct the word vectors that have been removed from it.

This is equivalent to designing autoencoders in a convolution fashion, and Transformer also has more elaborate manual design: in addition to having word vectors directly affect word vectors at the same and higher levels, each word vector generates a key, a query, and a value. According to the state of Transformer shown on this PPT page, the word vector looks at its query, which is a learned vector, and compares it with the key of the adjacent word vector. If it does, it takes part of the value of the adjacent word vector as its new value. The process is to constantly look for similar things and then combine them to get new representations. This is basically how Transformer works.

Here I show you the results of running a simple data set with a Transformer composite model and a simple generative model with coordinate systems and parse trees.

Don’t laugh, these are MNIST digital samples from the 1980s. I took some difficult samples, ambiguous ones. I’m going to do this with the model of the design, to see if the idea is right. The MNIST data is modeled by having a layer of parts, which may be part strokes; And then you have a whole layer, a high-level capsule, maybe the whole number, but not exactly the number.

Each part is a small 11×11 template that was learned, and I’m not going to explain in detail how the part was learned, because it’s basically the same as the whole number, so I’m going to focus on how the whole number was learned. The core here is to model the pixel density with a suite of predictions from various components, where each component can be modeled with a profiling transformation, that is, its attitude matrix allows it to have different instantiation results.

There are a couple of numbers, let’s look at the “4”. The red part is obtained by extracting parts from the image and reconstructing pixels. The green part is generated by extracting the parts from the image, activating the higher level capsules, and then reconstructing the lower level capsules and pixels, which is generated step by step from the higher level. The overlap of red and green is yellow. As you can see, most of them are yellow, and both red and green have only a small edge, which means there is very little difference between the results reconstructed by the two methods.

The activation of 24 high-level capsules is shown on the right. These high-level capsules learn things like whole numbers, or larger numbers, that don’t exactly correspond to numbers.

Now let’s look at how the parts make up the whole number. The fourth and fifth cells of the number 4, parts 4 and 5, are the same part, but with different affine transformations. So, depending on the affine transformation, its instantiation will be very different; In this way, the same component can serve different purposes.

And what I’m going to show you is, after you learn how to extract the parts, you learn the whole, to explain the combination of the parts. And then you take the vectors of the activation patterns of the 24 high-level capsules, and you plot them with t-SNE, which is you embed these higher-dimensional vectors in a two-dimensional space, and the more similar the two vectors are, the smaller the distance between them. Before I look at the picture, I should say that these capsules are never labeled, they are completely unsupervised, and the result is:

It classifies 10 categories, with distinct distinctions and some misclassifications. Now if I tag them, I take a sample from each category and use its tag as the tag of its class, and I can get 98.7% MNIST accuracy directly — you can say this is learning without any tags, or using 10 tags.

In summary, MNIST is learned with this generative model that allows components to have coordinates, and the natural classification in MNIST naturally emerges. In fact, the numbers in MNIST are deformed, and the relationship between the whole number and its components is not fixed, but depends on each specific number. It works.

But there are two problems with this approach. The first problem is that our human vision doesn’t just take a whole image and process it, it has a tiny fovea and we have to choose what to look at with it. So what we see is actually a sampling process, and not everything we see is in high resolution.

Human vision, on the other hand, also depends on observation points. I’ve always believed that we’re seeing some context as well as shapes. So there are all kinds of optical illusions, maybe a vase, maybe two faces. So if from a psychological point of view, vision is observing a figure in a certain background, this capsule model is modeling the perception of the figure, not the perception of the background; If you want to model the background, you need something like material modeling, and you don’t need to parse the whole object into different parts. A variational autoencoder can do the job well.

So, when it comes to interpreting MNIST numbers in textured backgrounds, Sarah trained a combination of layered capsule autoencoders + variational autoencoders to do a much better job of modeling the background using only variational autoencoders. It still doesn’t perform as well as no context at all, but I think if you want to solve the context problem, this is the right theory. Just like people, we treat the background as just the background when it exists and do not model the background with high-level, component-based models that are reserved for shape modeling.

The other problem is that these are two-dimensional cases, and what we’re really dealing with are three-dimensional images. Sarah’s previous version of the capsule network experimented with Yann’s 3d image data to see if it could process real 3D graphics without using contour lines.

To do this, we need to have the front end, the basic capsule, represent the perceptible parts of the object. Think of vision as the reverse engineering of computer graphics, in which you build the whole object, then the parts, the parts of the parts, and the parts of the parts, all the way to triangles, and then render them. So in a reverse-engineering way, you just have the bottom capsule deal with light properties and reflectivity and things like that, while the top capsule deals with geometry. What I’m talking about here is also primarily concerned with dealing with the hierarchy of geometric shapes.

What we’re working on right now is reverse rendering, extracting pixels into perceptive parts. We have a lot of different approaches, we can use surface mesh, we can refer to known geometries, we can use half-space sections, and so on.

Final conclusion:

Prior knowledge of coordinate system transformations and analytic trees can be easily integrated into a simple generative model. An interesting advantage of putting knowledge into a generative model is that the complexity of your cognitive model, your coder, doesn’t interfere with the complexity of the generative model. You can make the encoder really, really complicated, but how short the description length can be depends on the complexity of your generative model.

So, design a generative model with some structure and then dump the reverse flow on that big set of Transformer. If your Transformer model is large enough, has enough layers, and trains on enough data, good performance is almost guaranteed.

(End of speech)

Hinton finally had a good answer to the capsule network that he had been mulling over for years, and by the end of his talk, the Don was beaming with relief.