December 10, 2015, in ImageNet computer recognition challenge, led by the chief researcher Sun Jian Microsoft Asia Research Institute visual computing group, through the application of 152 layers of neural network, to absolute advantage to obtain image classification, object positioning and object detection all three major projects champion.

Half a year ago, Dr. Sun jian left Microsoft Research institute to join Face++ megvii technology (hereinafter referred to as Face++) as the chief scientist and Research director, causing a hot discussion in the industry. Dr. Sun wrote “beauty of research in startups” two months ago, detailing the direction of Face++ research and the way it is conducted. In his opinion, there is no essential difference between Face++ Research and MSR Research in terms of mission positioning, personnel composition and r&d methods, and they are all self-driven people with Geek spirit exploring cutting-edge technologies.

But there are still many problems plaguing us. He has been working in the field of graphics for more than 10 years. Why did he choose a startup? How did it change from a big company to a startup? What’s the next “big” problem in image recognition? How did the idea of a 152-layer neural network come about?

To this end, machine Heart interview With Dr. Sun Jian, from the residual network, ImageNet test, data annotation and other aspects of the topic. The contents are arranged as follows for readers.


About 152-layer neural networks and residual learning


Heart of the Machine: In ImageNet testing in 2015, you led the team using 152-layer neural networks and won three major championships. How did you and your team come up with this approach, and how did you implement it?

Sun Jian: Most of the time, when doing research, the method is finally concluded after countless attempts, and a complicated method is simplified at the same time. We tried a lot of methods to do this [residual network], some of which we didn’t publish. I went through a lot of experiments and finally concluded (residual network) and found it very effective.

Once we have found the method that works, we analyze how it works and why it works. Finally, it was presented in the paper in the form of residual learning, which we thought was the best explanation at that time. Then A lot of people tried new explanations and improvements, and there were explanations A, B, C, some of which we agreed with, some of which we disagreed with, and it was actually quite interesting.

The residual network is not about how many layers you can do, it’s about how many layers you can do, and at its core it makes deep network optimization easy. The residual network is equivalent to redescribing the problem, but essentially unchanged, so that it can be easily solved using existing optimization algorithms. It didn’t converge before, but now it converges; What used to converge to a very bad result, now it’s very easy to converge to a very good result, so it essentially solves the optimization problem.

A screenshot of the results from the ImageNet 2015 test (ILSVRC2015) website

This problem has puzzled neural network workers for a long time. Why is it called deep learning? Depth is the number of network layers. The more layers you have, the deeper it is. In 2012, Geoffrey Hinton made 8 layers, and he devoted a section of his paper to proving that 8 layers are better than 5 layers, and the deeper the better, because there are still many people who don’t believe it makes sense. Even if they did it well, there are papers that say shallow networks can do it just as well, and “deep” is not necessary.

In the history of neural network research, for a long time it was not believed that networks that deep could be optimized. Before deep learning, we studied SVM (Support Vector Machine) and sparse representation, which is largely a linear problem. People try to study the convex (problem), and try to transform the non-convex (problem) into the convex (problem). For such a deep network, such a complex thing, highly nonlinear, so many parameters, and very few data, many people do not believe that it can be optimized. There are many factors that can be addressed to a considerable degree today. Residual learning is an important factor, but not the only one.

As a result, any deep network can be easily trained today. Depth is no longer the problem of poor network convergence training, breaking the previous magic spell.

Finally, it is the collective wisdom of the team (He Kaiming, Zhang Xiangyu, Ren Shaoqing and myself) to make this residual network. Without any one of us, we dare not say that we can achieve this step, which experienced many failures and twists and turns. I feel very lucky to be able to put four of us with different skills together to create a “big monster”. The process of immersing myself with them was one of the most memorable experiences of my research career.


Mind of the Machine: Can residual networks be used in other fields besides image recognition?

Sun Jian: Recently, speech recognition and natural language processing have been used. It is an idea, not a method limited to image recognition. This idea works everywhere else, and we see a lot of examples of it, both large and small.

In the paper “Deep Residual Learning for Image Recognition”, the effect comparison chart after Residual network optimization was used on ImageNet

And the most advanced systems, the most complex systems use this idea. It is not simply done in the way of residual network. For example, if a link in language processing wants to make it deep, it can make it deep now instead of two layers. Learning with residuals or layer-hopping connections is done deep, effective and easy to train. It’s not like you couldn’t go this deep before, it was worse, but now you have the freedom to go as deep as you want.

Of course, it does not mean that the deeper the better, with questions and data are related. There is definitely a trade-off between complexity and effectiveness, but depth is no longer a constraint.


Heart of the Machine: Will you continue to work on residual networks?

Sun Jian: This is one of our intermediate results. I think the residual network is one thing, but we’re doing research to try to find the next big idea, and of course the structure might incorporate the residual network approach, because it’s a nice idea and it’s not a specific network.

There have been a lot of people developing various networks with different structures, but the idea of residual networks is an essential part of it. Now all networks are residual network, the focus is not to add residual network, but in the case of adding it as the basis, then to study other characteristics, to understand this problem again, how can do better. For example, classification can do a good job, but the network is not necessarily suitable for problems such as detection and segmentation. Only by understanding the problem more deeply can we design a network that is best suited to a particular problem.


About ImageNet tests and data


Heart of the Machine: ImageNet has been around for a long time. Is it still possible to use its test results to determine whether an image recognition model works? Or, how do we decide if an image recognition model is good?

Sun Jian: ImageNet still has its value today. To do new problems, there is very little data for new annotations, and this data set is indispensable. It’s very generic, the pre-train model above is definitely not optimal, but it works great with very little data. Also, ImageNet does a good job, and its training and testing are very consistent. It is a platform for the birth of research methods, new ideas, including the face recognition we do, are inherited through ImageNet ideas and practices.

Of course, following the rules of the game to win a champion is to be congratulated, but the main thing is whether there is a new method or idea that can be used. With the rapid development of deep network, ImageNet 1K data is now prone to serious overfitting, looking forward to the emergence of the next generation of ImageNet. We’ve also recently been thinking about how to design ImageNet better.


Heart of the Machine: Fei-fei Li has also done Visual Genome, which combines images with languages. What else do you think is worth doing in terms of data sets?

Sun Jian: Visual Genome is a very good data set. Fei-fei Li and her team have made great efforts and we are also using this data set. There is not only one level of image in the data set. The objects in the image and the relations between objects are marked out, including action relations and position relations.

See the Paper section of The Visual Genome website for more information on labeling

These are the things you have to have to study cognitive problems. For example, there is no horse on the house, this thing is common sense. It used to be possible to do this with a lot of statistics, sending in a lot of training data, and it’s true that there’s no horse on the house. But it also implies that this thing, it hasn’t been shown yet, and once your algorithm does that, it’s already done something wrong. But if language is introduced, he will tell you the new common sense, there are no horses on the house.

In other words, why is it important for Visual Genome to clearly describe photos? For example, if you wanted to teach a computer to recognize pictures, how would you do it? Language is probably the most natural way to teach a computer how to read pictures.

Hopefully the pool will be bigger, perhaps adding two orders of magnitude to it will produce the next unexpected breakthrough.


Mind of machine: Will more dimensional annotation data be an important direction to solve the problem of image recognition?

Sun Jian: It may be at this stage. There are two new directions that we’re also trying to go in. One is to create synthetic data, to use graphics to create very realistic images that look like real training images. With this method can produce a large number of data, and there are annotations, can get a very good effect, can make it true to also need to learn the efforts of graphics colleagues. Another approach, through adversarial learning or adversarial neural networks, is to automatically generate new samples from a bunch of samples without supervision.

In terms of annotations, some are artificial, and some are already available online, including the correlation between the two frames in the video. We train face recognition by knowing that these photos are of the same person, and these photos are not of the same person. Or you just know that these are the same person, these are the same person, and you can train them.

There are so many ways to get training data, data training is very important, and we try our best to get the best data. For megvii researchers, getting data is part of the job, figuring out how to get it or create it.


About the past and future of computer vision


Heart of the Machine: You have been involved in the field of graphics for nearly 20 years. What do you see as the characteristics of the industry? What are some of the milestones to date?

Sun Jian: I still feel that I am still a novice in this industry, many things are relatively shallow, dare not jump to judge milestone events, this is not modest. Of course, deep learning is the most recent and important event. Before that, it was probably the introduction of machine learning methods into computer vision that changed the way many problems in computer vision were studied. Before the current hot deep learning, more is how to use machine learning, statistical learning to study and think about visual problems.

It didn’t just happen at one point in time, it happened slowly over time. This changed computer vision so much that a very large percentage of computer vision people today are very machine learning people.

One milestone was the spread of depth sensors. The launch of Microsoft’s Kinect in 2009 was a big deal because 3D information was finally available easily and cheaply. Computer vision has two big problems, one image understanding and one three-dimensional reconstruction. Solving 3D is a dream. It used to take two or more pictures and take a lot of effort to reconstruct. Today, there are sensors that can directly measure 3D. It immediately opens up many applications today and in the future.

As for The future, my mentor Dr. Shen Xiangyang often quotes: “The best way to predict The future is to create it.”


Heart of the Machine: What problems need to be solved in the future development of computer recognition or image recognition?

Sun Jian: I think it’s hard to predict. Today everyone is studying unsupervised learning, because supervised learning is mature, but unsupervised learning is not good enough, this is a very big problem. I read On Intelligence many years ago and reread it recently and was inspired again. Of course unsupervised learning is important, and there’s a lot of research going on, but it’s not going to solve the problem right away, and it’s hard to say how much practical value there is right away from generating a bunch of unlabeled data from another bunch of unlabeled data.




On Intelligence, subtitled “How a New Understanding of the Brain will Lead to the Creation of Truly Intelligent Machines”

I’m looking at two things right now, one is deep neural networks, you have to be able to remember things. It’s not short term memory, it’s long term memory as a child grows up, you have a big memory bank, you put things in there and you can decide whether to remove them or not, or you can put them together, you need memory mechanisms. Most of today’s supervised learning is memorized in network parameters and not explicitly memorized. There’s a lot of good research out there, but it’s not practical yet, and I think this is going to be a really big breakthrough.

Another direction is how to do continuous input-output. The reason why people can process these videos and do a good job in unsupervised learning is that they are processing all kinds of videos in real time with continuous input and output. The problem now is that people don’t know how to prepare such training data to teach computers. You can input the video, but what do you want to teach it? What to teach it, and how granular to teach it, is unclear.

Feed continuous, dynamic content, a little annotated (data) and a lot unannotated, because it’s impossible to annotate everything. Only by organizing a big training problem for everyone to study in the academic field can we promote the next step. Deep learning works very well because the data in and out is now fitting into a single function F(x). But when functions are not static input-output, but constantly changing inputs, what to do is a big challenge.


Mind of the Machine: Some argue that computer vision is too focused on functional branches like face recognition, and that it is still at the level of recognition (or perception). Should we also focus on the more important goal of cognition?

Sun jian: this is a misunderstanding. The field of computer vision has never paid much attention to the study of human faces. Face++ is not just about human faces. We mainly do four core computer vision problems (image classification, object detection, semantic segmentation, and sequence learning) that we are most concerned about, as well as core network training problems, underlying architecture problems, and deep learning platform problems.

Of course, the level of things should also be studied, or artificial intelligence can not solve. The recent Image Caption project is a good one, linking Image perception to this semantic understanding, and it can in turn help solve the problem of perception. Perception is often wrong, and it can be very wrong. Unreasonable because there is no common sense, like identifying a horse on a house. Common sense is actually in language. It can only be expressed through language. People are expressed through language and conceptual abstraction. Until this object is studied, it cannot represent knowledge, and it cannot represent the general impossibility of a horse on a house.


Machine Mind: Deep learning is one of the most popular methods for image recognition, and last year Science published a paper on learning to recognize handwriting using Bayesian programs. In order to make image recognition develop faster and better, do we need some other methods or models besides deep learning?

Sun Jian: Deep learning is a broad concept. It is end-to-end, and the specific form is deep neural network. I think further down the line, it may be a component of unsupervised, reinforcement learning that is not mutually exclusive with other methods.

In a narrow sense, deep learning refers to supervised deep learning or neural network with supervised training. Broadly speaking, it has permeated unsupervised and enhanced learning, and it is a large set of concepts.


Heart of the Machine: The latest wave of image-recognition activity is medical, and Hinton says radiologists are no longer needed because image-processing technology is mature enough. In your opinion, has medical image recognition gone that far? Or what needs to be done next?

Sun Jian: I think today in the overall is not mature, individual issues are hopeful. Medical data is still not large enough and not open enough. Medical data is often 3D, and 3D has both advantages and disadvantages. And doctors don’t just look at images, they design a lot of background. On the plus side, medical image recognition is easier than general natural image recognition, because there are so many things in natural images that involve our understanding of common sense and expression of knowledge; The medical image is relatively limited, its ambiguity, difficulties are much less. The problems today are probably not enough data, not enough people in studies, not enough open data platforms, and patient privacy issues. A combination of these problems may be applicable in very rare cases, but most cases, as far as I know, need to be studied.


About the chief scientist and Face++ megvii


Heart of the machine: why did you join Face++ six months ago?

Sun Jian: I want to try, want to have this experience. I have been in touch with computer vision for 20 years. I first got in touch with image processing in my junior year, and then I did my graduation project “Hardware Realization of Chaotic Neural Network” at the end of my senior year. Of course, neural network at that time was quite different. I’ve been working on face recognition for a long time, but with an older generation of technology. Now with deep learning, it’s really bringing things to the ground that weren’t possible before.

In fact, in Microsoft, I always pay attention to both research methods and practical application style, and do a lot of research work applied to the company’s products. When I was in college, I learned this idea from the teacher who taught me automatic control: to do things well, we should be both gods and ghosts. To be gods means to understand and correct the method, while to be ghosts means to test and guide with practice.

I want to join a startup because startups today are different than they used to be. You can think of today’s startup as a division of a larger company that puts all of its people, heart, and money, 200% into doing one thing. I want to be part of this very focused process.

Megvii’s main product right now

Heart of the machine: was there a setup for Face++ for a long time?

Sun jian: Face++ is a technology company. It was first staffed by researchers and adopted deep learning very early. So you can think of the beginning of Face++ is a research institute, and then slowly there are products, business, sales, and then slowly become now like this.

Because deep learning, especially computer vision, has a lot of engineering to do outside of pure research work. It is a very practical subject, must do experiments, hands-on data processing, understanding problems, so our research and development is not separate. The research results will be delivered to the product department through the internal algorithm library and SDK. The product department develops their products on the basis of the SDK, and then the product goes to sales.


Heart of machine: what is your next major research direction, or the research direction of the institute?

Sun Jian: As mentioned earlier, the main focus of the Institute is on four core research topics (image classification, object detection, semantic segmentation, and sequence learning), which is exactly the same as what I do at Microsoft, and we will continue to push forward on these issues. New directions are being explored, but not the main one.


The Heart of the Machine: does the product lead our research work, or is the research work independent and closer to the cutting edge?

Sun Jian: now the research institute of all companies has no pure pure research, the real pure research only has in the school. The R&D department of each company has different degrees of purpose compromise, so it can not be said to be completely independent of pure research, or completely for product development.


Mind of machine: Do startups need cutting-edge research or engineering results?

Sun Jian: we all want, this is not greedy, but the best way. We spend a lot of energy and resources to research and improve the essence of the method, the essence of the method will be transmitted to the product, such as higher accuracy, faster. This cannot be short-sighted, it must be short, medium and long term (goals). When the company was founded, there was no product. All we did was research related things. The research itself is divided into two kinds of research, applied research and basic research. The very basic research is applicable everywhere, and applied research should solve problems.

In fact, this is the truth we all understand, but it is easier to know than to do, and good control is the key.


Heart of the machine: in your previous post you mentioned that Face++ will be involved in robots, can you talk more about that?

Sun Jian: Now we do hardware modules of face recognition and object recognition, and cooperate with several domestic home-service robots. You can think of the core components of robots as eyes, brain, hands and feet: vision as glasses, hands as robotic arms, feet as something called AGV+ navigation, and of course, more difficult bipedal, multi-legged robots.

Face++ is already offering hardware modules for the robotics industry with our algorithms built in. Next, we are very interested in studying the body part, the hand part, the leg part, to make a complete robot.

The heart of the machine is original