The Heart of the Machine by Tony Peng.

CVPR 2018, one of the three top five computer Vision (CV) conferences, kicked off in Salt Lake City, Utah, on June 18, US time. Although it was not the first official day of the conference, 26 workshops and 11 challenges were enough to satisfy thousands of attendees.

We have selected and summarized some of the noteworthy topics of the symposium to share with our readers as soon as possible.

Jitendra Malik, former head of CS department at Berkeley: Studying SLAM requires a combination of geometry and semantics

At this year’s CVPR, the first International Symposium on SLAM(Real-time Location and Map Building) and Deep Learning received a lot of attention, thanks to the growing importance of SLAM technology in autonomous robotics and autonomous driving.

The first speaker was Jitendra Malik, a guru in the field of computer vision (CV) and former chair of the Computer science department at the University of California, Berkeley. Malik joined Facebook’s Artificial Intelligence Institute (FAIR) late last year.

Malik first briefly described the research and development process of target recognition, localization and 3D reconstruction in the past decades — starting from the traditional algorithm represented by DPM(Deformable Parts Model). Then, it introduces Fast R-CNN, an important algorithm for image segmentation that became popular around 2015, and the Mask R-CNN derived from it. Finally, it introduces the latest research on 3D object shape up to now.

Subsequently, Malik recommended and introduced three papers he participated in, which were accepted by NIPS 2017 and CVPR in recent two years respectively, and were all about 3d structure reconstruction based on 2D images:

  • Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene, the goal of this article is to take a single 2D Image of the Scene and restore the 3D structure based on a set of small factors: A layout representing a closed surface and a set of objects represented by shape and posture. This paper proposes a convolutional neural network based approach to predict this representation and benchmarks it on large data sets of indoor scenes.
  • Learning Category-Specific Mesh Reconstruction from Image Collections: This paper proposes a Learning framework for reconstructing three aspects of real world objects from a single Image: 3D shape, Camera and Texture. The shape is represented as a deformable 3D mesh model of the object class. The paper allows for training using annotated image collections without relying on ground real 3D or multi-view supervision.
  • Learning a multi-view Stereo Machine: This paper proposes a multi-view Stereo Learning system. An end-to-end learning system is adopted, which makes it possible to reconstruct and complete invisible surfaces with far fewer images (even single images) than classical methods.

Finally, Malik mentioned some new developments in the field of SLAM. In his view, traditional mapping and planning methods are inefficient because they require the reconstruction of the structure of entire areas, which is not what humans do. At the same time, the traditional SLAM technology only focuses on the annotation of geometry, but ignores the semantics. For example, when human sees a door with “exit”, he will naturally understand it as “exit through here”, but machine does not have this concept.

“SLAM needs to be studied from both semantic and geometric perspectives,” Malik said. He then introduced the Stanford University study’s Dataset, the Stanford Large-scale 3D Indoor Spaces Dataset (S3DIS), from a 2016 CVPR paper. This paper presents a hierarchical approach to semantic analysis of 3d point cloud of the whole building. The paper emphasizes that the identification of structural elements of interior space is essentially a detection problem, rather than the usual segmentation. The authors validated their method on the S3DIS dataset, which covers more than 6,000 square meters of buildings and covers more than 215 million points.


Ross Girshick, Malik & R-CNN Founder: Video q&A systems need better data sets

Malik, again, highlighted the importance of Visual Question Anwersing (VQA) and Conversational Systems for current ai research, as well as its current challenges.

VQA is an important interdisciplinary area of vision and language at present. The system answers any questions from the questioner based on the information on the picture. On top of that, visual conversational systems (proposed at last year’s CVPR) require machines that can answer follow-up questions such as “How many people are in wheelchairs?” “What is their gender?”

Why is language so important to visual understanding? A research paper entitled “Language helps categorization” suggests that for infants, language plays a very important role in acquiring concepts of object classes, and words can serve as essential placeholders that help infants establish knowledge and representation of different objects more quickly.

However, Malik believes that solving VQA is much harder than object recognition. The system can use object recognition or obtain some basic information on the image, and there are many such labeled data sets, but no data set can label the human behavior, goals, actions and events in the image, which are the key to visual understanding.

Another notable speaker was Ross Girshick, a senior fellow at FAIR and the academic champion behind R-CNN and Fast R-CNN. In his talk, he raised the current problem in VQA: the answer is contradictory.

For example, CloudCV: Visual Question Answering (VQA) is a cloud-based Visual Question Answering system, which gives users a picture and asks questions at will, and the system will give different answers with different accuracy. When savvy users tease the system with different questions, they find that the system sometimes gives the same answer to completely different questions.

The heart of the reporter a face meng force

A typical visual question and answer dataset contains three elements — A picture, A question, and an answer, which are (I, Q, A). Girshick argues that measuring the accuracy of A VQA should not examine isolated (I, Q, A), but rather A structured set of data, in which each question Q means the value of another answer A within the same picture.

“Building such a dataset is undoubtedly very difficult, but we need the dataset to be more algorithmic and model-demanding,” Girshick said.

“A policeman was driving along the street in his police car. He noticed a dark figure moving under the street lamp, which looked like a drunk man, so the policeman approached him and asked him: “Excuse me, what are you doing here?” “I’m looking for my key and I dropped it when I opened the door.” “You dropped it under the street light?” “No, it fell in the bush by the door!” “Then why are you looking under the street light?” “Because it’s brighter here!”

Despite the old story, Malik said the story is similar to current scientific research. In recent years, a large amount of annotated data, powerful computing power and large-scale simulation environment have provided a good research environment for current supervised learning, which is just like the street lamp, rapidly improving research results, but it may not be the right road to strong artificial intelligence.


Honglak Lee: Video prediction and unsupervised learning

In the CV field, deep learning still has many challenges in the field of video analysis, including motion recognition and detection, motion analysis and tracking, shallow architecture and other issues. This year’s CVPR symposium, titled “Bold New Ideas for Video Understanding,” brought together researchers from the field of video analytics to discuss challenges, metrics, and benchmarks.

The panel featured Honglak Lee, a Google brain researcher and Michigan professor who was Ng’s protege at Stanford.

Lee brings research on video prediction and unsupervised learning.

According to Lee, a key challenge in video analysis is to separate out the many variations that produce images, from the scene in terms of posture, shape, and lighting, to the video in terms of the distinction between background and foreground objects, and the interaction of different objects in the frame. His research focuses on complex reasoning on video, such as predicting the future and acting on it.

Lee mainly introduced his latest paper accepted by ICML 18: Hierarchical Long-term Video Prediction without Supervision. This paper aims to provide a training method for solving long-term video prediction by training encoders, predictors and decoders without high level supervision. Further improvements were made by training predictive variables using antagonistic losses in the feature space. The method developed by Lee can predict the future of the video for about 20 seconds and provide better results on the Human 3.6m data set.


Autonomous Driving Symposium: Challenges, Opportunities, Safety

This year’s CVPR autonomous Driving Seminar was a great lineup: Andrej Karpathy, head of ARTIFICIAL intelligence at Tesla, Raquel Urtasun, head of autonomous driving at Uber and an authority in CV field at University of Toronto, and Kurt Keutzer, co-founder of Berkeley Autonomous Driving Industry Alliance.

While their respective presentations were mostly “advertising” for their companies, the final panel of the day produced some rare debate among the eight invited guests (with the exception of Karpathy).

It’s no wonder that most seminars in the autonomous driving area and CVPR are on different topics. Visual comprehension and SLAM are not life and death issues. But in autonomous driving, where the lives of hundreds of millions of people are at stake, the stakes are high. At the same time, the understanding of autonomous driving is also different from each other, and the debate caused by different opinions only gives the audience more to think about.

During the hour-long panel, Machine Heart reporters summed up three of the more important issues:


What are the biggest challenges for autonomous driving?

Luc Vincent, Lyft’s vice president of engineering, thinks Compute isn’t ready and society isn’t ready for autonomous driving.

Perception, according to Berkeley’s Keutzer, was supported by Urtasun, though the two later disagreed: Urtasun argued that once perception was solved, planning would not be a problem. Keutzer, however, argues that the two are different things, and that even if the perception problem is solved, it does not solve the planning dilemma that arises in a particular scenario.

Bo Li, also a postdoctoral researcher at Berkeley, believes that there are still many uncollected corner cases in the autonomous driving field, which may cause some safety risks.


If you are a CV PhD student and want to do research on autonomous driving, what should you do?

“Make maps!” Urtasun is the first to say that there is no measurement standard or reliable solution for high precision mapping, and it is technically difficult.

Urtasun’s answer was immediately refuted by several of his peers. “Don’t do it!” Edwin Olson, an associate professor at the University of Michigan and CEO of May Mobility, snapped it up. “We’re at a very stupid point in time in autonomous driving — over-reliance on maps. I think the downside of maps is very clear, and eventually we’re going to be less reliant on maps.”

Others expressed similar sentiments: “When algorithms go up, you don’t need maps as much.” “In the future, the technology for producing high-precision maps will become more reliable and the number of people needed to tag map data will decrease.”


How will the safety of different autonomous vehicles be measured in the future?

This is a problem that makes many guests on the spot get stuck, and there seems to be no unified standard in the industry. Olson did offer a new idea: “auto insurance,” which may be a measure of how much faith companies have in safety.

Later, Bo Li proposed that it might be possible to input the code of the automatic driving background system into the benchmark evaluation through modeling in the future. However, Will Maddern, a senior engineer at California-based autonomous driving company Nuro.AI, told Machine Heart that this is unlikely to happen anytime soon, and that he thinks it would be more feasible to have different vehicles run around in the same environment to do some comparisons.


The result of the challenge: the attack of the Chinese army

In addition to the guest speakers at the seminar, another highlight of the first day of the conference was the challenge competition. According to machine Heart reporter, Chinese scholars had a very good performance in the challenge, the following is the results of the competition (incomplete) :


DeepGlobe Satellite Image Understanding Challenge

The DeepGlobe Satellite Image Understanding Challenge is co-sponsored by Facebook, Uber, IEEE’s GRSS organization, and others. Satellite imagery is a powerful source of information because it contains more structured and uniform data. While the computer vision community has developed many datasets of everyday images, satellite imagery has only recently attracted attention to maps and demographic analysis.

So the organizers came up with the challenge, which is built around three different satellite imagery understanding tasks: road extraction, building detection, and land cover classification. The data set created and published by the competition can serve as a reference for future satellite image analysis and research.

In the end, Lichen Zhou’s team from Beijing University of Posts and Telecommunications won first place in the road extraction task, while teams from Harbin Institute of Technology and Chao Tian won first place in the land cover classification task.

Link: deepglobe.org/workshop.ht…


Look Into Person (LIP) Challenge

The Look Into Person (LIP) Challenge is jointly organized by Sun Yat-sen University and Carnegie Mellon University. The challenge aims to improve the application of computer vision in field scenarios, such as human parsing and pose estimation problems. Wu Liu’s team from JD Artificial Intelligence Research Institute won the first place in the single and multi-person pose estimation tasks of the five tracks in the challenge.

Link: sysu-hcp.net/lip/pose_lb…


Image Compression Challenge (CLIC)

The CHALLENGE ON LEARNED IMAGE COMPRESSION CHALLENGE, sponsored by Google, Twitter, Amazon and others, is the first IMAGE COMPRESSION CHALLENGE sponsored by a conference in computer vision. It aims to introduce some new methods such as neural network and deep learning into the field of image compression.

According to the conference, the challenge will evaluate teams’ performance in two aspects: PSNR and subjective evaluation. Not long ago, the results were announced: Under different benchmarks, TucodecTNGcnn4p, a team from domestic startup Tuya Technology, took the first place in MOS and MS-SSIMM scores, IipTiramisu, a joint team led by Tencent Audio and Video Lab and Professor Chen Zhen of Wuhan University, ranked first in PSNR (Peak signal-to-noise Ratio) indicators.

The result of the match: www.compression.cc/results/


Moments in Time Video Behavior Understanding Challenge

Moment is a research project developed by mit-IBM Watson AI Lab. The project aims to build very large datasets to help AI systems identify and understand actions and events in video. Today, the dataset contains a million tagged 3-second videos of people, animals, objects or natural phenomena, capturing the essence of a dynamic scene.

This challenge is divided into Full Track and Mini Track. The top three winners are all Chinese teams:

The result of the match: moments.csail.mit.edu/results2018…

In the Full Track category, Deep-HRI from Hikvision won the first place, Megvii technology second place and Team Seven Niuyun third place. In Mini Track, SYSU_isee team from Sun Yat-sen University won the first place, beihang University and Taiwan University won the second and third place respectively.

The heart of the Machine observed and recorded this on the first day of the conference, but it’s not the whole story. In the next few days, we will continue to report the CVPR 2018 conference for you. Readers who participated in the conference can also contribute to us, so as to share more wonderful content with you.