From The M Tank

Heart of the machine compiles

Participants: Jiang Siyuan, Liu Xiaokun

The M Tank has compiled A report, “A Year in Computer Vision,” which documents research in The field of Computer Vision in 2016-17 and is A rare source of detail for developers and researchers. This material consists of four parts. In this article, Heart of the Machine has compiled and introduced the first part, and the rest will be released later.


Table of contents


Introduction to the

The first part

  • Classification/location
  • Target detection
  • Target tracking

The second part

  • segmentation
  • Super resolution, style transfer, coloring
  • Gesture recognition

The third part

  • 3 d target
  • Body posture estimation
  • 3 d reconstruction
  • Other not classified 3D
  • conclusion

The fourth part

  • Convolution architecture
  • The data set
  • Non-categorizable other material with interesting trends

conclusion

Full PDF address: www.themtank.org/pdfs/AYearo…


Introduction to the

Computer vision is the study of the ability of machines to see, or visually analyze, their environment and the stimuli within it. Machine vision usually involves the evaluation of images or videos, and the British Machine Vision Association (BMVA) defines machine vision as “the automatic extraction, analysis and understanding of useful information from a single image or a series of images”.

A true understanding of our environment cannot be achieved through visual representations alone. More precisely, it is the process by which visual cues travel through the optic nerve to the main visual cortex, where they are then analyzed by the brain in a highly characteristic form. Extracting explanations from this sensory information encompasses almost all of our natural evolution and subjective experience of how evolution allowed us to survive and how we learn and understand the world throughout our lives.

In this respect, the visual process is simply a process of transmitting images and interpreting them, whereas in computational terms, images are actually closer to thought or cognition, involving a large number of brain functions. Therefore, because of its remarkable cross-domain nature, many people believe that computer vision is a true understanding of the visual environment and its context, and will lead us to strong artificial intelligence.

However, we are still in the embryonic stage of development in this field. The purpose of this article is to clarify the most important advances in computer vision in 2016 and 2017, and how these advances contribute to practical applications.

For the sake of simplicity, this article will be limited to basic definitions and will omit much of the content, especially regarding the design architecture of various convolutional neural networks.

Here are some recommended learning materials, of which the first two are for beginners to quickly lay a foundation, the last two can be used as advanced learning:


  • Andrej Karpathy: “What a Deep Neural Network thinks about your #selfie”, which is the best article to understand the application and design function of CNN [4].
  • Quora: what is a convolutional neural network? , the explanation is clear and clear, especially suitable for beginners [5].
  • CS231n: Convolutional Neural Networks for Visual Recognition, a Stanford course, is an excellent resource for advanced learning [6].
  • Deep Learning(Goodfellow,Bengio&Courville,2016), this book provides detailed explanations on the characteristics and architectural design of CNN in chapter 9. Free resources are available online [7].


For those who want to learn more about neural networks and deep learning, we recommend:

  • Neural Networks and Deep Learning(Nielsen,2017) is a free online book that provides readers with an intuitive understanding of the complexities of Neural Networks and Deep Learning. Even reading chapter 1 will help beginners understand the text thoroughly.


Below we first introduce the first part of this paper, this part mainly describes the target classification and location, target detection and target tracking and other very basic and popular computer vision tasks. Benjamin F. Duffy and Daniel R. Flynn will share the next three parts of computer vision, including the second part of semantic segmentation, super-resolution, style transfer and motion recognition. The third part is 3d object recognition and reconstruction, and the fourth part is the architecture and data set of convolutional network.


Basic computer vision tasks


Classification/location

The task of image classification usually refers to assigning a specific label to the whole image. The label of the whole image on the left below is CAT. Positioning refers to finding the position of the object in the image, and usually this position information will be represented by some boundary boxes around the object. The classification/localization accuracy on ImageNet [9] currently exceeds that of a group of trained humans [10]. Therefore, compared with the foundation of the previous part, we will focus on the following content such as semantic segmentation and 3D reconstruction.

Figure 1: Computer vision task, source cs231N course materials.


However, with the increase of target categories [11], the introduction of large data sets will provide new metrics for recent research progress. In this regard, Francois Chollet, founder of Keras [12], applied architecture and new technologies including Xception to a large data set within Google, which contains 17,000 target categories and 350 Million (Million) multi-category images.

Figure 2: Year by year error rate of classification/localization in ILSVRC competition, Source Jia Deng (2016), ILSVRC2016.


ImageNet LSVRC (2016)

  • Scene classification refers to the classification of images with specific scenes such as “greenhouse”, “stadium” and “cathedral”. ImageNet held a scene classification challenge last year based on Places2[15] subdata, a dataset of 365 scenes totaling 8 million training images. Hikvision [16] chose a deep Inception network and a not too deep ResNet, and used their integration to achieve a 9% top-5 error rate to win the competition.
  • Trimps-soushen won the ImageNet classification task with a top-5 classification error rate of 2.99% and a positioning error rate of 7.71%. The team used an integration of classification models (i.e., the average results of Inception, Inception-ResNet, ResNet, and the width residual network module [17]) and a tag-based localization model called Faster R-CNN [18] to accomplish the task. The training data set includes 1,000 categories and a total of 1.2 million image data. The segmented test set also includes 100,000 test images that have not been seen in the training.
  • Facebook’s ResNeXt achieves a top-5 classification error rate of 3.03% by using a new architecture extended from the original ResNet [19].


Target detection

Object Detection is to detect the Object or target contained in the image. ILSVRC 2016 [20] defined target detection as outputting boundary boxes and labels of a single object or object. This is different from the classification/localization task in that object detection applies classification and localization techniques to multiple targets in an image instead of one main target.

Figure 3: Target detection with only one category of face. The figure shows an example of face detection. The author indicates that a problem of target recognition is small object detection. Detecting small faces in the figure helps to mine the scale invariance, image resolution and the ability of situational reasoning of the model, source Hu and Ramanan (2016, p. 1)[21].


One of the major trends in target recognition in 2016 is the shift to faster, more efficient detection systems. This feature is significant in the YOLO, SSD, and R-FCN methods, which all tend to share calculations across the entire image. They can therefore be distinguished from higher-cost sub-network technologies such as Fast/Faster R-CNN, which are Faster and more efficient detection systems often referred to as “end-to-end training or learning”.

The rationale for this shared computing is usually to avoid focusing independent algorithms on their respective sub-problems, as this avoids increasing training time and decreasing network accuracy. That is, this end-to-end adaptive networking typically occurs after the inception of the sub-network solution and is therefore a retrospective optimisation. However, Fast/Faster R-CNN techniques are still very effective and are still widely used for target detection tasks.

  • SSD: Single Shot MultiBox Detector[22] uses a Single neural network that encapsulates all necessary calculations and eliminates high-cost communications to achieve 75.1% mAP and better performance than the Faster R-CNN model (Liu et al. 2016).
  • The most notable system we saw in 2016 was “YOLO9000: Better, Faster, Stronger” [23], which introduced YOLOv2 and YOLO9000 detection systems [24]. YOLOv2 greatly improves the performance of the YOLO model [25] proposed in 2015, which can achieve better results at very high FPS (90FPS on low resolution images using the original GTX Titan X). In addition to the speed of completion, the accuracy of the system is better than Faster RCNN with ReNet and SSD in the specific target detection data set.

YOLO9000 implements joint training of detection and classification, and extends its predictive generalization capability to unknown detection data, i.e. it can detect objects or objects that have never been seen before. The YOLO9000 model provides real-time object detection of more than 9000 categories, narrowing the gap between classification and detection data sets. Additional details of the model and the pre-training model can be found at pjreddie.com/darknet/yol… .

  • Feature Pyramid Networks for Object Detection [27] was proposed by FAIR [28] lab, which can use “the internal multi-scale, pyramidal hierarchical structure of deep convolutional Networks to construct Feature pyramids with marginal extra costs”. This means that representations can be more powerful and faster. Lin et al. (2016) achieved top-notch single-model results on COCO[29] dataset. If combined with the basic Faster R-CNN, it will surpass the best results in 2016.
  • R – FCN: Object Detection via region-based Fully Convolutional Networks [30], which is another regional subnetwork method that avoids the application of hundreds of high-cost times on the image, It performs full convolution and shared computation on the whole image by using a region-based detector. “Our test time for each image is only 170ms, which is 2.5 to 20 times Faster than Faster R-CNN” (Dai et al., 2016).

Figure 4: Tradeoff of accuracy in target detection, Source Huang et al. (2016, p. 9)[31].


Note: The Y-axis represents average accuracy (mAP), and the X-axis represents various feature extractors with different meta-architectures (VGG, MobileNet… Inception ResNet V2). In addition, mAP small, medium and Large represent the average accuracy of detection of small, medium and large targets respectively. That is, the accuracy is stratified by “target size, meta-architecture and feature extractor”, and the image resolution is fixed at 300. Although Faster R-CNN performed better in the above sample, this is of little value because the meta-architecture is much slower than R-FCN.

Huang et al. (2016)[32] provided a comparison of depth performance between R-FCN, SSD and Faster R-CNN. Due to problems in machine learning accuracy comparison, a standardized method is used here. These architectures are considered meta-architectures because they can combine different feature extractors, such as ResNet or Inception.

The authors study the trade-off between accuracy and speed by changing the meta-architecture, feature extractor, and image resolution. For example, the choice of different feature extractors can result in very large changes in meta-architecture comparisons.

Real-time commercial applications require target detection methods with low power consumption and high efficiency while maintaining accuracy, especially for autonomous driving applications. SqueezeDet[33] and PVANet[34] describe this development trend in their paper.

COCO[36] is another commonly used image dataset. However, it is smaller than ImageNet and is more often used as an alternative dataset. ImageNet focuses on target recognition and has a broader context for situational understanding. Organizers host an annual challenge that includes object detection, segmentation and key point marking. The results of the target detection Challenge on ILSVRC[37] and COCO[38] were as follows:

  • ImageNet LSVRC Image Target Detection (DET) : CUImage 66% average accuracy, winning 109 out of 200 categories.
  • ImageNet LSVRC Video Target Detection (VID) : NUIST 80.8% average accuracy.
  • ImageNet LSVRC video tracking target detection: CUvideo 55.8% average accuracy.
  • COCO 2016 Target Detection Challenge (Bounding box) : G-RMI (Google) 41.5% average accuracy (4.2% absolute percentage points higher than 2015 winner MSRAVC).

From the above results, the ImageNet results show that the MSRAVC 2015 results set the bar very high for “introducing ResNet”, improved target detection performance across all categories throughout the project, and improved positioning performance in both challenges. The results of large performance improvements for small target instances are described in the references “(ImageNet,2016). [39]

Figure 5. Image target detection results of ILSVRC (2013-2016), from Imagenet.2016. [Online] Workshop


Target tracking

Target tracking, the process of tracking one or more specific objects of interest in a given scene, has many applications in video and real-world interactions (usually starting with the initial target detection of tracking) and is important for autonomous driving.

  • Fully-Convolutional Siamese Networks for Object Tracking[40] combine a Siamese network with a basic Tracking algorithm and use the end-to-end training method to achieve the current best. The picture frame display rate exceeds the requirement of real-time application. This paper uses traditional online learning methods to build a tracking model.
  • Learning to Track at 100 FPS with Deep Regression Networks[41]. This paper attempts to improve the defects existing in online training methods. They built a tracker that uses a feedforward network to learn general relationships in the target’s motion, appearance, and orientation, allowing it to track new targets efficiently without online training. The algorithm achieves the current best in a standard tracking benchmark and can simultaneously track all targets at 100FPS (Held et al.,2016).
  • Deep Motion Features for Visual Tracking[43] combine hand-designed Features, Deep appearance Features (using CNN) and Deep Motion Features (trained on optical flow images), and achieve the best current results. While deep motion features are common in motion recognition and video classification, the authors claim this is the first time they have been applied to visual tracking. The paper was awarded the best paper in ICPR2016 “Computer Vision and Robot Vision”.

“This paper shows the depth of movement characteristics (motion features) effect on the detection and tracking framework. We also further illustrates the hand-made features, depth using RGB and depth with complementary information. To the best of our knowledge, this is the first presented information fusion appearance and depth of movement characteristic, and is used to research of visual tracking. Our comprehensive experiments show that the fusion approach is characterized by deep motion and exceeds the approach that relies solely on external information.”

  • The method of Virtual Worlds as Proxy for Multi-object Tracking Analysis [44] solves the lack of real variable video Tracking benchmark and data set in the existing Virtual world. The paper proposes a new approach to real-world reproduction that generates rich, virtual, synthetic and photo-realistic environments from scratch. In addition, the method can overcome some problems of lack of content in existing data sets. The generated image can be labeled automatically with the correct truth value and can be used for other tasks such as optical flow in addition to target detection/tracking.
  • Globally Optimal Object Tracking with Fully Convolutional Networks [45] concentrates on processing target change and occlusion, and regards them as two fundamental limitations of target Tracking. “Our proposed method solves the change of object or target’s appearance by using full convolutional network, and also solves the occlusion situation by using dynamic programming method” (Lee et al., 2016).


References:

[1] British Machine Vision Association (BMVA). 2016. What is computer vision? [Online] Available at: www.bmva.org/visionoverv… [Accessed 21/12/2016]

[2] Krizhevsky, A., Sutskever, I. and Hinton, G. E. 2012. ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada. Available: www.cs.toronto.edu/~kriz/image…

[3] Kuhn, T. S. 1962. The Structure of Scientific Revolutions. 4th ed. United States: The University of Chicago Press.

[4] Karpathy, A. 2015. What a Deep Neural Network thinks about your #selfie. [Blog] Andrej Karpathy Blog. Available: Karpathy. Making. IO / 2015/10/25 /… [Accessed: 21/12/2016]

[5] Quora. 2016. What is a convolutional neural network? [Online] Available: www.quora.com/What-is-a-c… [Accessed: 21/12/2016]

[6] Stanford University. 2016. Convolutional Neural Networks for Visual Recognition. [Online] CS231n. Available: cs231n.stanford.edu/ [Accessed 21/12/2016]

[7] Goodfellow et al. 2016. Deep Learning. MIT Press. [Online] www.deeplearningbook.org/[Accessed: 21/12/2016] Note: Chapter 9, Convolutional Networks [Available: www.deeplearningbook.org/contents/co…

[8] Nielsen, M. 2017. Neural Networks and Deep Learning. [Online] EBook. Available: neuralnetworksanddeeplearning.com/index.html [Accessed: 06/03/2017].

[9] ImageNet refers to a popular image dataset for Computer Vision. Each year entrants compete in a series of different Tasks called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Available: image-net.org/challenges/…

[10] See “What I learned from competing against a ConvNet on ImageNet” by Andrej Karpathy. The blog post details Author “s Journey to provide a human benchmark against the ILSVRC 2014 dataset. The error rate was approximately 5.1% Versus a, then the state – of – the – art GoogLeNet classification error of 6.8%. The Available: karpathy. Making. IO / 2014/09/02 /…

[11] See new datasets later in this piece.

[12] Keras is a popular neural network-based deep learning library: keras.io/

[13] Chollet, F. 2016. Information-theoretical label embeddings for large-scale image classification. [Online] arXiv: 1607.05691. The Available: arXiv: 1607.05691 v1

[14] Chollet, F. 2016. Xception: Deep Learning with Depthwise self-stitching. [Online] arXiv:1610.02357. Available: arXiv:1610.02357 V2

[15] Places2 dataset, details available: places2.csail.mit.edu/. See also new datasets section.

[16] Hikvision. 2016. Hikvision ranked No.1 in Scene Classification at ImageNet 2016 challenge. [Online] Security News Desk. The Available: www.securitynewsdesk.com/hikvision-r… [Accessed: 20/03/2017].

[17] See Residual Networks in Part Four of this publication for more details.

[18] Details available under Team Information Trimps-Soushen from: image-net.org/challenges/…

[19] Xie, S., Girshick, R., Dollar, P., Tu, Z. & He, K. 2016. Aggregated Residual observations for Deep Neural Networks. [Online] arXiv: 1611.05431. Available: ArXiv: 1611.05431 v1

[20] ImageNet Large Scale Visual Recognition Challenge (2016), Part II, Available: image-net.org/challenges/… [Accessed: 22/11/2016]

[21] Hu AND Ramanan. 2016. Finding Tiny Faces. [Online] arXiv:1612.04402

[22] Liu ET al. 2016. Single Shot MultiBox Detector. [Online] arXiv:1512.02325 V5. Available: arXiv:1512.02325 V5

[23] Redmon, J. Farhadi, A. 2016. YOLO9000: Faster, Stronger. [Online] arXiv: 1612.08242v1. ArXiv: 1612.08242 v1

[24] YOLO stands for “You Only Look Once”.

[25] Redmon et al. 2016. You Only Look Once: Unified, Real-Time Object Detection. [Online] arXiv: 1506.02640. The Available: arXiv: 1506.02640 the v5

[26] Redmon.2017. YOLO: Real-time Object Detection. [Website] pjreddie.com. Available: pjreddie.com/darknet/yol… [Accessed: 01/03/2017].

[27] Lin ET al. 2016. Pyramid network for Object Detection. [Online] arXiv: 1612.03144. ArXiv: 1612.03144 v1

[28] Facebook’s Artificial Intelligence Research

[29] Common Objects in Context (COCO) image dataset

[30] Dai et al. 2016. R-FCN: Object Detection via Region-based Fully Convolutional Networks. [Online] arXiv: 1605.06409. The Available: arXiv: 1605.06409 v2

[31] Huang et al. 2016. Speed/accuracy trade-offs for modern convolutional object detectors. [Online] arXiv: 1611.10012. The Available: arXiv: 1611.10012 v1

[32] ibid

[33] Wu et al. 2016. SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving. [Online] arXiv: 1612.01051. The Available: arXiv: 1612.01051 v2

[34] Hong et al. 2016. PVANet: Lightweight Deep Neural Networks for Real-time Object Detection. [Online] arXiv: 1611.08588 v2. The Available: arXiv: 1611.08588 v2

[35] DeepGlint Official.2016. DeepGlint CVPR2016. [Online] Youtube.com. Available: www.youtube.com/watch?v=xhp… [Accessed: 01/03/2017].

[36] COCO – Common Objects in Common. 2016. [Website] Available: mscoco.org/[Accessed: 04/01/2017].

[37] ILSRVC results taken from: ImageNet. 2016. Large Scale Visual Recognition Challenge 2016.

[Website] Object Detection. Available: image-net.org/challenges/… [Accessed: 04/01/2017].

[38] COCO Detection Challenge results taken from: Detections Leaderboard [Website] mscoco.org. Available: mscoco.org/dataset/#de… [Accessed: 05/01/2017].

[39] Imagenet.2016. [Online] Workshop Presentation, Slide 31. Available: image-net.org/challenges/… [Accessed: 06/01/2017].

[40] Bertinetto et al. 2016. Fully-Convolutional Siamese Networks for Object Tracking. [Online] arXiv: 1606.09549. The Available: arxiv.org/abs/1606.09…

[41] Held et al. 2016. Learning to Track at 100 FPS with Deep Regression Networks. [Online] arXiv: 1604.01802. The Available: arxiv.org/abs/1604.01…

[42] David Held. 2016. GOTURN – a Neural Network Tracker. [Online] YouTube.com. [Accessed: 03/03/2017].

[43] arXiv: 1612.06615. Available: ArXiv: 1612.06615 v1

[44] Gaidon et al. 2016. Virtual Worlds as Proxy for Multi-Object Tracking Analysis. [Online] arXiv: 1605.06457. The Available: arXiv: 1605.06457 v1

[45] Lee et al. 2016. Globally Optimal Object Tracking with Fully Convolutional Networks. [Online] arXiv: 1612.08274. The Available: arXiv: 1612.08274 v1


Original report address: www.themtank.org/a-year-in-c…


This article is compiled for machine heart, reprint please contact this public number for authorization.