The heart of the machine is original

Author: Yuanyuan Li

Editor: Hao Wang

Researchers have long wanted to give machines the same sense as humans, and that includes vision. As one of our most important senses, more than 70 percent of the information we receive comes from our eyes. Human eyes can perceive 3D information, and because of binocular vision, it is relatively easy to obtain depth information from a photo. However, this is a difficult problem for the computer — objects with different shapes in the 3D world can have exactly the same projection in the 2D world, and the estimation of 3D shape is actually an ill-posed problem. Traditional research mainly uses all kinds of geometric relations or prior information, but with the popularity of deep learning in recent years, the research on geometric methods seems to be neglected.

Not long ago, Ma Yi, a professor at the University of California, Berkeley, also tweeted, “… No tool or algorithm is a panacea. At least when it comes to 3D reconstruction, algorithms that do not rigorously apply the conditions of geometric relationships are unscientific — far from reliable and accurate.”

This paper tries to make a brief introduction to 3D reconstruction, and discusses and compares the two popular methods.

Terminology: introduction

3D reconstruction refers to the process of obtaining the real shape and appearance of an object, and its modeling methods can be divided into Active method and Passive method.

Active method:

In the active approach, where the depth information of an object is given — this can be obtained by sending a signal to the object using a light source or energy source such as a laser and then parsing the returned signal — reconstruction involves primarily numerical approximation to restore the OBJECT’s 3D outline. Some of the main methods are:

  1. Moore’s fringe method: This method was proposed by Andrew P. Witkin in 1987, when he was working at SRI International’s AI Center. Mohr’s fringe method relies on the interference image produced by overlapping palisade fringes. The advantage of this method is high precision, relatively robust, but the object requirements are very high — the surface must have regular texture.

  2. Structured light: Structured light is robust to the color and texture of the object itself. It uses a projector to project encoded structured light onto the object being photographed, which is then captured by a camera. The size and shape of the patterns encoded by structured light also change because different parts of the object being photographed are at different distances and directions from the camera. This change can be captured by a camera, which can then be converted into depth information by an algorithm unit to obtain a three-dimensional profile. According to the different coding methods, it can be divided into direct coding, time multiplexing coding and spatial multiplexing coding. Structured light is simple to implement and is now widely used in industrial products such as Microsoft Kinect and Apple iPhone X. However, this method is greatly affected by strong light and limited by projection distance.

  3. Triangle ranging method: this method is based on the principle of triangulation, in fact, a large number of similar triangles in geometry principle. The triangulation method requires that the measurement system contains a laser transmitter and a CCD image sensor. A laser beam emitted by the laser emitter is reflected by the object and detected by the image sensor. By comparing the laser offset detected by the image sensor after the object displacement, the distance between the emitter and the object can be solved. Because of its high accuracy and low cost, this method is widely used in both civilian and commercial products, such as sweeping robots.

Passive method:

The passive method does not have any “interaction” with the reconstructed object during 3D reconstruction. It mainly uses the surrounding environment to obtain the image, and then deduces the 3D structure of the object through understanding the image. It can be divided into three categories:

Monocular stereo Vision

Basically, a camera takes a picture to create a 3D reconstruction. Which are divided into:

  1. Shape From Shading (SFS) : The shadow of the image is used to estimate the contour features of the object. Several different light conditions are used to compare the light and shade of the image and the shadow obtained to estimate the depth information of the object. As you can imagine, SFS requires accurate light source parameters and therefore cannot be used in scenes with complex lighting.

  2. Shape From Texture (SFT) : SFT studies how a Texture changes on an image after it has undergone a distortion such as perspective — such as a checkerboard image taken with a fisheye lens, where the edge of the image is pulled out of Shape — and uses this distortion to calculate the depth data backwards. This method requires the understanding of distortion information, and has many limitations on the texture of the surface of the object, so it is seldom used in practical products.

Binocular stereo Vision

In this method, the image is taken by two cameras in different positions taking pictures of the object at the same time, or by one camera constantly moving to different points of view. Obviously, this method attempts to simulate the situation in which humans obtain depth information by sensing image disparity with both eyes. Binocular vision relies on image pixel matching and can use template alignment or pole-oriented geometry.

Multi-view Stereo (MVS)

Stereo vision was originally a natural development of the above two studies. Stereo vision sets multiple cameras at viewpoints, or takes images with monocular cameras at different viewpoints to increase robustness, as shown in Figure 1. Along the way, this field of research has encountered many new problems — for example, the amount of computation has exploded — and evolved into a different kind of problem. This area of research is very active at present.

Figure 1: Multiple images taken from different viewpoints with a monocular camera for 3D reconstruction. Photo source: Multi – View 3 d Reconstruction at https://vision.in.tum.de/research/image-based_3d_reconstruction/multiviewreconstruction

Early stereoscopic research was carried out mainly in laboratory Settings, so the camera parameters were known. But as researchers increasingly use images taken outside the lab, such as photos downloaded from the Internet, they must first estimate camera parameters by obtaining camera positions and 3D points before reconstruction.

The term “camera parameters” refers to a set of values that describe the camera configuration, that is, camera posture information composed of position and orientation, as well as camera intrinsic attributes such as focal length and pixel sensor size. There are many different methods or “models” to parameterize this camera configuration, and the most commonly used is the pinhole camera model.

These parameters are necessary because the camera projects objects from the 3D world onto a 2D image, and now we need to deduce the 3D world information from the 2D image. It can be said that camera parameters are the parameters of the geometric model that determine the relationship between the object’s world points and image points.

At present, the main algorithm used to calculate camera parameters is Structure from Motion (SfM), and the success of MVS development also depends on the success of SfM to a large extent. Along with SfM comes the VSLAM algorithm, both of which rely on image pixel matching and the assumption that the scene is rigid. SfM is most commonly used to compute camera models with unordered sets of images, usually offline, while VSLAM specializes in computing camera positions from video streams, usually in real time.

Structure from Motion

Probably in junior high school physics class, we were exposed to the principle of keyhole imaging. Taking Figure 2 as an example, we consider how to project object X to point X on the image. The projection center is C — also known as camera center — from which the camera coordinate system can be drawn, and the coordinates of object X are (X, Y, Z). The center line that passes through the center of the camera and is perpendicular to the image plane is called the principle axis, or the Z axis in Figure 2. The intersection of the principal axis and the camera plane is called the principal point and becomes the origin P of the image coordinate system. The distance between point P and point C, the center of the camera, is f, or focal length.

Figure 2: Pinhole camera model. Source: Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.

According to the principle of similar triangle (right side of Figure 2), we can easily get the relationship between camera coordinate system and image coordinate system: Z/f = Y/ Y = X/ X

After finishing, we can get:

 x = f*X/Z

 y = f*Y/Z

 z = f

That is, the coordinates of the point (X, Y, Z) in the image coordinate system are (fX/Z, fY/Z, f). This mapping relation is non-linear for coordinate Z, and we need to linearize it by extending the coordinate dimension to perform matrix operations:

In the above formula, we use an image coordinate system with the center of the image plane as the origin. In fact, we usually use an Angle of the image — usually the upper left corner — as the origin, with the horizontal line as the x axis and the vertical line as the Y axis. Therefore, we need to scale and shift the imaging coordinate system to match the actual coordinates of the pixels. Scale the points in the image coordinate system m_x times and m_y in the x and y axes respectively, and translate p_x and p_y points respectively. The adjusted matrix becomes:

Where f_x = F *m_x, f_y= F *m_y, sometimes also written \alpha_x, \alpha_y.

In the above formula, the first matrix on the right of the equation is the camera’s internal parameter matrix K:

S is the skew parameter of the camera, which is generally set to 0.

Next, we need to consider how the coordinates of the object are determined in the above process — in fact, its coordinates are calculated from the origin of the camera. However, the position of the camera can be moved at any time, and the position of the object in the camera coordinate system will also move accordingly. We need to convert the coordinate of the object to the world coordinate system to obtain a stable coordinate position.

Figure 3: Camera coordinate to world coordinate conversion. Source: Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.

Similarly, the transition between camera and world coordinates can be achieved by scaling and panning, as follows:

X_cam = RX + t

Where X_cam and X are the coordinate representation of a point in the camera coordinate system and the world coordinate system respectively, R is the rotation matrix of 3×3, and t is the translation vector of 3×1. In order to unify the operation mode, we also rewrite the above equation into a matrix form:

By combining the camera’s internal and external parameter matrices, we can obtain the complete coordinate transformation relation of the object projected from the 3D world to the 2D image:

The complete camera matrix can be summarized as follows:

The external camera parameter is responsible for the conversion from the world coordinate system to the camera coordinate system, and the internal parameter is responsible for the subsequent conversion from the camera coordinate system to the 2D image coordinate system.

The SfM algorithm can directly output the camera parameters of each image given a set of images, as well as a set of 3D points visible in the image, generally encoded as tracks.

Figure 4: General flow of SfM algorithm. Photo source: Chippendale, Paul & Tomaselli, Valeria & D’Alto, Viviana & Urlini, Giulio & Modena, Carla & Messelodi, Stefano & Mauro Strano, Sebastiano & Alce, Gunter & Hermodsson, Klas & Razafimahazo, Mathieu & Michel, Thibaud & Farinella, Giovanni. (2014). Personal Shopping Assistance and Navigator System for Visually Impaired People. 10.1007/978-3-31-16199-0 _27.

The first step of THE SfM algorithm is to extract a series of features from the image (feature detector), generally using scale-invariant feature Transform (SIFT). SIFT uses continuous Gaussian blur on images to obtain different image scales and find possible key points on them, and then discard the non-obvious key points. The key points detected by SIFT algorithm are usually quite robust to changes in light, Angle of view, etc., and even are not affected by sight occlusion. SIFT algorithm is another advantage is very fast calculation speed, can basically meet the real-time operation.

Figure 5: Example SIFT algorithm. URL:https://en.wikipedia.org/wiki/Scale-invariant_feature_transform photo source: wikipedia

After determining the features, k-D Tree algorithm is generally used to match the features in different images. After matching, it is also necessary to remove repeated matching and matching that does not meet geometric constraints, otherwise mismatching will cause a large error. Generally, sampling consistency algorithm RANSC eight-point method is used to calculate geometric constraints.

Figure 6: Example feature matching. Photo source: CS 6476 – Computer Vision Project 2: Local Feature Matching URL:https://www.cc.gatech.edu/~hays/compvision/proj2/

The filtered matches will be merged into tracks to solve the camera parameters and estimate the sparse structure of the camera parameter scene.

Since the algorithm is very sensitive to the accuracy of the estimated camera model, it is usually necessary to optimize the estimation results by bundle adjustment. The main idea is to reverse-project the obtained 3D points into the image, compare them with the initial coordinates, delete them according to the tolerance of error, and then readjust the camera parameters.

SfM has its drawbacks, one of the most important being that since SfM relies on the assumption that features can be matched across views, SfM simply won’t work if this assumption isn’t met. For example, if two images are too far apart from each other (one from the front and one from the side) or partially obscured, establishing feature matching is extremely problematic. In addition, due to the optimization of SfM through bundling adjustment, the operation speed of the whole algorithm is extremely slow, and the speed will decrease significantly with the increase of images.

Deep learning

Using deep learning, researchers actually want to be able to skip the manual process of feature extraction, feature matching and camera parameter solving, and simulate the shape of a 3D object directly from an image — sometimes just a single image. At the same time, the neural network should not only learn the structure of the visible part, but also be able to deduce the structure of the occluded part.

As mentioned above, there are generally four methods to represent 3D objects: depth, point cloud, Voxel and mesh. Therefore, deep learning can also be divided into learning image to depth representation, image to voxel representation, etc. However, some studies try to integrate multiple representations. Due to the requirement of output data form, decoder-encoder structure neural networks, such as 3D-R2N2 network [1], were used in most studies (see Figure 7). The network consists of three parts: Encoder composed of 2D convolutional neural network (2D-CNN), 3D convolutional LSTM (3D-LSTM) as the intermediate architecture, and DECOder composed of 3D deconvolution network (3D-DCNN), which can transform the input image into the voxel representation of 3D objects.

Figure 7:3D-R2N2 model structure. Source: C. B. Choy, D. Xu, J. Gwak. (2016). 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. ECCV.

Given one or more images of objects from any viewpoint, 2D-CNN first encodes input image X as low-dimensional feature T (x). Then, depending on the given encoding input, the 3D-LSTM) cells selectively update their cell state or maintain the original state. Finally, 3D-DCNN decodes the hidden states of LSTM units and generates 3D probabilistic voxel reconstruction. [1] Pointed out that the main reason to use LSTM based network is that such network can effectively deal with the situation of object occlusion, because the network only updates the units corresponding to the visible part of the object. If the subsequent view shows the previously occluded part and this part does not match the network prediction, the network will update the LSTM state of the previously occluded part but preserve the state of the other parts.

One of the challenges with voxel representation is that the computational footprint of the model increases exponentially with resolution, and much of the output detail is lost if the resolution is limited below 32*32*3 to reduce memory requirements. Some studies achieve computing and storage efficiency through hierarchical division of output space [2, 3, 4]. Among them, The Octree network [4] proposed by Maxim Tatarchenko et al., has achieved significant resolution improvement, even up to 512^3.

In terms of computation, it is more advantageous to use neural network to learn depth map. Take the study of D. Eigen et al. [5] as an example, they can only use the convolutional layer and attempt to explicitly encode higher resolution into the structure of neural network.

The neural network proposed by D. Eigen et al. is divided into two parts: global coarse-grained network and local fine-grained network (see Figure 8). The task of coarse-grained networks is to predict the entire depth map structure using a global view of the scenario. The upper layer of the network is fully connected so that the entire image can be contained in its field of view. Coarse-grained networks consist of five feature extraction layers of convolution and maximum pool, followed by two full connection layers. Input, feature mapping, and output sizes are also shown in the figure below. The final output will be downsampled by a factor of four compared to the input. Fine-grained networks receive forecasts and input images from coarse-grained networks to align rough forecasts with local details. Fine-grained networks contain only the convolution layer, in addition to the concatenation layer used to connect input images and rough predictions.

Figure 8: Structure of neural network model proposed by D. Eigen et al. Photo source: D. Eigen, C. Puhrsch, and R. Fergus. (2014). Depth map prediction from a single image using a multi-scale deep network. NIPS.

This model achieved state-of-art at that time, but in fact the prediction was still very rough, and sometimes the model prediction could only see the outline of the object. In order to further improve the resolution, D. Eigen and other scholars have made some improvements to the model since then. This includes extending the network from two to three parts to double the resolution, but still only half the resolution of the input image.

Applying neural networks to the remaining two data forms, point clouds and grids, presents an additional difficulty: point clouds and grids are not regular geometric data forms and cannot be used directly. Earlier studies, such as Voxnet proposed by D. Maturana and S. Scherer. Voxne [6] and 3D Shapenets proposed by Z. Wu et al. [7], first converted data into voxel form or image representation. And then training on voxelized data. But this can greatly increase the computational burden and may obscure the data. Later, some studies have attempted to make neural networks learn the geometric features of the grid, such as the Geodesic convolutional Neural networks proposed by Jonathan Masciy and other scholars [8].

In general, if you need to choose one of the four representations, the reconstruction accuracy using voxels is generally the highest, and the depth map is the easiest way to migrate the reconstruction task to deep learning. The current state of the Art is a Matryoshka Networks proposed by Stephan R. Richter and Stefan Roth [9], which is expressed by shapes composed of multiple nested depth maps.

In terms of loss function, due to the different forms of data, there is no universally used loss function at present. The commonly used ones are scale-invariant error and cross entropy. The standards used to measure accuracy are MOSTLY RMSE and IoU.

Geometry vs. deep learning

Maxim Tatarchenko et al. [10] believe that the current best deep learning algorithm actually learns image classification rather than image reconstruction. They designed several simple baseline algorithms. One was k-means algorithm, which clustered the shapes in the training set into 500 subcategories. A retrieval algorithm projects the shapes in a dataset into a low-dimensional space to obtain the similarity with other shapes. There is also a nearest neighbor algorithm (Oracle NN) to find the shape closest to the training set for the 3D shapes in the test set according to IoU. This approach cannot be applied in practice, but it can give an upper limit to the performance that retrieval methods can achieve for this task.

The test on ShapeNet data set shows that no neural network algorithm can outperform the nearest neighbor algorithm (Oracle NN), that is to say, the highest mean IoU in the test result is even less than 0.6.

Figure 9: Sample representation of the model on test data. Photo source: M. Tatarchenko, S. R. Richter, R. Ranftl, Z. Li, V. Koltun, T. Brox. (2019). What Do Single-view 3D Reconstruction Networks Learn? ArXiv: 1905.03678 v1.

Figure 9 shows the performance of several models on some test data. It can be seen that the performance of the baseline model designed by the author is basically the same as that of the neural network. The number in the lower right corner of each sample represents the IoU, and the baseline model ensures the overall shape is correct by searching for similar shapes, but the details may not be correct.

Another surprising finding is that if you visualize the histogram of the IoU score for each object class, the neural network and the two baseline methods show very similar in-class distributions. However, these two baseline methods are essentially image recognition methods.

Kolmogorov-smirnov tests were performed on 55 classes and histograms of all test methods. The null assumption of this test is that the two distributions do not show statistically significant differences. Three examples are shown on the right side of Figure 10. Each cell in the heat map on the left shows the number of classes that the statistical test does not allow to reject the null hypothesis, i.e., ap value greater than 0.05. As you can see, the absolute majority of classes cannot reject the null assumption that the deep learning method and the histogram distribution of the two baseline methods are the same. The nearest neighbor method is quite different from other methods.

Figure 10: IoU histogram comparison of test methods and Kolmogorov-Smirnov test results. Photo source: M. Tatarchenko, S. R. Richter, R. Ranftl, Z. Li, V. Koltun, T. Brox. (2019). What Do Single-view 3D Reconstruction Networks Learn? ArXiv: 1905.03678 v1.

Even if we ignore the question of what deep learning has learned in training and only focus on the reconstruction performance of neural network, the results presented in the study of D. Eigen et al. [1] are still worthy of discussion. The authors used 4 high-quality CAD models of different categories, manually edited textures to enhance their texture intensity to low, medium and high, and rendered from different perspectives. The MVS algorithm used for comparison is patch-match, which estimates the camera position through global SfM. The reconstruction effect is represented by voxels and measured by IoU. Since 3D objects generated by patch-match are represented by grids, they need to be voxelized for comparison, and the final output size is 32×32×32.

Figure 11: Sample test data and comparison results. Source: C. B. Choy, D. Xu, J. Gwak. (2016). 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. ECCV.

The first line of the figure above shows an example of the test data used, from left to right, for images with various viewpoints and texture levels from high to low. The results are interesting — the pros and cons of 3D-R2N2 and patch-match algorithms are almost completely opposite. It can be seen from figure a) that when patch-match method meets objects with low texture level, the prediction accuracy will be greatly reduced. On the other hand, when the hypothesis of patch-match algorithm is satisfied (object texture level is high), the accuracy of patch-match algorithm is much higher than 3D-R2N2, even though the size of output data is only 32x32x32. In the study of neural network of depth of learning, we can also see that the prediction of neural network is still dominated by outline, but details are missing. The authors suggest that this is mainly due to the bias of the training data, most of which are data of lower texture level, because the performance of 3D-R2N2 actually decreases with higher texture level. Figure b) shows that when the viewpoint is insufficient, patch-match algorithm cannot run completely, while 3D-R2n2 is not affected by this. However, the accuracy of patch-match algorithm can be improved with the increase of viewpoints, while 3D-R2N2 cannot reconstruct as many details.

Figure (c) to figure (h) shows the performance of 3d-r2n2 and patch-match for high-textured aircraft models with 20,30 and 40 viewpoints respectively. Where, (C-e) is the reconstruction result of patch-match, and (F-H) is the reconstruction result of 3D-R2N2. You can see that the 3D-R2N2 prediction did not differ much, while patch-match improved from a completely unrecognizable prediction to a smooth, fine-grained reconstruction.

When the author saw this result, he felt that it was almost the recurrence of no Free lunch theory. Since the deep learning algorithm only uses an image as input, the available information is reduced, which will definitely decrease the optimal performance while increasing the applicability. Yasutaka Furukawa also emphasizes in his book [11], An MVS algorithm is only as good as the quality of the input images and camera parameters.

In addition, there are two problems in current deep learning research — how to measure model performance and what training data to use.

As mentioned earlier, the average IoU is often used as the primary quantitative metric for the baseline single-view reconstruction approach. But using this as a single indicator can be problematic because the quality of the predicted shape can only be expressed if the IoU is high enough. Low to medium scores indicate significant differences between the two shapes. See the figure below for an example.

Figure 11: IoU with different generated results. Photo source: M. Tatarchenko, S. R. Richter, R. Ranftl, Z. Li, V. Koltun, T. Brox. (2019). What Do Single-view 3D Reconstruction Networks Learn? ArXiv: 1905.03678 v1.

In view of the second problem, the current training set used in deep learning is mainly ShapeNet data set. However, the shape of the test set of this data set is very similar to the shape of the training set. A reconstructed model trained on ShapeNet can easily find shortcuts – it simply retrieves similar shapes from the training set. We also see this phenomenon in the research of Maxim Tatarchenko and other scholars.

So far, we have discussed some commonly used algorithms, and the author has summarized the performance of these algorithms in table 1.

The future direction

As you can see, the advantages and disadvantages of geometric algorithms and deep learning methods are completely different. Restoring 3D structures with a single image is valuable when conditions do not allow; But when more data is available, it must be better to use as much information as possible. And deep learning research, which focuses on recovering 3D structures from single images, actually assumes that monocular images can detect three-dimensional information, which should be theoretically impossible, which is why most creatures have two eyes.

In addition, in terms of productization, it is worth exploring how to combine with specific applications, such as how to apply 3D vision to AGV and unmanned driving. This process will certainly bring additional challenges, including but not limited to:

  • Algorithm deployment, embedding and embedded SLAM have received a lot of research, but not enough attention in SfM field

  • With the improvement of computing speed, the embedded system has a very high requirement on algorithm computing speed. How to optimize the current SfM, deep learning and other algorithms with huge computing load to meet the requirements of real-time operation

  • Sensor fusion, like the driverless cars tend to deploy the vision system camera, radar and other sensors, 3 d reconstruction after toward the industry must also need to collect more data to deal with complex visual environment, the inevitable requirement of three-dimensional reconstruction system can cope with different types of data and carries on the comprehensive analysis

In the face of the complex environment in which the actual algorithm will be used, geometric algorithm and deep learning method can actually be complementary, and it is not necessary to make a choice between the two. It is a development trend to take geometric methods as the leading and use deep learning’s powerful feature expression ability to supplement the situations where geometric methods are not applicable.

resources

Software library:

  • PCL (Point Cloud Library) (http://pointclouds.org (http://pointclouds.org/))

  • Meshlab: The Open Source System for Processing and editing 3D console meshes. (http://www.meshlab.net/)

  • Colmap: a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line Interface. (https://colmap.github.io/)

The teaching material:

  • Multi – View Stereo: A Tutorial Yasutaka Furukawa (https://www.nowpublishers.com/article/Details/CGV-052)

  • Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. (https://www.robots.ox.ac.uk/~vgg/hzbook/)

  • D. Scharstein, R. Szeliski. (2001). A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001). (http://vision.middlebury.edu/stereo/taxonomy-IJCV.pdf)

  • NYU course slide Lecture 6: Multi-view Stereo & Structure from Motion (https://cs.nyu.edu/~fergus/teaching/vision_2012/6_Multiview_SfM.pdf)

Data set:

  • Shapenet dataset (https://www.shapenet.org/about)

The text was written by Machine Heart analyst Yuanyuan Li. After graduation, she stayed in Europe and chose to engage in mechanical research and development, mainly responsible for image processing and realizing the implementation of computer vision algorithm. Appreciate simple, elegant, yet effective algorithms, and try to find a balance between deep learning enthusiasts and skeptics. I hope I can broaden my ideas by sharing my humble opinion and colliding with my ideas here.

The heart of machine personal homepage: https://www.jiqizhixin.com/users/a761197d-cdb9-4c9a-aa48-7a13fcb71f83

References:

[1] C. B. Choy, D. Xu, J. Gwak. (2016). 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. ECCV.

[2] C. Hane, S. Tulsiani, and J. Malik. (2017). Hierarchical surface prediction for 3D object reconstruction. 3DV.

[3] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger.(2017). OctNetFusion: Learning depth fusion from data. 3DV.

[4] M. Tatarchenko, A. Dosovitskiy, and T. Brox. (2017). Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. ICCV.

[5] D. Eigen, C. Puhrsch, and R. Fergus. (2014). Depth map prediction from a single image using a multi-scale deep network. NIPS.

[6] D. Maturana; S. Voxne. (2015). VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition. IROS.

[7] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xia. (2015). 3D ShapeNets: A Deep Representation for Volumetric Shape Modeling. CVPR.

[8] J. Masci, D. Boscaini, M. M. Bronstein, A. Vandergheynst. (2015). Geodesic neural networks on Riemannian manifolds. ArXiv :1501.06297.

[9] S. R. Richter and S. Roth. (2018). Matryoshka networks: Predicting 3D geometry via nested shape layers. CVPR.

[10] M. Tatarchenko, S. R. Richter, R. Ranftl, Z. Li, V. Koltun, T. Brox. (2019). What Do Single-view 3D Reconstruction Networks Learn? ArXiv: 1905.03678 v1.

[11] Y. Furukawa and C. Hernandez. (2015). Multi-view Stereo: Foundations and Trends® in Computer Graphics and Vision. 9(1-2):1-148.

The depth of the Pro

Theoretical explanation analysis | | | engineering practice industry research report

The latest in-depth content column of machine Heart, a summary of AI in-depth articles, detailed explanation of theory, engineering, industry and application. Each of these articles requires 15 minutes of deep reading.

Today’s in-depth recommendation

Iqiyi short video classification technology analysis

CVPR 2019 Preview: Small sample learning Projects

This article will take you through the various models of NLP deep learning

Click the picture to enter the mini program Depth Pro column

PC click to read the original article, visit the official website

Better for deep reading

www.jiqizhixin.com/insight

Don’t want to miss important papers, tutorials, information and reports every day?

Click to subscribe to daily picks