Since CVPR 2018 announced the acceptance of papers, the heart of the machine for you to introduce a number of papers, and after the plan to release this one unexpectedly in CVPR 2018 the best paper (small make up clumsy ah), so recommended to you in advance.

The introduction

Target recognition, depth estimation, edge detection, attitude estimation, etc., are examples of common visual tasks considered useful and solved by the research community. There is an obvious correlation between some of these tasks: we know that surface normals and depths are related (one is the derivative of the other), or that vanishing points in space are helpful for positioning. Other tasks are less relevant: how keypoint detection and shadows in space work together to perform pose estimation.

So far, the field of computer vision has really not explicitly used these relationships. We have made significant progress by developing advanced learners, such as ConvNets, that can find complex mappings from X to Y given training data, i.e., groups of (x, y) satisfying x ∈ x, y ∈ y. This is often called fully supervised learning and can often solve problems independently. Topic-categorizing tasks make training new tasks or integrated perceptual systems a Sisyphean challenge, each requiring learning from scratch individually. In doing so, quantifiable correlations between tasks are ignored, resulting in the need for large amounts of markup data.

Figure 1: An example task structure discovered by Taskonomy. For example, it can be seen from the figure that by combining the features learned by the surface normal estimator and the occlusion edge detector, a good network for retracing and point matching can be quickly trained with a small amount of marker data.

In addition, models that incorporate correlations between tasks require less oversight, use fewer calculations, and operate in a more predictable manner. Incorporating such an architecture is the first stepping stone to developing a comprehensive/general perception model that can be proven to be effective [34, 4], that is, capable of solving a large number of tasks before the need for supervision or computation becomes intractable. However, the structure of this mission space and its implications remain largely unknown. These correlations are important, but finding them is complicated by the imperfection of our learning models and optimizers.

In this paper, researchers attempt to uncover this underlying structure and propose a framework for mapping visual task space. By “structure,” I mean a set of computationally discovered correlations that specify which tasks provide useful information to another task and how much information to provide (see Figure 1).

Therefore, a completely computational approach is adopted by using neural networks as computational functions. In a feedforward network, each layer successively generates a more abstract representation of the input, which contains the information needed to map from input to output. However, if the tasks are assumed to be interrelated in some form [83, 19, 58, 46], then these representations can transmit statistics that are useful for solving other outputs (tasks). The basis of this approach is whether a task-based solution can be read easily enough from a representation of the training of another task to compute the affinity matrix between tasks. Such migrations are fully sampled and a globally valid migration strategy is extracted from them by a binary integer programming paradigm. The results show that this model can solve tasks with less data than learning tasks independently, while the resulting structure is just as valid for common datasets (ImageNet [78] and Places [104]).

The fully computational and representation-based approach proposed in this paper avoids imposing a priori (possibly false) assumptions on the task space. This is crucial because priors about the correlation between tasks are usually derived from human intuition or analytical knowledge, while neural networks do not need to operate according to the same principles [63, 33, 40, 45, 102, 88]. For example, while we might expect depth to migrate better to surface normals (taking derivatives is easy), we find it better to migrate in reverse in a computational framework (that is, better for neural networks).

Figure 2: Computational modeling of task dependencies and creating classifications. From left to right: I. Train the network for a specific task. II. Train transfer functions between (first order or higher order) tasks in hidden space. III. Use AHP (analytic hierarchy Process) to obtain the normalized migration affinity. IV. Use BIP (binary integer programming) to find the classification of global migration.

Figure 3: Task dictionary. Output of 24 task-specific networks (out of 26 tasks) for a query (upper left). View frame-by-frame results of applying frames in the video here.

Figure 4: Migration function. Train a small read function to map the representation of the encoder frozen by the source task to the label of the target task. If the order is greater than 1, the migration function receives representations from multiple source tasks.

Figure 5: Migrating results from five different source tasks to surface normals (top) and 2.5-dimensional segmentation (bottom). The differences in transferability between sources are significant, and in this case retracing is one of the best transferable tasks. The task-specific network was trained with 60 times more data. “Scratch” is Scratch training without transfer learning.

Figure 6: Higher-order migration. Representations can contain auxiliary information. For example, a single staircase can be extracted by simultaneously migrating from three-dimensional edges and curvatures. For more examples, please refer to the public of the migration of interactive visualization page: http://taskonomy.stanford.edu/tasks/.

Figure 7: First-order task affinity matrices before (left) and after (right) normalization of AHP. Lower means better migration performance. For visualization, we use the standard affinity distance, dist = E ^−β·P (where β = 20 and e is the element-by-element logarithm of the matrix). For a complete matrix of higher-order migrations, see supplementary materials.

Table 1: Performance of task-specific networks: Win rate vs. random (Gaussian) network representation readings and statistically informed guess averages. The win percentage (%) is the percentage of images in the test set that exceed the benchmark. Table 1 provides the win rates for a specific mission network versus both benchmarks. The visual output of a random test sample is shown in Figure 3. The high win rate and qualitative results in Table 1 indicate that network training is good and stable and can be used to model task space.

Figure 8: Computational taxonomy for solving 22 tasks given different oversight budgets (X-axis) and the maximum allowable migration order (Y-axis). One was enlarged to improve visibility. The node with afferent edges is the target task, and the number of afferent edges is the order of its selected migration function. When the budget is 26 (total budget), still migrating to certain goals means that certain migrations start to perform better than their counterparts for specific tasks under their full oversight. Through the gain and performance indicators to check the node color coding interactive solver website: http://taskonomy.stanford.edu/api/. Dim colored nodes are only source tasks, so they participate in classification only if it is considered valuable to transform to one of the sources through BIP optimization.

Figure 9: Classification evaluation calculated to solve the complete task dictionary. As the oversight budget increases (→), gain (left) and performance (right) for each task are obtained using policies recommended by computational taxonomy. The migration order is shown below as 1 to 4.

Figure 10: Scaling to new tasks. Each row represents a new test task, left: gain and performance using an “all in one” migration strategy of order 1-4 designed for the new task. Right: The win rate (%) of the migration strategy under various self-supervised methods, ImageNet features, and starting from scratch training, as shown in the colored lines. Note the great advantages of taxonomies. Colorless rows represent the corresponding loss value.

Figure 11: Importance of architecture. A comparison of the taxonomy presented here with a random migration strategy (a random feasible taxonomy using the maximum allowable oversight budget). The Y-axis represents performance or gain, and the X-axis represents supervisory budget. The green line and gray line represent the taxonomy and random join respectively. The error line is between 5% and 95%.

Figure 12: Evaluating the performance of existing architectures on other data sets: ImageNet [78] for object classification (left) and MIT Places [104] for scene classification (right). The Y-axis represents the accuracy of the external benchmark, while the bars on the X-axis rank according to the classification prediction performance on the data set. A monotonically decreasing graph is equivalent to maintaining the same order and perfect generalization.

Figure 13: Task similarity tree. Combined clustering of tasks based on migration-output patterns (that is, using columns of the normalized affinity matrix as task characteristics). Three-dimensional, two-dimensional, low-dimensional geometric and semantic tasks are clustered together using a fully computational approach.

Taskonomy: Disentangling Task Transfer Learning

The paper address: http://taskonomy.stanford.edu/taskonomy_CVPR2018.pdf

Are visual tasks related? For example, can surface normals be used to simplify the process of estimating image depth? Intuitively positive answers to these questions suggest that there is a structure between various visual tasks. Understanding this structure is of great value; It is the concept behind transfer learning and provides a rational approach to identifying redundancy between tasks, for example, in order to seamlessly reuse supervision across related tasks or to solve multiple tasks in a system without increasing complexity.

We propose a fully computational approach to model the spatial structure of visual tasks by looking for (first-order or higher) transfer learning dependencies in 26 two-dimensional, 2.5-dimensional, tri-dimensional and semantic tasks located in a hidden space. The finished product is a computational classification diagram for task transfer learning. We examine the results of this structure, such as the emergence of non-trivial correlations, and use them to reduce the need for annotated data. For example, we showed that the total number of annotated data points required to solve a set of 10 tasks can be reduced by about two-thirds (compared to independent training) while maintaining nearly the same performance. We provide a set of tools for calculating and probing this classification structure, including a solver that users can use to design effective oversight policies for their use cases.