A deep look at Tesla AI DAY(2) - HydraNet, a multi-tasking neural network

This is the 22nd day of my participation in the August More Text Challenge

The above describes how to extract different scale features based on backbone network to carry out cross-scale feature fusion in BiFPN network, and then input these features into YOLO’s one-stage target detection network for single target detection. For Tesla, of course, reading comes first

Image semantics are read by semantic segmentation for FSD controlled decision making and presented to users to enhance users’ trust in Tesla

The multi-scale feature map obtained from the original image through a series of networks will be used for different detection tasks, such as checking vehicles, checking traffic lights and lane lines, etc., which is a multi-task learning network. Each task is based on a backbone network structure, they share the same set of characteristics, and each task is seen as a branch that grows from the backbone network.

Let’s talk about the benefits of this network structure

Characteristics of the Shared

First is the characteristics of sharing, that is, different tasks are backbone network to spread reasoning process before, so from the side to save test time, effective time between the to test to different tasks, in Tesla, there are thousands of species to identify target need to identify, because share the characteristics of the backbone network greatly save the time.

Low coupling between tasks

Although different tasks share trunk characteristics, the adjustment of different tasks is compared, and only the task head is required to memorize the Fine Tune head network structure for the task alone. This adjustment does not affect other tasks, resulting in low coupling between tasks.

Cache bottleneck layer

When the network structure is updated each time with updated data, multi-scale features extracted from the trained trunk network will be cached to the disk. In other words, the features and parameters of the multi-task network will be updated through data when fine-tune is done. In other words, the features of the trunk network will not be involved in fine-tune optimization learning. And the training is a joint training.

In previous neural networks, images were collected by a single camera and processed by hydraNet. A number of predictive tasks, such as lane lines, vehicles on the road and traffic signs, are based on information provided by a single image

This is a new function called Tesla, which can recall the Tesla parked in the parking lot and move to the owner. This is a completely driverless function, and it is a tentative step towards completely driverless driving. At present, a problem for Tesla is how to position itself. According to these views, the vehicle cannot find the owner by controlling the vehicle.

The image above shows how the aerial view is pieced together in real time by detecting road sidings across the camera. In fact, there are multiple algorithms for image edge detection, such as canny algorithm and Hough transform introduced before, which can be used for detection of lane lines and road edge lines. If the edge lines in different images are sutured to obtain a bird’s eye view, it is not easy to achieve such effect synchronously through the algorithm. Therefore, it is hoped that all this can be done internally by the neural network, so that there will be no data exchange between different models and ensure the consistency of the network end-to-end network.

Sew each camera kerb forecast across camera and time.

The difficulties in 1

Cross-lens fusion and trackers are hard to write about

The difficulties in 2

Image space is not the correct output space

The image obtained good detection results of lane exit lines and edge lines through neural network. These edges were represented by red and blue lines in the image, but when these lines were mapped to vector space, they became a mess. The main reason for the problem is the lack of spatial depth information of each pixel.

Regarding object detection, the cross-camera also helps to solve the problem that trucks on the road appear in all five cameras, but the problem is that none of the trucks in the cameras are complete. How to get complete truck recognition through fusion across camera.

So here are two questions for you

The first question is how Tesla introduces depth information through neural network to fuse the information of multiple images into a vector space
How to use the camera for more accurate target recognition

Answer this information next time.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.