This is the 24th day of my participation in the August More Text Challenge

A total of 8 cameras are installed around the car body of Tesla. Image information is collected by installing 8 cameras around the car body during driving, and only 7 of them may be used, because one of the cameras installed under the license plate uses reversing image for parking.

  • A front three-view camera, including a front wide field camera, a main field camera and a narrow field camera, is mounted behind the windshield.

  • There are two on the pillar, basically looking to the side and to the front, and there are two cameras on the b-pillar, which is the partition between the front door and the back door on that black part

  • There are two cameras on the side turn signals to look behind the vehicle, actually on either side of the car’s front baffle

The field of vision of these cameras overlaps each other to ensure that there are no blind spots when the vehicle is walking. Except for the front three-eye camera, the other four cameras basically guarantee the L3 level functions of Tesla: lane change, merging and high-speed exit. As can be seen from the live photos, the adjacent left and right lanes are indeed taken care of without dead corners.

The autopilot system can not only use target detection in 2D image space, but also need a top-down aerial view. The traditional method is to create a occupancy tracker, which projects 2D images into 3D space. This can be used to join different images at different time points. The tracker can only create a small local map without considering the association of images at different time points. To help the car wind its way through the parking lot).

However, there are many problems with stitching because the cameras are pointing in arbitrary directions and it is difficult to align the images between different cameras. Very difficult to develop. Such problems have been tried, is a more difficult task to do.

To provide the function of vehicle call, the vehicle or neural network is required to provide a map needed for the vehicle to drive. Among the cameras around the vehicle, no one camera can see all the situations around the vehicle. Then how to integrate the images of these cameras into one image should first look at the benefits of doing so. Through the splicing of multiple views, the complete information around the vehicle can be obtained. The information is continuous and consistent, and the complete information is conducive to the correct judgment. It also facilitates more accurate target prediction. However, it is difficult to suture images with traditional feature and edge extraction. Here Karpathi uses neural networks to train a neural network again. In order to obtain accurate information, the stitching process is not only the spatial stitching and stitching of multiple camera images, but also the time, that is, what was seen in the last second, what was seen now, and what will be seen in the next second. Through time and space multiple image information fusion, I believe it will get better results, but the key is fusion, how to fusion is the key.

After image information is collected by 5 cameras, features are extracted from the backbone network, and these features are fused.

The fusion layer makes a unified view by bringing all the views together. The next stage is temporal, and in this stage, there are several consecutive frames over time, like maybe 8 frames, so you take the last 8 frames and analyze them and check the consistency of those frames, so you have a network and you cross-check all those frames. So now you can see things moving, you can see a car moving from one frame to another, and that’s the valuable thing that the network can output, not only is there a car there, but it can tell you which direction the car is moving, and how fast the car is moving. Knowing the information from the previous image, it gives you information about where pedestrians are, where road signs are, where roadside signs are, and so on.

The idea of how to use neural networks to train a bird’s view was just on paper. It felt like Tesla’s success was knowing how to land these technologies, and that had something to do with Elon Musk’s estimation. It felt like Elon Musk was a guy who knew how to subtract. Of course, karpathy is a great guy, and it feels like he’s taking neural networks to the extreme. I think the difficulty of neural network training is how to design the network structure has been solved how to design the standard sample, that is, how to design the data structure, which determines the effect of the training process.

BEV NET network can find the key training set to train the whole network, the network can use Unet to carry out semantic segmentation, for BEV NET input is a number of images after semantic segmentation, the structure of the fusion output is planar graph. On the plane plan, the moving effect of the road boundary, lane lines and moving objects on the plane is realized. This part is then expanded to have a look.