This is the 21st day of my participation in the August More Text Challenge

Last night, all those who opened Tiktok push were Tesla AI Day. You may pay attention. For me, I may care more about Tesla Bot, but I pay more attention to the part of neural network that I know a little about, that is, Tesla’s FSD. Today we are going to focus on Andrej Karpathy’s presentation to explain how FSD completes the big Task of unmanned driving through computer vision according to our own understanding.

This article itself invested a lot of time and energy, but also based on personal understanding of computer vision, I hope you can like it. If you want to give me a free praise to give encouragement, at the same time this article is not easy, so do not reprint at will, thank you!

The envoy’s idea of autonomous driving based on computer vision

During AI Day, Tesla once again endorsed its vision-based autopilot approach, which uses neural networks and, ideally, allows cars to operate anywhere on Earth through its “autopilot” systems. Andrej Karpathy, Head of AI at Tesla, describes tesla’s architecture as “building an animal from the ground up” that can move around, sense its environment and act autonomously and intelligtively based on what he sees.

Start with the Vision Component, which builds a vector space in real time from raw video data collected by eight cameras around the car. This vector space provides all the information of unmanned vehicle control, such as lane lines of the road, vehicles running on the road, pedestrians, traffic support on the road, traffic knowledge signs and so on.

How do creatures perceive their environment visually

The picture above is symbolizing the process by which vertebrates perceive their environment through vision, and here we have the concept of multibiology, where we think that neural networks are layered and abstracted before information is transmitted to the brain for processing.

To elaborate, in the picture below, light (essentially electromagnetic waves) carries structural information about the outside world through a series of refraction systems (such as lenses, vitreous bodies, etc.) and is projected onto the retina at the base of the eyeball. LGN accepts these inputs and one axon doesn’t mean there’s only one downstream, in fact one LGN neuron can project to multiple downstream V1 neurons, and one V1 cell can receive multiple inputs from LGN cells. A dozen LGN inputs for V1 are enough to form a visual feature. With the gradual integration of information, the receptive field of neurons also changes from local and simple to global and complex. LGN is concentric circle receptive field, and V1 can encode different local features, such as direction. Part of V2 can respond to the Angle formed by the two directions based on integrating information from V1, while some IT neurons can be activated by more complex visual features (such as specific objects) to determine a certain thing.

Karpathy explained how Tesla’s neural networks process information over time. The release technology designs the car’s visual cortex, which is basically the first part of the car’s “brain” to process visual information, so that information flows more intelligently into the system.

1280×96012−BitHDR1280 \times 960 12-bit HDR1280×96012−BitHDR It was a few years ago that Tesla Autopolite used a single camera to collect image information to identify lane lines to ensure vehicles were in driveable areas and to predict distances to maintain a safe distance between vehicles. All of this work is based on a single image.

Feature extraction backbone network

This is the structure of the residual network. Yes, this is the structure of the residual network. Collect the original pictures and input them to backbone neural network composed of residual blocks in a certain sequence. Details about the structure of the residual network have been shared before.

Note that on the left side of the image, the original image is characterized by a residual network structure at different resolutions


  • 20 t i m e s 15 x 512 20 \\times 15 \times 512

  • 20 t i m e s 15 x 256 20 \\times 15 \times 256

  • 80 t i m e s 60 x 128 80 \\times 60 \times 128

  • 160 t i m e s 120 x 64 160 \\times 120 \times 64

If you look at the data above, if you don’t know people well enough about neural networks, 80times60×12880 \\times 60 \times 12880times60×128 The so-called different resolution means that the larger the size of the feature map is, the higher the resolution is, which means that the feature map contains more detailed information of the image. On the contrary, the smaller the resolution is, the smaller the size of the feature map is, which means that the attention is global information.

The next step is BiFPN. We extract a certain number of features of different sizes from the original images. The next step is the fusion stage. That is, mulit-scale Feature Pyramid Fusion. Information of different scales is needed by us. Feature maps with lower resolution will grasp features (that is, contextual information) from the global perspective, while feature maps with higher resolution pay more attention to details. Of course, we need both details and global information.

Take an example to illustrate the benefits of fusion. For example, the detail feature graph in the lower right corner of the figure above cannot determine whether it is a vehicle (Cart) or not. It is the vanishing point position identified as parallel line based on the figure on the right, so the correct answer is given by disambiguation yes.

Above BiFPN is the detection head, respectively CLS is used to predict the category of the identified target, and REG is used to regression the position of the target. Similar to YOLO, the output is a grid. Each grid is used to indicate whether there is a target at that location. If there is a target, the target is extended to obtain information which is used to accurately locate the target, such as the target’s center coordinate x, y offset.

Take a break for a while, we will continue to update, the so-called later may be this afternoon, please do not go away, if you want to continue to pay attention to.