Make writing a habit together! This is the 12th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

preface

In order to better read the source code, in the reading stage of the source code must first carry out theoretical exploration, after the combination of engineering work. So this time is also a reference to many big guy’s video, blog for a summary. Let’s start with V1, the first version, mainly because the first paper is not many, so here is mainly about a structure of this neural network, a specific work flow behind it, this part is mainly divided into two parts, one is training part, and the other is recognition part. Here we are mainly on the whole process to do an understanding analysis.

Related resources are linked as follows: arxiv.org/pdf/1506.02… Reference and the article is as follows: blog.csdn.net/shuiyixin/a… www.cnblogs.com/makefile/p/…

Author’s brief introduction

This is a pretty bullheaded guy, who wrote V1, V2, V3, but later, because the U.S. military used YOLO intelligent identification technology for military weapons development, so he later retired from the computer vision research and work, so from the later V4 to V6 are maintained and upgraded by his successor. He is a very responsible and highly skilled master.

Introduction of algorithm

This is part of the paper

Anyway, this is a pretty awesome computer vision recognition algorithm.

So what we want to do with this article is to try to figure out how the YOLO network works. You should know a lot about neural networks from these three blog posts.

GitHub Water project quick start YOLOV5 YOLOV5 parameter setting and model training pit point one two three

V1 network structure

Now that you’ve read the first three articles, let’s take a quick look at what this neural network looks like.

So this is the structure of his first generation V1 neural network. (The fifth generation is probably a residual neural network. When you look at the code, you see the residual structure.)

The whole process is actually not very complicated, the whole network structure of V1 is still, through many convolutional layers, pooling layers, and finally you get a full connection layer of 7x7x 1024, then you go through a full connection layer of 1 x 1 x 4096 and finally you get a connection layer of 7x7x30.

So the whole process is basically a complicated convolutional pooling operation, which is important.

Look at the CIRAF10 we built earlier

The feeling is actually not complicated to where go. It’s just a lot more nerve nodes, N times more training power.

Identification process

If we want to tease out the whole process, let’s start with the recognition process, because that’s the most intuitive part.

Convolution part

Let’s ignore the convolution part, because this is still a modeling process.

We’re just going to focus on the last 7 x 7 x 30 which means that at the end when we actually use this model, if we input an image, first of all the image will be scaled to 448×488 and 3 is RGB three channels. The last one gives me 7 x 7 x 30. But 7 x 7 doesn’t mean making an image 7 x 7 pixels.

That’s what it looks like

grid cell

This thing right here is the 7×7 cell from the picture above.

The 7x7x30 thing we ended up exporting was just to get this picture later

Then we can get this image through processing

Cells store information

So first of all we have 7 x 7 x 30 so what information do we store in each cell? Why does this thing have 30 dimensions?

First of all, there are two kinds of information stored in there. The first one is border information, starting point, width, height, reliability. The second one is the conditional probability of the class, which is basically 20 classes.

So there’s 20 first, and then because of the borders, in this case, each cell is predicted to have two borders, so there’s 10, so there’s 30.

A border

Now let’s look at the borderThere are 49 cells and one cell and two borders, 98 in total. Each border is centered on one of the cells, and each border is not necessarily in a standard rectangle, it may be very long or very short, but the center point is inside the cell.

So there’s also a confidence, which IS what I think of as the probability of determining that there’s something in this area.

Post-processing stage

Ok, now let’s move on to the final stage of identification which is how do you go from that picture with so many frames up there to this

suchOr something like this

That’s the main part.

First of all, we already know that we have 96 borders, and we know the conditional probabilities of different categories of each cell.

For example, we predicted the little sister above. I’ll take three categories as examples. So let’s assume that the output is 7 x 7 x 13.

So what do we do, first of all, bounding box

Border processing

We have a total of 98 frames here, and each frame has its corresponding feasibility. If we assume that there are 3 categories, then for each category, we have the conditional probability x bounding confidence (C) of the corresponding category in our cell. In this way, we know the probability of 3 categories in each frame. There are 20 categories for every 20.

So let’s figure out the total probability

So let’s do the first class

IOU

There’s a concept that we need to introduce here, which is that the two borders overlap.This parameter is also available in V5 detect

The higher you set it, the more boxes you have on the image, the lower you set it, the fewer boxes you have if there’s only one overlap.

The comparison process

So let’s go into the comparison process

Now the first category comes out.

And so on. Second and third.

At this point, the question might be, why don’t you just pick the box with the highest probability. The reason for this is simple: there are two people in one picture

The paper is probably in this section

The training process

Having said that, the recognition process has finally moved on to the training process.

The process is actually relatively simple. We can compare our original linear regression modelIt’s basically a neural network, except it has two connections, one input, one output, and then a loss function, propagating back. Training just needs to propagate forward, get a value from the result, and then process (NMS)

Essentially, what we’re doing here is a fitting process.

How do you fit it? Well, first of all, make sure, whether it’s training, whether it’s prediction, that after one operation you end up with something like 7 x 7 x 30. We fed the neural network a marked picture. That’s what it looks like.Put the label on the picture.

During training, it is the cell corresponding to the center of the blue rectangle that is fitted. Each cell has two borders, and let the border with the largest border (the border closest to the border you marked) fit the calculation. Let’s leave the other one as it is.

Loss function

Now that we know how we fit our training, we’re going to move on to, you know, the loss function.

conclusion

At this point, it’s only about page 3 or 4 of the paper. But the rest of the story, it’s the same old story, where I present my idea, and then I talk about my model, what it is, how it works. And then the pros and cons, the comparisons, the analysis, and then the experimental data, which we’re not going to look at, just to get a sense of how it works. So here’s the YOLOV1, go chew V2 tomorrow.