Medium by Ayoosh Kathuria, Compiled by Heart of the Machine.

Object detection is the most beneficial area in the recent development of deep learning. With the advancement of technology, many algorithms have been developed for target detection including YOLO, SSD, Mask RCNN and RetinaNet. In this tutorial, we will use PyTorch to implement a YOLO V3-based target detector, which is a fast target detection algorithm. This article contains the first three of the five parts of the tutorial.

For the past few months, I’ve been working in my lab on ways to improve target detection. One of the biggest lessons I learned was realizing that the best way to learn object detection is to implement the algorithms yourself, and that’s exactly what this tutorial leads you to do.

In this tutorial, we will use PyTorch to implement a YOLO V3-based target detector, which is a fast target detection algorithm.

The code used in this tutorial needs to run on Python 3.5 and PyTorch 0.3. You can find all the code at the following link:

Github.com/ayooshkathu…

This tutorial consists of five parts:

1. How YOLO works

2. Create a YOLO network layer

3. Realize forward propagation of the network

4. Objectness confidence threshold and non-maximum suppression

5. Design input and output pipes

Background Required

Before you follow this tutorial, you need to know:

  • The principle of convolutional neural network includes residual block, skip connection and upsampling.
  • Target detection, bounding box regression, IoU and non-maximum suppression;
  • Basic PyTorch use. You need to be able to create simple neural networks easily.

What is YOLO?

YOLO stands for You Only Look Once. It is a target detector that uses features learned from deep convolutional neural network to detect objects. Before we start writing code, we need to understand how YOLO works.

Full convolutional neural network

YOLO uses only the convolutional layer, which makes it a full convolutional neural network (FCN). It has 75 convolution layers, as well as skip join and upsample layers. It does not use any form of pooling and uses a convolution layer with step 2 to downsample the feature graph. This helps prevent the loss of low-level features that are often caused by pooling.

As FCN, YOLO is not sensitive to the size of the input image. In practice, however, we might want to keep the input size constant because problems only surface when we implement the algorithm.

One important problem with this is that if we want to process images in batches (batch images are processed by gpus in parallel to speed things up), we need to fix the height and width of all images. This requires the consolidation of multiple images into one large batch (combining many PyTorch tensors into one).

YOLO upsamples the image by the stride. For example, if the stride of the network is 32, the 416×416 input image will produce a 13×13 output. In general, any step in the network layer refers to the input of the layer divided by the input.

Explain output

Typically (this is the case for all target detectors), features learned by the convolution layer are passed to the classifier/regressor for prediction (boundary box coordinates, category labels, etc.).

In YOLO, prediction is done through the convolutional layer (it’s a full convolutional neural network, remember!). Its core size is:

1×1× (B× (5+C))

Now, the first thing to notice is that our output is a feature graph. Since we used a 1×1 convolution, the size of the prediction graph is exactly the size of the previous feature graph. On YOLO V3 (and later versions), a prediction graph is each cell that can predict a fixed number of bounding boxes.

Although the correct term for a cell in a feature graph is “neuron,” we will refer to it as a cell for the sake of intuition in this article.

In terms of depth, the feature map has (B x (5 + C))* * items. B represents the number of predictable bounding boxes per cell. According to YOLO’s paper, each of these B bounding boxes could be used specifically to detect an object. Each bounding box has 5+C attributes that describe each bounding box’s central coordinates, dimensions, Objectness score, and class C confidence. YOLO V3 predicts 3 bounding boxes per cell.

If the center of the object is in the cell’s feel field, you want each cell in the feature graph to predict the object through one of the bounding boxes. (Receptive field is the area of the input image visible to the cell.)

This has to do with how YOLO is trained, with only one bounding box responsible for detecting any given object. First, we must determine which cell the bounding box belongs to.

Therefore, we need to slice the input image into a grid with dimensions equal to the final feature image.

Let’s consider the following example, where the input image size is 416×416 and the network step is 32. As mentioned earlier, the dimension of the feature map will be 13×13. We then divide the input images into 13×13 grids.


The grid in the center of the input image containing the truth object box is used as the cell responsible for the prediction object. In the image, it is the cell marked red that contains the center of the truth box (marked yellow).

Now, the red cell is the seventh row in the grid. We now make the seventh cell in row 7 of the feature map (the corresponding cell in the feature map) the detection dog cell.

Now, this cell can predict three bounding boxes. Which truth tag will be assigned to the dog? To understand this, we must understand the concept of an anchor point.

Note that the cells we are talking about here are cells on the predicted feature graph, and we split the input images into grids to determine which cells in the predicted feature graph are responsible for the predicted objects.

Anchor Box

Predicting the width and height of bounding boxes may seem reasonable enough, but in practice, training leads to unstable gradients. Therefore, most target detectors now predict log-space transformations, or offsets from the pre-trained default boundary box (i.e., anchor points).

These transformations are then applied to the anchor box to obtain predictions. YOLO V3 has three anchor points, so each cell predicts three bounding boxes.

Returning to the previous problem, the anchor point responsible for detecting the dog boundary box has the highest IoU and has a truth box.

To predict

The following formula describes how the network output is transformed to obtain the bounding box prediction results.


The center coordinates

Note: We use sigmoid function for center coordinate prediction. This makes the output value between 0 and 1. Here’s why:

Normally, YOLO does not predict the exact coordinates of the center of the bounding box. It forecasts:

  • The offset associated with the upper-left corner of the grid cell of the predicted target;
  • The normalized offset is made using dimension (1) of the feature graph element.

Take our image for example. If the prediction of the center is (0.4, 0.7), then the coordinates of the center on the 13 x 13 feature plot are (6.4, 6.7) (the upper-left coordinates of the red cell are (6,6)).

However, if the predicted x and y coordinates are greater than 1, say (1.2, 0.7). So the central coordinate is (7.2, 6.7). Notice that the center is in the cell to the right of the red cell, or cell 8 in line 7. This breaks the theory behind YOLO, because if we assume that the red box is responsible for predicting target dogs, then the center of the dog must be in the red cell and should not be in the grid cell next to it.

Therefore, to solve this problem, we perform the sigmoid function on the output to compress it between 0 and 1, effectively ensuring that the center is in the grid cell where the prediction is performed.

The dimension of the bounding box

We perform a logarithmic space transformation on the output, and then multiply the anchor points to predict the dimensions of the bounding box.

Detector output before the final prediction transformation process, photo source: http://christopher5106.github.io/

The resulting predictions BW and BH are normalized using the height and width of the image. That is, if the predicted BX and BY of the box containing the target (dog) are (0.3, 0.8), then the actual width and height of the 13 x 13 feature plot are (13 x 0.3, 13 x 0.8).

Objectness scores

The Object score represents the probability that the target is inside the boundary box. The red and adjacent grids should have an Object score close to 1, while the corner grids may have an Object score close to 0.

The Calculation of the Objectness score also uses the Sigmoid function, so it can be understood as a probability.

Category confidence

Category confidence represents the probability that detected objects belong to a certain category (such as dogs, cats, bananas, cars, etc.). Prior to v3, YOLO needed to perform softmax function operations on category scores.

However, YOLO V3 abandoned this design and the authors chose to use the Sigmoid function instead. Because the softmax operation on category scores assumes that categories are mutually exclusive. In short, if an object belongs to one category, you must ensure that it does not belong to another. This is true on the COCO dataset where we set the detector. However, this assumption does not work when the categories “Women” and “Person” are present. This is why the authors chose not to use the Softmax activation function.

Predictions on different scales

YOLO V3 makes predictions on three different scales. The detection layer is used to perform predictions on three feature graphs of different sizes, with steps 32, 16, and 8. This means that when the input image size is 416 x 416, we perform detection on scales 13 x 13, 26 x 26, and 52 x 52.

The network performs downsampling of the input image before the first detection layer, which uses the feature map of the layer with step 32 to perform detection. Then, upsampling with factor 2 is performed and spliced with the feature graph of the previous layer (the size of the feature graph is the same). The other check is performed in the 16-step layer. The same upsampling step is repeated, and the last check is performed in the step 8 layer.

At each scale, 3 anchors were used to predict 3 boundary boxes per unit, with a total of 9 anchors (different anchors for different scales).


The authors say this helps YOLO V3 achieve better performance when detecting smaller targets, a common complaint of previous versions of YOLO. Upsampling can help the network learn fine-grained features and detect smaller targets.

Output processing

For images of size 416 x 416, YOLO predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647 bounding boxes. However, our example has only one object — a dog. So how can we reduce the number of tests from 10,647 to one?

Target confidence threshold: First, we filter boundary boxes based on their Objectness score. In general, bounding boxes with scores below the threshold are ignored.

Non-maximum suppression: Non-maximum suppression (NMS) can solve the problem of multiple detections of the same image. For example, three bounding boxes of a red grid cell can detect one box, or neighboring grids can detect the same object.


implementation

YOLO can only detect objects belonging to the categories of the data set used for training. Our detector will use the official weights file, which was obtained by training the network on the COCO dataset, so we can detect 80 object classes.

This concludes the first part of the tutorial. This section explains the YOLO algorithm in detail. If you want to learn more about how YOLO works, the training process, and performance evasion with other detectors, read the original paper:

1. YOLO V1: You Only Look Once: Unified, real-time Object Detection (arxiv.org/pdf/1506.02…)

2. YOLO V2: YOLO9000: Better, Faster, Stronger (arxiv.org/pdf/1612.08…)

3. YOLO V3: An Incremental Improvement (pjreddie.com/media/files…)

4. The Convolutional Neural Networks (cs231n. Making. IO/convolution…).

Bounding Box Regression (Appendix C) (arxiv.org/pdf/1311.25…)

6. IoU (www.youtube.com/watch?v=DNE…)

7. Non maximum suppresion (www.youtube.com/watch?v=A46…)

PyTorch Official Tutorial (pytorch.org/tutorials/b…)

Part two: Creating the YOLO Network Hierarchy

Here is the second part of the tutorial to implement the YOLO V3 detector from scratch. We will implement the YOLO hierarchy using PyTorch based on the basic concepts described earlier, that is, creating the basic building blocks for the entire model.

This section requires that you have a basic understanding of how YOLO works and how it works, as well as basic knowledge about PyTorch, such as how to build a custom neural network architecture with classes nn.Module, Nn. Sequential, and Torch.

Begin the journey

First create a folder to store the detector code, then create the Python file darknet.py. Darknet is the environment where YOLO’s underlying architecture is built. This file will contain all the code to implement the YOLO network. We also need to add a file called util.py, which contains various functions to be called. Once you have all these files in the detector folder, you can track their changes using Git.

The configuration file

The official code (authored in C) uses a configuration file to build the network, a CFG file that describes the network architecture piece by piece. If you’ve ever used a Caffe back end, it’s the equivalent of a.protxt file describing the network.

We will build the network using the official CFG file, which was published by the authors of YOLO. We can download it at the following address and put it in the CFG folder in the detector directory.

Configuration file download: github.com/pjreddie/da…

Of course, if you use Linux, you can CD to the detector network directory and run the following command line.

mkdir cfg
cd cfg
wget https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolov3.cfg
Copy the code

If you open the configuration file, you will see some of the following network architectures:

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear
Copy the code

We see four block configurations above, three of which describe the convolution layer, and finally the shortcut layer or skip join commonly used in ResNet. Here are the five levels used in YOLO:

1. The convolution layer

[convolutional]
batch_normalize=1 
filters=64 
size=3 
stride=1 
pad=1 
activation=leaky
Copy the code

2. Skip the connection

[shortcut]
from=-3 
activation=linear
Copy the code

Skip connections are similar to the structure used in residual networks, where the parameter from -3 indicates that the output of the shortcut layer is obtained by adding the characteristic graph of the output of the previous layer and the previous third layer to the module’s input.

3. Samples

[upsample]
stride=2
Copy the code

The feature graph is sampled bilinear in the previous hierarchy by the parameter stride.

4. Route layer

[route]
layers = -4

[route]
layers = -1, 61
Copy the code

The routing layer needs some explanation, and its layers parameter has one or two values. When there is only one value, it outputs the feature graph of the layer indexed by that value. In our experiment it was set to -4, so the hierarchy will output the feature map of the fourth layer before the routing layer.

When the hierarchy has two values, it returns a spliced feature graph indexed by those two values. -1 and 61 in our experiment, so the level will output the feature maps from the previous level (-1) to the 61st level and concatenate them in depth.

5.YOLO

[yolo] mask = 0,1,2 anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 classes=80 num=9 jitter=.3 ignore_thresh =.5 Truth_thresh =1 Random =1Copy the code

The YOLO level corresponds to the detection level described above. Parameter anchors define nine sets of anchors, but they are only the ones indexed by the property used by the Mask tag. Here, the values of mask 0, 1, 2 represent the anchor points used for the first, second, and third. The mask indicates that each unit in the detection layer predicts three boxes. All in all, we detected a layer size of 3 and assembled a total of 9 anchors.

Net

[net] # Testing batch=1 subdivisions=1 # Training # batch=64 # subdivisions=16 width= 320 height = 320 channels=3 Decay =0.0005 Angle =0 saturation = 1.5 exposure = 1.5 hue=.1Copy the code

There is another block NET in the configuration file, but I don’t consider it a layer because it only describes information about network inputs and training parameters and is not used for forward propagation of YOLO. However, it gives us information such as network input size that can be used to adjust anchor points in forward propagation.

Parsing configuration files

Before we begin, we add the necessary imports at the top of the darknet.py file.

from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
from torch.autograd import Variable
import numpy as np
Copy the code

We define a function parse_cfg that takes the path of the configuration file as input.

def parse_cfg(cfgfile):
 """
 Takes a configuration file

 Returns a list of blocks. Each blocks describes a block in the neural
 network to be built. Block is represented as a dictionary in the list

 """
Copy the code

The idea here is to parse the CFG and store each block as a dictionary. The attributes and values of these blocks are stored in the dictionary as key-value pairs. During parsing, we add these dictionaries (represented by the variable block in the code) to the list blocks. Our function will return that block.

We start by saving the configuration file contents in a list of strings. The following code preprocesses the list:

file = open(cfgfile, 'r') lines = file.read().split('\n') # store the lines in a list lines = [x for x in lines if len(x) > 0] # get read of the empty lines lines = [x for x in lines if x[0] ! = '#'] # get rid of comments lines = [x.rstrip().lstrip() for x in lines] # get rid of fringe whitespacesCopy the code

We then iterate over the pre-processed list to get blocks.

block = {}
blocks = []

for line in lines:
 if line[0] == "[": # This marks the start of a new block
 if len(block) != 0: # If block is not empty, implies it is storing values of previous block.
 blocks.append(block) # add it the blocks list
 block = {} # re-init the block
 block["type"] = line[1:-1].rstrip() 
 else:
 key,value = line.split("=") 
 block[key.rstrip()] = value.lstrip()
blocks.append(block)

return blocks
Copy the code

Creating a building block

Now we will use the list returned by parse_cfg above to build the PyTorch module as a building block in the configuration file.

There are five types of layers in the list. PyTorch provides a preset layer for both convolutional and UpSample. We’ll write our own modules for the rest of the layers by extending the nn.Module class.

Create_modules returns a list of blocks using parse_cfg:

def create_modules(blocks):
 net_info = blocks[0] #Captures the information about the input and pre-processing 
 module_list = nn.ModuleList()
 prev_filters = 3
 output_filters = []
Copy the code

Before iterating through the list, we define the variable net_info to store information about the network.

nn.ModuleList

Our function will return an nn.ModuleList. This class is almost equivalent to a normal list of nn.Module objects. However, when adding nn.modulelist as a member of the nn.Module object (that is, when we add modules to our network), The parameters of all nn.ModuleList objects (modules) inside nn.ModuleList are also added as parameters of nn.Module objects (that is, our network, adding nn.ModuleList as its members).

When we define a new convolution layer, we must define its convolution kernel dimension. Although the height and width of the convolution kernel are provided by the CFG file, the depth of the convolution kernel is determined by the number of convolution kernels (or the depth of the feature graph) in the upper layer. This means that we need to keep track of the number of convolution kernels applied to the convolution layer. We use the prev_filter variable to do this. We initialize it to 3 because the image has 3 channels corresponding to RGB channels.

The Route layer gets the feature map (possibly spliced) from the front layer. If there is a convolution layer behind the routing layer, then the convolution kernel will be applied to the feature graph of the front layer, which is precisely the feature graph obtained by the routing layer. So, not only do we need to track the number of convolution kernels in the previous layer, but we also need to track each previous layer. As we iterate, we add the number of output convolution kernels for each module to the list of output_filters.

For now, the idea is to iterate through the list of modules and create a PyTorch module for each module.

for index, x in enumerate(blocks[1:]):
 module = nn.Sequential()

 #check the type of block
 #create a new module for the block
 #append to module_list
Copy the code

The nn.Sequential class is used to execute a number of nn.Module objects sequentially. If you look at the CFG file, you will see that a module may contain more than one layer. For example, a convolutional module has a batch normalization layer, a Leaky ReLU activation layer, and a convolutional layer. We use nn.Sequential to concatenate these layers to produce the add_module function. For example, here is an example of how we create the convolution layer and the upsampling layer:

if (x["type"] == "convolutional"): #Get the info about the layer activation = x["activation"] try: batch_normalize = int(x["batch_normalize"]) bias = False except: batch_normalize = 0 bias = True filters= int(x["filters"]) padding = int(x["pad"]) kernel_size = int(x["size"]) stride =  int(x["stride"]) if padding: pad = (kernel_size - 1) // 2 else: pad = 0 #Add the convolutional layer conv = nn.Conv2d(prev_filters, filters, kernel_size, stride, pad, bias = bias) module.add_module("conv_{0}".format(index), conv) #Add the Batch Norm Layer if batch_normalize: bn = nn.BatchNorm2d(filters) module.add_module("batch_norm_{0}".format(index), bn) #Check the activation. #It is either Linear or a Leaky ReLU for YOLO if activation == "leaky": LeakyReLU(0.1, inplace = True) module.add_module(" Leaky_ {0}". Format (index), activn) #If it's an upsampling layer #We use Bilinear2dUpsampling elif (x["type"] == "upsample"): stride = int(x["stride"]) upsample = nn.Upsample(scale_factor = 2, mode = "bilinear") module.add_module("upsample_{}".format(index), upsample)Copy the code

Routing layer/shortcut layer

Next, let’s write the code to create the Route Layer and Shortcut Layer:

#If it is a route layer
 elif (x["type"] == "route"):
 x["layers"] = x["layers"].split(',')
 #Start of a route
 start = int(x["layers"][0])
 #end, if there exists one.
 try:
 end = int(x["layers"][1])
 except:
 end = 0
 #Positive anotation
 if start > 0: 
 start = start - index
 if end > 0:
 end = end - index
 route = EmptyLayer()
 module.add_module("route_{0}".format(index), route)
 if end < 0:
 filters = output_filters[index + start] + output_filters[index + end]
 else:
 filters= output_filters[index + start]

 #shortcut corresponds to skip connection
 elif x["type"] == "shortcut":
 shortcut = EmptyLayer()
 module.add_module("shortcut_{}".format(index), shortcut)
Copy the code

The code that creates the routing layer requires some interpretation. First, we extract the value about the layer attribute, represent it as an integer, and store it in a list.

Then we get a new layer called EmptyLayer, which, as the name suggests, is the EmptyLayer.

route = EmptyLayer()
Copy the code

Its definition is as follows:

class EmptyLayer(nn.Module):
 def __init__(self):
 super(EmptyLayer, self).__init__()
Copy the code

Wait, an empty layer?

Now, an empty layer can be confusing because it doesn’t do anything. The Route Layer, like any other Layer, does something (gets a concatenation of the previous Layer). In PyTorch, when we define a new layer, we subclass nn.Module and write layer operations on the forward function of the nn.Module object.

To design a layer in the Route Module, we must create an nn.Module object that is initialized as a member of Layers. We can then write code that concatenates and feeds forward the feature graphs in the forward function. Finally, we execute this layer of some forward function of the network.

But the code for the concatenation operation is fairly short and simple (calling torch. Cat on the feature map), and designing a layer as described above would lead to unnecessary abstraction and increased boilerplate code. Instead, we can place a dummy layer in place of the proposed routing layer and perform concatenation directly in the forward function representing darknet’s Nn. Module object. (If confused, I suggest you read the nn.Module class in PyTorch).

The convolutional layer behind the routing layer will apply its convolution kernel to the feature graph (possibly spliced) of the previous layer. The following code updates the filters variable to hold the number of convolution kernels output by the routing layer.

if end < 0:
 #If we are concatenating maps
 filters = output_filters[index + start] + output_filters[index + end]
else:
 filters= output_filters[index + start]
Copy the code

The shortcut layer also uses an empty layer because it also performs a very simple operation (plus). There is no need to update the filters variable because it simply adds the feature map of the previous layer to the following layer.

YOLO layer

Finally, we’ll write the code to create the YOLO layer:

#Yolo is the detection layer
 elif x["type"] == "yolo":
 mask = x["mask"].split(",")
 mask = [int(x) for x in mask]

 anchors = x["anchors"].split(",")
 anchors = [int(a) for a in anchors]
 anchors = [(anchors[i], anchors[i+1]) for i in range(0, len(anchors),2)]
 anchors = [anchors[i] for i in mask]

 detection = DetectionLayer(anchors)
 module.add_module("Detection_{}".format(index), detection)
Copy the code

We define a new layer, DetectionLayer, to hold the anchor points used to detect bounding boxes.

The detection layer is defined as follows:

class DetectionLayer(nn.Module):
 def __init__(self, anchors):
 super(DetectionLayer, self).__init__()
 self.anchors = anchors
Copy the code

At the end of the loop, we do some bookkeeping.

module_list.append(module)
 prev_filters = filters
 output_filters.append(filters)
Copy the code

This summarizes the body of the loop. After create_modules, we get a tuple containing net_info and module_list.

return (net_info, module_list)
Copy the code

The test code

You can test the code after darknet.py by typing the following command line to run the file.

blocks = parse_cfg("cfg/yolov3.cfg")
print(create_modules(blocks))
Copy the code

You should see a long list (106 items to be exact) with elements that look like this:

(9): Sequential( (conv_9): Conv2d (128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (batch_norm_9): BatchNorm2d(64, EPS = 1E-05, Momentum =0.1, affine=True) (LEAKY_9): LeakyReLU(0.1, inplace)) (10): Sequential((conv_10): Conv2d (64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (batch_norm_10): BatchNorm2d(128, EPS = 1E-05, Momentum =0.1, affine=True) (LeakY_10): LeakyReLU(0.1, inplace)) (11): Sequential( (shortcut_11): EmptyLayer( ) )Copy the code

Part three: realize forward propagation of network

In part 2, we implemented the layers used in the YOLO architecture. In this section, we plan to implement the YOLO network architecture with PyTorch so that we can generate output for a given image.

Our goal is to design forward propagation of the network.

A prerequisite for

  • Read the first two parts of this tutorial;
  • PyTorch basics, including how to create custom architectures using nn.Module, nn.Sequential, and Torch.
  • Process images in PyTorch.

Define the network

As mentioned earlier, we use nn.Module to build a custom schema in PyTorch. Here, we can define a network for the detector. In the darknet.py file, we added the following categories:

class Darknet(nn.Module):
 def __init__(self, cfgfile):
 super(Darknet, self).__init__()
 self.blocks = parse_cfg(cfgfile)
 self.net_info, self.module_list = create_modules(self.blocks)
Copy the code

Here, we subclassify the nn.Module category and name our category Darknet. We initialize the network with members, blocks, net_info, and module_list.

Realize forward propagation of the network

Forward propagation of the network is achieved by overwriting the forward method of the nn.Module category.

Forward has two main purposes. One, calculation output; Second, the output detection feature graph should be converted as soon as possible (for example, after conversion, the detection graph of different scales can be connected in series, otherwise it will be impossible to connect in series because of different dimensions).

def forward(self, x, CUDA):
 modules = self.blocks[1:]
 outputs = {} #We cache the outputs for the route layer
Copy the code

The forward function takes three arguments: self, input X, and CUDA (if true, GPU is used to speed forward propagation).

Here, we iterate over self.block[1:] instead of self.blocks, because the first element of self.blocks is a net block that is not propagated forward.

Since the routing layer and shortcut layer require the output characteristic graph of the previous layer, we cache the output characteristic graph of each layer in dictionary outputs. The key is the index of the layer, and the value corresponds to the feature graph.

As in the create_module function case, we now iterate over the module_list, which contains modules for the network. Note that these modules are added in the same order as in the configuration file. This means that we can simply pass the input through each module to get the output.

write = 0 #This is explained a bit later
for i, module in enumerate(modules): 
 module_type = (module["type"])
Copy the code

Convolution layer and upsampling layer

If the module is a convolution layer or upsampling layer, then forward propagation should work as follows:

if module_type == "convolutional" or module_type == "upsample":
 x = self.module_list[i](x)
Copy the code

Routing layer/shortcut layer

If you look at the routing layer code, we have to illustrate two cases (as described in Part 2). For the first case, we had to use the torch. Cat function to cascade the two feature graphs, with the second parameter set to 1. This is because we want to cascade the feature maps along depth. (In PyTorch, the input and output of the convolution layer are of the format ‘B X C X H X W’. Depth corresponds to channel dimensions).

elif module_type == "route":
 layers = module["layers"]
 layers = [int(a) for a in layers]

 if (layers[0]) > 0:
 layers[0] = layers[0] - i

 if len(layers) == 1:
 x = outputs[i + (layers[0])]

 else:
 if (layers[1]) > 0:
 layers[1] = layers[1] - i

 map1 = outputs[i + layers[0]]
 map2 = outputs[i + layers[1]]

 x = torch.cat((map1, map2), 1)

 elif module_type == "shortcut":
 from_ = int(module["from"])
 x = outputs[i-1] + outputs[i+from_]
Copy the code

YOLO (Detection Layer)

The output of YOLO is a convolution feature graph containing bounding box attributes along the depth of the feature graph. Bounding box properties are predicted from cells stacked on top of each other. Therefore, if you need to access the second border of the cell at (5,6), you need to index it by map[5,6, (5+C): 2*(5+C)]. This format is inconvenient for output processing, such as threshold processing through target confidence, adding grid offsets to centers, applying anchor points, and so on.

Another problem is that because the tests are performed on three scales, the dimensions of the predicted graphs will be different. Although the dimensions of the three feature graphs differ, the output processing performed on them is similar. It would be nice to be able to perform these operations on a single tensor instead of three.

To solve these problems, we introduced the function predict_transform.

Transform the output

The predict_transform function is imported in the file util.py when we use it in the Forward of the Darknet class.

Add an import at the top of util.py:

from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
from torch.autograd import Variable
import numpy as np
import cv2
Copy the code

Predict_transform uses five parameters: Prediction (our output), INP_DIM (the dimension of the input image), anchors, num_classes, CUDA Flag (optional).

def predict_transform(prediction, inp_dim, anchors, num_classes, CUDA = True):
Copy the code

The predict_transform function converts the detection feature graph into a two-dimensional tensor, each row of which corresponds to the properties of the bounding box, as shown below:


The code used for the above transformation:

batch_size = prediction.size(0) stride = inp_dim // prediction.size(2) grid_size = inp_dim // stride bbox_attrs = 5 + num_classes num_anchors = len(anchors) prediction = prediction.view(batch_size, bbox_attrs*num_anchors, Grid_size *grid_size) prediction = prediction.transpose(1,2). Access to a contiguous() prediction = prediction. View (batch_size, grid_size*grid_size*num_anchors, bbox_attrs)Copy the code

The dimensions of the anchor point correspond to the height and width attributes of the NET block. These attributes describe the dimensions of the input image, larger than the size of the detection graph (the quotient of the two is the stride). Therefore, we must use the stride length of the detection feature map to segment the anchor points.

anchors = [(a[0]/stride, a[1]/stride) for a in anchors]
Copy the code

Now, we need to transform the output according to the formula discussed in Part 1.

Perform Sigmoid function operations on (x,y) coordinates and Objectness scores.

#Sigmoid the centre_X, centre_Y. and object confidencce
 prediction[:,:,0] = torch.sigmoid(prediction[:,:,0])
 prediction[:,:,1] = torch.sigmoid(prediction[:,:,1])
 prediction[:,:,4] = torch.sigmoid(prediction[:,:,4])
Copy the code

Add grid offset to center coordinate prediction:

#Add the center offsets grid = np.arange(grid_size) a,b = np.meshgrid(grid, Grid) x_offset = torch.FloatTensor(a).view(-1,1) y_offset = torch.FloatTensor(b).view(-1,1) if CUDA: x_offset = x_offset.cuda() y_offset = y_offset.cuda() x_y_offset = torch.cat((x_offset, y_offset), Num_anchors, 1). Repeat (1) the view (1, 2) unsqueeze (0) prediction [:, :, : 2] + = x_y_offsetCopy the code

Apply anchor points to the bounding box dimension:

#log space transform height and the width
 anchors = torch.FloatTensor(anchors)

 if CUDA:
 anchors = anchors.cuda()

 anchors = anchors.repeat(grid_size*grid_size, 1).unsqueeze(0)
 prediction[:,:,2:4] = torch.exp(prediction[:,:,2:4])*anchors
Copy the code

Apply the SigmoID activation function to the category score:

prediction[:,:,5: 5 + num_classes] = torch.sigmoid((prediction[:,:, 5 : 5 + num_classes]))
Copy the code

Finally, we need to adjust the size of the detection graph to match the size of the input image. Bounding box properties depend on the size of the feature graph (e.g., 13 x 13). If the input image size is 416 x 416, then we multiply the property by 32, or multiply the Stride variable.

prediction[:,:,:4] *= stride
Copy the code

This is where the loop part ends more or less.

The function returns the predicted result at the end:

return prediction
Copy the code

Revisit the detection layer

We have transformed the output tensor and can now cascade three detection graphs of different scales into a large tensor. Note that this must be done after the transformation, as you cannot cascade feature maps of different spatial dimensions. After the transformation, our output tensor renders the bounding box table as rows, making cascading more feasible.

One hindrance is that we cannot initialize an empty tensor and cascade it with a (different form) non-empty tensor. Therefore, we delay the initialization of the collector (the tensors that hold the checks) until we get the first check graph, which is then cascades together.

Note that write = 0 precedes the loop of the function forward. Write Flag indicates whether we have encountered the first detection. If write is 0, the collector has not been initialized. If write is 1, the collector is already initialized, and we just need to cascade the detection graph to the collector.

Now that we have the predict_Transform function, we can write code to process the detection feature graph in the forward function.

At the top of the darknet.py file, add the following imports:

from util import *
Copy the code

Then define it in the forward function:

elif module_type == 'yolo': 

 anchors = self.module_list[i][0].anchors
 #Get the input dimensions
 inp_dim = int (self.net_info["height"])

 #Get the number of classes
 num_classes = int (module["classes"])

 #Transform 
 x = x.data
 x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
 if not write: #if no collector has been intialised. 
 detections = x
 write = 1

 else: 
 detections = torch.cat((detections, x), 1)

 outputs[i] = x
Copy the code

Now you just need to return the test results.

return detections
Copy the code

Test forward propagation

The following function will create a forged input that we can pass into our network. Before writing this function, we can use the following command line to save the image to the working directory:

wget https://github.com/ayooshkathuria/pytorch-yolo-v3/raw/master/dog-cycle-car.png
Copy the code

Images can also be downloaded directly: github.com/ayooshkathu…

Now define the following functions at the top of the darknet.py file:

def get_test_input(): img = cv2.imread("dog-cycle-car.png") img = cv2.resize(img, # the Resize (416416)) to the input dimension img_ = img [:, :, : : - 1]. The transpose ((2, 1)) # BGR - > RGB | H X W C - > C H X W X Img_ = img_ [np newaxis, :, :, :] / 255.0 # Add a channel at 0 (for batch) | Normalise img_ = torch. From_numpy (img_). Float () #Convert to float img_ = Variable(img_) # Convert to Variable return img_Copy the code

We need to type the following code:

model = Darknet("cfg/yolov3.cfg")
inp = get_test_input()
pred = model(inp)
print (pred)
Copy the code

You should see the following output:

(0,.,.) = 16.0962 17.0541 91.5104... 0.4336 0.4692 0.5279 15.1363 15.2568 166.0840... 0.5561 0.5414 0.5318 14.4763 18.5405 409.4371... 0.5908 0.5353 0.4979 ⋱... 411.2625 412.0660 9.0127... 0.5054 0.4662 0.5043 412.1762 412.4936 16.0449... 0.4815 0.4979 0.4582 412.1629 411.4338 34.9027... [torch.FloatTensor of size 1x10647x85]Copy the code

The shape of the tensor is 1×10647×85. The first dimension is the batch size. Here we only use a single image. For the images in the batch, we will have a table of 100647×85, each row of which represents a bounding box (four bounding box attributes, one Objectness score, and 80 category scores).

Right now, our network has random weights and does not output the correct categories. We need to load the weight file for the network, so we can use the official weight file.

Download pre-training weights

Download the weight file and put it in the detector directory. We can directly download it using the command line:

wget https://pjreddie.com/media/files/yolov3.weights
Copy the code

It can also be downloaded at pjreddie.com/media/files…

Understanding weight files

The official weights file is a binary file that stores neural network weights in sequence.

We have to read the weights carefully because they are only stored in floating-point form and there is no other information to tell us which layer they belong to. So if the reading is wrong, chances are the weights will be loaded all wrong and the model won’t work at all. Therefore, reading only floating point numbers makes it impossible to tell which layer the weight belongs to. Therefore, we must understand how weights are stored.

First, weights belong to only two types of layers, namely the Batch Norm Layer and the convolution layer. The weights for these layers are stored in exactly the same order as they are defined in the configuration file. So, if a convolutional block is followed by a shortcut block, and a shortcut connects to another convolutional block, you would expect the file to contain the weights of the former convolutional block, followed by the weights of the latter.

When the batch normalization appears in the convolution module, it does not carry a bias term. However, when there is no batch normalization in the convolution module, the “weight” of the bias item is read from the file. The following figure shows how weights are stored.


The load weight

We write a function to load the weights, which is a member function of the Darknet class. It takes a parameter other than self as the path to the weight file.

def load_weights(self, weightfile):
Copy the code

The first 160-bit weight file holds five INT32 values, which form the header of the file.

#Open the weights file fp = open(weightfile, "rb") #The first 5 values are header information # 1. Major version number # 2. Minor Version Number # 3. Subversion Images seen by the network (during training) header = np.fromfile(fp, dtype = np.int32, count = 5) self.header = torch.from_numpy(header) self.seen = self.header[3]Copy the code

The bits that follow represent weights, in the order described above. Weights are stored as float32 or 32-bit floating-point numbers. Let’s load the remaining weights in Np.ndarray.

weights = np.fromfile(fp, dtype = np.float32)
Copy the code

Now we iteratively load the weight file onto the module of the network.

ptr = 0
 for i in range(len(self.module_list)):
 module_type = self.blocks[i + 1]["type"]

 #If module_type is convolutional load weights
 #Otherwise ignore.
Copy the code

During the loop, we first check whether the convolutional module has batCH_normalize (True). Based on this, we load the weights.

if module_type == "convolutional":
 model = self.module_list[i]
 try:
 batch_normalize = int(self.blocks[i+1]["batch_normalize"])
 except:
 batch_normalize = 0

 conv = model[0]
Copy the code

We keep a variable called PTR to track our position in the weight array. Now, if the batch_normalize check results in True, we load the weights as follows:

if (batch_normalize): bn = model[1] #Get the number of weights of Batch Norm Layer num_bn_biases = bn.bias.numel() #Load the weights bn_biases  = torch.from_numpy(weights[ptr:ptr + num_bn_biases]) ptr += num_bn_biases bn_weights = torch.from_numpy(weights[ptr: ptr + num_bn_biases]) ptr += num_bn_biases bn_running_mean = torch.from_numpy(weights[ptr: ptr + num_bn_biases]) ptr += num_bn_biases bn_running_var = torch.from_numpy(weights[ptr: ptr + num_bn_biases]) ptr += num_bn_biases #Cast the loaded weights into dims of model weights. bn_biases = bn_biases.view_as(bn.bias.data) bn_weights = bn_weights.view_as(bn.weight.data) bn_running_mean = bn_running_mean.view_as(bn.running_mean) bn_running_var = bn_running_var.view_as(bn.running_var) #Copy the data to model  bn.bias.data.copy_(bn_biases) bn.weight.data.copy_(bn_weights) bn.running_mean.copy_(bn_running_mean) bn.running_var.copy_(bn_running_var)Copy the code

If the batch_normalize check result is not True, only the bias item of the convolution layer needs to be loaded.

else:
 #Number of biases
 num_biases = conv.bias.numel()

 #Load the weights
 conv_biases = torch.from_numpy(weights[ptr: ptr + num_biases])
 ptr = ptr + num_biases

 #reshape the loaded weights according to the dims of the model weights
 conv_biases = conv_biases.view_as(conv.bias.data)

 #Finally copy the data
 conv.bias.data.copy_(conv_biases)
Copy the code

Finally, we load the weights of the convolution layer.

#Let us load the weights for the Convolutional layers
num_weights = conv.weight.numel()

#Do the same as above for weights
conv_weights = torch.from_numpy(weights[ptr:ptr+num_weights])
ptr = ptr + num_weights

conv_weights = conv_weights.view_as(conv.weight.data)
conv.weight.data.copy_(conv_weights)
Copy the code

So much for this function, you can now load weights in a Darknet object by calling the load_weights function on a Darknet object.

model = Darknet("cfg/yolov3.cfg")
model.load_weights("yolov3.weights")
Copy the code

Through model building and weight loading, we can finally start target detection. In the future, we will also show how to use Objectness confidence threshold and non-maximum suppression to generate final detection results.


Original link: medium.com/paperspace/…