YOLO, which stands for “You Only Look Once,” is an object detection algorithm based on convolutional neural network (CNN). YOLO V3 is the third version of YOLO (YOLO, YOLO 9000, AND YOLO V3). The detection effect is more accurate and stronger.

For more details on YOLO V3, see the YOLO website.

“You Only Live Once,” he said. “You Only Live Once.”

This article mainly shares, how to implement YOLO V3 algorithm details, Keras framework. This is chapter 3, The Network, based on DarkNet. Of course, there are chapters 4 to n, after all, this is a full edition 🙂 this one is a little longer.

GitHub source: github.com/SpikeKing/k…

Has been updated:

  • Article 1 training: mp.weixin.qq.com/s/T9LshbXoe…
  • Article 2 model: mp.weixin.qq.com/s/N79S9Qf1O…
  • Article 3 network: mp.weixin.qq.com/s/hC4P7iRGv…
  • Article 4 the true value: mp.weixin.qq.com/s/5Sj7QadfV…
  • Article 5 Loss:mp.weixin.qq.com/s/4L9E4WGSh…

Welcome to pay attention to the wechat public number DeepAlgorithm (ID: DeepAlgorithm), to know more depth technology!


1. The network

In the model, the network Model_body of YOLO V3 is built by passing in the input layer image_input, num_anchors//3 and num_classes for each layer, and calling the yolo_body() method. Where, the image_input structure is (? , 416, 416, 3).

model_body = yolo_body(image_input, num_anchors // 3, num_classes)  # model
Copy the code

In model_body, the final input is image_input, and the final output is a list of three matrices:

[(?, 13, 13, 18), (?, 26, 26, 18), (?, 52, 52, 18)]
Copy the code

The basic network of YOLO V3 is the DarkNet network, which creates the output of three scales through convolution operation and splicing operation of multiple matrices in the bottom and middle layers of the DarkNet network, namely [Y1, Y2, Y3].

def yolo_body(inputs, num_anchors, num_classes):
    darknet = Model(inputs, darknet_body(inputs))
    
    x, y1 = make_last_layers(darknet.output, 512, num_anchors * (num_classes + 5))

    x = compose(
        DarknetConv2D_BN_Leaky(256, (1.1)),
        UpSampling2D(2))(x)
    x = Concatenate()([x, darknet.layers[152].output])
    x, y2 = make_last_layers(x, 256, num_anchors * (num_classes + 5))

    x = compose(
        DarknetConv2D_BN_Leaky(128, (1.1)),
        UpSampling2D(2))(x)
    x = Concatenate()([x, darknet.layers[92].output])
    x, y3 = make_last_layers(x, 128, num_anchors * (num_classes + 5))

    return Model(inputs, [y1, y2, y3])
Copy the code

2. Darknet

Inputs for the Darknet network are image data sets: , 416, 416, 3), the output is the output of the darknet_body() method. Encapsulate the network’s core logic in the darknet_body() method. That is:

darknet = Model(inputs, darknet_body(inputs))
Copy the code

The output format for darknet_body is (? , 13, 13, 1024).

A simplified diagram of Darknet’s network as follows:

The version of Darknet used by YOLO V3 is Darknet53. So why Darknet53? Because Darknet53 is a combination of 53 convolutional layers and pooling layers, it corresponds to a Darknet simplification diagram, namely:

53 = 2 + 1*2 + 1 + 2*2 + 1 + 8*2 + 1 + 8*2 + 1 + 4*2 + 1
Copy the code

In darknet_body(), the Darknet network contains 5 duplicate resblock_body() units, namely:

def darknet_body(x):
    '''Darknent body having 52 Convolution2D layers'''
    x = DarknetConv2D_BN_Leaky(32, (3.3))(x)
    x = resblock_body(x, num_filters=64, num_blocks=1)
    x = resblock_body(x, num_filters=128, num_blocks=2)
    x = resblock_body(x, num_filters=256, num_blocks=8)
    x = resblock_body(x, num_filters=512, num_blocks=8)
    x = resblock_body(x, num_filters=1024, num_blocks=4)
    return x
Copy the code

In the first convolution operation DarknetConv2D_BN_Leaky(), is a combination of three operations, namely:

  • 1 Darknet’s 2d convolution Conv2D layer, i.eDarknetConv2D();
  • There is a BatchNormalization(BN) layer, known as BatchNormalization().
  • 1 LeakyReLU layer, slope is 0.1, LeakyReLU is the change of ReLU;

That is:

def DarknetConv2D_BN_Leaky(*args, **kwargs):
    """Darknet Convolution2D followed by BatchNormalization and LeakyReLU."""
    no_bias_kwargs = {'use_bias': False}
    no_bias_kwargs.update(kwargs)
    return compose(
        DarknetConv2D(*args, **no_bias_kwargs),
        BatchNormalization(),
        LeakyReLU(alpha=0.1))
Copy the code

LeakyReLU’s activation function is as follows:

Where, Darknet’s 2-d convolution DarknetConv2D is operated as follows:

  • The kernel weight matrix is regularized by L2, and the parameter is 5E-4, that is, the operation parameter W;
  • Padding, the same mode is generally used, but valid mode is only used if the step size is (2,2). Avoid introducing useless boundary information in downsampling;
  • The other parameters remain unchanged and are consistent with the two-dimensional convolution operation Conv2D().

Kernel_regularizer regularizes the kernel weight parameter W, while BatchNormalization regularizes the input data X.

Implementation:

@wraps(Conv2D)
def DarknetConv2D(*args, **kwargs):
    """Wrapper to set Darknet parameters for Convolution2D."""
    darknet_conv_kwargs = {'kernel_regularizer': l2(5e-4)}
    darknet_conv_kwargs['padding'] = 'valid' if kwargs.get('strides') = = (2.2) else 'same'
    darknet_conv_kwargs.update(kwargs)
    return Conv2D(*args, **darknet_conv_kwargs)
Copy the code

Next, the first residual structure resblock_body(), the input data x is (? , 416, 416, 32), the number of filters in the channel is 64. Num_blocks is repeated 1 time. The first residual structure is part 1 of the network simplification diagram.

x = resblock_body(x, num_filters=64, num_blocks=1)
Copy the code

In resblock_body, there is the following logic:

  • ZeroPadding2D() : Fills the x boundary with 0, given by (? , 416, 416, 32) convert to (? 417, 417, 32). Since the step size of the next convolution operation is 2, the edge length of the graph needs to be odd.
  • DarknetConv2D_BN_Leaky()Is DarkNet’s 2-d convolution operation with a core of (3,3) and step size of (2,2). Note that this results in a smaller feature size, which is caused by (? , 417, 417, 32) convert to (? 208, 208, 64). Due to thenum_filtersIt’s 64, so it produces 64 channels.
  • Compose () : outputs the prediction graph Y, which functions as a combination function. The convolution operation of 1×1 is performed first, and then the convolution operation of 3×3 is performed. The filter is reduced by 2 times and then recovered, and is the same as the input, both of which are 64.
  • x = Add()([x, y])The Residual operation adds the value of x to the value of y. The residual operation can avoid the Vanishing Gradient Problem that occurs when the network is deep.

Implementation:

def resblock_body(x, num_filters, num_blocks):
    '''A series of resblocks starting with a downsampling Convolution2D'''
    # Darknet uses left and top padding instead of 'same' mode
    x = ZeroPadding2D(((1.0), (1.0)))(x)
    x = DarknetConv2D_BN_Leaky(num_filters, (3.3), strides=(2.2))(x)
    for i in range(num_blocks):
        y = compose(
            DarknetConv2D_BN_Leaky(num_filters // 2, (1.1)),
            DarknetConv2D_BN_Leaky(num_filters, (3.3)))(x)
        x = Add()([x, y])
    return x
Copy the code

Residual operation process, as shown in the figure:

Similarly, in darknet_body(), perform 5 groups of resblock_body() residuals, repeat [1, 2, 8, 8, 4] times, double convolution (1×1 and 3×3) operations, each group contains a convolution operation of step 2, thus reducing dimension by 32 times in total 5 times, i.e. 32=2^5, The dimension of the output feature graph is 13, that is, 13=416/32. The filter number of the last layer is 1024, so the final output structure is (? , 13, 13, 1024), namely:

Tensor("add_23/add:0", shape=(? , 13, 13, 1024), dtype=float32)Copy the code

At this point, the input to the Darknet model is (? , 416, 416, 3), the output is (? , 13, 13, 1024).


3. The characteristics of figure

In YOLO V3 network, three detection graphs of different scales are output to detect objects of different sizes. Call make_last_layers() three times to generate three checkers, y1, y2, and y3.

Figure 13 x13 detection

In section 1, the output dimension is 13×13. In the make_last_layers() method, enter the following parameters:

  • Darknet. output: The output of a darknet network, i.e. (? , 13, 13, 1024);
  • num_filters: Number of channels 512, used to generate the intermediate value X, x will be transmitted to the second detection graph;
  • out_filters: The number of channels of the first output y1, the value is the number of anchor frames *(number of categories +4 frame values + frame confidence);

That is:

x, y1 = make_last_layers(darknet.output, 512, num_anchors * (num_classes + 5))
Copy the code

In the make_last_layers() method, perform two steps:

  • In the first step, x performs multiple groups of 1×1 convolution operations and 3×3 convolution operations. Filter is first expanded and then restored. Finally, it remains unchanged with the input filter, which is still 512. , 13, 13, 1024) to (? , 13, 13, 512);
  • In step 2, X first performs the convolution operation of 3×3, and then performs the convolution operation of 1X1 without BN and Leaky, which is similar to the fully connected operation to generate the prediction matrix Y.

Implementation:

def make_last_layers(x, num_filters, out_filters):
    '''6 Conv2D_BN_Leaky layers followed by a Conv2D_linear layer'''
    x = compose(
        DarknetConv2D_BN_Leaky(num_filters, (1.1)),
        DarknetConv2D_BN_Leaky(num_filters * 2, (3.3)),
        DarknetConv2D_BN_Leaky(num_filters, (1.1)),
        DarknetConv2D_BN_Leaky(num_filters * 2, (3.3)),
        DarknetConv2D_BN_Leaky(num_filters, (1.1)))(x)
    y = compose(
        DarknetConv2D_BN_Leaky(num_filters * 2, (3.3)),
        DarknetConv2D(out_filters, (1.1)))(x)
    return x, y
Copy the code

Finally, the first make_last_layers() method outputs x as (? , 13, 13, 512), the output y is (? , 13, 13, 18). Since the model has only one detection category, the fourth dimension of y is 18, that is, 3*(1+5)=18.

Figure 26 x26 detection

The second part, the output dimension is 26×26 and contains the following steps:

  1. throughDarknetConv2D_BN_LeakyConvolution, convert x from 512 channels to 256 channels;
  2. By double up-samplingUpSampling2DConvert x from 13×13 structure to 26×26 structure;
  3. Splice X with Layer 152 of DarkNetConcatenate, as the secondmake_last_layers()Is used to generate the second prediction graph Y2;

Where, the input x and darknet.layers[152]. Output structure is 26×26, as follows:

x: shape=(? , 26, 26, 256) darknet.layers[152].output: (? , 26, 26, 512)Copy the code

After concatenation, the output x is of the format (? , 26, 26, 768).

The purpose of this is: After several conversions, darknet.output, the low-level high-level abstract information of Darknet, is not only output to the first detection part, but also used in the second detection part. After up-sampling, it is joined together with the data of the penultimate dimension reduction in the Darknet backbone as the input of the second detection part. The bottom abstract feature contains global information, and the middle abstract feature contains local information, so it is used to detect small objects.

Finally, the same make_last_layers() is called to output the second layer, y2, and temporary data X.

Implementation:

x = compose(
    DarknetConv2D_BN_Leaky(256, (1.1)),
    UpSampling2D(2))(x)
x = Concatenate()([x, darknet.layers[152].output])
x, y2 = make_last_layers(x, 256, num_anchors * (num_classes + 5))
Copy the code

Finally, the second make_last_layers() method outputs x as (? , 26, 26, 256), the output y is (? , 26, 26, 18).

Figure 52 x52 detection

Part 3, whose output dimension is 52×52, is similar to Part 2 and contains the following steps:

x = compose(
    DarknetConv2D_BN_Leaky(128, (1.1)),
    UpSampling2D(2))(x)
x = Concatenate()([x, darknet.layers[92].output])
_, y3 = make_last_layers(x, 128, num_anchors * (num_classes + 5))
Copy the code

The logic is as follows:

  • X is convolved by 128 filters, and then upsampling is performed. The output is (? , 52, 52, 128);
  • Darknet. Layers [92]. Output, similar to layers 152, structure is (? , 52, 52, 256);
  • After the two are concatenated, x is (? , 52, 52, 384);
  • Last input tomake_last_layers(), generate y3 is (? , 52, 52, 18), ignore the output of x;

Finally, build a model based on the inputs and outputs of the entire logic. Inputs remain unchanged, i.e. , 416, 416, 3), and the output is converted into the prediction layer of three scales, namely [Y1, y2, y3].

return Model(inputs, [y1, y2, y3])
Copy the code

[y1, y2, y3]

Tensor("conv2d_59/BiasAdd:0", shape=(? , 13, 13, 18), dtype=float32)
Tensor("conv2d_67/BiasAdd:0", shape=(? , 26, 26, 18), dtype=float32)
Tensor("conv2d_75/BiasAdd:0", shape=(? , 52, 52, 18), dtype=float32)
Copy the code

Finally, in Yolo_body, the entire YOLO V3 network is built, and the base network is DarkNet.

model_body = yolo_body(image_input, num_anchors // 3, num_classes)
Copy the code

A schematic diagram of the network, with a slightly different hierarchy number:


Add 1. Convolution Padding

In the convolution operation, there are two operations for edge data, one is to discard valid and the other is to fill SAME.

Such as:

Data: 12 3 4 5 6 7 8 9 10 11 12 13 Input data = 13 Filter width = 6 Step length = 5Copy the code

The first operation, valid, with a width of 6 and a step of 5, executes data:

1) I'm sorry to bother you. 2) I'm sorry to bother you.Copy the code

The second, same operation, executes data:

12 3 4 5 6 6 7 8 9 10 11 11 12 13 0 0Copy the code

The same mode has higher data utilization, while the Valid mode avoids introducing invalid edge data. The two modes have their own strengths.


Add 2. compose function

The compose() function, which uses Python’s Lambda expression, executes the list of functions sequentially, with the output of one function being the input of the other. The compose() function is suitable for connecting two layers in a neural network.

Such as:

def compose(*funcs):
    if funcs:
        return reduce(lambda f, g: lambda *a, **kw: g(f(*a, **kw)), funcs)
    else:
        raise ValueError('Composition of empty sequence not supported.')
def func_x(x):
    return x * 10
def func_y(y):
    return y - 6
z = compose(func_x, func_y)  Execute x first, then y
print(z(10))  # 10 * 10-6 = 94
Copy the code

Supplement 3. UpSampling2D upsampling

Upsampling operation of UpSampling2D enlarges the feature matrix by multiple. The core of UpSampling2D is resize and Nearest Neighbor interpolation algorithm is adopted by default. Data_format is the data mode, and the default is channels_last, where the channel is last, as in (128,128,3).

Source:

def call(self, inputs):
    return K.resize_images(inputs, self.size[0], self.size[1],
                           self.data_format)
// ...
x = tf.image.resize_nearest_neighbor(x, new_shape)                           
Copy the code

For example: data (? , 13, 13, 256), after UpSampling2D(2) operation, generate (? , 26, 26, 256).


Supplement 4. 1×1 convolution operation with full connection

Both the convolution layer and the full connection layer of 1×1 can be used as the predictive output of the last layer, with slight differences between the two.

Point 1:

  • At 1×1 convolution layer, the eigenmatrix of fixed channel number can be output, regardless of the number of channels input;
  • Dense, in which inputs and outputs are fixed, cannot be modified when designing networks;

So the 1×1 convolution layer is more flexible than the full connection layer;

Point 2:

For example, if the input value is 13,13,1024 and the output value is 13,13,18, the following operations are performed:

  • 1×1 convolution layer has fewer parameters and only needs the parameters matching the output channel, such as 1x1x1024x18 parameters;
  • The full connection layer has many parameters, which need to match both input and output parameters, such as 13x13x1024x18 parameters.

OK, that’s all! Enjoy it!

Welcome to pay attention to the wechat public number DeepAlgorithm (ID: DeepAlgorithm), to know more depth technology!