YOLO, which stands for You Only Look Once, is an object detection algorithm based on convolutional neural network (CNN). YOLO V3 is the third version of YOLO (YOLO, YOLO 9000, AND YOLO V3). The detection effect is more accurate and stronger.

For more details on YOLO V3, see the YOLO website.

“You Only Live Once,” he said. “You Only Live Once.”

This article mainly shares, how to implement YOLO V3 algorithm details, Keras framework. In chapter 5, Loss function (Loss) is designed exquisitely, including the Loss value of four parts: center point, width and height, frame confidence and category confidence. Of course, there are chapters 6 to n, after all, this is a full edition :).

GitHub source: github.com/SpikeKing/k…

Has been updated:

  • Article 1 training: mp.weixin.qq.com/s/T9LshbXoe…
  • Article 2 model: mp.weixin.qq.com/s/N79S9Qf1O…
  • Article 3 network: mp.weixin.qq.com/s/hC4P7iRGv…
  • Article 4 the true value: mp.weixin.qq.com/s/5Sj7QadfV…
  • Article 5 Loss:mp.weixin.qq.com/s/4L9E4WGSh…

Welcome to pay attention to the wechat public number DeepAlgorithm (ID: DeepAlgorithm), to know more depth technology!


1. The damage layer

During the training process of the model, parameters in the network are constantly adjusted to optimize the value of loss function to minimize and complete the training of the model. In YOLO V3, the loss function Yolo_loss encapsulates the loss layer of the custom Lambda as the last layer of the model and takes part in training. The input of the loss layer Lambda is the output of the existing model model_body.output and the true value y_true, and the output is one value, namely the loss value.

The core logic of the loss layer is in Yolo_loss, which in addition to receiving the input model_body. Output and y_true for the Lambda layer, yolo_Loss also receives three anchors, num_classes and the threshold ignore_THresh.

Implementation:

model_loss = Lambda(yolo_loss,
                    output_shape=(1,), name='yolo_loss',
                    arguments={'anchors': anchors,
                               'num_classes': num_classes,
                               'ignore_thresh': 0.5}
                    )(model_body.output + y_true)
Copy the code

Where, model_body.output is the predicted value of the existing model and y_true is the true value, both of which have the same format as follows:

model_body: [(?, 13, 13, 18), (?, 26, 26, 18), (?, 52, 52, 18)]
y_true: [(?, 13, 13, 18), (?, 26, 26, 18), (?, 52, 52, 18)]
Copy the code

Then, in the yolo_loss method, the parameters are:

  • Args is the input to the Lambda layer, i.emodel_bodyThe output andy_trueThe combination of;
  • Anchors are a 2-d array with a structure (9, 2) that is, nine Anchor boxes;
  • Num_classes is the number of classes;
  • Ignore_thresh is the filtering threshold;
  • Print_loss is a switch to print the loss function;

That is:

def yolo_loss(args, anchors, num_classes, ignore_thresh=. 5, print_loss=True):
Copy the code

Parameters of 2.

In the loss method yolo_loss, several parameters need to be set:

  • Num_layers: the number of layers, which is 1/3 of the number of anchors;
  • yolo_outputsandy_true: Separate ARGS, the first three areyolo_outputsThe predicted values, the last three arey_trueTrue value;
  • Anchor_mask: index array of Anchor box, 3 1 groups in reverse order, 678 corresponding to 13×13, 345 corresponding to 26×26, 123 corresponding to 52×52; Namely [[6, 7, 8], [3, 4, 5], [0, 1, 2]];
  • input_shape:K.shape(yolo_outputs[0])[1:3], the first prediction matrixyolo_outputs[0]Is the first to second position of the shape, namely (? , 13, 13, 18) of (13, 13). X32 is the input size of YOLO network, namely, (416, 416), because there are 5 convolution operations with step size of (2, 2) in the network, dimension reduction 32=5^2 times;
  • grid_shapesAnd:input_shapeA similar,K.shape(yolo_outputs[l])[1:3], in the form of a list, select three dimensions of prediction graph, namely [(13, 13), (26, 26), (52, 52)];
  • M: The first bit of the structure of the first prediction graph, i.eK.shape(yolo_outputs[0])[0], input the total number of pictures of the model, that is, the number of batches;
  • Mf: float type of m, i.e. K.cast(m, k.dtype (yolo_outputs[0]))
  • Loss: The loss value is 0.

That is:

num_layers = len(anchors) // 3  # default setting
yolo_outputs = args[:num_layers]
y_true = args[num_layers:]
anchor_mask = [[6.7.8], [3.4.5], [0.1.2]] if num_layers == 3 else [[3.4.5], [1.2.3]]
# input_shape is the output size *32, which is the original input size, and [1:3] is the size position, which is 416x416
input_shape = K.cast(K.shape(yolo_outputs[0[])1:3] * 32, K.dtype(y_true[0]))
# The size of each grid to form a list
grid_shapes = [K.cast(K.shape(yolo_outputs[l])[1:3], K.dtype(y_true[0])) for l in range(num_layers)]

m = K.shape(yolo_outputs[0[])0]  # batch size, tensor
mf = K.cast(m, K.dtype(yolo_outputs[0]))

loss = 0
Copy the code

3. Forecast data

In yolo_head, the prediction graph yolo_outputs[L] is divided into xy, WH, confidence and class_probs of boundary box. Input parameters:

  • Outputs [L] or feats: the first predicted map, such as (? , 13, 13, 18);
  • Anchors [anchor_mask[L]] or anchors: the first anchor box, as in [(116, 90), (156,198), (373,326)];
  • Num_classes: number of classes, such as 1;
  • Input_shape: Put in the size of the image, Tensor, 416, 416;
  • calc_loss: Calculates the loss switch. When calculating the loss value,calc_lossOpen, True;

That is:

grid, raw_pred, pred_xy, pred_wh = \
    yolo_head(yolo_outputs[l], anchors[anchor_mask[l]], num_classes, input_shape, calc_loss=True)
    
def yolo_head(feats, anchors, num_classes, input_shape, calc_loss=False):
Copy the code

Next, count the number of anchors num_anchors, which is three. To transform anchors into a Tensor that has the same feats dimensions as predicted, the structure of anchors_tensor is (1, 1, 1, 3, 2), that is:

num_anchors = len(anchors)
# Reshape to batch, height, width, num_anchors, box_params.
anchors_tensor = K.reshape(K.constant(anchors), [1.1.1, num_anchors, 2])
Copy the code

Next, create a grid:

  • Get grid_shape sizes for feats 1 and 2, such as 13×13;
  • grid_yandgrid_x0 0 0 0 0 0 0 0 0 0 0 0 0 0 0grid_y, and then create a combination of 0 to 12 on the X-axisgrid_x, concatenate the two, that is grid;
  • Grid is the value of traversing binary value combination, the structure is (13, 13, 1, 2);

That is:

grid_shape = K.shape(feats)[1:3]
grid_shape = K.shape(feats)[1:3]  # height, width
grid_y = K.tile(K.reshape(K.arange(0, stop=grid_shape[0]), [- 1.1.1.1]),
                [1, grid_shape[1].1.1])
grid_x = K.tile(K.reshape(K.arange(0, stop=grid_shape[1]), [1.- 1.1.1]),
                [grid_shape[0].1.1.1])
grid = K.concatenate([grid_x, grid_y])
grid = K.cast(grid, K.dtype(feats))
Copy the code

Next, unpack the last dimension of the feats and isolate anchors from the other data (number of categories +4 box values + box confidence)

feats = K.reshape(
    feats, [- 1, grid_shape[0], grid_shape[1], num_anchors, num_classes + 5])
Copy the code

Next, calculate the starting point XY, wh, box confidence and category confidence box_class_probs:

  • Starting point x, y: feats in the value of x, y, after sigmoid normalization, coupled with the corresponding grid binary group, divided by the grid side length, the normalized;
  • * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *anchors_tensorThe Anchor box of, divided by the picture width and height, normalized;
  • Box confidencebox_confidence* * * * : Feats of confidence are normalized by sigmoid;
  • Category confidencebox_class_probs* * * * : Normalize the feats class_probs values by sigmoid;

That is:

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::- 1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::- 1], K.dtype(feats))
box_confidence = K.sigmoid(feats[..., 4:5])
box_class_probs = K.sigmoid(feats[..., 5:)Copy the code

Among them, the calculation formula of xywh, tx, ty, tw, and th is feats values, while bx, by, bw and bh as output values, as follows:

The four values box_xy, box_WH, box_confidence, and box_class_probs all range from 0 to 1.

Due to calculating the loss value, calc_loss is True, then return:

  • Grid: the structure is (13, 13, 1, 2), the value of 0~12 traversal binary group;
  • 1. Predicted values and other feats: After 0 0 transformation, separating 18-dimensional data out of 3-dimensional anchors, the structure is 0 , 13, 13, 3, 6)
  • box_xyandbox_whThe normalized starting points xy and wh, the structure of xy is (? , 13, 13, 3, 2), wh is (? , 13, 13, 3, 2);box_xyIs in the range of (0~1),box_whThe range of is (0~1); That is, after the calculation of Bx, BY, BW and BH is completed, normalization is carried out.

That is:

if calc_loss == True:
    return grid, feats, box_xy, box_wh
Copy the code

4. Loss function

In the calculation of the loss value, the loss value of each layer is calculated circularly and added together, i.e

for l in range(num_layers):
        // ... 
        loss += xy_loss + wh_loss + confidence_loss + class_loss
Copy the code

Within each loop body:

  • Gets object confidenceobject_mask, the fourth bit of the last dimension, 0-3 bits are boxes, and the fourth bit is object confidence;
  • Category confidencetrue_class_probs, the 5th place of the last dimension;

That is:

object_mask = y_true[l][..., 4:5]
true_class_probs = y_true[l][..., 5:]
Copy the code

Next, call yolo_head to reconstruct the prediction graph, and output:

  • Grid: the structure is (13, 13, 1, 2), the value of 0~12 traversal binary group;
  • Predicted RAW_Pred: after 0 0 transformation, anchors separated, structure is 0 , 13, 13, 3, 6)
  • pred_xyandpred_whThe normalized starting points xy and wh, the structure of xy is (? , 13, 13, 3, 2), wh is (? , 13, 13, 3, 2);

Then, xy and WH are combined into pred_box, whose structure is (? , 13, 13, 3, 4).

grid, raw_pred, pred_xy, pred_wh = \
    yolo_head(yolo_outputs[l], anchors[anchor_mask[l]], 
              num_classes, input_shape, calc_loss=True)
pred_box = K.concatenate([pred_xy, pred_wh])
Copy the code

Next, generate truth data:

  • raw_true_xy: center point xy in the grid, offset data, value range is 0~1; The 0 and 1 bits of y_true are relative to the center point xy, ranging from 0 to 1.
  • raw_true_whThe WH on the web is specific to the proportion of anchors, and then converted to log form, with a range of positive and negative; The second and third bits of y_true are the relative positions of wh with width and height, ranging from 0 to 1;
  • box_loss_scale: Calculate wh weight, value range (1~2);

Implementation:

# Darknet raw box to calculate loss.
raw_true_xy = y_true[l][..., :2] * grid_shapes[l][::- 1] - grid
raw_true_wh = K.log(y_true[l][..., 2:4] / anchors[anchor_mask[l]] * input_shape[::- 1])  # 1
raw_true_wh = K.switch(object_mask, raw_true_wh, K.zeros_like(raw_true_wh))  # avoid log(0)=-inf
box_loss_scale = 2 - y_true[l][..., 2:3] * y_true[l][..., 3:4]  # 2-w*h
Copy the code

Then, ignore_mask is generated according to the IoU ignore threshold, and the prediction box pred_box and true_box are calculated to suppress the value of unwanted Anchor boxes, that is, those anchor boxes whose IoU is less than the maximum threshold. Ignore_mask’s shape is (? ,? ,? , 3, 1), bit 0 is the number of batches, bit 1~2 is the size of feature map.

Implementation:

ignore_mask = tf.TensorArray(K.dtype(y_true[0]), size=1, dynamic_size=True)
object_mask_bool = K.cast(object_mask, 'bool')

def loop_body(b, ignore_mask):
    true_box = tf.boolean_mask(y_true[l][b, ..., 0:4], object_mask_bool[b, ..., 0])
    iou = box_iou(pred_box[b], true_box)
    best_iou = K.max(iou, axis=- 1)
    ignore_mask = ignore_mask.write(b, K.cast(best_iou < ignore_thresh, K.dtype(true_box)))
    return b + 1, ignore_mask

_, ignore_mask = K.control_flow_ops.while_loop(lambda b, *args: b < m, loop_body, [0, ignore_mask])
ignore_mask = ignore_mask.stack()
ignore_mask = K.expand_dims(ignore_mask, - 1)
Copy the code

Loss function:

  • xy_loss: Loss value of center point.object_maskisy_trueThe fourth bit of phi, which is whether it contains objects, is 1 if it does, and is 0 if it does not.box_loss_scaleThe value of is related to the size of the object box, 2 minus the relative area, the value range is (1~2).binary_crossentropyIt’s binary cross entropy.
  • wh_loss: Loss of width and height. In addition to this, multiply the coefficient extra by 0.5, square k.squire ().
  • confidence_loss: Indicates the loss value of an enclosure. It consists of two parts, the first part is the loss value of the object that exists, and the second part is the loss value of the object that does not exist, where multiplied by the ignore maskignore_maskThe box whose IoU is less than the threshold is ignored.
  • class_loss: Category loss value.
  • Divide the sum of the loss values of each part by the mean and add up as the final loss value of the picture.

Detail implementation:

object_mask = y_true[l][..., 4:5]  Object mask
box_loss_scale = 2 - y_true[l][..., 2:3] * y_true[l][..., 3:4]  # Frame loss ratio
z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))  # Binary cross entropy function
iou = box_iou(pred_box[b], true_box)  # Prediction box with real box IoU
Copy the code

Loss function implementation:

xy_loss = object_mask * box_loss_scale * K.binary_crossentropy(raw_true_xy, raw_pred[..., 0:2],
                                                               from_logits=True)
wh_loss = object_mask * box_loss_scale * 0.5 * K.square(raw_true_wh - raw_pred[..., 2:4])
confidence_loss = object_mask * K.binary_crossentropy(object_mask, raw_pred[..., 4:5], from_logits=True) + \
                  (1 - object_mask) * K.binary_crossentropy(object_mask, raw_pred[..., 4:5],
                                                            from_logits=True) * ignore_mask
class_loss = object_mask * K.binary_crossentropy(true_class_probs, raw_pred[..., 5:], from_logits=True) xy_loss = K.sum(xy_loss) / mf wh_loss = K.sum(wh_loss) / mf confidence_loss = K.sum(confidence_loss) / mf class_loss =  K.sum(class_loss) / mf loss += xy_loss + wh_loss + confidence_loss + class_lossCopy the code

The loss function formula of YOLO V1 is slightly different from that of V3 for reference:


supplement

1. “…” The operator

In Python, “…” Ellipsis operator, indicating that other dimensions remain unchanged, only the first or last dimension;

import numpy as np

x = np.array([[1.2.3.4], [5.6.7.8], [9.10.11.12]])
[[12 3 4] [5 6 7 8] [9 10 11 12]] ""
print(x.shape)  # (3, 4)
y = x[1:2. ][[5 6 7 8]] ""
print(y)
Copy the code

2. Iterate the value combination

In YOLO V3, when the grid value is calculated, it needs to be converted from relative position to absolute position, namely relative value, plus the value of the upper left corner of the grid. For example, the absolute value of relative value (0.2, 0.3) in the first (1, 1) grid is (1.2, 1.3). When converting coordinate values, add corresponding initial values according to the position of coordinate points. In this case, you need to iterate over pairs of numeric combinations, such as generating a grid matrix from 0 to 12.

It’s fast to 0 by arange -> Tile -> Concatenate combination

Source:

from keras import backend as K

grid_y = K.tile(K.reshape(K.arange(0, stop=3),- 1.1.1]), [1.3.1])
grid_x = K.tile(K.reshape(K.arange(0, stop=3),1.- 1.1]), [3.1.1])

sess = K.get_session()
print(grid_x.shape)  # (3, 3, 1)
print(grid_y.shape)  # (3, 3, 1)
z = K.concatenate([grid_x, grid_y])
print(z.shape)  # (3, 3, 2)
print(sess.run(z))
Create a 3x3 two-dimensional matrix and iterate through all arrays 0 to 2.
Copy the code

3. : : 1

“::-1” is the value of the reversed array, for example:

import numpy as np

a = np.array([1.2.3.4.5])
print a[::- 1]
[5 4 3 2 1] ""
Copy the code

4. Session

In Keras, Session tests are used to validate data.

from keras import backend as K

sess = K.get_session()
a = K.constant([2.4])
b = K.constant([3.2])
c = K.square(a - b)
print(sess.run(c))
Copy the code

OK, that’s all! Enjoy it!

Welcome to pay attention to the wechat public number DeepAlgorithm (ID: DeepAlgorithm), to know more depth technology!