By Matthijs Hollemans, translator: Yang Zi; Proofreading: Numbbbbb, Linus the little blacksmith; Finalized: Forelax

Object detection is one of the classic problems in computer vision:

Identify what objects are in an image and where they are in the image.

Detection is a more complex problem than classification, because classification can also identify targets, but cannot accurately determine where they are in the image — and classification cannot be applied to images containing multiple targets.

YOLO is a real-time and effective target detection neural network.

In this article, I will explain how to use Metal Performance Shaders to run a “simplified” version of YOLOv2 on iOS devices.

Before you read on, watch this amazing YOLOv2 introduction video.

How YOLO works

You can use a classifier like VGGNet or Inception to create a target detector by traversing the image through a small sliding window. Each step traverses the classifier to classify the target in the current window. Using a sliding window like this, you can output hundreds of detection results for an image, but you only need to keep those results that the classifier is most certain about.

This works, but it’s obviously slow because you have to run the classifier many times. A slightly more efficient approach is to first determine which parts of the image contain valid information — called region proposals — and then run the classifier only in those regions. This method can reduce the number of classifier runs than the sliding window method, but it is still a lot.

YOLO takes a completely different approach. It does not transform a traditional classifier into a detector. YOLO actually does something to an image Once (hence its name: You Only Look Once), but in a clever way.

YOLO divides the image into 13×13 grid cells:

Each grid cell is responsible for predicting 5 checkboxes. A detection box is a rectangular region containing a target.

YOLO also provides a confidence level that describes the degree of certainty that a box does contain a target. This value has nothing to do with the object in the detection box, but only with the shape and size of the detection box matching degree.

The predicted detection box looks like the following (the higher the confidence, the thicker the box) :

For each detection frame, the corresponding grid cell is also given a classification prediction. This is similar to what a classifier does: it gives a probability distribution of all possible categories. The version of YOLO we used was trained with the PASCAL VOC Dataset and can detect 20 different categories, such as:

  • The bicycle
  • The ship
  • The car
  • The cat
  • The dog
  • people
  • , etc.

The confidence and classification predictions of the checkbox are combined into a final score that tells us the probability that the checkbox contains a specific type of target. For example, the big, thick yellow box on the left tells us that there is an 85% chance that it contains a dog:

Because the whole image contains 13×13 = 169 grid cells, and each grid cell predicts 5 detection frames, we can finally get a total of 845 detection frames. In fact, most of these boxes have a low confidence level, so we only need to keep those boxes with a final score of 30% or more (you can change the threshold depending on how accurate you want it to be).

Final test results:

Out of 845 boxes, we only kept these three because they gave the best results. Although we had 845 boxes, they were all made at the same time — the neural network only had to run once. That’s what makes YOLO so powerful and fast.

(Above image from PjReddie.com)

The neural network

The structure of YOLO is just a simple convolutional neural network:

Layer kernel stride output shape --------------------------------------------- Input (416, 416, 3) Convolution 3×3 1 (416, 416, 16) MaxPooling 2×2 2 (208, 208, 16) Convolution 3×3 1 (208, 208, 16) 1) pooling 2×2 2 (104, 104, 64) Convolution 3×3 1 (104, 104, 64) pooling 2×2 2 (104, 104, 64) Convolution 3×3 64) Convolution 3×3 1 (50, 50, 50) MaxPooling 2×2 2 (50, 50, 50) Convolution 3×3 1 (50, 50, 50) MaxPooling 2×2 2 (50, 50, 50) Convolution 3×3 256) MaxPooling 2×2 2 (13, 13, 256) Convolution 3×3 1 (13, 13, 512) MaxPooling 2×2 1 (13, 13, 256) Convolution 3×3 1 (13, 13, 512) 512) Convolution 3 x 3 1 (13, 13, 1024) Convolution 3 x 3 1 (13, 13, 1024) Convolution 1 x 1 1 (13, 13, 1024) 125) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --Copy the code

The neural network has a typical structure: a convolution layer with 3×3 convolution kernel and a pooling layer with 2×2 sampling window. Nothing fancy. There is no full connection layer in YOLO.

Note: The “simplified” version of YOLO we use has only 9 convolutional layers and 6 pooling layers. The full YOLOv2 model has three times as many layers and is a bit more complex, but it is still a normal convolutional neural network.

The convolution kernel of the last convolution layer is 1×1 in order to reduce the dimension of the parameter to 13×13×125. Thirteen by thirteen looks familiar: this is the number of grid cells the image is divided into.

Thus, each grid cell has 125 channels. These 125 channels contain the data of the detection frame, as well as the data of classification prediction. Why 125? Because each grid cell predicts the results of 5 checkboxes, each checkbox is described by 25 data elements:

  • Check the x, y, width, and height of the box rectangle
  • Degree of confidence
  • Probability distributions over 20 categories

The use of YOLO is simple: input an image (size 416×416 pixels), it runs a convolutional network and outputs a 13×13×125 tensor describing the grid cells and the detection frame. All you need to do is calculate the final score for each box and discard those below 30%.

Tip: To learn more about how YOLO works and how it trains, watch an interview with one of its inventors. This video is on YOLOv1, which has a slightly different structure because it’s an older version, but the idea of juche is the same. Well worth watching!

Into the Metal

The version described above is a simplified version of YOLO, which we will be using in our iOS app. The full YOLOv2 neural network has three times the number of layers and is too large to run quickly on current iPhone devices. The simplified version of YOLO uses fewer layers, so it runs faster but with slightly less accuracy.

YOLO is written using Darknet, a deep learning framework written by the YOLO authors themselves. Everything you can download is in Darknet format. Although Darknet is open source, I don’t want to spend a lot of time trying to figure out how it works.

Fortunately, someone has already done this, converting the Darknet model into Keras, the deep learning tool I use. All I have to do is run this “YAD2K” script to convert Darknet to Keras format and then convert Keras to Metal using a script I wrote myself.

However, there is a slight hitch. YOLO uses a normalization method called “batch normalization” behind its convolutional layer.

Batch standardization is the idea that neural networks work best when the data is clean. Ideally, a hierarchy of input data has an average of 0 and a small variance. This idea should be familiar to anyone who has done machine learning, because we often use a technique called “feature scaling” or “whitening” to process our input data for this purpose.

Batch standardization does similar feature scaling for data between layers. This processing can prevent the degradation of data transmission in neural network and improve the performance of neural network effectively.

To give you an intuitive sense of the effect of batch standardization, the output histogram of the first convolution layer with and without batch standardization is as follows:

Batch standardization is important in training a deep network, but in fact we can do without it when making inferences. Not having to perform batch standardized calculations helped make our application run faster. Metal does not have an MPSCNBATchnormalization layer at any point in time.

Batch standardization usually occurs after the convolution layer and before the activation function (ReLU function in YOLO). Both convolution operation and batch standardization operation perform linear transformation on data, so we can combine the parameters of the batch standardization layer with the convolution weight. This is called “folding” the batch standardization layer into the convolution layer.

To make a long story short, using some math, we can omit the batch standardization layer, but need to change the weights of the preceding convolution layer.

A quick description of the calculation process of the convolution layer: suppose x is the pixel in the input image and W is the weight of the convolution layer. Then, after the calculation of the convolution layer, the value of each pixel output is:

out[j] = x[i]*w[0] + x[i+1]*w[1] + x[i+2]*w[2] + ... + x[i+k]*w[k] + b
Copy the code

The dot product of the input pixel matrix and the weight of the convolution kernel, plus the deviation b.

The following is the calculation process of batch standardization of the output of the convolution layer:

        gamma * (out[j] - mean)
bn[j] = ---------------------- + beta
            sqrt(variance)
Copy the code

Batch standardization starts by subtracting mean from the output value of each pixel, dividing by the standard deviation, multiplying by a scaling factor gamma, plus an offset value beta. These four parameters — mean, variance, Gamma, and beta — are learned by the batch standardization layer during network training.

In order to omit batch standardization, we can slightly integrate these two formulas to calculate new weights and deviation terms for the convolution layer:

           gamma * w
w_new = --------------
        sqrt(variance)

        gamma*(b - mean)
b_new = ---------------- + beta
         sqrt(variance)
Copy the code

Using these new weights and deviation terms to convolve the input X, the same result can be obtained with the original convolution layer plus batch normalization.

Now we can remove the batch standardization layer and just use the convolution layer, but the parameters are adjusted weights and deviation terms w_new and b_new. We repeat this process for all the convolutional layers in the network.

Note: In fact, the convolution layer in YOLO does not use the deviation term, so b is 0 in the formula above. But please note that after integrating the parameters of batch standardization, there are deviation terms in the convolution layer.

Once we have consolidated all the batch standardization layers into their pre-convolution layer, we can convert the weights to Metal. Simply transpose the array (stored in a different order in Keras than in Metal) and write it into a 32-bit floating-point binary.

If you are curious about these operations, you can check out the conversion script yolo2metal.py for details. To verify the effect of integrated batch standardization, the script creates a model that does not contain the batch standardization layer, but uses adjusted weights, and compares it to the predicted results of the original model.

IOS app

I use Forge to write my iOS apps as a matter of course. 😂 you can find the source code in the YOLO folder. To give it a try, you can download or Clone Forge, open Forge. Xcworkspace in Xcode 8.3 or later, and run YOLO on iPhone 6 or later.

The easiest way to test it is to point your iPhone at a YouTube video:

There is some interesting code in YOLO. Swift. First, the convolutional network is created here:

letLeaky = MPSCNNNeuronReLU(Device: Device, a: 0.1)let input = Input()

let output = input
         --> Resize(width: 416, height: 416)
         --> Convolution(kernel: (3, 3), channels: 16, padding: true, activation: leaky, name: "conv1")
         --> MaxPooling(kernel: (2, 2), stride: (2, 2))
         --> Convolution(kernel: (3, 3), channels: 32, padding: true, activation: leaky, name: "conv2") --> MaxPooling(kernel: (2, 2), stride: (2, 2)) --> ... and so on...Copy the code

The input image of the camera is adjusted to 416×416 pixels and then input to the convolution layer and pooling layer. This is very similar to the operation of other convolutional neural networks.

What’s really interesting is what you do with the output. Recall that our output is a 13×13×125 tensor: there are 125 channels per grid cell in the image. These 125 numbers contain the data from the detection box, as well as the classification prediction data. We need to organize this data in some way. This is done via fetchResult().

Note: The fetchResult() function runs on the CPU, not the GPU. This implementation is relatively simple. Some people say that the parallelism of the GPU will be better for nested loops. Maybe I’ll rewrite a GPU version in the future.

Here’s how fetchResult() works:

public func fetchResult(inflightIndex: Int) -> NeuralNetworkResult<Prediction> {
  let featuresImage = model.outputImage(inflightIndex: inflightIndex)
  let features = featuresImage.toFloatArray()
Copy the code

The output of the convolutional network is an MPSImage format data. We first convert it to an array of floats, or features, for processing.

The body of fetchResult() is a large nested loop. It traverses all grid cells and the five predicted results of each grid cell:

  for cy in0.. The < 13 {for cx in0.. The < 13 {for b in0.. <5 {....}}}Copy the code

In each cycle we calculate the detection box B for the grid cell (cy, Cx).

First, we read the x, y, width, height, and confidence of the detection frame from the Features array:

let channel = b*(numClasses + 5)
let tx = features[offset(channel, cx, cy)]
let ty = features[offset(channel + 1, cx, cy)]
let tw = features[offset(channel + 2, cx, cy)]
let th = features[offset(channel + 3, cx, cy)]
let tc = features[offset(channel + 4, cx, cy)]
Copy the code

The offset() function helps find the right place in an array to read data. Metal stores its data in groups of four channels, which means that the 125 channels are not stored consecutively, but scattered. (See the code for a more detailed explanation)

We still need to deal with the 5 data tx, TY, TW, TH, tc, because their format is a bit strange. If you’re wondering where these formulas came from, they were presented in this article (which is a byproduct of network training).

let x = (Float(cx) + Math.sigmoid(tx)) * 32
let y = (Float(cy) + Math.sigmoid(ty)) * 32

let w = exp(tw) * anchors[2*b    ] * 32
let h = exp(th) * anchors[2*b + 1] * 32

let confidence = Math.sigmoid(tc)
Copy the code

X and y now represent the coordinates of the center point of the detection frame in the 416×416 input image. W and H are the width and height of the detection frame. Tc is the confidence of the detection frame, which is converted into the form of percent by logistic sigmoid function.

We now have the detection box, and we know how sure YOLO is that it contains a target. Next, let’s take a look at the results of the classification prediction and see what objects YOLO thinks are targeted in the detection box:

var classes = [Float](repeating: 0, count: numClasses)
for c in0.. <numClasses { classes[c] = features[offset(channel + 5 + c, cx, cy)] } classes = Math.softmax(classes)let (detectedClass, bestClassScore) = classes.argmax()
Copy the code

Recall that the data for the 20 channels in the Features array contains the classification detection results for this detection box. We read these into a new classes array. As usual with classifiers, we use the Softmax function to convert the array into a probability distribution. Then, we pick the category with the highest score as the result.

Now we can calculate the final score for the checkbox — for example, “I am 85% confident that this checkbox contains a dog.” There will be 845 boxes, and we only want to keep the ones that end up scoring above a certain threshold.

let confidenceInClass = bestClassScore * confidence
ifConfidenceInClass > {0.3let rect = CGRect(x: CGFloat(x - w/2), y: CGFloat(y - h/2),
                    width: CGFloat(w), height: CGFloat(h))

  let prediction = Prediction(classIndex: detectedClass,
                              score: confidenceInClass,
                              rect: rect)
  predictions.append(prediction)
}
Copy the code

Repeat the code for all grid cells. When the loop is complete we have a Predictions array which will typically contain 10 to 20 predictions.

We have filtered out the boxes that ended up with very low scores, but there is still a chance that the remaining boxes overlap heavily. So, the last thing we do in fetchResult() is to reduce this duplicate check box with a method called non-maximum suppression.

  var result = NeuralNetworkResult<Prediction>()
  result.predictions = nonMaxSuppression(boxes: predictions,
                                         limit: 10, threshold: 0.5)
  return result
}
Copy the code

The algorithm used by the nonMaxSuppression() function is simple:

  1. Start with the box with the highest final score.
  2. Remove other detection enclosures whose overlap rate with this detection enclosure exceeds a certain threshold (for example, more than 50%).
  3. Return to the first step and repeat until you have traversed all the boxes.

The algorithm removed other boxes that had too much overlap with the higher-scoring ones and kept only the best ones.

That’s all there is to it: a regular convolutional network, and some subsequent processing of the results.

How does it work?

YOLO’s website claims that the lite version can process up to 200 frames per second. But that’s on a good laptop, not a mobile device. So how fast will it run on the iPhone?

On my iPhone 6S, it takes about 0.15 seconds to process an image. That’s only 6 FPS, hardly real time. If you point your phone at a passing car, you’ll see a detection frame trailing behind the car. Still, I’m impressed that it worked. 😁

Note: As explained above, the detection box is processed on the CPU, not the GPU. Would YOLO be faster if it ran entirely on a GPU? Maybe, but the CPU code runs in 0.03 seconds, or 20% of the time. It’s certainly feasible to hand off some of this work to the GPU, but given that the convolutional layer still takes up 80% of the time, I’m not sure it’s worth it.

I think the main reason for the slow down is that the output channels are 512 and 1024 convolution layers. Through experiments, MPSCnconvolution performs worse on small pictures with many channels than on large pictures with few channels.

Another thing I was interested in was taking a different network structure, such as SqueezeNet, and retraining it to do the detection box prediction at the last layer. In other words, take the ideas of YOLO and apply them to a smaller, faster network. Is the loss of accuracy worth the gain in speed?

Note: BTW, the latest Caffe2 framework also runs on iOS devices with Metal support. Caffe2-ios Project is a streamlined version of YOLO. It looks like it will run slightly slower than the pure Metal version, around 0.17 seconds/frame.

Afterword.

To learn more about YOLO, check out the following papers by YOLO authors:

  • You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2015)
  • YOLO9000: Better, Faster, Stronger by Joseph Redmon and Ali Farhadi (2016)

My implementation is based in part on TensorFlow Android Demo TF Detect, Allan Zelener’s YAD2K, and Darknet source code.

This article is translated by SwiftGG translation team and has been authorized to be translated by the authors. Please visit swift.gg for the latest articles.