WWDC 2019 session 222 Understanding Images in Vision Framework

Vision is an image recognition library based on Core ML encapsulation launched by Apple together with Core ML in 2017. According to the official document, Vision includes basic features such as face recognition, machine learning image analysis, bar code recognition, and text recognition.

This paper mainly introduces some cool functions of Vision framework in image technology, and explains its principle to some extent.

Saliency is the focal area of the image

What is Saliency?

Saliency is pronounced, prominent, characteristic. So what does Saliency mean in the context of Vision? Let’s go straight to the picture above:

When we first see the image on the left, our eyes must first be drawn to these three bird faces. If you highlight the area of attention in the image, it looks something like this: the highlight is on the bird’s head.

These highlights, where human attention is more likely to be focused, are actually Saliency, and The resulting Heatmap is actually The Saliency.

Two kinds of Saliency

The two Saliency algorithms are obviously different. Attention-based Saliency is actually intuitively judged by the degree of concentration of human attention, and object based Saliency is to identify the focal object and then segment the focal object.

The middle image is based on attention. We usually focus on the faces of people and animals, so only the vicinity of the face is highlighted. The image on the right, which highlights the entire bird, is based on the object. Here, for example, the same is true:

In fact, attention-based Saliency is more complex. We can see this intuitively because attention itself is affected by so many artificial uncertainties, such as contrast, face, subject matter, field of view, light intensity, etc. Interestingly, it can even be affected by continuous motion, as shown in the image below, where attention-based Saliency highlights parts of a person’s path in front of them:

For a Demo, see highlight interesting parts of an image

The Heat Map of Saliency

The concept of a heat map is easy to understand, but how to obtain Saliency heat map? The basic usage logic of the Vision API design is handler with request. Perform (_:) by creating a handler (VNImageRequestHandler, the primary Vision handler) To perform the corresponding request (VNGenerateAttentionBasedSaliencyImageRequest, as can be seen from the name, keyword AttentionBasedSaliency), specific code is as follows:

let handler = VNImageRequestHandler(url: imageURL)
let request: VNImageBasedRequest = VNGenerateAttentionBasedSaliencyImageRequest()
request.revision = VNGenerateAttentionBasedSaliencyImageRequestRevision1

try? handler.perform([request])
guard letresult = request.results? .firstlet observation = result as? VNSaliencyImageObservation
else { fatalError("missing result")}let pixelBuffer = observation.pixelBuffer
Copy the code

The Bounding Boxes: Saliency positions

var boundingBox: CGRect { get }
Copy the code

Bounding boxes are the Saliency position information detected, but it should be noted that the origin of the coordinate system is in the lower left corner of the image. For attention-based Saliency, there is only one bounding box, while object based Saliency has up to three bounding boxes.

Obtain the bounding box code as follows:

func addSalientObjects(in observation: VNSaliencyImageObservation,
                        to path: CGMutablePath, 
                        transform: CGAffineTransform)
{
    guard let objects = observation.salientObjects else { return }
    for object inObjects {// Get bounding box path.addrect (object.boundingbox, transform:transform)}}Copy the code

Some use cases

Image Saliency actually has many effects, to give a few examples:

  • For filters: add different types of filters, image effects.
  • For photo albums: Add a photo browsing experience, automatically zoom photos to the best position.
  • For recognition: Used together with other image algorithms, objects are first cut out by bounding box and then identified to improve accuracy.

Image Classification

Image recognition and classification is the most basic function of Vision. Vision framework provides AN API for image classification, which is very convenient to use, and is widely used in iPhone album. Although coreML framework can also use its own training image classifier, requires a lot of data and computing resources, etc., has a relatively high cost for ordinary developers. Moreover, Vision framework uses multi-label network to identify multiple objects in a single image.

What objects can be recognized? — Taxonomy

Which objects can be recognized? This brings us to the concept of Taxonomy. Taxonomy actually refers to a biological classification system in which different objects are classified according to their semantic meaning. In this tree structure, there are more than 1,000 categories, with broader parent classes and more specific subclasses.

You can also check the entire Taxonomy using the following statement:

// List full taxonomy with
VNClassifiyImageRequest.knownClassifications(forRevision: VNClassifyImageRequestRevision1 )
Copy the code

In the construction of this Taxonomy tree structure, each category had to be visually definable, and adjectives, abstract nouns, overly broad nouns, and occupational names had to be avoided. Specific use also conforms to the unified use method of Vision API, Request (VNClassifyImageRequest, Classify, Classify) :

let handler = VNImageRequestHandler(url: imageUrl)
let request = VNClassifyImageRequest()
try? handler.perform([request])
let observations = request.results as? [VNClassificationObservation]
Copy the code

The result is an Array of observations containing a series of object recognition results and their corresponding confidence values (probabilities). Notice that the sum of confidence values is not 1, which is the result of the multi-label network mentioned just now.

What are the following observations? // What are the following observations? // What could be recognized from the picture: animal, cat, mammal, clothes, bonnet, hat, human, adult, snow... [(Animal, 0.848), (cat, 0.848), (Mammal, 0.848), (Clothing, 0.676), (Beanie, 0.675), (Hat, 0.675), (People, 0.616), (adult, 0.616), (snow, 0.445), (jacket, 0.214), (0.063) is suing, and (leash, 0.057), (0.057) cord,...]Copy the code

Further filtering of results: Precision and Recall

After obtaining the recognition results, how can the Observation array conduct further analysis to determine which of the recognition results are credible enough? A key formula that is very common sense is: Confidence > Threshold. It is easy to understand that when the confidence value is greater than a certain threshold, you can judge that there are corresponding objects in the picture. However, the biggest problem is how to determine the threshold, and the threshold is not fixed, and the threshold must be different in different images.

Next, we need to introduce two indicators: Precision Precision and Recall. Here’s a classic illustration:

  • Precision Precision refers to the percentage of all predictions that are actually correct. It can reflect the extent of false positives. The higher the Precision rate, the more accurate the prediction, the less the number of false positives.
  • Recall is the percentage of all results that meet a requirement that can be successfully predicted. It reflects the degree of underreporting. The higher the Recall rate is, the more accurate the prediction is, and the less the number of missed reports is.

Precision and Recall can reflect the accuracy of classification algorithms, but they have different tendencies. To cite two examples, for example, AI usually places more importance on Recall when treating patients, because we are more worried about the occurrence of under-reporting. When filtering spam, for example, Precision accuracy is often more important because we don’t want to mistakenly filter the user’s normal mail.

So going back to the original question, how do we determine which results of the Array of observations are consistent with the Confidence > Threshold formula and should be kept? We can get results directly by limiting Precision or Recall.

For example, hasMinimumRecall(_:forPrecision:) limits recall to 0.5 when precision is 0.7:

letsearchObservations = observations? .filter {$0HasMinimumRecall (0.5,forPrecision: 0.7)}
Copy the code

Of course, using hasMinimumPrecision(_:forRecall:) to limit precision is the same:

letsearchObservations = observations? .filter {$0HasMinimumPrecision (0.5,forRecall: 0.7)}
Copy the code

Graphic screening process: PR Curve

PR curve reflects the relationship between Precision and Recall in the same classifier, which can be used to measure the performance of the classifier. You can see that Precision and Recall are inversely correlated.

For each point on the PR curve, there is a value of Precision and Recall. We can intuitively understand the above screening and filtering process through PR curve. For example, here we have three classifiers, corresponding to the PR curve for recognizing “Cat”, “Anvil” and “CD” respectively. When we restrict (Recall = 0.5, Precision >= 0.4), it can be seen that the first two graphs have the point of buy group condition, while the third graph does not, then the obvious “CD” should be filtered out of the result.

Image Similarity

The traditional way of describing pictures

In addition to the problem of recognizing images, we often face the problem of determining how similar two images are. So first of all, how do we describe an image? There are two traditional ways:

  1. Use pixel point information for comparison. This comparison is very inaccurate, and small changes can be completely different.

  1. Use keywords. But keywords are too general and not accurate for a picture.

A description of an image must not only include a description of its surface style, but also include a further description of the content of the image. It is difficult to achieve the so-called “further description” with the above traditional methods, but the clever thing is that when we use the classification neural network to classify the image, the neural network itself is the further description of the image. The upper layers of the neural network just contain the salient information of the image, and at the same time discard some redundant information. So, we can use this feature to describe the picture.

Vector description: FeaturePrint

FeaturePrint is a vector used to describe the content of an image, similar to a traditional word vector. It reflects the information extracted from the picture when the neural network is doing the picture classification. With a specific Vision API, images can be mapped to the corresponding FeaturePrint (which is why it’s a vector).

Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy: FeaturePrint Taxonomy:

With FeaturePrint, we can directly compare the similarity between images. The computeDistance(_:to:) method directly yields a floating point number that reflects the image similarity. For example, in the figure below, the smaller the number, the more semantically similar the images are.

For the Demo, see: Demo – Use FeaturePrint to compare similarities between images

Face Technology

Let’s talk about advances in facial recognition technology.

Face Landmarks: Advances in Face Feature point recognition

The recognition of Face Landmark has always been an important part of Face recognition technology. Vision framework has the following advances in this aspect:

  1. Identification points increased from 65 to 76 points
  2. Each point provides confidence (previously only one overall confidence)
  3. Pupil recognition is more accurate

Face Capture Quality: Face Capture Quality

Face Capture quality is a comprehensive index used to judge the quality of portrait effect, measuring factors including light, degree of blur, whether there is shelter, expression, figure posture and so on.

For example, the first photo scored higher than the second, meaning the first photo was of better quality.

Through the above code, we can directly obtain the face Capture quality value of an image, and then compare similar images to screen out better images. For example, here is a Demo: Filter selfies by Face Capture quality

Note that Face Capture quality cannot be compared to a fixed threshold. Face Capture quality values in different series of photos may be distributed in different areas. If we filter with a fixed threshold value (such as 0.520 in the figure above), we may filter out all the photos on the left, even though there are some relatively good photos on the left. In other words, face capture quality is only a relative value of the same object being photographed, and its absolute value cannot directly reflect the shooting effect of the photo.

Other Progress has been made

New recognizer

In addition to these traditional recognizers, there are new ones like Human Detectors and Cat and Dog Detectors.

Enhanced video tracking technology

Video tracking technology has also been enhanced. A Demo of the tracking technology can be seen here: Track multiple objects in a video. Specific reinforcement is as follows:

  1. Bounding box is more fit and reduces the clutter of backgrounds
  2. It’s better to deal with occlusion
  3. It relies on machine learning algorithms
  4. Lower energy loss

Vision and CoreML are more compatible

Vision’s support for CoreML has also been improved. While it was possible to run the CoreML model through the Vision API last year, it is now easier to use:

  1. Vision can automatically convert the input image into coreML format and automatically parse the output into the appropriate Observation type.
  2. Vision can now take multiple images as input, but you need to set the mixRatio.