Guide language: the unknown object is not in the label category recognition into known categories, image recognition is a headache, how to solve it?

By Pete Warden

Compiler: McGL

A few days ago, I received a question from Plant Village, a team I work with, who are developing an app. It can detect plant disease and get good results when it’s pointed at a leaf, but if you point it at a computer keyboard, it thinks it’s a damaged crop.

For computer vision researchers, this result is not surprising, but for most others it is, so I want to explain why this happens and what we can do about it.

As humans, we are so used to categorizing everything in the world around us that we naturally expect machines to be able to do the same. Most models recognize very limited targets, such as the 1000 categories of the original ImageNet contest. Crucially, the training process assumes that every example the model sees is one of these goals, and that the prediction must be in this set. The model doesn’t have the option of saying “I don’t know,” nor does it have training data to help it learn this response. This is a simplification that makes sense in a research environment, but causes problems when we try to use the model in the real world.

When I was at Jetpac, we had a hard time convincing people that this groundbreaking AlexNet model was a huge breakthrough, because every time we handed them a demo phone running the web, they would point the phone at their face and it would predict something like “oxygen mask” or “seat belt.” That’s because ImageNet’s contest category doesn’t include anyone’s tag, but most photos with mask and seatbelt tags include faces. Another embarrassing mistake was that when they pointed it at a plate, it predicted “toilet seat”! This is because there are no dishes in the original classification, and the closest white circular object in appearance is a toilet.

I think it’s an “open world” vs. “closed world” issue. The model was trained and evaluated on the assumption that only a finite universe of objects was presented to them, but once it was used outside the lab, the assumption was broken and users judged the model based on the performance of any object placed in front of them, regardless of whether the object was in the training set.

So what’s the solution?

Unfortunately, I don’t know of any easy solutions to this problem, but I’ve seen some useful strategies. The most obvious is to add an “unknown” class to the training data. The bad news is that it creates another set of problems.

  • What samples should be put into this unknown class? The number of possible natural images is almost unlimited, so how do you choose which ones to include?
  • How many targets of each different type are required in an unknown class?
  • What should you do about unknown targets that look very similar to the classes you care about? For example, adding a dog breed that is not in ImageNet 1000, but looks almost identical, may force a large number of otherwise correct matches to unknown classes.
  • What percentage of your training data should be samples of unknown classes?

That last point actually touches on a larger issue. The predicted values you get from the image classification network are not probabilities. They assumed that the probability of seeing a particular class was equal to the frequency with which that class appeared in the training data. If you try to use an animal classifier that includes penguins from the Amazon jungle, you run into this problem because (presumably) all the common penguins are false positives. Even in U.S. cities, rare breeds show up in ImageNet’s training data far more often than they do in dog parks, so they can be overcharacterized as false positives. The usual solution is to find the prior probabilities of the situations you will face in production, and then use that data to apply the calibration values to the output of the network to get results that are closer to the true probabilities.

The main strategy that helps solve the overall problem in real applications is to limit the use of models and match assumptions about what objects will appear with training data. One simple way to do this is through product design. You can create a user interface that lets users focus their devices on objects of interest before running the classifier, just like when an app asks you to take a picture of a check or other document.

To be a little more complicated, you can write a single image classifier that tries to identify conditions for which the main image classifier does not fit. This is not the same as adding a single “unknown” class, because it is more like a cascade, or a filter before a detailed model. In the case of crop diseases, the operating environment is visually clear, so a model can be trained to distinguish between leaves and other randomly selected photographs. There are enough similarities that the Gating model should at least be able to distinguish this from an unsupported scenario. This gated model will run before the full image classifier, and if it doesn’t detect something that looks like a plant, it will exit early and pop up an error message indicating that no crop was found.

Apps that ask you to take credit card images or perform other types of OCR will often use a combination of on-screen instructions and models to detect blurring or under-alignment to guide the user to a photo that can be processed successfully, a “Got leaves? The model is a simple version of this interface pattern.

This may not be a very satisfactory set of answers, but they reflect the confusion of user expectations once you take machine learning beyond the limited research questions. There’s a lot of common sense and external knowledge that can help people identify an object that we don’t capture in traditional image classification tasks. To achieve results that meet user expectations, we must design a complete system around our model that understands the world they will deploy to and makes informed decisions based on something other than the model’s output.

Source: petewarden.com/2018/07/06/…