Discovering the unknown unknowns in machine learning

The performance of a machine learning (ML) model depends on the learning algorithm and the data used for training and evaluation. The role of algorithms is well studied and is the focus of many challenges, such as SQuAD, GLUE, ImageNet, etc. There are also efforts to improve the data, including a series of workshops addressing ML estimation issues. In contrast, it is not common to focus on data used to evaluate ML models. In addition, many evaluation datasets contain items that are easy to evaluate, such as photos with easily recognizable subjects, so they miss the natural ambiguity of real-world context. The absence of ambiguous real-world examples in assessments undermines the ability to reliably test machine learning performance, and this makes ML models vulnerable to “weaknesses” — categories of examples that the model is difficult or impossible to accurately evaluate because of the lack of such examples in the assessment set.

To address the problem of identifying these weaknesses in ML models, we recently launched the Machine Learning Crowdsourced Adverse Test Set (CATS4ML) Data Challenge at HCOMP 2020 (open to researchers and developers worldwide until April 30, 2021). The goal of the challenge is to raise the standard of ML evaluation sets and find as many examples as possible that make the algorithm confusing or otherwise problematic to handle. CATS4ML relies on people’s ability and intuition to discover new examples of data that machine learning is convinced of but actually misclassifies.

What are machine learning “weaknesses”?

There are two kinds of weaknesses: known unknowns and unknown unknowns. Known unknowns are examples of correct classification of model uncertainties. The research community continues to study this in a field called active learning, and has found solutions that, in a nutshell, interactively extract new labels from people on uncertain examples. For example, if the model wasn’t sure whether the subject of the photo was a cat, a person was asked to verify it; But if the system is certain, it won’t ask a person. While there is room for improvement in this area, the good news is that the model’s confidence is related to its performance, that is, you can see what the model doesn’t know.

Unknown unknowns, on the other hand, are examples where the model was confident of its answer but was actually wrong. Proactive efforts to discover unknown unknowns (e.g. Attenberg 2015 and Crawford 2019) have helped uncover many unexpected machine behaviors. In contrast to this approach of discovering unknown unknowns, generative adversarial networks (gans) generate unknown unknowns for image recognition models in the form of computer visual illusions that cause deep learning models to make errors beyond human perception. While GAN will find model holes in the case of intentional manipulation, real-world examples can better highlight model failures in everyday performance. These real-world examples are the unknowns that CATS4ML is interested in — the challenge is to collect untreated examples that humans can reliably explain but many ML models certainly disagree with.

First edition of CATS4ML Data Challenge: Open image dataset

The CATS4ML data challenge focuses on visual recognition, using images and labels from open image datasets. The target images of the challenge were selected from the open image dataset, along with a set of 24 target tags from the same dataset. Challenge participants are invited to invent new and creative ways to explore this existing publicly available data set, and to focus on pre-selected target tag lists, discovering examples of ML models with unknown unknowns.

CATS4ML complements FAIR’s recently launched DynaBench research platform for dynamic data collection. While DynaBench uses ML models to solve static benchmarking problems in a loop, CATS4ML focuses on improving ML’s evaluation data set by encouraging exploration of adverse examples where existing ML benchmarking may be unknown. The results will help detect and avoid future errors, and will also provide insights into model interpretability.

In this way, CATS4ML aims to raise awareness of problems by providing data set resources that developers can use to discover weaknesses in their algorithms. It will also provide researchers with information on how to create more balanced, diverse and socially aware baseline data sets for machine learning.

To participate in

We invite the global community of ML researchers and practitioners to join us in our efforts to uncover interesting, difficult examples from the open image dataset. Register on the challenge website, download target images and tag data, contribute your found images and enter the winner’s contest!

In order to score points in this competition, participants shall submit a set of image-tag pairs, which shall be verified by the human EIA scorer, whose votes shall be inconsistent with the learning model of the average machine score of the labels on multiple machines.

The challenge is open to researchers and developers worldwide until April 30, 2021. To learn more about CATS4ML and how to join, visit our website.

Update note: first update wechat public number “rain night blog”, later update blog, after will be distributed to each platform, if the first to know more in advance, please pay attention to the wechat public number “rain night blog”.

Blog Source: Blog of rainy Night