From the Google Blog

Author: James Wexler et al

Heart of the machine compiles

Participation: Huang Xiaotian, Li Zinan

In support of the PAIR Initiative, Google has released Facets, an open source visualization tool. It helps you understand, analyze, and debug ML datasets. Facets contains two parts — Facets Overview and Facets Dive — allowing users to view the panorama of data from different granularity. Facets can also be easily used in the notebooks of Jupyter or embedded in web pages. In addition to opening Facets source code, Google has also created a demo website, Github and the website address are in the article.

  • Github:https://github.com/pair-code/facets

  • Demo website: https://pair-code.github.io/facets/

Getting the best results from machine learning (ML) models requires a real understanding of the data. However, there are typically millions of data points in an ML dataset, each containing hundreds (or even thousands) of features, making it impossible to intuitively understand the entire dataset. Visualization helps solve this problem with large data sets. A picture is worth a thousand words, and an interactive visualization is worth more than a thousand words.

In support of the PAIR Initiative, we release Facets, an open source visualization tool that helps you understand and analyze ML datasets. Facets contain two parts — Facets Overview and Facets Dive — allowing the user to view a panorama of its data from different granularity. You can use Facets Overview to visualize each feature of the data, or Facets Dive to explore individual sets of data observations. These visualizations allow you to debug data, which is just as important in machine learning as debugging models; It is also easy to use in the notebooks of Jupyter, or embed into the notebooks. In addition to opening Facets source code, we have created demo sites that allow anyone to visualize data sets directly in a browser without installing any software or Settings, and without the data leaving your computer.

Facets Overview

The Facets Overview automatically helps the user quickly understand the value distribution of all features in the data set. Multiple data sets, such as training and test sets, can be compared in the same visualization. General data challenges that constrain machine learning are pushed to the forefront, such as unexpected eigenvalues, features with high proportions of missing values, features with unbalanced distributions, and distribution skew between data sets.



Overview of Facets for six numerical features of the University of California, Irvine (UCI) Census data set [1].

Features are sorted by non-uniformity, with those with the largest non-uniformity distribution at the top. Numbers marked in red indicate possible problem points, in which case numeric features with high scale values are set to 0. The bar chart on the right allows you to compare the distribution between the training set (blue) and the test set (orange).



Facets Overview shows two of the nine classification features of the UNIVERSITY of California, Irvine census data set.

These features are sorted by distribution spacing, with the features with maximum skewness between the training set (blue) and the test set (orange) ranked at the top. Due to the trailing period in the test set (” <=50K “vs” <=50K. “), the tag values in the “target” feature differ between the training and test sets. This can be seen in the diagram of the characteristics, as well as in the entries in the “top” column of the table. This label mismatch will result in the model being trained and tested on this data not being evaluated correctly.

Facets Dive

Facets Dive provides an easy-to-customize intuitive interface for exploring relationships between data points of different characteristics in a data set. With Facets Dive, you can control position, color and visual expression. If a data point has an image associated with it, the image can be used as a visual representation.

Facets Dive visualization shows 16,281 data points in the Census test dataset of the University of California, Irvine.

The GIF shows that different feature “relations” of logarithmic data color are colored separately. The continuous feature “age” is arranged in one dimension and the discrete feature “marital status” is arranged in another dimension.



Facets Dive A visualization generated from the “Quick Draw” dataset that shows that strokes and points in the “Quick Draw” picture are correctly classified as human faces.

Quick Draw data sets: https://github.com/googlecreativelab/quickdraw-dataset

Fun Fact: In large data sets (such as the CIFAR-10 dataset), a small labeling error is easy to overlook. We used Dive to examine the CIFAR-10 data set and found a frogcat — a frog tagged as a cat.

Explore the CIFAR-10 data set using Facets Dive. In this case, the basic classification labels are rows and the prediction classification labels are columns.

This combination results in the confusion matrix view, where we can find specific types of misclassification. In the example above, we can see that the machine learning model incorrectly classifies some pictures of cats as frogs. One interesting observation about the inclusion of real figures in the confusion matrix is that one of these “real cats” was predicted to be a frog by the model because it was defined as a frog in the visual inspection because it was artificially misclassified in the data set trained by the model.



Can you tell a cat from a frog?

Inside Google, Facets has shown great value. Now, Google wants to share this convenience around the world to create more powerful and accurate machine learning models by discovering new and more interesting features in the data. Because Facets are open source, you can customize the visualizations according to your needs or contribute to the project.


reference

[1] Lichman, M. (2013). UCI Machine Learning Repository 

[http://archive.ics.uci.edu/ml/datasets/Census+Income]. Irvine, CA: University of California, School of Information and Computer Science

[2] Learning Multiple Layers from Tiny Images, Alex Krizhevsky, 2009: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

The original link: https://research.googleblog.com/2017/07/facets-open-source-visualization-tool.html

This article is compiled for machine heart, reprint please contact this public number for authorization.

✄ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Join Heart of the Machine (full-time reporter/intern) : [email protected]

Contribute or seek coverage: [email protected]

Advertising & Business partnerships: [email protected]

Click to read the original text and view the official website of the Heart of the Machine ↓↓↓