This is the 26th day of my participation in the August More Text Challenge

Introduction of Yellowbrick

Yellowbrick is a visual analysis and diagnostic tool for facilitating machine learning model selection. It extends the SciKit-Learn API to make it easier to navigate the model optimization phase. In short, Yellowbrick combines SciKit-Learn with Matplotlib to help us optimize our model through visualization.

The installation

pip install -U yellowbrick
Copy the code

Yellowbrick, though, provides parameters like color, size, and title to help you customize your drawing. Sometimes, however, you just want to draw a graph with one line of code. Yellowbrick offers a number of tools that can be visualized in one line of code.

Some of the popular visualization tools for feature analysis, classification, regression, clustering, and goal evaluation are shown below.

Feature analysis visualization

Rank1D/Rank2D

By default, Rank1D uses the shapiro-wilk algorithm to evaluate the normality of feature distributions. Bar charts are then drawn showing the relative rank of each feature.

The Shapiro-Wilk algorithm statistically compares the sample distribution with the normal distribution to determine whether the data shows deviations or conformance to normality.

from yellowbrick.features import rank1d
from yellowbrick.datasets import load_energy


X, _ = load_energy()
visualizer = rank1d(X, color="r")
Copy the code

By default, the Rank2D visualization tool detects the correlation between two features using Pearson correlation coefficients.

from yellowbrick.features import rank2d
from yellowbrick.datasets import load_credit


X, _ = load_credit()
print(X[['limit'.'sex'.'edu']].head())
visualizer = rank2d(X)
Copy the code

Coordinates for Parallel Coordinates

Parallel coordinate is a multidimensional feature visualization technique in which the vertical axis (Y-axis) is the value of each feature and the Y-axis represents the feature. The color of the broken line indicates the target value. This allows visualization of multiple dimensions at once; In fact, given an infinite amount of horizontal space (such as a scrolling window), it is technically possible to display an infinite number of dimensions!

from sklearn.datasets import load_wine
from yellowbrick.features import parallel_coordinates


X, y = load_wine(return_X_y=True)

print(X.shape,y.shape) #(178, 13) (178,)
visualizer = parallel_coordinates(X, y, normalize="standard")
Copy the code

Radial Visualization

RadViz is a multivariate data visualization algorithm that draws each characteristic dimension evenly around the circumference of a circle and then draws points inside the circle so that the point normalizes its values along the axis from the center to each arc. This mechanism allows as many dimensions as possible to be easily placed on a circle, greatly expanding the dimensions of visualization.

Data scientists use this method to detect separability between classes. Is there an opportunity to learn from feature sets, for example, or is there too much noise?

If your data contains rows with missing values (numpy.nan), these missing values will not be drawn. In other words, you may not get the full picture of the data. RadViz will issue a DataWarning to tell you the percentage that is missing.

If you receive this warning, you may need to look at the interpolation policy. For details on how to fill the missing value, see scikit-learn Imputer.

from yellowbrick.features import radviz
from yellowbrick.datasets import load_occupancy


X, y = load_occupancy()
visualizer = radviz(X, y, colors=["maroon"."gold"])
Copy the code

PCA

PCA decomposition visualizer uses principal component analysis to decompose high-dimensional data into two or three dimensions so that each instance can be plotted on a scatter plot. The use of PCA means that these projected data sets can be analyzed along the axis of major variation and can be interpreted to determine whether spherical distance measures can be utilized.

from yellowbrick.datasets import load_spam
from yellowbrick.features import pca_decomposition

X, y = load_spam()
print(X.shape,y.shape) # (4600, 57) (4600),
visualizer = pca_decomposition(X, y)
Copy the code

Manifold

The manifold visualization tool uses manifold learning to embed instances of multidimensional descriptions into 2 dimensions to provide high dimensional visualization. This allows the creation of scatter plots to show the underlying structures in the data.

Unlike decomposition methods such as PCA and SVD, manifolds are usually embedded using the nearest neighbor method, allowing them to capture nonlinear structures that will be lost. The noise or separability of the generated projection can then be analyzed to determine whether decision Spaces can be created in the data.

from sklearn.datasets import load_iris
from yellowbrick.features import manifold_embedding


X, y = load_iris(return_X_y=True)
visualizer = manifold_embedding(X, y)
Copy the code

Classification model visualization

Class Prediction Error

The ClassPredictionError graph fits the support (number of training samples) for each class in the classification model and displays it as a stacked bar graph. Each bar is segmented to show the proportion of predictions for each category (including false negatives and false positives, such as confounding matrices).

You can use ClassPredictionError to visualize categories where your classifier is having particular difficulty,

More importantly, what incorrect answers it gives on a per-class basis. This often gives you a better understanding of the strengths and weaknesses of different models as well as the specific challenges unique to the dataset.

The category prediction error graph provides a way to quickly see how effective a classifier is at predicting the correct category.

from yellowbrick.datasets import load_game
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import class_prediction_error


X, y = load_game()
X = OneHotEncoder().fit_transform(X)
visualizer = class_prediction_error(
    RandomForestClassifier(n_estimators=10), X, y
)
Copy the code

Classification Report

The classification report visualizer shows the model’s accuracy, recall rate and F1 score. For easier interpretation and problem detection, the report combines numerical scores with color-coded heat maps. All heat maps are in the (0.0, 1.0) range to facilitate easy comparison of classification models between different classification reports.

from yellowbrick.datasets import load_credit
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import classification_report


X, y = load_credit()
visualizer = classification_report(
    RandomForestClassifier(n_estimators=10), X, y
)
Copy the code

Confusion Matrix

The ConfusionMatrix visualizer is a ScoreVisualizer that uses the fitted SciKit-Learn classifier and a set of test X and Y values and then returns a report showing how the predicted category for each test value compares to its actual category.

Data scientists use obfuscation matrices to understand which classes are most easily obfuscated.

from yellowbrick.datasets import load_game
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import RidgeClassifier
from yellowbrick.classifier import confusion_matrix


X, y = load_game()
X = OneHotEncoder().fit_transform(X)
visualizer = confusion_matrix(RidgeClassifier(), X, y, cmap="Greens")
Copy the code

Accuracy /Recall curve (Precision/Recall)

The PrecisionRecallCurve shows the tradeoff between classifier accuracy (a measure of relevance of results) and recall rate (a measure of completeness).

For each category, accuracy is defined as the ratio of true positives to the sum of true and false positives, and recall is the ratio of true positives to the sum of true positives and false negatives.

from sklearn.naive_bayes import GaussianNB
from yellowbrick.datasets import load_occupancy
from yellowbrick.classifier import precision_recall_curve


X, y = load_occupancy()
visualizer = precision_recall_curve(GaussianNB(), X, y)
Copy the code

ROC/AUC

The ROC/AUC diagram is a trade-off between the sensitivity and specificity of the user’s visual classifier.

ROC is a measure of the predictive quality of a classifier that compares the trade-off between sensitivity and specificity of a visual model. When plotted, the ROC curve shows the true positive rate on the Y-axis and the false positive rate for each category on the X-axis. Thus, the ideal point is the top left corner of the graph: false positives are zero and true positives are 1.

Another measure, the area under the curve (AUC), is a calculation of the relationship between false positives and true positives. The higher the AUC, the better the model is generally. However, it is also important to check the “steepness” of the curve, as this describes maximization of the true positive rate and, at the same time, minimization of the false positive rate.

from yellowbrick.classifier import roc_auc
from yellowbrick.datasets import load_spam
from sklearn.linear_model import LogisticRegression


X, y = load_spam()
visualizer = roc_auc(LogisticRegression(), X, y)
Copy the code

Discrimination Threshold

Precision, recall rate, F1 score and other indicators are the visualization of the discriminant threshold of binary classifier.

The discriminant threshold is the probability or score of choosing a positive class over a negative class. Typically, this is set to 50%, but the threshold can be adjusted to increase or decrease sensitivity to false positives or other application factors.

from yellowbrick.classifier import discrimination_threshold
from sklearn.linear_model import LogisticRegression
from yellowbrick.datasets import load_spam

X, y = load_spam()
visualizer = discrimination_threshold(
    LogisticRegression(multi_class="auto", solver="liblinear"), X, y
)
Copy the code

Regression model visualization

Residuals Plot

In the context of a regression model, residuals are the differences between the observed (Y) and predicted (YY) values of a target variable, which are known as the predictions’ errors.

The residual plot shows the difference between the residual on the vertical axis and the dependent variable on the horizontal axis, allowing you to detect areas within the target that may be susceptible to more or less error.

from sklearn.linear_model import Ridge
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import residuals_plot


X, y = load_concrete()
visualizer = residuals_plot(
    Ridge(), X, y, train_color="maroon", test_color="gold"
)
Copy the code

Prediction Error

The prediction error graph shows the actual targets in the data set versus the predicted values generated by our model. This allows us to see how much variance there is in the model.

Data scientists can use this graph to diagnose regression models by comparing them to the 45-degree line, where the predicted results perfectly match the model.

from sklearn.linear_model import Lasso
from yellowbrick.datasets import load_bikeshare
from yellowbrick.regressor import prediction_error

X, y = load_bikeshare()
visualizer = prediction_error(Lasso(), X, y)
Copy the code

Cooks Distance

The Cook distance is a common distance used in statistical analysis to diagnose the presence of abnormal data in various regression analyses.

Instances of high impact may be outliers, and data sets with a large number of high impact points may not be suitable for linear regression without further processing, such as removal of outliers or interpolation.

The CooksDistance visualizer shows indexed stem plots for all instances and their associated distance scores, as well as heuristic thresholds to quickly show the percentage of the dataset that might affect the OLS regression model.

from sklearn.datasets import load_diabetes
from yellowbrick.regressor import cooks_distance


X, y = load_diabetes(return_X_y=True)
visualizer = cooks_distance(X, y)
Copy the code

Cluster model visualization

Silhouette Scores

When the true information about the data set is unknown, contour coefficients are used and the estimation of the cluster density of the model is calculated. The score is calculated by averaging the contour coefficients for each sample, calculating the difference between the average intra-cluster distance and the average nearest cluster distance for each sample, and normalizing the maximum. This yields a score between 1, where 1 is a highly dense cluster, and -1, where -1 is a completely incorrect cluster.

Silhouette Visualizer displays the Silhouette coefficients for each sample based on each cluster, visualizing which clusters are dense and which are not. This is especially useful for determining clustering imbalances or for selecting values by comparing multiple visual tools.

from sklearn.cluster import KMeans
from yellowbrick.datasets import load_nfl
from yellowbrick.cluster import silhouette_visualizer

X, y = load_nfl()
visualizer = silhouette_visualizer(KMeans(5, random_state=42), X)
Copy the code

Intercluster Distance

The intercluster distance diagram shows the embedment of cluster centers in two dimensions and preserves the distance from other centers.

For example, the closer they are to the center in the visualization, the closer they are to the original feature space. The cluster size was determined according to the scoring index.

By default, they are sized by membership, for example, the number of instances belonging to each center. This gives an understanding of the relative importance of clusters.

Note:

Because two clusters overlap in 2D space, it does not mean that they overlap in the original feature space.

from yellowbrick.datasets import load_nfl
from sklearn.cluster import MiniBatchKMeans
from yellowbrick.cluster import intercluster_distance


X, y = load_nfl()
visualizer = intercluster_distance(MiniBatchKMeans(5, random_state=777), X)
Copy the code

Target value analysis

ClassBalance

One of the biggest challenges with classification models is the class imbalance in training data. Severe class imbalances may be masked by relatively good F1 and accuracy scores.

The classifier simply guesses the majority of classes, rather than doing any evaluation of underrepresented classes.

There are several techniques for dealing with category imbalances, such as stratified sampling, downsampling of most categories, weighting, etc.

But before taking these actions, it’s important to understand what the class balance is in the training data.

The ClassBalance visualization tool understands the balance of categories by creating a bar chart for each class, that is, how often the class is represented in the dataset.

from yellowbrick.datasets import load_game
from yellowbrick.target import class_balance

X, y = load_game()
visualizer = class_balance(y, labels=["draw"."loss"."win"])
Copy the code

conclusion

Yellowbrick is a tool kit that makes machine learning modeling a lot easier for us. Firstly, the visualization problem in feature engineering and modeling is solved, which greatly simplifies the operation. Second, through a variety of visualization can also supplement their own modeling of some blind spots. This is just the tip of the Yellowbrick iceberg, see the website for more details.

Refer to the article

  • yellowbrick source code
  • yellowbrick oneliners