Scikit-learn new version released, new features detailed

Image from Webtext /Python Data Science

Scikit-learn is available in version 0.22. I had a look, this version in addition to fixing some of the previous bugs, but also updated a lot of new features, I have to say more user-friendly. Let me share with you some of the major new features I’ve learned.

Sklearn. Ensemble model

1. Model fusion \

In the old ensemble ensemble ensemble learning module, only advanced models such as ascending tree and random forest were available. In the new ensemble ensemble learning module, a fusion model, StackingClassifier and StackingRegressor, was added for classification and regression. The original method of model fusion is to hand lift one, but now you can directly use the method, which is more convenient, especially for participating in Kaggle competition, model fusion is also a top sharpper.

Here is an updated example of use.

from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', make_pipeline(StandardScaler(),
                          LinearSVC(random_state=42)))
]
clf = StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression()
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42
)
clf.fit(X_train, y_train).score(X_test, y_test)
Copy the code

0.9473684210526315
Copy the code

2. Local support for gradient lifting with missing values

Ensemble. HistGradientBoostingClassifier and ensemble. HistGradientBoostingRegressor now (NaNs) with native support for missing values, so the training or predict when they do not need to fill in the missing data, It can run directly.

from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
import numpy as np

X = np.array([0.1.2, np.nan]).reshape(- 1.1)
y = [0.0.1.1]

gbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)
print(gbdt.predict(X))
Copy the code

[0 0 1 1]
Copy the code

▍ sklearn impute module

Impute. KNNImputer was added in the new version of sklear. impute module, so when we needed to fill the missing value, we could directly use KNN’s algorithm to fill it.

import numpy as np
from sklearn.impute import KNNImputer

X = [[1.2, np.nan], [3.4.3], [np.nan, 6.5], [8.8.7]]
imputer = KNNImputer(n_neighbors=2)
print(imputer.fit_transform(X))
Copy the code

[[1.  2.  4[...]3.  4.  3[...]5.5 6.  5[...]8.  8.  7.]]Copy the code

▍ sklearn. Inspection module

Inspection.permutation_importance has been added, which can be used to estimate the importance of each feature.

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

X, y = make_classification(random_state=0, n_features=5, n_informative=3)
rf = RandomForestClassifier(random_state=0).fit(X, y)
result = permutation_importance(rf, X, y, n_repeats=10, random_state=0,
                                n_jobs=- 1)

fig, ax = plt.subplots()
sorted_idx = result.importances_mean.argsort()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=range(X.shape[1]))
ax.set_title("Permutation Importance of each feature")
ax.set_ylabel("Features")
fig.tight_layout()
plt.show()
Copy the code

▍ sklearn. Metrics module

The new version adds a nice feature called metrics.plot_roc_curve that solves the problem of plotting ROC_curve. The original need according to the AUC/ROC principle of their own lu, although the Internet has a corresponding mature ready-made code, but since then can be directly relieved of the bold use.

Also, the ROC_auc_score function can be used for multi-class classification. Two averaging strategies are currently supported: the 1-VS-1 algorithm computes the average of paired ROC AUC scores, and the 1-VS-REST algorithm computes the average scores of each class relative to all other classes. In both cases, the multi-class ROC AUC score is calculated based on the model’s estimate of the probability that the sample belongs to a particular class. OVO and OVR algorithms support uniform weighting (Average =’macro’) and by prevalence (Average =’weighted’).

from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score

X, y = make_classification(n_classes=4, n_informative=16)
clf = SVC(decision_function_shape='ovo', probability=True).fit(X, y)
print(roc_auc_score(y, clf.predict_proba(X), multi_class='ovo'))
Copy the code

0.9957333333333332
Copy the code

Total script run time: (0 minutes 7.364 seconds)

Estimated memory usage: 8 MB

The New Plotting API

Scikit-learn introduces a new Plotting API for creating visualizations. This new API allows you to quickly adjust the visuals of your graphics, eliminating the need for recalculation. You can also add different charts to the same graph. Such as:

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import plot_roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

svc = SVC(random_state=42)
svc.fit(X_train, y_train)
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)

svc_disp = plot_roc_curve(svc, X_test, y_test)
rfc_disp = plot_roc_curve(rfc, X_test, y_test, ax=svc_disp.ax_)
rfc_disp.figure_.suptitle("ROC curve comparison")

plt.show()
Copy the code

Estimation of sparse neighbor map

Most nearest neighbor graph-based estimates accept pre-computed sparse graphs as input to re-apply the same graph to multiple estimator fitting. \

Want to use this feature in the pipeline, you can use the memory parameters, as well as neighbors. KNeighborsTransformer and neighbors. One of the RadiusNeighborsTransformer.

The estimation can also be performed by a custom estimator.

from tempfile import TemporaryDirectory
from sklearn.neighbors import KNeighborsTransformer
from sklearn.manifold import Isomap
from sklearn.pipeline import make_pipeline

X, y = make_classification(random_state=0)

with TemporaryDirectory(prefix="sklearn_cache_") as tmpdir:
    estimator = make_pipeline(
        KNeighborsTransformer(n_neighbors=10, mode='distance'),
        Isomap(n_neighbors=10, metric='precomputed'),
        memory=tmpdir)
    estimator.fit(X)

    # We can decrease the number of neighbors and the graph will not be
    # recomputed.
    estimator.set_params(isomap__n_neighbors=5)
    estimator.fit(X)
Copy the code

The above is the main update I learned this time, please refer to more details.

Link: scikit-learn.org/dev/whats_n…

▍ installation

The upgrade is simple and can be done with a single command.

pip install --upgrade scikit-learn
Copy the code

Or use conda

conda install scikit-learn
Copy the code

Click become:Community registered members ** like the article, click the **Looking at the

Scikit-learn new version released, new features detailed

Estimation of sparse neighbor map

▍ installation

Related Posts

Mysql basic use of add, delete, change, search

Kubernetes study notes CSI Plugin registration mechanism source analysis

After Redis didn’t support Lua scripts, I went down the road of go-ZooKeeper