By Chris Albon

Translator: Flying Dragon

Protocol: CC BY-NC-SA 4.0

Dimensionality reduction on sparse eigenmatrices

Preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse Import csr_matrix from sklearn import datasets import numpy as NP # load data digits = datasets. Load_digits () # StandardScaler().fit_transform(digits.data) # Generate x_SR_matrix (X) # create TSVD TSVD = TruncatedSVD(N_Components =10) # Use TSVD X_sparse_tsvd = TVD. Fit (X_sparse). Transform (X_sparse) # print('Original number of features:', X_sparse.shape[1]) print('Reduced number of features:', X_sparse_tsvd.shape[1]) ''' Original number of features: 64 Reduced number of features: Tsvd.explained_variance_ratio_ [0:3].sum() # 0.30039385372588506Copy the code

Kernel PCA dimension reduction

Decomposition import PCA, KernelPCA from sklearn.datasets import make_circles _ = make_circles(n_samples=1000, random_state=1, noise=0.1, Factor =0.1) # Apply the kernel PCA with RBF kernel kpCA = KernelPCA(kernel=" RBF ", gamma=15, n_components=1) X_kpca = kpca.fit_transform(X) print('Original number of features:', X.shape[1]) print('Reduced number of features:', X_kpca.shape[1]) ''' Original number of features: 2 Reduced number of features: 1 '''Copy the code

Use dimension reduction of PCA

Preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn import Datasets # load data digits = datasets.load_digits() # Normalize the feature matrix X = StandardScaler().fit_transform(digits.data) # create the one that keeps 99% of the variance PCA PCA = PCA (n_components = 0.99, Print ('Original number of features:') print('Original number of features:', X.shape[1]) print('Reduced number of features:', X_pca.shape[1]) ''' Original number of features: 64 Reduced number of features: 54 '''Copy the code

PCA feature extraction

Principal component analysis (PCA) is a common feature extraction method in data science. Technically, PCA finds the eigenvectors of the covariance matrix with the highest eigenvalues, and then uses these eigenvectors to project the data into new subspaces of equal or smaller dimensions. In practice, PCA transforms n feature matrices into new data sets with (possibly) fewer than n features. That is, it reduces the number of features by constructing new fewer variables that capture an important part of the information found in the original features. However, the purpose of this tutorial is not to explain the concept of PCA, which has been done very well elsewhere, but to demonstrate the practical application of PCA.

Import numpy as NP from sklearn import Decomposition, Datasets from sklearn. Preprocessing import StandardScaler # Load breast cancer dataset dataset = datasets.load_breast_cancer() # load feature X = dataset.dataCopy the code

Note that the raw data contained 569 observations and 30 features.

X.shape # (569, 30)Copy the code

Here’s what the data looks like

Array ([[1.79900000e+01, + 1.03800000e+01, + 1.22800000e+02, + e-1, + e-1, + e-1, + e-1) E-01], [2.05700000e+01, 1.77700000e+01, 1.32900000e+02, [1.96900000e+01, + 2.12500000e+01, + 1.30000000e+02, E-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1; e-1 + e-01; + e-01; + e-01; + e-01; + e-01 [7.76000000e+00, 2.45400000e+01, 4.79200000e+01,... X_std = sc.fit_transform(X)Copy the code

Note that PCA contains one parameter, the number of components. This is the number of output characteristics that need to be adjusted.

X_std_pca = pca.fit_transform(X_std) = fit_transform(X_std)Copy the code

After PCA, the new data has been reduced to two features with the same number of rows as the original feature.

X_std_pca "" array([[9.19283683, 1.94858307], [2.3878018, -3.76817174], [5.73389628, -1.0751738], [1.25617928, -1.90229671], [10.37479406, 1.67201011], [-5.4752433, 0.67063679]] "'Copy the code

Observations were grouped using KMeans clustering

From sklearn.datasets import make_blobs from sklearn.cluster import KMeans import pandas as pd _ = make_blobs(n_samples = 50, n_features = 2, Centers = 3, random_state = 1) # create DataFrame df = pd.dataframe (X, Columns =['feature_1','feature_2']) # columns=['feature_1','feature_2'] KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, N_clusters =3, N_init =10, n_JOBS =1, Precompute_Distances ='auto', random_state=1, TOL =0.0001, ['group'] = clusterer.predict(X)Copy the code
feature_1 feature_2 group
0 9.877554 3.336145 0
1 7.287210 8.353986 2
2 6.943061 7.023744 2
3 7.440167 8.791959 2
4 6.641388 8.075888 2

Choose the best number of ingredients for the LDA

In scikit – learn, LDA are implemented using LinearDiscriminantAnalysis, contains a parameter n_components, said we want to return to the characteristics of the number. To find out the parameter values for n_components (for example, how many parameters to keep), we can take advantage of the fact that explain_variance_ratio_ tells us the interpreted variance for each output feature and is an ordered array.

Specifically, we can run Linear_iscriminantAnalysis, set n_Components to None to return the ratio of interpretation variance by each characteristic component, and then calculate how many components are required to exceed the threshold of interpretation variance (typically 0.95 or 0.99).

# to load the library from sklearn import datasets from sklearn. Discriminant_analysis import LinearDiscriminantAnalysis # loading irises iris data set . = datasets. Load_iris (X) = iris data y = iris. The target # create and run the LDA LDA = LinearDiscriminantAnalysis (n_components = None) X_lda = lda.fit(X, Lda_var_ratios = lda.explained_variance_ratio_ def select_n_components(var_ratio, goal_var: float) -> int: # set explained_variance to total_variance = 0.0 # set n_components = 0 # Set explained_variance to explained_variance in var_ratio: Add explained_variance to the total total_variance += explained_variance # Number of components + 1 n_components += 1 # If we achieve our explain_variance target if total_variance >= Goal_var: # end loop break # return n_components # execute function select_n_components(lda_var_ratios, 0.95) # 1Copy the code

Choose the best number of ingredients for TSVD

Preprocessing import StandardScaler from sklearn.decomposition import TruncatedSVD from scipy.sparse Import csr_matrix from sklearn import datasets import numpy as NP # load data digits = datasets. Load_digits () # Taichichuan The feature matrix X = StandardScaler().fit_transform(digits.data) # XSR_matrix (X) # TSVD = TruncatedSVD(N_components =X_sparse. Shape [1]-1) X_tsvd = TVD. fit(X) # List of explained variances tsvd_var_ratios = Def def select_n_components(var_ratio, goal_var: float) -> int: # set explained_variance to total_variance = 0.0 # set n_components = 0 # Set explained_variance to explained_variance in var_ratio: Add explained_variance to the total total_variance += explained_variance # Number of components + 1 n_components += 1 # If we achieve our explain_variance target if total_variance >= Goal_var: # end loop break # return n_components # execute function select_n_components(tsvd_var_ratios, 0.95) # 40Copy the code

LDA is used for dimension reduction

# to load the library from sklearn import datasets from sklearn. Discriminant_analysis import LinearDiscriminantAnalysis # loading irises iris data set = datasets.load_iris() X = iris.data y = iris.target It will be the data dimension reduction to a characteristic lda = LinearDiscriminantAnalysis (n_components = 1) # run lda features and use it to convert X_lda = lda. Fit (X, Y). Transform (X) # print('Original number of features:', x.shape [1]) print('Reduced number of features:', X_lda.shape[1]) ''' Original number of features: 4 Reduced number of features: Lda. Explained_variance_ratio_ # array([0.99147248])Copy the code