directory

1. Get the data set

2. Visualization of data sets

3. Dimensionality reduction and visualization

3.1. Random projection dimension reduction

3.2 PCA dimension reduction

3.3. Truncated SVD dimension reduction

3.4 LDA dimension reduction

3.5 MDS dimension reduction

3.6 Isomap dimension reduction

3.7 LLE dimension reduction

3.7.1, standard LLE

3.7.2, modified LLE

3.7.3, hessian LLE

3.7.4, LTSA

3.8 t-SNE dimension reduction

3.9 RandomTrees dimensionality reduction

3.10. Spectral embedding reduces dimension

4, summarize


Dimensionality reduction is an operation that transforms a single image into a data set in a high-dimensional space by high-dimensional data of a single image. Dimensionality reduction in machine learning refers to the use of a mapping method to map data points from a high-dimensional space to a low-dimensional space. The essence of dimensionality reduction is to learn a mapping function F: x->y, where X is the expression of original data points, and at present, vector expression is used at most. Y is the low-dimensional vector representation of the data point mapping, and the dimension of y is usually less than the dimension of X (although it is possible to raise the dimension). F may be explicit or implicit, linear or nonlinear.

This project will rely on MNIST data set to achieve dimensionality reduction of image data set by hand.

The MNIST dataset from the National Institute of Standards and Technology is an entry level computer vision dataset. It is made up of 60,000 training pictures and 10,000 test pictures, which are handwritten numbers from zero to nine, 50 percent from American high school students and 50 percent from the Census Bureau workers. The digital images were preprocessed and formatted in black and white, resized (28 x 28 pixels) and centered. MNIST dataset effect is shown in the figure below:

1. Get the data set

In this case, choose to import the handwritten digital image dataset directly from the sklearn.datasets module with load_digits, The Data Set is a test Set in UCI Datasets’ Optical Recognition of Handwritten Digits Data Set, and is a small subset of MNIST. There are 1797 Handwritten digital images with a resolution of 8××8. At the same time, the image has ten categories of numbers ranging from 0 to 9.

Import the load_digits module and the relevant packages for this case as follows:

from time import time # used to calculate running time
import matplotlib.pyplot as plt 
import numpy as np
from matplotlib import offsetbox # Define the format of the graph box
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection) 
Copy the code

The n_class parameter in load_digits specifies how many classes of images you want to extract (starting with the number 0). The default value is 10; There is also a return_X_y argument (new in Sklearn 0.18) that returns the image data data and the label target if True, which defaults to False. When return_X_y is False, a Bunch object is returned, which is a dictionary-like object that contains DESCR, the complete description of data, images, and data sets.

The following shows the two reading methods respectively:

Method 1: Return the Bunch object, and the implementation code is as follows:

digits = datasets.load_digits(n_class=6)
print(digits)
# Fetch bunch's data and target
print(digits.data)
print(digits.target)
Copy the code

The following output is displayed:

[[0. 0. 5...., 0, 0, 0.] [0. 0. 0...., 10. 0. 0.] [0. 0. 0...., 16. 9. 0.]... [0. 0. 0.... 9. 0. 0.] [0. 0. 0.... 4. 0. 0.] [0. 0. 6.... 6. 0. 0.]] to [0, 1, 2... 4 4 0]Copy the code

Method 2: only return data and target, the implementation code is as follows:

data = datasets.load_digits(n_class=6)
print(data)
Copy the code

The following output is displayed:

{'images': array([[[ 0., 0., 5., ..., 1., 0., 0.], [ 0., 0., 13., ..., 15., 5., 0.], [ 0., 3., 15., ..., 11., 8., 0.], ..., [ 0., 4., 11.,..., 12., 7. 0.], [0. 2., 14,..., 12., 0., 0.], [0., 0. 6.,..., 0., 0., 0.]], [[... 0, 0, 0,..., 5., 0. 0.], [... 0, 0, 0,..., 9, 0., 0.], [0., 0. 3,..., 6., 0., 0.],... [. 0, 0. 1,..., 6., 0., 0.], [0., 0. 1,..., 6., 0. 0.], [... 0, 0, 0,..., 10., 0., 0.]], [[... 0, 0, 0,..., 12., 0., 0.], [0., 0. 3,..., 14., 0., 0.], [0, 0), 8,..., 16., 0., 0.],... [0. 9., 16.,..., 0), and 0., 0.], [0. 3., 13.,..., 11., 5., 0.], [0., 0. 0. And... and, 16. 9. 0.]],..., [[... 0, 0, 0,..., 6., 0., 0.], [... 0, 0, 0,..., (2), and 0., 0.], [0., 0. 8.,..., (1), 2. 0.],..., [0. 12., 16.,..., 16., 1, 0.], [0. 1, 7,..., 13., 0., 0.], [... 0, 0, 0,..., 9, 0., 0.]]. . [[0, 0, 0),..., 4, 0., 0.], [0., 0. 4.,..., 0), and 0., 0.], [0., 0. 12.,..., 4, 3, 0.],... [0., 12., 16. ... and 13., 0. 0.], [0., 0. 4.,..., 8, 0., 0.], [... 0, 0, 0,..., 4, 0., 0.]], [[0., 0. 6.,..., 11., 1, 0.]. [0, 0. 16.,..., 16., 1, 0.], [0. 3., 16.,..., 13., 6., 0.],... [0. 5., 16.,..., 16., 5., 0.], [0. 1., 15., ..., 16., 1., 0.], [ 0., 0., 6., ..., 6., 0., 0.]]]), 'data': array([[ 0., 0., 5., ..., 0., 0., 0.], [ 0., 0., 0., ..., 10., 0., 0.], [ 0., 0., 0., ..., 16., 9., 0.], ..., [ 0., 0., 0., ..., 9., 0., 0.], [ 0., 0., 0., ..., 4., 0., 0.], [ 0., 0., 6., ..., 6., 0., 0.]]), 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'DESCR': "Optical Recognition of Handwritten Digits Data Set\n===================================================\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 5620\n :Number of Attributes: 64\n :Attribute Information: 8x8 image of integer pixels in the range 0.. 16.\n :Missing Attribute Values: None\n :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0.. 16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\nReferences\n----------\n - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n  Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n Graduate Studies in Science and Engineering, Bogazici University.\n - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n Linear dimensionalityreduction using  relevance weighted LDA. School of\n Electrical and Electronic Engineering Nanyang Technological University.\n 2005.\n -  Claudio Gentile. A New Approximate Maximal Margin Classification\n Algorithm. NIPS. 2000.\n", 'target': array([0, 1, 2, ..., 4, 4, 0])}Copy the code

This case only extracts images from numbers 0 to 5 for dimensionality reduction. We select the first six photos in the image data set for display, and the implementation code is shown as follows:

# plt.gray() 
fig, axes = plt.subplots(nrows=1, ncols=6, figsize=(8.8))
for i,ax in zip(range(6),axes.flatten()):
    ax.imshow(digits.images[i], cmap=plt.cm.gray_r)
plt.show()
Copy the code

The effect is as follows:

In order to conveniently display handwritten digital pictures, the Bunch object import method is used, and the implementation code is as follows:

digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
Copy the code

2. Visualization of data sets

Visualization of some data images of the data set, the implementation code is shown as follows:

n_img_per_row = 30 Display 30 images per line

# The whole figure is 300*300. Since a picture is 8*8, a white frame is wrapped around each picture to prevent the pictures from influencing each other
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))

for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8.8))  
plt.figure(figsize=(6.6))
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')
Copy the code

The effect is as follows:

3. Dimensionality reduction and visualization

Image data is a kind of high-dimensional data (from decades to millions of dimension), if each image as a high-dimensional space of points, to display these points in high dimensional space is extremely difficult, so we need to put these data dimension reduction, in 2 d or 3 d space see the nested structure of the whole data set.

The dimension reduction method and sklearn module to be demonstrated in this case are shown in the following table:

The process of calling the above methods for dimension reduction is similar:

  • First create an instance according to the specific method:Instance name = sklearn module. Method called (some parameter Settings)
  • The data is then converted:Converted data variable name = instance name. fit_transform(X)In some methods, such as LDA dimension reduction, labels are also requiredy
  • Finally, the transformed data will be visualized: input the transformed data and the title to draw a two-dimensional map.

In order to facilitate drawing and unify the drawing style, the plot_embedding function is defined to draw low-dimensional embedding graphics.

The color scheme is as follows:

  • Green # 5 dbe80
  • Blue d9ed8 # 2
  • Purple # a290c4
  • Orange # efab40
  • Red # eb4e4f
  • Grey # 929591

The implementation code is as follows:

First define the function to draw the sample points in two-dimensional space. Input parameters: 1. 2. Picture titles

def plot_embedding(X, title=None) :
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min) # Normalize each dimension 0-1. Note that X has only two dimensions
    
    plt.figure(figsize= (6.6)) # Set the entire graphic size
    ax = plt.subplot(111)
    colors = ['#5dbe80'.'#2d9ed8'.'#a290c4'.'#efab40'.'#eb4e4f'.'# 929591']
    
    # Draw sample points
    for i in range(X.shape[0) :Each row represents a sample
        plt.text(X[i, 0], X[i, 1].str(digits.target[i]),
                 #color=plt.cm.Set1(y[i] / 10.),
                 color=colors[y[i]],
                 fontdict={'weight': 'bold'.'size': 9})  # Draw the number label of the sample point at the location of the sample point
    
    # Draw thumbnails on sample points and make sure the thumbnails are sparse enough not to cover each other
    'AnnotationBbox' is not available for offsetBox unless matplotlib 1.0 is available
    if hasattr(offsetbox, 'AnnotationBbox'): 
        shown_images = np.array([[1..1.]])  # assume that the original thumbnail appears in position (1,1)
        for i in range(digits.data.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2.1) # Calculate the distance between the sample point and all the images shown (shown_images)
            if np.min(dist) < 4e-3: # If the minimum distance is less than 4E-3, i.e. there are two sample points close to each other, display the digital image thumbnail with continue skip
                continue
            shown_images = np.r_[shown_images, [X[i]]] Sample points showing thumbnails are added to the Shown_images matrix by vertical stitching
            
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
                X[i])
            ax.add_artist(imagebox)
            
    plt.xticks([]), plt.yticks([]) # do not display horizontal and vertical scale
    if title is not None: 
        plt.title(title) 
Copy the code

3.1. Random projection dimension reduction

Random projection is the simplest way to reduce dimension. This method can only show the spatial structure of the entire data to a very small extent and will lose most of the structural information, so this dimension reduction method is rarely used, and the implementation code is shown as follows:

t0 = time() 
rp = random_projection.SparseRandomProjection(n_components=2, random_state=66)
X_projected = rp.fit_transform(X)
plot_embedding(X_projected, 
               "Random Projection of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.2 PCA dimension reduction

PCA dimension reduction is the most commonly used linear unsupervised dimension reduction method. PCA dimension reduction is actually a linear dimension reduction method for dimension reduction by SVD decomposition of covariance matrix, and the implementation code is shown as follows:

t0 = time()
pca = decomposition.PCA(n_components=2)
X_pca = pca.fit_transform(X)
plot_embedding(X_pca,
               "Principal Components projection of the digits (time %.2fs)" %
               (time() - t0))
print pca.explained_variance_ratio_ # What percentage of each component explains the variance of the original data
Copy the code

The effect is as follows:

3.3. Truncated SVD dimension reduction

Truncated SVD employs truncated SVD decomposition to linearly reduce data dimension. Different from PCA, this method does not centralize data before SVD decomposition, which means that it can effectively handle sparse matrices such as those defined by Scipy. Sparse sparse matrices, while PCA method does not support the input of Scipy. In the field of text analysis, this method can perform SVD decomposition of sparse word frequency/TF-IDF matrix, namely LSA (cryptic analysis), and the implementation code is shown as follows:

t0 = time()
svd = decomposition.TruncatedSVD(n_components=2)
X_svd = svd.fit_transform(X)
plot_embedding(X_svd,
               "Principal Components projection of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.4 LDA dimension reduction

LDA dimension reduction method makes use of label information of original data, so after dimension reduction, sample points of the same class in low-dimensional space are gathered together, and points of different classes are separated more widely. The implementation code is shown as follows:

X2 = X.copy()
X2.flat[::X.shape[1] + 1] + =0.01  Let's make X invertible
t0 = time()
lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X2, y)
plot_embedding(X_lda,
               "Linear Discriminant projection of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.5 MDS dimension reduction

If the distance between sample points is similar to that between cities, MDS dimension reduction can be used to maintain such distance in low-dimensional space, as shown in the code below:

clf = manifold.MDS(n_components=2, n_init=1, max_iter=100)
t0 = time()
X_mds = clf.fit_transform(X)
plot_embedding(X_mds,
               "MDS embedding of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.6 Isomap dimension reduction

Isomap is a manifold learning method, which is short for Isometric Mapping. This method can be regarded as an extension of the MDS method. Different from MDS, this method maintains geodesic distances between all data points before and after dimensionality reduction.

Isomap needs to specify the number of nearest neighbor points for the domain, which we already specified as 30 when extracting image data. Due to the need to calculate the geodesic distance of sample points, this method consumes a lot of time. The implementation code is shown as follows:

t0 = time()
iso = manifold.Isomap(n_neighbors, n_components=2)
X_iso = iso.fit_transform(X)
plot_embedding(X_iso,
               "Isomap projection of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.7 LLE dimension reduction

LLE dimension reduction also needs to specify the number of sample points in the domain n_neighbors. LLE dimension reduction preserves the distance relationship between sample points in the neighborhood, which can be understood as a series of local PCA operations, but it preserves the unstructured information of data globally. LLE dimension reduction method mainly includes four kinds of standard, modified, hessian and ltsa, the following display, and output their reconstruction error (from low dimensional space data reconstruction error of the original spatial data).

3.7.1, standard LLE

Standard LLE dimension reduction code is as follows:

clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2, method='standard')
t0 = time() 
X_lle = clf.fit_transform(X)
plot_embedding(X_lle,
               "Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is shown below:

3.7.2, modified LLE

Standard LLE has regularization problems: When n_neighbors are larger than the input dimension, the local neighborhood matrix is rank-deficient. To solve this problem, a regularization parameter 𝑟r is introduced on the basis of Standard LLE. Modified LLE dimension reduction can be achieved by setting methond=’modified’, as shown below:

clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2, method='modified')
t0 = time()
X_mlle = clf.fit_transform(X)
plot_embedding(X_mlle,
               "Modified Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.7.3, hessian LLE

Hessian LLE, also called Hessian Eigenmapping, is another approach to the PROBLEM of LLE regularization. N_neighbors > n_components * (n_components + 3) / 2

clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2, method='hessian')
t0 = time()
X_hlle = clf.fit_transform(X)
plot_embedding(X_hlle,
               "Hessian Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.7.4, LTSA

The LTSA (Local Tangent Space alignment) method is not actually a variant of LLE, but is LocallyLinearEmbedding because it is similar to the LLE algorithm. Instead of maintaining the distance between adjacent points before and after LLE dimensionality reduction, LTSA places data in the tangent space to describe the geographical properties between sample points in the neighborhood, as shown in the following code:

clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2, method='ltsa')
t0 = time()
X_ltsa = clf.fit_transform(X)
plot_embedding(X_ltsa,
               "Local Tangent Space Alignment of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.8 t-SNE dimension reduction

This example uses the PCA-generated embed to initialize t-SNE instead of the default setting for T-SNE (namely init=’ PCA ‘). It ensures the global stability of the embedding, that is, the embedding does not depend on random initialization.

T-sne method is sensitive to local structure information of data and has many advantages:

  • Reveals samples belonging to different manifolds or clusters
  • Reduced sample aggregation in

Of course, it also has many disadvantages:

  • Computations are expensive, taking hours on millions of images while PCA can take minutes or seconds for the same task.
  • The algorithm has randomness and different random seeds will produce different dimensionality reduction results. Of course, by selecting different random seeds, it is feasible to select the random seed with the smallest reconstruction error as the final dimension reduction parameter.
  • The global structure remains poor, but this problem can be mitigated by using PCA initial sample points (init='pca').

The implementation code is as follows:

tsne = manifold.TSNE(n_components=2, init='pca', random_state=10) Generate tSNE instance
t0 = time()  The moment before dimension reduction is performed
X_tsne = tsne.fit_transform(X) # Dimensionality reduction yields data in two dimensional space
plot_embedding(X_tsne, "t-SNE embedding of the digits (time %.2fs)" % (time() - t0)) # Draw the embedded graph after dimensionality reduction
plt.show()
Copy the code

The effect is as follows:

3.9 RandomTrees dimensionality reduction

RandomTreesEmbedding from sklearn. Ensemble module is not a multidimensional embedding method technically, but it learns a high dimensional representation of data that can be used in data dimension reduction methods. RandomTreesEmbedding can first be used for high-dimensional representation of data, and then PCA or TRUNCated SVD can be used for dimensionality reduction. The implementation code is as follows:

hasher = ensemble.RandomTreesEmbedding(n_estimators=200, random_state=0, max_depth=5)
t0 = time()
X_transformed = hasher.fit_transform(X)
pca = decomposition.TruncatedSVD(n_components=2)
X_reduced = pca.fit_transform(X_transformed)
plot_embedding(X_reduced,
               "Random forest embedding of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

3.10. Spectral embedding reduces dimension

Spectral embedding is also called Laplacian Eigenmaps. It uses spectral decomposition of Laplacian graph to find the representation of data in low-dimensional space, and the implementation code is as follows:

embedder = manifold.SpectralEmbedding(n_components=2, random_state=0, eigen_solver="arpack")
t0 = time()
X_se = embedder.fit_transform(X)
plot_embedding(X_se,
               "Spectral embedding of the digits (time %.2fs)" %
               (time() - t0))
Copy the code

The effect is as follows:

4, summarize

This case uses a variety of dimensionality reduction methods, including PCA, LDA and dimensionality reduction method based on manifold learning, to reduce the dimensionality and visualization of hand-written digital image data.

Linear dimension reduction methods including PCA and LDA consume less time, but this linear dimension reduction method will lose nonlinear structure information in high-dimensional space. In contrast, the manifold learning method in the nonlinear dimension reduction method (KPCA and KLDA are not mentioned here, and you can try them if you are interested) can well retain the nonlinear structure information in the high-dimensional space. Although typical manifold learning is unsupervised, some supervised variations exist.

When reducing the dimension of data, we must make clear whether the purpose of reducing the dimension is to carry out feature extraction to make the model more explanatory or improve the effect, or just to visualize high-dimensional data. In the choice of dimensionality reduction method, we should try to balance the time cost and dimensionality reduction effect.

In addition, the following points should be paid attention to when reducing dimension:

  • Before dimensionality reduction, the scale of all features is consistent.
  • The reconstruction error can be used to find the optimal output dimension 𝑑d (dimensionality reduction is not just for visualization at this point). As the dimension 𝑑d increases, the reconstruction error will decrease until the threshold set by the implementation is reached;
  • The noise point may cause a “short circuit” of the manifold, that is, two parts of the manifold that are easily separated are connected by the noise point as a “bridge”;
  • Certain types of input data can cause the weight matrix to be singular, such as having more than two sample points in the dataset that are identical, or sample points that are grouped in disjoint groups. In this case, the implementation of eigenvalue decompositionsolver='arpack'I’m not going to find my null space. The easiest way to solve this problem is to usesolver='dense'Implement eigenvalue decomposition, thoughdenseIt may be slow, but it can be used for singular matrices. In addition, we can think of a solution by understanding the cause of singularity: if it is due to disjoint sets, we can tryn_neighborsIncrease; If it is due to the existence of the same sample points in the dataset, you can try to remove these duplicate sample points and keep only one of them.