This paper is part of notes and code repetition (clustering, dimensionality reduction) of Ng’s machine learning course [1].

Author: Huang Haiguang [2]

Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].

I will publish the course notes and course codes on the official account “Machine Learning Beginners”, please pay attention. This is part 4: Unsupervised Learning, week 8 of the original tutorial, which includes notes and work code (the original work was from OCTAVE, here is the Python code reproduced).

Part ONE: Regression

Part two: Logistic regression

Part three: Support vector machines

The job code [4] for this article can be downloaded in its entirety

Markdown file for notes [5]

PDF file of notes [6]

Notes Section contents

Xiii. Clustering

13.1 Unsupervised Learning: Introduction

13-1 – Unsupervised Learning_introduction (3 min). MKV

In this video, I’m going to start introducing clustering algorithms. This is going to be an exciting moment, because this is the first unsupervised learning algorithm we’ve ever learned. We’re going to have the computer learn unlabeled data instead of labeled data.

So what is unsupervised learning? I briefly introduced unsupervised learning at the beginning of the course, however, it is worth comparing it to supervised learning.

In a typical supervised learning, we have a labeled training set, and our goal is to find a decision boundary that distinguishes positive samples from negative samples. In this case, we have a set of labels that we need to fit into a hypothesis function. In unsupervised learning, by contrast, we don’t have any labels attached to our data, and this is what we get:

Here we have a bunch of points, but no label. Therefore, our training set can be written as only,… . Have been to. We don’t have any labels. Therefore, these points on the graph have no label information. In other words, in unsupervised learning, we need to take a bunch of unlabeled training data, feed it into an algorithm, and we tell the algorithm, go find us the internal structure of the data given data. We might need some kind of algorithm to help us find a structure. The data on the graph looks like it can be divided into two separate sets of points (called clusters), and an algorithm that can find the set of points I circled is called a clustering algorithm.

This will be the first unsupervised learning algorithm we introduce. Of course, we’ll mention other types of unsupervised learning algorithms later, which can find other types of structures or other patterns for us, not just clusters.

We will first introduce clustering algorithms. After that, we will introduce other algorithms. So what do clustering algorithms do in general?

Earlier in the course, I listed some applications: market segmentation, for example. Maybe you have a lot of customer information stored in a database, and you want to break them up into different groups of customers so that you can sell products or provide services that are more suitable for different types of customers. Social network analysis: in fact there are many researchers are studying something that they focus on a group of people, pay attention to social networks, such as Facebook, Google +, or some other information, such as: do you often contact with what people, and these people often give who email, which is close to people. So this might require another clustering algorithm that you want to use to find close friends in your social network. I have a friend who is working on this problem and wants to use clustering algorithms to better organize clusters of computers, or better manage data centers. Because if you know that in a data center, those computers often work together. Then you can reallocate resources and rearrange the network. Thus optimize the data center, optimize data communication.

Finally, I’m actually working on how to use clustering algorithms to understand galaxy formation. And then we can use that knowledge to learn some astronomical details. Ok, so that’s the clustering algorithm. This will be the first unsupervised learning algorithm we introduce. In the next video, we’ll start with a concrete clustering algorithm.

13.2 K-means algorithm

13-2-K-means Algorithm (13 min).mkv

K-means is the most popular clustering algorithm, which takes an unlabeled data set and then clusters the data into different groups.

K-means is an iterative algorithm, assuming that we want to cluster the data into N groups, the method is as follows:

Firstly, random points are selected, called cluster centroids.

For each data in the data set, according to the distance from each central point, it is associated with the nearest central point, and all points associated with the same central point are grouped into one class.

Calculate the average for each group and move the center point associated with the group to the average.

Repeat steps 2-4 until the center point does not change.

Here is an example of clustering:

Iteration 1 time

Iteration 3 times

Iteration 10 times

Use,,… , to represent the clustering center, with,… , to store the index of the clustering center closest to the data of the first instance. The pseudo-code of the K-mean algorithm is as follows:

Repeat {
for i = 1 to m
   c(i) := index (form 1 to K) of cluster centroid closest to x(i)


for k = 1To K μ K := average (mean)of points assigned to cluster k
}
Copy the code

The algorithm is divided into two steps. The first for loop is the assignment step, that is, for each sample, it calculates the class it should belong to. The second for loop is the shift of the cluster center, that is, for each class, the center of mass of that class is recalculated.

The K-means algorithm can also be conveniently used to divide data into many different groups, even in cases where there are no very distinct groups. The data set shown in the figure below consists of two characteristics, height and weight. K-means algorithm is used to divide the data into three categories to help determine the three sizes of t-shirts to be produced.

13.3 Optimization objectives

13-3 – Optimization Objective (7 min).mkv

The k-mean minimization problem isto minimize the sum of distances between all data points and their associated cluster center points. Therefore, the cost function of K-mean (also known as Distortion function) is:

Where represents the clustering center with the nearest. Our optimization goal is to find the minimum cost function,… , and, and… To:

Reviewing the k-means iterative algorithm, we know that the first loop is used to reduce the costs incurred, while the second loop is used to reduce the costs incurred. The process of iteration must be decreasing the cost function with each iteration, or there will be an error.

13.4 Random Initialization

13-4 – Random Initialization (8 min).mkv

Before running the K-means algorithm, we first need to randomly initialize all clustering centers. Here is how to do it:

  1. We should choose $K
  2. Select a training instance at random and set the clustering centers to be equal to the training instance

One of the problems with the k-mean is that it can stay at a local minimum, depending on the initialization.

To solve this problem, we usually need to run the K-mean algorithm several times, re-initialize randomly each time, and finally compare the results of running the K-mean for many times and select the result with the smallest cost function. This may work on a smaller scale (2–10), but on a larger scale it may not make a significant difference.

13.5 Selecting the Number of clusters

Refer to the video: 13-5-Choosing the Number of Clusters (8 min).mkv

There is no so-called best method to select the number of clusters, but it usually needs to be selected manually according to different problems. When selecting, consider what is the motivation for clustering with the K-means algorithm, and then select the number of standard clusters that best serve this purpose.

When people talk about ways to choose the number of clusters, one of the ways they might talk about it is called the elbow rule. For the “elbow rule,” all we need to do is change the value, which is the total number of categories in the cluster. We run the k-means clustering method with a cluster. This means that all the data will be grouped into a cluster and the cost function or distortion function will be calculated. Represents the clustering number.

We might get a curve that looks something like this. Like a man’s elbow. That’s what the elbow rule does. Let’s look at a picture like this. It looks like there’s a very clear elbow there. Like a human arm, if you hold out your arm, then this is your shoulder, elbow, hand. This is called the elbow rule. And what you see is this pattern, the distortion goes down very quickly, from one to two, from two to three, and then you get to a cubit point at three. After that, the distortion goes down very slowly, it looks like it’s right to cluster with three clusters, that’s because that point is the elbow of the curve, the distortion goes down very quickly, and then it goes down very slowly, so we choose. When you apply the “elbow rule”, if you get a graph like the one above, then this is a reasonable way to choose the number of clusters.

For example, in our T-shirt manufacturing example, we want to cluster users according to their body size. We can divide them into 3 sizes:, or 5 sizes. Such choice is based on answering the question “whether the T-shirt we make after clustering can better fit our customers”.

Clustering reference:

1. Summary of similarity/distance calculation methods

(1) Minkowski distance/(Euclidean distance:)

(2). Jaccard similarity coefficient:

(3). Cosine similarity:

The Angle included by the sum of dimension vectors is denoted as. According to the law of cosines, its cosine value is:

(4). Pearson correlation coefficient:

Pearson’s correlation coefficient is the cosine of the Angle between and after the coordinate vectors are shifted to the origin.

2. Measurement index of clustering

(1) uniformity:

Similar to precision, uniformity is satisfied if only one category of samples is contained in a cluster. In fact, it can also be considered as the accuracy rate (the sum of the proportion of correctly classified samples in each cluster to the total number of samples in the cluster).

(2) Completeness:

Similar to recall rate, if samples of the same category are classified into the same cluster, completeness is satisfied. The sum of the proportion of correctly classified samples in each cluster to the total samples of that type

(3). V-measure:

Weighted average of homogeneity and integrity

(4). Contour coefficient

Profile coefficient of sample:

Intra-cluster dissimilarity: the average distance between calculated samples and other samples in the same cluster is, which should be as small as possible.

Inter-cluster dissimilarity: calculate the average distance between samples and all samples of other clusters, which should be as large as possible.

Contour coefficient: the closer the value is to 1, the more reasonable the sample clustering is; the closer it is to -1, the sample should be classified into another cluster; approximately 0, the sample should be on the boundary; The mean of all samples is called the contour coefficient of the clustering result.

(5). ARI

The dataset has a total of elements, and the two clustering results are as follows:

The number of elements of and is:

\

Reporter:

Dimensionality Reduction

14.1 Motivation 1: Data compression

141-Motivation I_ Data Compression (10 min).mkV

In this video, I want to start talking about a second type of unsupervised learning problem, called dimensionality reduction. There are several different reasons why you might want to reduce dimensions. One is data compression, and we’ll see some videos later, data compression not only allows us to compress data and therefore use less computer memory or disk space, but it also allows us to speed up our learning algorithms.

But first, let’s talk about what dimension reduction is. As a kind of vivid example, the data set that we collected, there were many, many features, and I drew two of them here.

Suppose we do not know the characteristics of two: : Length: in centimeters; Inches is the length of the same object in inches.

So, this gives us a highly redundant representation, and maybe instead of two separate features and these two basic length measures, maybe what we want to do is reduce the data to one dimension, just one number to measure this length. This example may seem contrived, but the centimetre example here is actually not that unrealistic, there is no difference between the two.

Reducing data from two dimensions to one: If we were to measure the dimensions of something with two different instruments, one measuring in inches and the other in centimeters, we would want to characterize our machine learning. The problem now is that the two instruments do not measure exactly the same thing (due to error, accuracy, etc.), and it is a bit repetitive to feature both, so we want to reduce this two-dimensional data to one dimension.

What I see from this is what happens in industry. If you have hundreds or thousands of features, it’s often easy to lose the features you need. Sometimes there may be several different engineering team, perhaps a construction team to give you two hundred feature, the characteristic of the construction team to give you another three hundred, the third team to give you five hundred, more than one thousand features all together, it becomes very difficult, to track those features, you know you got from the construction team. It doesn’t really want to have the same characteristics of high redundancy.

I’ve been working on helicopter autopilot for years. And so on. If you want to measure — if you want to do, you know, do a survey or do a test on these different pilots — you might have a characteristic: it might be their skill (helicopter pilot), it might be the pilot’s hobby. That’s how much they like flying, and maybe those two traits will be highly correlated. What you really care about is probably the direction of the red line, the different features that determine the ability of the pilot.

Reducing data from 3D to 2D: In this example we are reducing a 3D feature vector to a 2d feature vector. The process is similar to the above, we project a three-dimensional vector onto a two-dimensional plane, forcing all the data to be on the same plane, down to a two-dimensional eigenvector.

Such a process can be used to reduce data from any dimension to any desired dimension, such as reducing features from 1000 dimensions to 100 dimensions.

And as we saw, finally, this will allow us to make some of our learning algorithms run late as well, but we’ll talk about that in a future video.

14.2 Motivation 2: Data visualization

Think like (6 min). MKV

In many learning problems, if we can visualize the data, we can find a better solution, and dimensionality reduction can help us.

If we have data for many different countries, each feature vector has 50 features (GDP, GDP per capita, life expectancy, etc.). It’s impossible to visualize this 50-dimensional data. Using a dimensionality reduction method to reduce it to 2 dimensions, we can visualize it.

The problem with this is that dimensionality reduction algorithms only reduce the number of dimensions, and the meaning of the new features must be discovered by ourselves.

14.3 Principal component analysis problems

14-3 – Principal Component Analysis Problem Formulation (9 min).mkv

Principal component analysis (PCA) is the most common dimension reduction algorithm.

In PCA, what we want to do is to find a Vector direction to which we want the projected mean square error to be as small as possible. A direction vector is a vector passing through the origin, and the projection error is the length of a perpendicular from the eigenvector to that direction vector.

The following describes the principal component analysis problem:

The problem is to reduce the dimension data to dimension, the goal is to find the vector,,… , minimizes the total projection error. Comparison of principal component analysis and linear review:

Principal component analysis and linear regression are two different algorithms. Principal component analysis minimizes Projected Error, whereas linear regression attempts to minimize prediction Error. The purpose of linear regression is to predict the outcome, whereas principal component analysis does not predict anything.

In the figure above, the error from linear regression (perpendicular to the horizontal axis) is on the left, and the error from principal component analysis (perpendicular to the red line) is on the right.

PCA can be used for data compression by reducing the dimension of a feature to a dimension. If a vector of 100 dimensions can be represented by 10 dimensions in the end, the compression rate is 90%. The KL transform of the same image processing field uses PCA to do image compression. However, PCA should ensure the minimum loss of data characteristics after dimensionality reduction.

One of the benefits of PCA is the dimensionality reduction of data. We can sort the importance of the newly obtained “principal component” vector, take the most important part in front according to the need, and omit the following dimension, which can achieve the effect of dimensionality reduction to simplify the model or compress the data. At the same time, the original data information is maintained to the maximum extent.

One of the great advantages of PCA is that it is completely parametric free. In the calculation process of PCA, there is no need to set parameters artificially or intervene in the calculation according to any empirical model. The final result is only related to data and independent of users.

However, this can also be seen as a weakness. If the user has some prior knowledge of the observed object and grasps some characteristics of the data, but cannot intervene in the processing process through parameterization and other methods, the expected effect may not be achieved and the efficiency is not high.

14.4 Principal component analysis algorithm

14-4 – Principal Component Analysis Algorithm (15 min).mkv

PCA reduces dimension to dimension:

The first step is mean normalization. We need to figure out the mean of all the features, and let’s say. If the feature is on a different order of magnitude, we also have to divide it by the standard deviation.

The second step is to calculate the Covariance matrix:

The third step is to calculate the eigenvectors of the covariance matrix:

In Octave we can use singular value Decomposition (SVD), [U, S, V]= SVD (sigma).

For a dimensional matrix, in the above equation is a matrix composed of direction vectors with minimal projective errors with respect to data. If we want to reduce the data from dimension to dimension, we only need to select the previous vector from, and obtain a matrix of dimension, which is represented by, and then obtain the required new feature vector by the following calculation:

Where is dimensional, so the result is dimensional. Note, we do not treat the variance characteristics.

14.5 Choose the number of principal ingredients

14-5 – Choosing The Number Of Principal Components (13 min).mkv

Principal component analysis is to reduce the mean square error of projection:

The variance of the training set is:

We want to select the smallest possible value when the ratio of the mean square error to the variance of the training set is as small as possible.

If we want this ratio to be less than 1%, it means that 99% of the bias in the original data is retained. If we choose to keep 95% of the bias, we can reduce the dimension of features in the model very significantly.

We can shilling, and then do the principal component analysis, get the sum, and then figure out if the percentage is less than 1%. If not, do so again, and so on, until you find the minimum that makes the ratio less than 1% (because there is usually some correlation between the characteristics).

There are better ways to choose, when we call the “SVD” function from Octave, we get three arguments: [U, S, V] = SVD (sigma).

Where is a matrix with values only on the diagonal, while all other elements are 0. We can use this matrix to calculate the ratio of mean square error to the variance of the training set:

That is:

After the data is compressed, we can use the following methods to approximate the original features:

14.6 Compressed representation of reconstruction

Reconstruction from Compressed Representation (4 min). MKV

In previous videos, I talked about PCA as a compression algorithm. There you might need to compress 1000 dimensional data into 100 dimensional features, or have 3d data compressed into a 2d representation. So, if this is a compression algorithm, you should be able to go back to this compression representation, to an approximation of your original high dimensional data.

So, given that this could be 100 dimensions, how do you go back to your original representation, this could be a 1000 dimensional array?

PCA algorithm, we might have a sample like that. The sample in the picture,. So what we did is, we projected these samples onto this one-dimensional plane right here. And now we just have to use one real number, for example, to specify where these points are projected onto this three-dimensional surface. Given a point, how can we go back to this original two-dimensional space? Is 2 dimensional, is 1 dimensional, and the opposite equation is:. As shown in figure:

As you know, this is a pretty similar to the raw data. So, this is where you go from a low-dimensional representation back to an uncompressed representation. We get a piece of data between your original data, and we also call this process rebuilding the original data.

When we think about trying to reconstruct the initial value from the compressed representation. So, given an unlabeled data set, you now know how to apply PCA to your lower-dimensional representation with higher-dimensional features mapped here. In this video, hopefully you now also know how to take these low dimensional representations and map them to a backup that approximates your original high dimensional data.

Now that you know how to implement PCA, what we’re going to do is talk about some of the techniques that are good for actually using PCA. In particular, in the next video, I want to talk a little bit about how to choose.

14.7 Suggestions on the application of principal component analysis

14-7 – Advice for Applying PCA (13 min).mkv

Let’s say we’re doing some computer vision machine learning on a 100×100 pixel image, 10,000 features in total.

  1. The first step is to use principal component analysis to compress the data down to 1000 features
  2. Then the learning algorithm is run on the training set
  3. In the prediction, the input features are converted into feature vectors based on the previously learned features, and then the prediction is made

Note: If we have a cross validation set test set, we also learn from the training set.

The wrong principal component analysis case: A common mistake with principal component analysis is to use it to reduce overfitting (reducing the number of features). That’s a bad idea. Try regularization. The reason is that principal component analysis only approximately discards some features, it does not consider any information about the outcome variables, and therefore may lose very important features. However, when we perform regularization, we take into account the result variables and do not lose important data.

Another common mistake is to default to principal component analysis as part of the learning process. While this often works, it is better to start with all the original features and only consider principal component analysis when it is necessary (the algorithm is running too slowly or taking up too much memory).

Code part:

Machine learning exercises 4-K-Means and PCA

Code modification and annotation: Huang Hai Guang, [email protected]

In this exercise, we will implement k-means clustering and use it to compress images. We’ll start with a simple 2D data set to see how K-means works, and then we’ll apply it to image compression. We will also experiment with principal component analysis and see how it can be used to find low-dimensional representations of facial images.

K – means clustering

We will implement and apply K-means to a simple 2d data set to get some intuition of how it works. K-means is an iterative, unsupervised clustering algorithm that combines similar instances into clusters. The algorithm starts by guessing the initial cluster center of each cluster, then repeatedly assigns instances to the nearest cluster and recalculates the cluster center of that cluster. The first part of our implementation is to find the function that is closest to the cluster center for each instance of the data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
Copy the code
def find_closest_centroids(X, centroids) :
    m = X.shape[0]
    k = centroids.shape[0]
    idx = np.zeros(m)


    for i in range(m):
        min_dist = 1000000
        for j in range(k):
            dist = np.sum((X[i,:] - centroids[j,:]) ** 2)
            if dist < min_dist:
                min_dist = dist
                idx[i] = j


    return idx
Copy the code

Let’s test this function to make sure it works. We will use the test cases provided in the exercise.

data = loadmat('data/ex7data2.mat')
X = data['X']
initial_centroids = initial_centroids = np.array([[3.3], [6.2], [8.5]])


idx = find_closest_centroids(X, initial_centroids)
idx[0:3]
Copy the code
array([0..2..1.])
Copy the code

The output matches the expected value in the text (remember that our array is indexed from zero, not from the beginning, so the value is one lower than the one in the exercise). Next, we need a function to calculate the cluster center of the cluster. The cluster center is simply the average of all the samples currently assigned to the cluster.

data2 = pd.DataFrame(data.get('X'), columns=['X1'.'X2'])
data2.head()
Copy the code
X1 X2
0 1.842080 4.607572
1 5.658583 4.799964
2 6.352579 3.290854
3 2.904017 4.612204
4 3.231979 4.939894
sb.set(context="notebook", style="white")
sb.lmplot('X1'.'X2', data=data2, fit_reg=False)
plt.show()
Copy the code

def compute_centroids(X, idx, k) :
    m, n = X.shape
    centroids = np.zeros((k, n))


    for i in range(k):
        indices = np.where(idx == i)
        centroids[i,:] = (np.sum(X[indices,:], axis=1) / len(indices[0])).ravel()


    return centroids
Copy the code
compute_centroids(X, idx, 3)
Copy the code
array([[2.42830111.3.15792418],
       [5.81350331.2.63365645],
       [7.11938687.3.6166844 ]])
Copy the code

This output also matches the expected values in the exercise. The next section deals with some iterations and visualization results of actually running the algorithm. Since this step is not complicated, I will build it from scratch. To run the algorithm, we only need to recalculate the cluster center after assigning the sample to the nearest cluster.

def run_k_means(X, initial_centroids, max_iters) :
    m, n = X.shape
    k = initial_centroids.shape[0]
    idx = np.zeros(m)
    centroids = initial_centroids


    for i in range(max_iters):
        idx = find_closest_centroids(X, centroids)
        centroids = compute_centroids(X, idx, k)


    return idx, centroids
Copy the code
idx, centroids = run_k_means(X, initial_centroids, 10)
Copy the code
cluster1 = X[np.where(idx == 0) [0],:]
cluster2 = X[np.where(idx == 1) [0],:]
cluster3 = X[np.where(idx == 2) [0],:]


fig, ax = plt.subplots(figsize=(12.8))
ax.scatter(cluster1[:,0], cluster1[:,1], s=30, color='r', label='Cluster 1')
ax.scatter(cluster2[:,0], cluster2[:,1], s=30, color='g', label='Cluster 2')
ax.scatter(cluster3[:,0], cluster3[:,1], s=30, color='b', label='Cluster 3')
ax.legend()
plt.show()
Copy the code

One step we skipped was the process of initializing the cluster center. This can affect the convergence of the algorithm. Our task is to create a function that selects a random sample and uses it as the initial clustering center.

def init_centroids(X, k) :
    m, n = X.shape
    centroids = np.zeros((k, n))
    idx = np.random.randint(0, m, k)


    for i in range(k):
        centroids[i,:] = X[idx[i],:]


    return centroids
Copy the code
init_centroids(X, 3)
Copy the code
array([[0.99253246.5.01567424],
       [6.63060699.3.01502301],
       [4.88804332.5.50670795]])
Copy the code
from sklearn.cluster import KMeans# import kmeans library


model = KMeans(n_clusters=3, n_init=100, n_jobs=-1)
Copy the code
model.fit(data2)
Copy the code
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=100, n_jobs=- 1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
Copy the code
centroids = model.cluster_centers_
print(centroids.shape)


C = model.predict(data2)
print(C.shape)
Copy the code
(3.2)
(300.)Copy the code
centroids[C].shape
Copy the code
(300.2)
Copy the code

Laws of the elbow

When people talk about ways to choose the number of clusters, one of the ways they might talk about it is called the elbow rule. For the “elbow rule,” all we need to do is change the K value, which is the total number of categories in the cluster. We run the k-means clustering method with a cluster. This means that all the data will be divided into a cluster and the cost function or distortion function J will be calculated. K stands for clustering number.

We might get a curve that looks something like this. Like a man’s elbow. That’s what the elbow rule does. Let’s look at a picture like this. It looks like there’s a very clear elbow there. Like a human arm, if you hold out your arm, then this is your shoulder, elbow, hand. This is called the elbow rule. And what you see is this pattern, the distortion goes down very quickly, from one to two, from two to three, and then you get to a cubit point at two. After that, the distortion goes down very slowly, it looks like it’s right to cluster with three clusters, that’s because that’s the elbow of the curve, the distortion goes down very quickly, and after K=2 it goes down very slowly, so we choose K=2. When you apply the “elbow rule”, if you get a graph like the one below, then this is a reasonable way to choose the number of clusters.

import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt


c1x = np.random.uniform(0.5.1.5, (1.10))
c1y = np.random.uniform(0.5.1.5, (1.10))
c2x = np.random.uniform(3.5.4.5, (1.10))
c2y = np.random.uniform(3.5.4.5, (1.10))
x = np.hstack((c1x, c2x))
y = np.hstack((c1y, c2y))
X = np.vstack((x, y)).T


K = range(1.10)
meanDispersions = []
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    meanDispersions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])


plt.plot(K, meanDispersions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average Dispersion')
plt.title('Selecting k with the Elbow Method')
plt.show()
Copy the code

Import the corresponding package
import scipy
import scipy.cluster.hierarchy as sch
from scipy.cluster.vq import vq,kmeans,whiten
import numpy as np
import matplotlib.pylab as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.decomposition import PCA
Copy the code
newData=np.load("data/filename.npy")
Copy the code
newData
Copy the code
array([[1.18945132.0.31092235],
       [ 2.06415695.0.74854414],
       [1.43769023.0.80669682],
       [3.23039706.0.84519783],
       [ 2.36892693.0.44480961],
       [ 0.28997221.2.79266758],
       [ 1.2099519 , 0.00638496],
       [2.09689459.0.22796377],
       [ 5.50091159.0.14275827],
       [3.47948639.0.94978548]])
Copy the code
#1. Hierarchical clustering
# The distance matrix between points, the Euclidean distance used here:
disMat = sch.distance.pdist(newData,'euclidean')
# Hierarchical clustering:
Z=sch.linkage(disMat,method='average')
# Display the hierarchical clustering result as a tree graph and save it as plot_dendrogram.png
P=sch.dendrogram(Z)
Copy the code

Dbscan density clustering

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
# from .agglomerative_clustering import test_AgglomerativeClustering,test_AgglomerativeClustering_nclusters,test_AgglomerativeClustering_linkage
# from .dbscan import test_DBSCAN,test_DBSCAN_epsilon,test_DBSCAN_min_samples
# from chapters.Cluster_EM.gmm import test_GMM,test_GMM_cov_type,test_GMM_n_components
# from .kmeans import test_Kmeans,test_Kmeans_n_init,test_Kmeans_nclusters


def create_data(centers,num=100,std=0.7) :
    Generate a data set for clustering: Param centers: An array of central points for clustering. If the central point is two-dimensional, then every sample produced is two-dimensional. : Param num: number of samples: Param STD: standard deviation of samples in each cluster :return: data set used for clustering. Is a tuple where the first element is the sample set and the second element is the real cluster classification tag "" of the sample set.
    X, labels_true = make_blobs(n_samples=num, centers=centers, cluster_std=std)
    return  X,labels_true
def plot_data(*data) :
    Draw a data set for clustering :param data: variable parameter. It's a tuple. The first element is the sample set, and the second element is the true cluster of the sample set.
    X,labels_true=data
    labels=np.unique(labels_true)
    fig=plt.figure()
    ax=fig.add_subplot(1.1.1)
    colors='rgbyckm' # Each cluster sample is marked with a different color
    for i,label in enumerate(labels):
        position=labels_true==label
        ax.scatter(X[position,0],X[position,1],label="cluster %d"%label,
    color=colors[i%len(colors)])


    ax.legend(loc="best",framealpha=0.5)
    ax.set_xlabel("X[0]")
    ax.set_ylabel("Y[1]")
    ax.set_title("data")
    plt.show()
Copy the code
from sklearn import  cluster
from sklearn.metrics import adjusted_rand_score
import matplotlib.pyplot as plt
plt.figure(figsize=(12.12))
def test_DBSCAN(*data) :
    "Test DBSCAN usage :param data: variable parameter. It's a tuple. The first element is the sample set, and the second element is the true cluster of the sample set.
    X,labels_true=data
    clst=cluster.DBSCAN()
    predicted_labels=clst.fit_predict(X)
    print("ARI:%s"% adjusted_rand_score(labels_true,predicted_labels))
    print("Core sample num:%d"%len(clst.core_sample_indices_))
def test_DBSCAN_epsilon(*data) :
    "" Tests the effect of DBSCAN clustering results with EPS parameters :param data: variable parameter. It's a tuple. The first element is the sample set, and the second element is the true cluster of the sample set.
    X,labels_true=data
    epsilons=np.logspace(-1.1.5)
    ARIs=[]
    Core_nums=[]
    for epsilon in epsilons:
        clst=cluster.DBSCAN(eps=epsilon)
        predicted_labels=clst.fit_predict(X)
        ARIs.append( adjusted_rand_score(labels_true,predicted_labels))
        Core_nums.append(len(clst.core_sample_indices_))


    # # drawing
    fig=plt.figure()
    ax=fig.add_subplot(1.2.1)
    ax.plot(epsilons,ARIs,marker='+')
    ax.set_xscale('log')
    ax.set_xlabel(r"$\epsilon$")
    ax.set_ylim(0.1)
    ax.set_ylabel('ARI')


    ax=fig.add_subplot(1.2.2)
    ax.plot(epsilons,Core_nums,marker='o')
    ax.set_xscale('log')
    ax.set_xlabel(r"$\epsilon$")
    ax.set_ylabel('Core_Nums')


    fig.suptitle("DBSCAN")
    plt.show()
def test_DBSCAN_min_samples(*data) :
    "" Tests the clustering results of DBSCAN with the influence of the min_samples parameter :param data: variable parameter. It's a tuple. The first element is the sample set, and the second element is the true cluster of the sample set.
    X,labels_true=data
    min_samples=range(1.100)
    ARIs=[]
    Core_nums=[]
    for num in min_samples:
        clst=cluster.DBSCAN(min_samples=num)
        predicted_labels=clst.fit_predict(X)
        ARIs.append( adjusted_rand_score(labels_true,predicted_labels))
        Core_nums.append(len(clst.core_sample_indices_))


    # # drawing
    fig=plt.figure()
    ax=fig.add_subplot(1.2.1)
    ax.plot(min_samples,ARIs,marker='+')
    ax.set_xlabel( "min_samples")
    ax.set_ylim(0.1)
    ax.set_ylabel('ARI')


    ax=fig.add_subplot(1.2.2)
    ax.plot(min_samples,Core_nums,marker='o')
    ax.set_xlabel( "min_samples")
    ax.set_ylabel('Core_Nums')


    fig.suptitle("DBSCAN")
    plt.show()
Copy the code
centers=[[1.1], [2.2], [1.2], [10.20]] # The center point used to generate the cluster
X,labels_true=create_data(centers,1000.0.5) Generate data sets for clustering
Copy the code
X
Copy the code
array([[0.81394428.0.97283296],
       [2.39126942.1.39691895],
       [0.52294212.2.28195147],... [0.07392303.0.23407894],
       [0.92014917.1.63693107],
       [0.95976852.1.00443603]])
Copy the code
labels_true
Copy the code
array([0.1.2.1.2.2.3.3.3.3.3.2.2.1.1.0.0.0.2.1.3.1.0.1.1.1.1.0.1.2.3.2.3.1.0.1.3.2.3.2.2.2.2.3.2.2.2.0.1.3.0.2.1.0.2.2.3.2.2.1.3.3.1.2.2.0.1.3.1.0.2.0.0.0.0.1.0.1.0.3.0.3.0.1.2.2.0.1.1.3.0.1.0.3.0.0.1.3.3.3.0.1.0.1.0.1.1.2.2.2.1.1.3.2.2.2.2.1.2.1.1.0.2.0.1.1.1.1.1.3.2.0.0.1.3.1.0.3.3.2.3.1.1.2.1.0.2.3.0.1.3.2.2.2.0.2.2.0.3.0.1.0.3.3.0.0.3.2.3.1.3.0.0.1.0.1.3.3.3.1.3.2.3.2.0.0.2.1.1.2.3.1.2.3.0.3.1.0.0.1.3.2.1.1.0.3.1.2.2.3.0.0.2.2.1.3.3.0.2.0.0.2.1.0.3.0.1.3.2.2.0.0.3.3.2.2.1.3.0.1.3.1.2.2.0.3.0.2.2.2.2.2.1.2.0.2.0.0.2.3.1.3.3.3.3.3.3.0.1.3.2.0.1.2.2.3.0.1.1.0.3.0.3.1.2.2.3.0.0.3.1.0.3.2.2.0.3.2.1.0.3.3.2.1.0.1.3.3.0.1.2.0.3.3.0.2.0.0.0.2.3.0.2.3.1.2.1.1.3.3.1.2.2.0.0.2.0.3.1.3.1.2.2.1.2.0.0.1.2.1.1.1.0.0.1.3.2.3.3.0.3.1.1.1.0.3.1.1.3.1.3.0.3.3.1.0.2.0.0.2.2.1.2.3.2.2.0.2.2.1.3.1.1.2.3.1.3.2.3.2.1.3.3.0.1.1.3.2.1.1.2.3.1.3.0.3.0.2.3.0.0.2.3.2.3.0.0.3.3.3.0.0.2.3.2.3.3.0.2.3.2.3.0.1.1.1.3.0.3.0.1.0.3.0.3.1.2.2.2.1.2.0.3.0.0.1.0.0.3.0.2.3.3.1.0.2.1.0.2.3.0.1.2.1.3.0.1.2.2.3.2.1.0.1.2.3.2.3.1.2.3.0.1.2.2.2.3.0.3.1.1.1.2.1.1.0.0.1.1.0.3.3.0.1.1.3.1.3.2.2.0.3.1.0.2.1.0.2.3.1.0.1.2.3.2.2.0.0.0.1.0.0.0.3.3.2.0.0.2.2.1.2.1.0.0.1.0.2.3.1.1.3.3.1.3.3.3.1.2.0.2.1.3.3.1.1.1.0.1.3.3.3.1.1.2.1.2.1.0.3.1.0.1.1.2.3.1.0.1.0.1.0.3.0.0.3.3.0.1.3.2.3.3.3.3.2.3.2.3.2.0.3.2.0.0.2.2.0.1.0.2.1.0.2.2.0.0.1.0.1.0.3.3.3.1.2.3.1.0.1.3.3.0.1.3.2.3.3.0.0.0.3.2.1.0.0.2.3.0.0.2.0.3.1.3.3.3.2.1.2.3.0.1.3.3.2.3.2.2.2.1.3.0.2.3.2.1.2.3.3.1.1.0.0.1.2.1.1.2.3.2.0.1.0.0.3.2.3.1.1.2.3.1.1.0.3.2.0.1.3.2.2.0.0.1.0.3.2.1.0.1.3.0.3.3.3.1.2.2.0.2.2.2.0.0.0.3.2.0.0.1.0.1.1.2.1.0.3.0.0.3.3.3.3.2.1.0.1.2.3.1.0.0.0.1.1.1.3.1.0.3.2.2.2.2.2.1.2.3.1.3.0.1.2.3.1.0.0.0.1.2.3.2.2.3.3.2.1.0.2.2.2.2.3.1.0.1.2.0.0.2.1.2.0.0.2.1.0.0.2.3.0.1.2.0.3.1.2.1.0.3.2.0.1.0.0.3.1.2.0.0.2.2.1.1.2.3.2.3.0.0.2.3.2.1.0.0.3.0.0.2.3.3.3.1.3.0.2.2.3.2.1.0.3.3.0.1.3.3.2.2.2.3.3.2.3.2.1.3.3.3.2.0.1.2.1.3.1.1.1.0.1.1.0.2.1.2.2.1.3.0.2.0.1.2.1.1.0.3.0.2.1.1.0.1.1.0.1.0.1.2.3.1.3.3.2.1.2.2.1.2.3.2.2.0.3.0.1.0.0.3.1.3.1.0.2.1.2.2.2.3.0.3.2.1.3.1.0.2.0])
Copy the code
test_DBSCAN(X,labels_true) Call test_DBSCAN
Copy the code
ARI:0.3326689255565291
Core sample num:990
Copy the code
test_DBSCAN_epsilon(X,labels_true) Call test_DBSCAN_epsilon
Copy the code

test_DBSCAN_min_samples(X,labels_true) # Call the test_DBSCAN_min_samples function
Copy the code

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
%matplotlib inline
X1, y1=datasets.make_circles(n_samples=5000, factor=6.,
                                      noise=.05)
X2, y2 = datasets.make_blobs(n_samples=1000, n_features=2, centers=[[1.2.1.2]], cluster_std=[[1.]],
               random_state=9)


X = np.concatenate((X1, X2))
plt.scatter(X[:, 0], X[:, 1], marker='o')
plt.show()
Copy the code

from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=3, random_state=9).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
Copy the code

from sklearn.cluster import DBSCAN
y_pred = DBSCAN().fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
Copy the code

y_pred = DBSCAN(eps = 0.1).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
Copy the code

y_pred = DBSCAN(eps = 0.1, min_samples = 4).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
Copy the code

Principal Component Analysis

PCA is a linear transformation to find the direction of the “principal component” or maximum variance in a data set. It can be used for dimensionality reduction. In this exercise, we are first responsible for implementing PCA and applying it to a simple two-dimensional data set to see how it works. We start by loading and visualizing the data set.

data = loadmat('data/ex7data1.mat')
Copy the code
X = data['X']


fig, ax = plt.subplots(figsize=(12.8))
ax.scatter(X[:, 0], X[:, 1])
plt.show()
Copy the code

The algorithm of PCA is fairly simple. After ensuring that the data is normalized, the output is simply the singular value decomposition of the covariance matrix of the original data.

def pca(X) :
    # normalize the features
    X = (X - X.mean()) / X.std()


    # compute the covariance matrix
    X = np.matrix(X)
    cov = (X.T * X) / X.shape[0]


    # perform SVD
    U, S, V = np.linalg.svd(cov)


    return U, S, V
Copy the code
U, S, V = pca(X)
U, S, V
Copy the code
(matrix([[0.79241747.0.60997914],
         [0.60997914.0.79241747]]),
 array([1.43584536.0.56415464]),
 matrix([[0.79241747.0.60997914],
         [0.60997914.0.79241747]]))
Copy the code
print(X.shape)
print(U.shape)
print(S.shape)
print(V.shape)
Copy the code
(50.2)
(2.2)
(2(,)2.2)
Copy the code

Now we have the principal components (the matrix U), and we can use these to project the original data into a lower dimensional space. For this task, we will implement a function that computs the projection and selects only the top K components, effectively reducing the dimension.

def project_data(X, U, k) :
    U_reduced = U[:,:k]
    return np.dot(X, U_reduced)
Copy the code
Z = project_data(X, U, 1)
Copy the code

We can also restore the original data by reversing the conversion step.

def recover_data(Z, U, k) :
    U_reduced = U[:,:k]
    return np.dot(Z, U_reduced.T)
Copy the code
X_recovered = recover_data(Z, U, 1)
Copy the code
      
Copy the code
fig, ax = plt.subplots(figsize=(12.8))
ax.scatter(list(X_recovered[:, 0]), list(X_recovered[:, 1]))
plt.show()
Copy the code

Note that the projection axis of the first principal component is basically a diagonal in the dataset. When we reduce the data to a dimension, we lose the variation around that diagonal, so in our representation everything is along that diagonal.

Reference material \

[1] machine learning courses: www.coursera.org/course/ml [2] Huang Haiguang: https://github.com/fengdu78 [3] making: github.com/fengdu78/Co… [4] Job code: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes/blob/master/codeex7-kmeans%20and%20PCA/ML-Exercise7.ipynb [5] Markdown file: github.com/fengdu78/Co… [6] PDF file: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes/blob/master/ machine learning personal notes full version v5.4 – A4 print edition. PDF

On this site

The “Beginner machine Learning” public account was founded by Dr. Huang Haiguang. Huang Bo has more than 23,000 followers on Zhihu and ranks among the top 100 in github (33,000 followers). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.

Highlights from the past

  • All those years of academic philanthropy. – You’re not alone

  • Suitable for beginners to enter the artificial intelligence route and information download

  • Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \

  • Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)

  • Python code implementation of Statistical Learning Methods (Github 7200+)

  • The Mathematical Essence of Machine Learning (Online reading edition) \

Note: If you join our wechat group or QQ group, please reply”Add group

To join Knowledge Planet (4300+ user ID: 92416895), please reply”Knowledge of the planet