• Unsupervised Learning with Python
  • Original author: Vihar Kurama
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: zhmhhu
  • Proofreader: Jianboy,7Ethan

Unsupervised learning is a machine learning technique used to find patterns in data. The data supplied to the unsupervised algorithm is untagged, which means that only the input variable (X) is given and no corresponding output variable is given. In unsupervised learning, algorithms discover interesting structures in the data themselves.

Yan Lecun, director of AI research, explains that unsupervised learning — teaching machines to learn by themselves without explicitly telling them whether what they are doing is right or wrong — is the key to “true” AI.

Supervised learning vs. unsupervised learning.

In supervised learning, the system tries to learn from the examples given earlier. (In unsupervised learning, on the other hand, the system tries to find patterns directly from a given example.) Therefore, if the dataset is tagged, it is a supervised problem, and if the dataset is untagged, it is unsupervised.

src

The image above is an example of supervised learning; We use regression algorithm to find the best fit line between features. In unsupervised learning, input data is divided into clusters based on features, and the cluster to which it belongs is predicted.

Important terms

Feature: Input variable used for prediction.

Predictions: Model output when entering examples.

Example: A row of data sets. An example contains one or more characteristics and possible labels.

Label: characteristic result.

Unsupervised learning data preparation

In this paper, we make our first prediction using the Iris dataset. The dataset contains a set of 150 records with five attributes — petal length, petal width, sepal length, sepal width, and category. Iris Setosa, Iris Virginica and Iris Versicolor are the three categories. In our unsupervised algorithm, we give these four features of iris and predict which category it belongs to.

We used the Sklearn library in Python to load the iris data set and the matplotlib library to achieve data visualization. The following code snippet is used to explore the data set.

# import module
from sklearn import datasets
import matplotlib.pyplot as plt

Load the data set
iris_df = datasets.load_iris()

# Available methods on the dataset
print(dir(iris_df))

# features
print(iris_df.feature_names)

# target
print(iris_df.target)

# target name
print(iris_df.target_names)
label = {0: 'red', 1: 'blue', 2: 'green'}

# Data set slicing
x_axis = iris_df.data[:, 0]  # Sepal Length
y_axis = iris_df.data[:, 2]  # Sepal Width

# draw
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
Copy the code
['DESCR'.'data'.'feature_names'.'target'.'target_names']
['sepal length (cm)'.'sepal width (cm)'.'petal length (cm)'.'petal width (cm)'] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2'setosa' 'versicolor' 'virginica']
Copy the code

Purple: Setosa, green: Versicolor, yellow: Virginica

clustering

In a cluster, the data is divided into several groups. In short, the aim is to separate groups with similar characteristics and assign them to corresponding clusters.

Examples of visualization,

In the figure above, the image on the left is raw data that has not been classified, and the image on the right is clustered (the data is classified according to its characteristics). When given the input to predict, it checks the cluster to which it belongs based on its characteristics and makes the prediction.

K-means clustering algorithm in Python

K-means is an iterative clustering algorithm designed to find local maxima in each iteration. Initially select the required number of clusters. Since we knew that three categories were involved, we programmed the algorithm to group the data into three categories by passing the parameter “n_clusters” to our K-means model. Now randomly assign three points (inputs) to three clusters. The next given input is assigned to the appropriate cluster based on the centroid distance between each point. Now, recalculate the centroids of all clusters.

Each centroid of the cluster is a set of eigenvalues that define the result group. Checking the centroid feature weights can be used to qualitatively explain what type of group each cluster represents.

We import k-means model from sklearn library, fit feature and predict.

Python K means algorithm implementation.

# import module
from sklearn import datasets
from sklearn.cluster import KMeans

Load the data set
iris_df = datasets.load_iris()

# declare model
model = KMeans(n_clusters=3)

# Fit model
model.fit(iris_df.data)

# Predict a single inputPredicted_label = model. Predict ([[7.2, 3.5, 0.8, 1.6]])# Predict the entire data
all_predictions = model.predict(iris_df.data)

# Print predictions
print(predicted_label)
print(all_predictions)
Copy the code
[0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1, 2]Copy the code

Hierarchical clustering

As the name implies, hierarchical clustering is an algorithm for constructing a cluster hierarchy. The algorithm starts with all the data assigned to its cluster. The two closest clusters are then merged into the same cluster. Finally, the algorithm ends when only one cluster remains.

You can use a tree diagram to show the completion of hierarchical clustering. Now let’s look at an example of hierarchical clustering of grain data. The data set can be found here.

Implementation of hierarchical clustering algorithm in Python.

# import module
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
import pandas as pd

# read DataFrame
seeds_df = pd.read_csv(
    "https://raw.githubusercontent.com/vihar/unsupervised-learning-with-python/master/seeds-less-rows.csv")

Delete the grain type from the DataFrame and save it later
varieties = list(seeds_df.pop('grain_variety'))

Extract the measured values into a NumPy array
samples = seeds_df.values

"""Perform hierarchical clustering of samples using the linkage() function with method ='complete' keyword argument. Merge the results. """
mergings = linkage(samples, method='complete')

"""Use the dendrogram() function to draw a tree diagram at merge time, specifying the keyword arguments labels = Varieties, leaf_rotation = 90, and leaf_font_size = 6. """
dendrogram(mergings,
           labels=varieties,
           leaf_rotation=90,
           leaf_font_size=6,
           )

plt.show()
Copy the code

The difference between k-means and hierarchical clustering

  • Hierarchical clustering doesn’t work well with big data, but K-means clustering does. This is because the time complexity of k-means is linear, i.e. O(n), while the time complexity of hierarchical clustering is quadratic, i.e. O(n2).
  • In k-means clustering, when we start with an arbitrarily selected cluster, the results generated by running the algorithm many times may differ. However, the results are reproducible in hierarchical clustering.
  • When the shape of the cluster is hyperspherical (circles in 2D, spheres in 3D), we find that the K-means work well.
  • K-means do not allow noise data, and in hierarchical clustering we can directly use noise data sets for clustering.

T – SNE clustering

It is one of the methods of visual unsupervised learning. T-sne represents the random embedded neighborhood of t-distribution. It maps a higher dimensional space to a 2 – or 3-dimensional space that can be visualized. Specifically, it models each high-dimensional object by two-dimensional or three-dimensional points, so that similar objects are modeled by nearby points, while non-similar objects are modeled by distant points with high probability.

T-sne clustering implementation in Python for iris data sets

# import module
from sklearn import datasets
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

Load the data set
iris_df = datasets.load_iris()

# Define model
model = TSNE(learning_rate=100)

# Fit model
transformed = model.fit_transform(iris_df.data)

# Draw two dimensional T-Sne
x_axis = transformed[:, 0]
y_axis = transformed[:, 1]

plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
Copy the code

Purple: Setosa, green: Versicolor, yellow: Virginica

Here, since the iris data set has four features (4D), it is transformed and represented as a two-dimensional graph. Similarly, the T-SNE model can be applied to data sets with N features.

DBSCAN clustering

DBSCAN (Density-based clustering with Noise) is a popular clustering algorithm used to replace the K-means in predictive analysis. It does not need to enter the number of clusters to run. However, you have to adjust the other two parameters.

The SciKit-learn implementation provides default values for the EPS and MIN_samples parameters, but you will usually need to adjust these parameters. The EPS parameter is the maximum distance between two data points to be considered in the same neighborhood. The min_samples parameter is the minimum number of data points in the neighborhood that are considered to be clustered.

DBSCAN clustering in Python

# import module
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA

Load the data set
iris = load_iris()

# declare model
dbscan = DBSCAN()

# fitting
dbscan.fit(iris.data)

# Use PCA for transformations
pca = PCA(n_components=2).fit(iris.data)
pca_2d = pca.transform(iris.data)

Draw based on category
for i in range(0, pca_2d.shape[0]):
    if dbscan.labels_[i] == 0:
        c1 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='r', marker='+')
    elif dbscan.labels_[i] == 1:
        c2 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='g', marker='o')
    elif dbscan.labels_[i] == -1:
        c3 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='b', marker=The '*')

plt.legend([c1, c2, c3], ['Cluster 1'.'Cluster 2'.'Noise'])
plt.title('DBSCAN finds 2 clusters and Noise')
plt.show()
Copy the code

More unsupervised techniques:

  • Principal Component Analysis (PCA)
  • Anomaly detection
  • Automatic coding
  • Deep belief network
  • Hebb-type learning
  • Generative Adversarial Networks (GANs)
  • Self-organizing map

Important links:

Supervised learning algorithms in Python.

  • Supervised learning with PythonWhy artificial intelligence and machine learning
  • Introduction to Machine LearningMachine learning is an idea of learning from examples and experience, without explicit programming methods.
  • Use Python for deep learning: Mimic the human brain.
  • Linear algebra for deep learning: The math behind every deep learning program.

Afterword.

Thanks for reading. If you find this article useful, click ❤️ below to spread the love.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.