Chapter 13 introduction to unsupervised learning

This article is a code reproduction of the book statistical Learning Methods written by Li Hang. Author: Huang Haiguang

Note: the code can be downloaded on Github. I will continue to publish the code on the public account “Machine Learning Beginners”, which can be read online in this album.

1. Machine learning or statistical learning generally includes supervised learning, unsupervised learning and reinforcement learning.

Unsupervised learning refers to the machine learning problem of learning models from unlabeled data. The nature of unsupervised learning is to learn statistical rules or potential structures in data, mainly including clustering, dimensionality reduction and probability estimation.

2. Unsupervised learning can be used to analyze existing data and predict future data. The learned models are functions, conditional probability distributions, or conditional probability distributions.

The basic idea of unsupervised learning is to perform some kind of “compression” of a given data (matrix data) to find the underlying structure of the data, assuming that the compression with the least loss results in the most essential structure. We can consider mining the longitudinal structure of data and corresponding clustering. You can also consider mining the horizontal structure of the data, corresponding to dimensionality reduction. Longitudinal and transverse structures of excavated data can also be considered at the same time, corresponding to probabilistic model estimation.

3. Clustering is to assign similar samples (instances) in the sample set to the same class, and dissimilar samples to different classes. Clustering is divided into hard clustering and soft clustering. Clustering methods include hierarchical clustering and mean clustering.

4. Dimension reduction is to transform the samples (instances) in the sample set from high-dimensional space to low-dimensional space. Assuming that samples originally exist in low-dimensional space, or approximately exist in low-dimensional space, the structure of sample data can be better represented by dimensionality reduction, that is, the relationship between samples can be better represented. Dimension reduction includes linear and nonlinear dimension reduction, and principal component analysis.

5. Probabilistic model estimation assumes that the training data is generated by a probabilistic model, and the structure and parameters of the probabilistic model are learned by using the training data. Probability models include mixed model, rate graph model and so on. Probability graph model includes directed graph model and undirected graph model.

6. Topic analysis is a technique of text analysis. Given a set of texts, topic analysis aims to find the topic of each text in the set, and the topic is represented by a set of words. Topic analysis methods include latent semantic analysis, probabilistic latent semantic analysis and latent Dirichlet assignment.

7. The purpose of graph analysis is to uncover statistical laws or underlying structures hidden in the graph. Link analysis is a kind of graph analysis, mainly to find the important nodes in the directed graph, including PageRank algorithm.

Download address

Github.com/fengdu78/li…

References:

[1] Statistical Learning Methods: baike.baidu.com/item/ Statistical Learning Methods…

[2] Huang Hai-guang: github.com/fengdu78

[3] making: github.com/fengdu78/li…

[4] wzyonggege: github.com/wzyonggege/…

[5] WenDesi: github.com/WenDesi/lih…

[6] very hot very hot: blog.csdn.net/tudaodiaozh…

[7] HKTXT: github.com/hktxt/Learn…