Random Forest is a classification and regression algorithm, which contains multiple decision trees and forms a Forest. The category of Random Forest is the mode decision of the classification and regression results of all decision trees.

Because it has many advantages, it is more widely adapted.

1, can process high-dimensional data, do not have to do feature selection, automatically select which features are important.

2. Fast training speed and high classification accuracy.

3. Can detect the interaction between features.

4. No overfitting will occur.

EM Expectation Maximization, also known as maximum likelihood estimation. It is a parameter estimation method.

The basic idea is that the parameters should be set to maximize the probability of random samples. So if we know the probability distribution of everything, we can get the final value by finding the parameter that maximizes the probability distribution.

Refer to article 1. Given the probability of the height distribution of 100 male students, when their mean and variance are not known, it is believed that the mean and variance should maximize the probability distribution through maximum likelihood estimation.

The maximum likelihood estimation function is as follows:

The logarithm can be simplified to the sum form if it is multiplication:

The solution can be the derivative of zero, Newton’s method or gradient descent (used in computers).


HMM, hidden Markov model. HMM is widely used in natural language processing, such as Chinese word segmentation, part-of-speech tagging, and speech recognition.

In a typical hidden Markov model, the next state is only relevant to the present and not to other factors. Although this is not correct and may omit a lot of important information, it can simplify the model and calculation, and get results, so it is often used in practice.

Consider a classic HMM example, as shown below.

HMM solves three basic problems:

1. According to the observation series and model parameters, the posterior probability of the observation series under the condition that the model parameters are known is calculated.

2. Find the most reasonable value of the state sequence when the observation sequence is known.

3. How to adjust model parameters to maximize the posterior probability of observation series.

LDA (Latent Dirichlet Allocation) is a topic model, which is used in image classification, text classification, and topic word extraction.

Is a three-tier Bayesian probabilistic model that includes words, topics, and documents.

It can solve the problem of sparse matrix.

Topic Model is the simplest topic model, which is an algorithm for discovering the topic of a large collection of documents.

You can determine the similarity and distance between two documents. It is a semantic mining technique based on the topic model, which can distinguish the difference between two articles in semantics, but not in word frequency.

There are two commonly used algorithms for topic models: pLSA and LDA. See article 6 for more.



1, http://www.cnblogs.com/openeim/p/3921835.html

2, http://www.cnblogs.com/skyme/p/4651331.html (HMM)

3, http://blog.csdn.net/app_12062011/article/details/50408664#t6 (HMM in detail the application of natural language processing)

4, http://www.52nlp.cn/hmm-learn-best-practices-and-cui-johnny-blog

5, http://blog.csdn.net/daringpig/article/details/8072794

6, http://blog.csdn.net/huagong_adu/article/details/7937616 (the difference between LDA and TF – IDF)