1. Write at the front

If you want to engage in data mining or machine learning, it is necessary to master common machine learning algorithms. Common machine learning algorithms are as follows:

  • Supervised learning algorithms: logistic regression, linear regression, decision tree, Naive Bayes, K-nearest Neighbor, support vector machine, integration algorithm Adaboost, etc
  • Unsupervised algorithms: clustering, dimensionality reduction, association rules, PageRank, etc

Published:

Algorithm theory + practical K nearest neighbor algorithm

Algorithmic theory + practical decision tree

Naive Bayes for algorithm theory and practice

【 Vernacular Machine Learning 】 Algorithm theory + Practice Support Vector Machine (SVM)

Algorithmic theory + Practical AdaBoost algorithm

Algorithm theory + practical K-means clustering algorithm

PCA dimensionality reduction for algorithm theory and practice

For detailed understanding of the principles, have seen watermelon book, statistical learning methods, such as machine learning field, also heard some machine learning courses, but always feel more abstruse words, don’t have the patience to read, and theory are everywhere, but practice is the most important, so here want to use the most simple language to write a vernacular theory + practice series of machine learning algorithm.

In my opinion, it is more important to understand the idea behind the algorithm and its use than to understand its mathematical derivation. Idea will let you have an intuitive feeling, so as to understand the rationality of the algorithm, the mathematical deduction is to express this kind of rationality in a more rigorous language, for example, a pear is sweet and can be expressed in mathematical language to sugar content is 90%, but only the bite personally, you can truly feel the pear how sweet, And really understand the math of what 90 percent sugar looks like. If these machine learning algorithms are pears, the primary purpose of this article is to take a bite out of them. There are also several other purposes:

  • Test your understanding of the algorithm and make a summary of the algorithm theory
  • Can be happy to learn the core ideas of these algorithms, find the interest in learning these algorithms, for in-depth learning these algorithms lay a foundation.
  • The theory of each class will be put to a practical case, can really learn to apply, not only can exercise programming ability, but also can deepen the grasp of algorithm theory.
  • Also want to put all the previous notes and reference in a piece, convenient for the convenience of checking later.

The process of learning algorithms should not only gain the theory of algorithms, but also have fun and the ability to solve practical problems!

Today is the ninth part of EM clustering in vernacular machine learning algorithm theory + Practice. This is an unsupervised learning algorithm. If you use a model based on maximum likelihood estimation, if there are hidden variables in the model, you will use EM algorithm to estimate. So this algorithm is the maximum likelihood estimation of probability model parameters with hidden variables, in case you don’t understand what I’m saying, what hidden variables, what maximum likelihood mess? Nothing, through today’s learning, can quickly master the working principle of EM algorithm, can also understand the maximum likelihood, but also finally through the call tool to achieve EM algorithm and complete a king of Glory hero clustering (when playing the game, will not encounter the opponent robbed you good at the hero situation ah, So how do you pick one that’s as good as the hero? . Ha ha, can’t wait? Let’s get started.

The outline is as follows:

  • Draw the core idea of EM algorithm from a life scenario (divide the dish evenly, how should you divide it?)
  • The working principle of EM algorithm in vernacular, and give examples to illustrate (have played the coin toss bar)
  • Look at EM clustering (how is it different from KMeans?)
  • EM algorithm combat – Clustering of king of Glory heroes (after clustering, we can find interchangeable heroes, so that your opponent is not afraid to choose your good heroes)

OK, let’s go!

2. EM algorithm, or start from the dishes!

This is the first time you’ve heard of EM, but did you know it? You’ve probably used this idea more than once in your life, because the algorithms THAT I used to talk about come from life. What? Don’t believe it? Let’s look at the following scenario:

u

Suppose you stir-fry a dish, and I want you to divide it equally between two dishes. How do you divide it? You hear the average score? You can’t just take a scale and calculate half of it and split it, but if you do, you’re a genius and don’t need machine learning. Anyway, I think the way most people do it is this: first divide some of the food in plate A, then divide the rest of the food in plate B, and then observe whether there is the same amount of food in plate A and PLATE B, and then observe whether there is the same amount of food in plate A and plate B… The whole process is repeated until there is no change in portion size.

Which method do you use? If it is the latter, then congratulations, you have a rudimentary introduction to the EM algorithm, and you have worked with it more than once. The following information is already available to you. Take it easy and go down.

In this example you can see three main steps: initializing parameters, observing expectations, and reestimating. The first step is to initialize some amount of food for each dish and then look at the expectation. These two steps are actuallyThe Expectation step is called the E step. If the results are biased you have to reestimate the parameters, which is thisThe Maximization step is called the M-step. These two steps add up to the EM algorithm.Ha ha, is it suddenly clear ah, take advantage of this time, see the EM algorithm of the specific working principle.

3. Working principle of EM algorithm

Speaking of EM algorithm, we need to first look at the concept of “Maximum Likelihood”, which means Maximum Likelihood, so Maximum Likelihood is also the meaning of Maximum Likelihood.

What is maximum likelihood?

u

For example, there are two students, a boy and a girl, now want to compare their height, who is taller? According to our experience, the average height of men at the same age is higher than that of women, so the possibility of male students being tall is great. This is the concept of maximum likelihood.

Then there’s the question: what is the maximum likelihood estimate?

u

It means that an event has already happened and then deducts what is more likely to have caused it. Let’s take the example of a man and a woman comparing their height. If one person is taller than the other, he might be male. Maximum likelihood estimation is a method to estimate parameters based on known results.

What are these things? What exactly is the EM algorithm? What does it have to do with maximum likelihood estimation? (I’m at a loss for you to ask)

In fact, EM algorithm is a method to solve the maximum likelihood estimation, by observing the sample, to find the model parameters of the sample.

u

Going back to the example that I gave you at the beginning of dividing dishes, what we really want in the end is the amount of food in plate A and plate B, and you can think of those as the model parameters that you want to figure out. Then we observe through E step in EM algorithm, adjust the parameters of A and B through M step, and finally make the parameters of dish A and dish B no longer change.

Then, you suddenly realize, oh, the principle of EM algorithm is so simple ah, ha ha, do not overestimate yourself, the actual problems we encounter, more complex than dishes. No, look at the following example I gave you:

You’ve all flipped A coin before, so let’s say we have two coins, A and B, and we do five experiments, each with ten flips, and only one coin can be A or B at A time. Then we counted the number of heads in each group, and the results were as follows:I want to ask you, do you know the probability that coin A and coin B each come up heads?

How’s that? I don’t know. In every experiment, I don’t know if I’m using A or B. What do I do? Well, I suppose I’d like to refine this list for you:Can you tell me the answer? You say, well, that’s not easy, so let’s just say that the probability of A coming up is θA, the probability of B coming up is θB, and then:Ha ha, beautiful! You know what? I started by saying:

u

If the model based on maximum likelihood estimation is used, when there are hidden variables in the model, it is necessary to use EM algorithm to estimate

The second column here is the implicit data, and A and B are the implicit variables. We don’t actually know this column, but all you’re given at the beginning is the number of experimental groups and the number of heads, so what do you do?

In other words, we can’t estimate theta A and theta B if we don’t know whether each group is throwing A or B, and to know whether each group is throwing A or B, we have to know the probability of A and B coming up heads, theta A and theta B, and then using the idea of maximum likelihood, Estimate whether A or B was used in this round based on the number of heads in each set. That’s a little convoluted! You will find that this is a chicken-and-egg problem that cannot be solved!

So after all that talk, what should we do? Here we use the idea of EM algorithm.

u

I’ll randomly initialize A theta A and theta B, with these two parameters, we can according to the maximum likelihood estimate that each group is A or B, then based on each set is A or B, we can according to the calculated maximum likelihood, in turn, A and theta. Theta B, then to estimate the new use is A or B, then can calculate new A and theta. Theta. B, This way, when the new θA and θB are the same as our previous θA and θB, it means that these θA and θB may be the real values. This is EM Primary.

Well, based on what I said above, let’s see how it works.Here are two questions to answer:

u

  1. The new estimate of θA and θB must be closer to the real θA and θB, right? The answer is: yes, it’s definitely closer to the real θA and θB, the math will prove it, but that’s beyond the subject of this article, see other books or articles. (This is like dividing the dish evenly, there will always be a good point)
  2. Does the iteration necessarily converge to the real θA and θB? And the answer is: not necessarily, depending on the initial values of theta A and theta B, generally yes.

In fact, the above introduction is only an initial version, why say so? Because when we first calculated the probability above, we figured out:So we’re going to go straight to coin A for the first time.

Don’t you see that this is too hard, too absolute? B is less likely to get 5 heads than A, but it’s not zero, but it’s possible. At this time, we should take this possible situation into consideration. Then, the probability of A used in the first round of experiments is 0.246 / (0.246 + 0.015) = 0.9425; The probability of using B is 1-0.9425 = 0.0575.

Compared with the previous method, we directly estimate the first round as coin A according to the maximum likelihood probability. At this time, we are more cautious, we just say that there is 0.9425 probability of coin A and 0.0575 probability of coin B, it is no longer either/or. So when we estimate theta A and theta B, we can use data from each round, not just A few rounds, which is obviously better.

This step, we’re actually estimating A probability distribution for either A or B, and this step is called the E step.

So, for each round of experiments, we’ll figure out A table with the probabilities of A and B:And then, in combination with the statistical results,Re-estimate the new θA and θB according to the law of maximum likelihood probability:

u

So if we take coin A, the number of heads in the first round is 5 heads, 5 tails

  • 0.9425 * 5 = 4.7125(this is heads)
  • 0.9425 * 5 = 4.7125 (this is tails)

So for coin A, the table for five rounds can be drawn as follows:In this way, the new thetaA = 4.22 / (4.22+7.98)=0.35. In this way, after changing the estimation method for coins A and B, we find that the new estimate of thetaA is closer to the real value because we use the data of each round rather than the data of several rounds.

PS: This table is merely to say that I mean, intercepting A figure, real data is not the case, also can see the first round, real data is my calculation of the above, according to the calculation method, to calculate each round COINS when A positive and negative data, coin B positive and data, And then it’s A little bit more accurate to figure out the new theta A and theta B.

In this step, we worked out A probability distribution of coin A and coin B in each round of experiment according to step E, and estimated the new θ1 and θ2 according to the maximum likelihood rule combined with all the data, which is called step M.

This is the advanced EM algorithm.

Speaking of which, the EM algorithm is pretty much done, do you understand?

Briefly summarize the above steps:

u

You can see that step E in the EM algorithm calculates the hidden variable from the old parameters. Then in step M, the parameters are re-estimated by the results of the hidden variables obtained. Until the parameters stop changing and we get what we want.

Next, let’s take a look at the principle of EM algorithm clustering. KMeans clustering was introduced earlier. It is the same as clustering, what is the difference?

4. Working principle of EM clustering

EM algorithm is commonly used in clustering, unsupervised model, because there is no label unsupervised learning (y), the EM algorithm can give unsupervised learning estimate (tag), a hidden states with labels, algorithms can be converted into a model for supervised learning, then you can use the maximum likelihood estimation method to solve the model, the optimal parameters. The estimated hidden state flow should be E step of EM algorithm, followed by M step by maximum likelihood estimation.

Compared with k-means algorithm, EM clustering is more flexible. For example, in the following two cases, K-means will get the following clustering results.Because K-means distinguishes the difference between samples by distance, and each sample can only belong to one category during calculation, it is called hard clustering algorithm. In the process of EM clustering solving, in fact, each sample has a certain probability related to each cluster, which is called soft clustering algorithm.

u

You can think of EM algorithms as a framework in which different models can be used to solve EM. Common EM clustering includes GMM Gaussian mixture model and HMM hidden Markov model. GMM (Gaussian mixture model) clustering is one of EM clustering. For example, the above two graphs can be used for clustering by GMM.

Just like k-means, we know the number of clustering in advance, but do not know which category each sample belongs to. In general, we can assume that the sample is gaussian (that is, normally distributed). Each Gaussian distribution belongs to the component of the model. If it is divided into K classes, it equals K components. This way we can first initialize the parameters of the Gaussian distribution for each component, and then see which component each sample belongs to. So that’s step E.

Then through the results of these implicit variables, the parameters of the Gaussian distribution of each component can be calculated in reverse, namely the M step. Repeat the EM step until the Gaussian distribution parameters for each component remain unchanged.

In this way, the samples are clustered according to THE GMM model.Therefore, EM clustering can solve many problems that KMeans cannot solve. In the EM framework, we regard potential categories as hidden variables, samples as observation values, transform the clustering problem into parameter estimation problem, and finally cluster the samples.

Finally, the EM algorithm acts as a framework, and you can use different models for clustering, such as GMM (Gaussian mixture model) or HMM (Hidden Markov model).

  • GMM is clustering through probability density, and the clustering classes conform to gaussian distribution (normal distribution).
  • HMM uses markov process, in which we calculate the probability of state transition through state transition matrix. HMM is widely used in natural language processing and speech recognition.

Well, also said the PRINCIPLE of EM algorithm clustering, let us combat it below!

5. EM algorithm actual combat – King of Glory hero clustering

Above is the principle of EM algorithm, after understanding the principle, strike while the iron is hot, do a practical small project, see the power of EM algorithm!

5.1 How do I Use the EM Tool Package

There are third-party EM algorithms toolkits available in Python. Since EM algorithm is a clustering framework, you need to specify which algorithm you want to use, such as GMM gaussian mixture model or HMM Hidden Markov model.

We mainly use GMM, so we need to introduce toolkits

from sklearn.mixture import GaussianMixture
Copy the code

So how do you create a GMM cluster?

u

GMM = GaussianMixture(n_Components =1, covariance_type=’full’, max_iter=100)

  • N_components: the number of Gaussian mixture models, that is, the number of clustering, the default value is 1. If you do not specify n_components, the final clustering result will be the same value.
  • Covariance_type: indicates the covariance type. The distribution of a GAUSSIAN mixture model is determined by the mean vector and the covariance matrix, so the type of covariance also represents the characteristics of different Gaussian mixture models. There are four types of covariance:

u

  • Covariance_type =full, which stands for complete covariance, i.e., none of the elements are zero;
  • Covariance_type =tied, representing the same complete covariance;
  • Covariance_type =diag, that is, the diagonal covariance is not 0, the rest is 0;
  • Covariance_type = Spherical covariance, which is 0 and has exactly the same diagonal, representing spherical characteristics.

  • Max_iter: represents the maximum number of iterations. EM algorithm obtains the final model parameters through E and M iterations. The maximum number of iterations can be specified here, and the default value is 100.

Once the GMM clustering is created, you can pass in data for iterative fitting.

We use fit function, pass in the sample feature matrix, and the model will automatically generate a cluster, and then use prediction= GMM. predict(data) to cluster the data. Pass in the data you want to cluster, and the clustering result can be predicted.

You can see that fitting training and prediction can pass in the same feature matrix, because clustering is unsupervised learning, and you do not need to specify the results of clustering in advance, nor can you learn based on prior experience of the results. As long as the eigenvalue matrix is introduced in the training process, the machine will generate a cluster according to the eigenvalue matrix, and then the cluster can be used for clustering.

5.2 How to cluster Honour of Kings data with EM algorithm

First of all, we know the principle of clustering is “birds of a feather flock together”. Through clustering algorithm, the data with similar eigenvalues are grouped into one class, and the differences between different classes are large. In this way, the original data can be reduced in dimension. Study the characteristics between groups by dividing them into groups (clusters). Alternatively, we can increase the number of clusters so that we can find interchangeable heroes. For example, if your opponent chooses a hero that you are good at, you can choose another hero as a backup.

5.2.1 Introduction to data sets

See what the data looks like: The data set can be downloaded hereHere we’ve collected 20 attributes for 69 heroes, These properties are respectively the largest life, life, growth, initial life grows, maximum mana and mana, initial mana, the highest content of attack, attack growth, the initial attack, the biggest thing, technology growth, initial technology, maximum monthly bleeding every 5 seconds, every 5 seconds h. growth, initial h. every 5 seconds, the largest, mana per 5 seconds growth, initial mana per 5 seconds each 5 second renegade, maximum attack speed and range.

5.2.2 Execution process

Insert a picture description here

  1. First load the data source
  2. In the preparation stage, we need to explore the data, including the use of data visualization technology, so that we can have a deeper understanding of the hero attributes and the relationship between these attributes, and then evaluate the data quality, whether to conduct data cleaning, and finally carry out feature selection to facilitate the subsequent clustering algorithm
  3. Clustering stage: Suitable clustering model is selected. Here, GMM Gaussian mixture model is used for clustering, and the clustering results are output for analysis.

5.2.3 Actual combat operation

  1. First import the package
import pandas as pd
import csv
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
Copy the code
  1. Load the data
Data_ori = pd.read_csv('dataset/heros.csv', encoding = 'gb18030')
features = [u'Maximum life',u'Life growth',uPrimordial life,u'Maximum mana', u'Mana growth',u'Initial mana',u'Highest object attack',u'Object attack growth',u'Initial attack',u'Maximum Object Defense',u'Object defense growth',u'Initial Defense', u'Maximum blood return every 5 seconds', u'Blood returns every 5 seconds', u'Initial blood return every 5 seconds', u'Blue back every 5 seconds Max', u'Blue growth every 5 seconds', u'Initial blue back every 5 seconds', u'Maximum attack speed', u'Attack range']
data = data_ori[features]
Copy the code
  1. Exploration of data visualization We present the relationship between 20 hero attributes in a thermal diagram. The number in the middle represents the relationship coefficient between the two attributes. The maximum value is 1, indicating a complete positive correlation. As you can see from the graph, the “Maximum Life”, “Life growth” and “initial Life” attributes are highly correlated, and we only need to keep one attribute. Similarly, we can also filter other highly relevant attributes and keep one. As you can see in the code, I kept the features_remain array for the feature selection, reducing the original 20 attributes to 13.
# set PLT to correctly display Chinese plt.rcparams ['font.sans-serif'] = ['SimHei'[plt.rcparams]'axes.unicode_minus'Corr = data[features].corr() plt.figure(figsize=())14.14Heatmap (corr, annot=True) plt.show()Copy the code

The results are as follows:

  1. Characteristics of the engineering
Features_remain = 1 ~ ~'Maximum life', uPrimordial life, u'Maximum mana', u'Highest object attack', u'Initial attack', u'Maximum Object Defense', u'Initial Defense', u'Maximum blood return every 5 seconds', u'Blue back every 5 seconds Max', u'Initial blue back every 5 seconds', u'Maximum attack speed', u'Attack range']
data = data_ori[features_remain]
Copy the code
  1. Data normalization we can see that the “maximum attack speed” attribute value is a percentage, not suitable for matrix calculation, so we need to convert the percentage to a decimal. We also see that the value of “range of attack” is either ranged or melee, which is also not suitable for matrix calculation. We will map the value, using 1 to represent ranged and 0 to represent melee. Then z-Score normalization is used to normalize the eigenmatrix.
data[u'Maximum attack speed'] = data[u'Maximum attack speed'].apply(lambda x: float(x.strip(The '%')) /100)
data[u'Attack range']=data[u'Attack range'].map({'remote':1.'-':0}) # Use z-score to normalize data and ensure that the mean value of data of each feature dimension is0, the variance of1
ss = StandardScaler()
data = ss.fit_transform(data)
Copy the code
  1. Model and produce results, write to file
GMM = GaussianMixture(N_components =30, covariance_type='full') GMM. Fit (data) # Prediction = GMM. Predict (data)printData_ori. Insert (prediction)0.'group', prediction) #28 14  8  9  5  5 15  8  3 14 18 14  9  7 16 18 13  3  5  4 19 12  4 12
 12 12  4 17 24  2  7  2  2 24  2  2 24  6 20 22 22 24 24  2  2 22 14 20
 14 24 26 29 27 25 25 28 11  1 23  5 11  0 10 28 21 29 29 29 17]
Copy the code

We adopt GMM Gaussian mixture model and output the results to CSV file. Here I intercept the output result (set the number of clusters to 30) :The first column represents groups (clusters). We can see that Zhang Fei and Cheng Xiaojin are grouped together, Niu Niu and Bai Qi are grouped together, Lao Fu Zi is a group, And Dharma and Dian Wei are a group. The characteristic of clustering is that the attribute values of the same category are similar and the attribute values of different categories differ greatly. So if you’re good with Dian Wei as a hero, try Dharma as a hero. You can also switch between Zhang Fei and Cheng Zhuo Jin. This way, even if your hero is selected, you still have a backup hero to use.

Clustering is different from classification. Clustering is an unsupervised learning method, that is, we have no actual results to compare, so the evaluation of clustering results is not as intuitive as the accuracy of classification. Is there any evaluation method for clustering results? Here we can use the Calinski-Harabaz metric as follows:

from sklearn.metrics import calinski_harabaz_score
print(calinski_harabaz_score(data, prediction))
Copy the code

The higher the index score is, the better the clustering effect is, that is, the difference in the same class is small, and the difference between different classes is large. Of course, we need to manually analyze the meaning of the results of specific clustering, that is, the specific meaning of each class table after these data are divided into different categories.

In addition, clustering algorithm can also be used as the pre-processing stage of other data mining algorithms, so that we can reduce the dimension of data.

6. Summary

To this finally finished, my god, every time to write is so much, this can slowly dry goods ah, although much, but really can learn something. To summarize, today first from the example of dishes, feel the EM algorithm. \

Then, from the coin toss example, we learned about the primary EM and the upgraded EM. Then it explains how EM works. Step E of EM algorithm is to calculate hidden variables from old parameters. Then in step M, the parameters are re-estimated by the results of the hidden variables obtained. Until the parameters stop changing and we get what we want.

Next, the difference between EM and KMeans is compared, and IT is known that EM can solve some problems that KMeans cannot solve. In addition, EM is a framework, and its specific implementation includes Gaussian mixture model and hidden Markov model. Later, hidden Markov model will be discussed in detail.

Finally, the EM algorithm tool of SKlearn was called to complete a king of Glory hero clustering task, and a function to measure the clustering result was known.

I hope that through today’s learning can have a perceptual understanding of EM algorithm, as for the rational aspect, you can look at the watermelon book or statistical learning methods for a further step of learning.

Reference:

  • Note.youdao.com/noteshare?i…
  • www.cnblogs.com/coshaho/p/9…
  • Blog.csdn.net/taka_is_bea…

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code