0 x00 the

This paper will try to explain the HIDDEN Markov model in an understandable way and do not involve mathematical formulas as much as possible. Instead, from the perspective of the overall thinking, it will use perceptual and intuitive thinking to explain the hidden Markov model. And from the classic to find a specific application scenarios to help you in-depth this concept.

0 x01 instructions

In the process of machine learning, you will encounter a lot of obscure concepts, related mathematical formulas, it is very difficult for you to understand. When faced with similar situations, we should think more intuitively, using categories or examples to follow up, which is often more effective.

In the process of explanation and elaboration, MY requirement for myself is to find an example in life or a famous book, and then illustrate it with my own words.

0 x02 probability graph

Probability graph model is a theory that uses graph to represent the probability dependence of variables. Combined with the knowledge of probability theory and graph theory, graph is used to represent the joint probability distribution of variables related to the model. Developed by Turing Award winner Pearl.

If there’s one word to describe Probabilistic Graphical models, it’s elegant. For a practical problem, we want to be able to mine the knowledge hidden in the data. The probability graph model constructs such a graph, in which observation nodes are used to represent observed data, implicit nodes are used to represent potential knowledge, and edges are used to describe the relationship between knowledge and data. Finally, based on such a graph, a probability distribution is obtained, which is very “elegant” to solve the problem.

Nodes in the probability graph are divided into hidden nodes and observation nodes, and edges are divided into directed edges and undirected edges. From the perspective of probability theory, nodes correspond to random variables, and edges correspond to the dependence or correlation of random variables, where directed edges represent one-way dependence, and undirected edges represent interdependence.

Probability graph models are divided into two categories: Bayesian Network and Markov Network **. A Bayesian network can be represented as a directed graph structure, and a Markov network can be represented as an undirected graph network structure. To be more detailed, probability graph model includes naive Bayesian model, maximum entropy model, hidden Markov model, conditional random field, subject model, etc., which is widely used in many scenarios of machine learning.

0x03 The difference between generative and discriminant models

Now we know that probability graphs can be divided into directed graph models and undirected graph models, which is, as the name suggests, whether or not the edges in the graph have directions. So what kinds of models have directions and what kinds of models don’t have directions? It’s easy to think about this, but the directed expression is A kind of deductive relationship, where A is given B, and this model is also called A generative model. Without direction, it expresses A “it’s right” relationship, that is, it’s right to have both A and B, which is also called A discriminant model.

According to modeling is the joint probability distribution P (x, y) or conditional probability distribution P (y | x). Derivation gives rise to a constitutive model and a discriminant model.

1. Generative model

The generation model is generally calculated with joint probability (since we know the premise of A, we can calculate the joint probability), that is, the purpose of determining and estimating Y is to calculate the joint probability distribution P(x,y) from the observed values and labeled data.

The generative model is to simulate the process of data generation. There is a causal and sequential relationship between the two types of random variables. Factor X comes first and result Y comes later.


P ( x . y ) = P ( x ) P ( y x ) P(x,y) = P(x)P(y|x)

Through joint distribution P(x,y), the generative model actually indirectly models P(x) :


P ( x ) = y P ( x . y ) P(x) = \sum_y P(x,y)

It should be noted that in the model training, I learned the joint model P(X,Y) of X and Y. That is to say, I modeled only P(X,Y) in the training phase, and I needed to determine all information parameters to maintain the joint probability distribution, and every label (Y) needed to be modeled. So there are no discriminant boundaries.

After the inference for new sample to calculate P (Y | X), export Y, finally choose the optimal probability of label for the result, but it is no longer belong to the modeling phase.

There are two drawbacks:

  • P(x) is difficult to estimate precisely because the features are not independent of each other, but are intricately dependent.
  • P(x) also has no direct role in the classification.

To overcome these two problems, discriminant models appear.

2. Discriminant model

Discriminant models are usually calculated using conditional probabilities (because we don’t know the premises, we can only “assume” the probability of B given A). By solving the conditional probability distribution P (y | x) or directly calculate the value to predict y y.

Skip the discriminant model P (x), directly on the conditional probability P (y | x) modeling, that is to say, directly according to the characteristics of x on y modeling training. No matter how complex the relationship inside X is, it does not affect the discriminant model’s judgment of Y, so it can confidently and boldly make use of a variety of rich and related features.

Specifically, my training process is to determine the component P (Y | X) model in the complex mapping relationship of the parameters, only to build a model of all the samples to confirm the overall discriminant boundary. After that, we will refine a batch of new samples. The most likely label is predicted when the input features are observed.

3. On behalf of

The representative generative models are n – meta-syntax model, hidden Markov model, naive Bayes model and so on.

The discriminant models are: maximum entropy model, support vector machine, conditional random field, perceptron model and so on

The difference between them is this:

  • Dealing with emergent model joint probability P (Y, X) = P (Y | X) * P (X) = P (X | Y) * P (Y)
  • Discriminant model calculation is the conditional probability P (Y | X) = P (X | Y)/P (X)

Discriminant models work by taking the Y (or label) of the data, learning from the features provided, and drawing a clear or relatively clear boundary (how? Through complex function mapping, or decision stacking mechanism), this should be obvious for linear LR and linear SVM.

The generative model works like this. They first figure out the distribution of all the data from the training sample data, and then finally determine a distribution that can be used as the distribution of all my input data, and it is a joint distribution P(X,Y) (note that X contains all the features, x_i, and Y contains all the labels). Then I came to the new sample data (INFERENCE). Well, through the joint distribution P(X,Y) of the model I learned, combined with X given by the new samples,Y can be derived through conditional probability


P ( category Characteristics of the ) = P ( Characteristics of the category ) P ( category ) P ( Characteristics of the ) P (category | characteristics) = \ frac {P (features | category) P (category)} {P (features)}

0x04 Markov series concept

When we talk about Markov we’re talking about a value that’s related to n values, so it’s a conditional probability, which is a generative model, which is a directed graph model.

Markov hypothesis: the probability of each event depends only on the preceding event.

Markov chain: A markov chain is a series of successive events that satisfy the Markov hypothesis. The node of Markov chain is the state, and the edge is the transition probability, which is a directed state transition expression of Template CPD.

Two hypotheses for Markov chains

  • Homogeneous Markov hypothesis: that is, the state at time T is only affected by the state at time T-1;

  • Observation independence hypothesis: that is, the observation at any time is only affected by the state at that time;

Markov model is an undirected probability graph model, which is not quite the same as Markov chain. Markov model is a probability graph model juxtaposed with Bayesian model. It describes the relationship between two random variables that affect each other, interact with each other, and have no causal relationship. Because the action is mutual, all the Edges of the Markov model are undirected, or you could say bidirectional.

The power of the Markov model is that it removes the causal relationship from the Bayesian model, which allows it to establish relationships between many equal things. For example, every pixel in an image is equal, but each pixel can influence each other (the sky is above, the earth is below). All Markov models are widely used in image processing/image understanding.

If events are embodied as words, then markov model is embodied as binary syntax model. Markov model can also be regarded as a state transition process with respect to time T, that is, a random finite state machine. Then the probability of the state sequence can be obtained by calculating the product of the probability on the transition arc of all the states forming the sequence.

0x05 Hidden Markov model

Hidden Markov Model (HMM) is a statistical Model used to describe a Markov process with Hidden unknown parameters. That is to describe the transition of a system’s recessive state and the probability of recessive state. The implicit states of a system are those states that cannot be observed by the outside world. The difficulty is to determine the implicit parameters of the process from the observable parameters. These parameters are then used for further analysis.

HMM is a generative model based on naive Bayes, which is essentially similar to the application of naive Bayes in single-sample classification to sequence sample classification.

For example, HMM is a probabilistic model describing the joint distribution p(x,y) of two sequential sequences.

  • X sequence is visible to the outside world (the outside refers to the observer), and is called the obsevation sequence.

  • The Y sequence is invisible to the outside world and is called a state sequence.

For example, observing that X is a word and state Y is a part of speech, we need to guess their part of speech according to the sequence of words.

A Markov process is one where the transitions between states depend only on the first n states. This process is known as an n-order Markov model, where n is the (first) n states that influence the selection of the next state. The simplest Markov process is a first-order model in which the state selection depends only on the previous state. Note here that it is not the same as a deterministic system, because the choice of the next state is determined by the corresponding probability, and is not deterministic.

Hidden Markov model is called “hidden” because from the outside, the state sequence (such as the part of speech) is hidden and invisible, which is the dependent variable to be required. From this perspective, y state is also called the hidden state, and observed X is called the visible state.

“Hidden” refers to the information of one order we don’t know, like we know that human ancestors are trilobites, but by the trilobite has experienced the evolution process of how to evolve to the appearance of the people we don’t know, we know only through fossil data distribution information, if such a lot of information, you can use of hidden markov model to model, The algorithm of this model is complicated because of the lack of information.

After the dependence between the state and the observation is determined, the hidden Markov model uses three elements to simulate the occurrence process of the time series —-, namely the initial state probability vector, the state transition probability matrix and the emission probability matrix.

1. Three hidden Markov problems

  • Probability calculation problem: given a model, how can the probability of generating an observation series be calculated effectively? In other words, how do you evaluate the match between the model and the observation sequence?
  • Prediction problem: Given a model and a sequence of observations, how to find the state sequence that best matches this sequence of observations? In other words, how can a hidden model state be inferred from a sequence of observations?
  • Learning question: Given a sequence of observations, how to adjust the model parameters to maximize the probability of occurrence of the sequence? In other words, how can the model be trained to best describe the observed data?

The first two problems are pattern recognition problems: 1) The probability (evaluation) of an observable state sequence obtained from the Hidden Markov model is used to measure the relative applicability of a model; 2) Finding a sequence of hidden states maximizes the probability that the sequence produces an observable sequence of states (decoding), which is used to infer what the hidden part of the model is doing (” what is happening “). The third problem is to generate a hidden Markov model (learning) from an observable set of sequences of states.

The second problem is the core of the hidden markov model, a lot of the hidden markov model in the field of application of most will be the problem (2), such as the Chinese translation of English in machine translation, translation of words always have a lot of meaning, and part of speech often play a very important role, at first glance the part of speech told us how the sequence to the implicit state sequence of = () very like! There are many more like this…….

Corresponding solutions to the three problems:

  1. Forward Algorithm, Backward Algorithm
  2. Viterbi Algorithm
  3. Baum-welch Algorithm (approximately equal to EM Algorithm)

Classic Examples from the Internet

Xiao Ming now have three days of vacation, in order to pass the time, he can choose in every day to do three things, these three things are walking, shopping, cleaning (corresponds to the observed sequence), but in what we do in life decisions are generally affected by the weather, may be a sunny day when want to go shopping or go for a walk, may time don’t want to go out on rainy days, Stay at home and clean up. The weather (sunny, rainy) is a hidden state,

So, we ask three questions, corresponding to Markov’s three questions:

  1. Given the whole model, I observed three consecutive days of activity: walking, shopping, and cleaning up. So, based on the model, calculate the probability of these behaviors occurring.
  2. Knowing this model, knowing these three things, I want to guess what the weather is going to be like for three days.
  3. Most complicated, I only know that I did these three things in three days, and I don’t have any other information. I have to build a model, the probability of the conversion between rain and sunshine, the probability distribution of the weather on the first day, the probability distribution of choosing to do something based on the weather.

2. Difficulty/modeling of Hidden Markov

It is actually quite easy for the HMM to simulate if it knows in advance the transition probability between all the hidden states and the output probability between all the hidden states and all the visible states. However, when the HMM model is applied, some information is often missing. How to estimate the missing information by using the algorithm has become a very important problem. These are the problems that the HMM tries to solve.

The observed state sequence has a certain probability relationship with the hidden process. We model such a process using a hidden Markov model, which consists of an underlying hidden Markov process that changes over time, and an observable set of states that are somehow related to the hidden states.

We model these examples using a hidden Markov model (HMM). This model consists of two sets of state sets and three sets of probability sets:

  • Hidden state: The (true) state of a system that can be described by a Markov process.
  • Observation state: The state of being ‘visible’ during this process.
  • PI vector: contains the probability (initial probability) of a particular hidden state of the (hidden) model at time t=1.
  • State transition matrix: contains the probability of passing from one hidden state to another
  • Confusion matrix: contains the probability of an observed state for a particular hidden state given a hidden Markov model.

Therefore, a hidden Markov model introduces a set of observed states into a standard Markov process and some probabilistic relationships between them and hidden states.

An application of Hidden Markov to the Water Margin

The breakout of the daimyo by Liang Zhongshu in The Outlaws of the Marsh is a hidden Markov (HMM) case that can be reinvented to make it easier to explain.

However, Liang Zhongshu was sitting in front of the government drunk. When he first heard the report, he was not very alarmed. After the second half of the time, the meteor scouts, one after another to report, scared out of his mind, hurriedly called the horse. Said not, see cui yun building, fiery, fire to seize the moon, very vast. When Liang Zhongshu saw this, he was anxious to get on the horse, but when he wanted to go to see it, he saw two big men pushing two carts and putting them on the road. He went to get the lamps hanging from the bowls and looked at the carts burning. When Liang Zhongshu was about to leave the east gate, two big men said: “Li Ying and Shi Jin are here!” Hand pick pu dao, big step to kill. The door officers, scared away, hand wounded more than ten. Du Qian and Song Wan, however, were good enough to come out next, and four of them joined together to hold the east gate. Liang Zhongshu see is not the head of the situation, leading the entourage, gallop south gate. According to a legend at the South Gate, “A fat monk swings his iron wand; A tiger – faced walker cast his saber and shouted to enter the city. Liang zhongshu returned to his horse, and then went to the front of the office, where Xie Zhen and Xie Bao were bumping against each other with steel forks in their hands. Eager to return to the state government, dare not approach. However, the king was well. Liu Tang and Yang Xiong were beaten together with fire and water, and their brains were gushed out and their eyes protruded. They died in front of the street. The hou of yu and the pan fled to the rest of their lives. Liang zhongshu hurried back to the horse rushed to the west door, only to hear the city God temple, artillery ring, bang heaven earthquake ground. Zou Yuan and Zou Run, bamboo poles in hand, just under the eaves of the fire. Before the south tile son, wang Dwarf tiger, one zhang green kill the future. Sun Xin and Sister-in-law Gu had a hidden weapon found near them and were there to help them. In front of the Tongsi Temple, Zhang Qing and Sun Erniang entered, climbed aoshan and set fire to it. At this time, the people in Beijing are in the crowd. A rat is strung out and a Wolf is running. A genie cries and a ghost cries. However, Liang Zhongshu ran to the west gate, and then Li Cheng hurried to the south gate. When he stopped his horse and looked from the drum Tower, he saw soldiers and horses lined up at the bottom of the gate. The banner read: “General Huyanjiao.” In the flame light, shake up the spirit, and be valiant; On the left there is Han Tao, on the right there is Peng Qi, and huang Xin is known to be known. He is known to be known as Qi, and is known to be known as Huang Xin. Liang Zhongshu could not go out of the city. He and Li Cheng hid under the north gate. They saw a bright fire, and they did not know the number of the horses, but it was Lin Chong, the head of a leopard, with a leaping horse and a spear. Then I turned to the east gate, and in the midst of the torches, I saw Mu Hong, Du Xing on the left, and Zheng Tianshou on the right. A hero from the Army of Three Boubu took the lead and led more than a thousand men to enter the city. Liang zhongshu path to the south gate, risking his life and walk. The torch beside the suspension bridge was qi Ming. I saw Li Kui, a black whirlwind, with Li Li on the left and Cao Zheng on the right. Li Kui stripped all over, clenched the root of his teeth, hand  double axes, from the city haoli fly to kill over. Li Li and Cao Zheng came together. Li Cheng took the lead and ran out of the city, protecting Liang Zhongshu and leaving. I saw the sound of killing under the left hand, the torches, numerous military horses, but the big sword Guan Sheng, flapping red rabbit horse, hand dance qinglongdao, the path to grab Liang Zhongshu. With two swords in hand, Li Cheng came forward to meet the enemy. Then Li Cheng, uninterested in war, plucked his horse and walked away. Left with Xuanzan, right with Hao Siwen, two sides hit. Sun Li, who was behind, urged the men to come and kill them. In the middle of the fight, xiao Li Guanghuarong caught up with him behind him. He took a bow and an arrow and hit Deputy Li Cheng. Li Cheng saw, pegasus running, not half an arrow, I saw the right hand under the drums and gongs, the fire dazzling, it is thunder fire Qin Ming horse dance stick, leading yan Shun, Ou Peng, behind Yang Zhi, and kill the future. Li Cheng fought and walked away, losing half his army and guarding Liang Zhongshu.

There are four options to break the siege: four doors, southeast and northwest. Each door has a different liangshan hero. Now liang Zhongshu faces liangshan heroes constitute a name chain.

For example, go to the South Gate and meet Wu Song, then go to the East gate and meet Shi Jin, then go to the North gate and meet Lin Chong, then go to the West Gate and meet Song Jiang. Then you can get the following list of hero names: Wu Song, Shi Jin, Lin Chong, Song Jiang.

It’s called the Visible State Chain. But in a hidden Markov model, we don’t just have this chain of visible states, we have this chain of implicit states. In this case, the chain of implied states is liang zhongshu’s sequence of selected gates: south gate, East gate, North gate, west gate.

We will follow liang zhongshu to meet the following heroes to explain: Wu Song, Shi Jin, Lin Chong, Song Jiang

1. State transition matrix

The state transition matrix contains the probability of going from one hidden state to another. Generally speaking, the Markov chain mentioned in HMM actually refers to the hidden state chain, because there is a transition probability between the hidden states (gates).

For example, after liang zhongshu failed to break through the west gate, the next state is the South gate, the east gate, the north gate, and the west gate with a probability of 1/4. This is designed to make it easy to explain. But we can actually set the conversion probability at will. For example, we can define a rule as follows:

* clockwise as follows: the east -- - > the south gate Simon -- -- -- -- -- -- -- -- > > north east gate - > - >... * The probability of selecting a gate again is1/8For example, east gate --> East Gate (unlikely to break in place again) * Has reached a gate, the probability of choosing the next gate clockwise or counterclockwise is3/8For example, dongmen ----> South gate or dongmen ----> North gate * has arrived at a gate, the probability of choosing the opposite door is1/8For example, dongmen ----> West gate (unlikely to go through daimyo to the opposite gate)Copy the code

The state transition matrix is as follows:


A = { 1 8 3 8 1 8 3 8 3 8 1 8 3 8 1 8 1 8 3 8 1 8 3 8 3 8 1 8 3 8 1 8 } A= \left\{ \begin{matrix} \frac18 & \frac38 & \frac18 & \frac38\\ \frac38 & \frac18 & \frac38 & \frac18\\ \frac18 & \frac38 & \frac18 & \frac38\\ \frac38 & \frac18 & \frac38 & \frac18\\ \end{matrix} \right\}

2. Confusion matrix

Confusion matrix: contains the probability of an observed state for a particular hidden state given a hidden Markov model. That is, although there is no transition probability between the visible states, there is a probability called output probability between the hidden state and the visible state.

Water Margin real summary of the four doors corresponding to liangshan heroes are:

* Nan Men: Wu Song, Lu Zhishen, Hu Yanzhu, Han Tao, Peng Qi, Huang Xin, Li Kui, Li Li, Cao Zheng, Guan Sheng, Xuan Zan, Hao Siwen, Sun Li, Qin Ming, Yan Shun, Ou Peng, Yang Zhi. * Dongmen: Li Ying, Shi Jin, Mu Hong, Du Xing, Zheng Tianshou * Beimen: Lin Chong, Ma Lin, Deng Fei, Hua Rong * Ximen: Liang Shan hero, Major Army horse. * In reality, the probability of meeting Lin Chong at the North Gate is1/4, the probability of meeting Wu Song at the south gate is1/17.Copy the code

But these are definite probabilities that don’t explain our HMM model.

Therefore, we need to transform from the perspective of Liang Zhongshu/Wu Scholar. Suppose wu Xuejiao assigned each hero to each gate with a certain probability, that is to say, liang Zhongshu had a certain probability to meet a hero after he went to a certain gate.

Here is a simplified version, given only a few heroes: Wu Song, Shi Jin, Lin Chong, Song Jiang.

* The following is the probability of each hero being assigned to each gate when Wu Xuedu sent troops1/2History into,1/4, Lin1/8Sung river,1/8* East Gate: Wu Song1/4History into,1/2, Lin1/8Sung river,1/8* North Gate: Wu Song1/8History into,1/8, Lin1/2Sung river,1/4* Simon: Wu Song1/8History into,1/8, Lin1/4Sung river,1/2* For our example, the probability of Liang Zhongshu encountering Wu Song at the North Gate is1/8, the probability of meeting Shijin is1/8, the probability of meeting Lin Chong is1/2.Copy the code

The confusion matrix is as follows:


B = { 1 2 1 4 1 8 1 8 1 4 1 2 1 8 1 8 1 8 1 8 1 2 1 4 1 8 1 8 1 4 1 2 } B = \left\{ \begin{matrix} \frac12 & \frac14 & \frac18 & \frac18\\ \frac14 & \frac12 & \frac18 & \frac18\\ \frac18 & \frac18 & \frac12 & \frac14\\ \frac18 & \frac18 & \frac14 & \frac12\\ \end{matrix} \right\}

Each probability in the state transition matrix and the confusion matrix is time independent — that is, the matrices do not change over time as the system evolves. In fact, this is one of the most unrealistic assumptions of the Markov model about the real world.

3. PI vector (initial probability)

Most of liang Shan’s troops attacked Li Cheng at the west gate. It was unlikely that Liang Zhongshu would go to the west gate, but the east Gate was the most likely

* south gate:1/4* the east gate:7/16* the north gate:1/4Simon: *1/16
Copy the code

The PI vector is as follows:


p i = { 1 4 7 16 1 4 1 16 } pi= \left\{ \begin{matrix} \frac14 \\ \frac7{16} \\ \frac14 \\ \frac1{16} \\ \end{matrix} \right\}

0x07 Three problems corresponding to the Water Margin HMM

1. Sequence probability process — Evaluation

Probability calculation problem: given a model, how to effectively calculate the probability of generating an observation sequence. That is, the probability (evaluation) of an observable state sequence is obtained according to the Hidden Markov model.

In combination with our water margin example, it is known that there are several gates (the number of implied states), and what is each gate (the transition probability), according to"Who do you see when you pick the door?"I want to know the probability of seeing this result. For example, three times in a row, the people to see are: Wu Song, Shi Jin, Lin Chong. So, based on the model, calculate the probability of these behaviors occurring.Copy the code

1.1 Direct calculation (exhaustive search)

Enumerate all possible state sequences, then calculate the probability of the observation sequence for each state sequence, and finally sum. The time complexity is very high, for every T, there are N hidden states, so all the possibilities of the whole sequence t are N to the t, and then the sum is all the complexity of t.

1.2 Forward algorithm (basic dynamic programming idea)

We use the forward algorithm to calculate the probability of an observation sequence given a HIDDEN Markov model (HMM), and therefore select the most appropriate HMM.

This algorithm adopts the DP idea to reduce the amount of computation, that is, each time directly references the calculation result of the previous moment to avoid repeated calculation, and the same technique as Viterbi. Correspondingly, its time complexity is linearly related to T.

Given a hidden Markov model (HMM), we will consider recursively calculating the probability of an observation sequence. We first define partial probability, which is the probability of reaching some intermediate state in the grid. For the last observed states, the local probability includes the probability of reaching those states by all possible paths. Thus, the sum of these final local probabilities is equivalent to the sum of all possible path probabilities in the grid, and the observed sequence probability given the HIDDEN Markov model (HMM) is obtained.

Note: The time complexity of exhaustive search is 2TN^T, and the time complexity of forward algorithm is N^2T, where T refers to the length of the observation sequence and N refers to the number of hidden states.

2. Sequence labeling process — Decoding

Prediction problem: Given a model and a sequence of observations, how to find the state sequence that best matches this sequence of observations? In other words, how can a hidden model state be inferred from a sequence of observations? For a particular hidden Markov model (HMM) and a corresponding observation sequence, we often want to find the sequence of hidden states most likely to generate the sequence.

In combination with our water margin example, knowing how many gates there are (number of implied states), what is each gate (transition probability), according to"Who do you see when you pick the door?"I want to know which gate is selected each time (implicit state chain). For example, three times in a row, the people to see are: Wu Song, Shi Jin, Lin Chong. So which door did you pick each of these three times?Copy the code

I now want to find a sequence of hidden states under a given sequence of observations, provided that the probability of this sequence of hidden states is the highest.

2.1 Viterbi algorithm

The Viterbi algorithm can be used to determine (search) the most likely sequence of hidden states under the known observation sequence and the HMM.

The Viterbi algorithm strives to find the best path to interpret the observed sequence. We can find the most likely sequence of hidden states by listing all possible sequences of hidden states and calculating the probability of the corresponding observation sequence for each combination. The most likely sequence of hidden states is the combination that maximizes the probability of:

The combination of Pr (observed sequence | hide status)

The basis of the idea is that if there is a best way from P1 to pN is k, and if this way k goes through p prime, then p prime to pN must be optimal, and if it’s not optimal, then there is another way k prime, whose p prime is better to pN, and then this way is not K, contradiction. So we just have to find the last node with the highest probability, and then step by step we can figure out the optimal path.

In its application, the Viterbi algorithm calculates a local probability for each cell in the grid and includes a backward pointer to indicate the most likely path to the cell. We first define the local probability, which is the probability of reaching a particular intermediate state in the grid. These local probabilities are different from those calculated in the forward algorithm because they represent the probability of the most likely path to a state at time T, rather than the sum of all path probabilities. After completing the calculation, it first finds the most likely state at the end time, and then traces back to t=1 through a backward pointer (which points to a state at the time before the current state was optimal), so that the sequence of states on the backtrace path is the most likely sequence of hidden states.

3. Learning

Learning question: Given a sequence of observations, how to adjust the model parameters to maximize the probability of occurrence of the sequence? In other words, how can the model be trained to best describe the observed data? That is, a hidden Markov model (learning) is generated from an observable set of state sequences.

Combined with our water margin example, knowing how many gates there are (the number of implied states), not knowing what each gate is (the transition probability), observing many times"Who do you see when you pick the door?"Of the results (visible state chain), I want to deduce backwards what each gate is (transition probability). For example, three times in a row, the people liang zhongshu saw were: Wu Song, Shi Jin, Lin Chong. And nothing else. To build a model, to find the probability of switching between the doors, and of course the probability of him meeting a good guy at those doors.Copy the code

In many cases, we only have the visible results and do not know the parameters in the HMM model. We need to estimate these parameters from the visible results, which is a necessary step in modeling.

This problem, the most difficult of all HMM-related problems, is to estimate A most appropriate hidden Markov model (HMM) based on A sequence of observations (from A known set) and A hidden set of states associated with it, that is, to determine the most appropriate triplet (PI,A,B) for the description of A known sequence.

3.1 Supervised learning algorithms — Frequency/total is probability

First, you supervise the learning algorithm, so you have enough data, and then you manually label it, but you just need to count the frequencies. For example, when participle: statistics B to E frequency, B to M frequency and other things can be solved, the transition matrix A has. Each word is the frequency of BEMS, and the observation matrix B has the probability that the first word in the sample is B and S respectively. I have my initial probability PI. And then you solve the problem. Actually, that’s what word segmentation is, you just count it on a manually labeled data set.

3.2 Unsupervised learning algorithm — Baum-Welch algorithm

In the case of no annotation, you only have A bunch of sentences (assuming the sentences are of the same length and are all T words), in which case the matrices A and B cannot be measured directly. To learn these parameters, you need to use the BAum-Welch algorithm, an adaptation of the EM algorithm to the HMM parameter learning problem.

Parameter learning algorithm is also called forward – backward algorithm. If you calculate the probability of the whole sequence, the forward is solved. If you want to calculate the probability of a subsequence of the whole sequence, you have to do both.

0x08 Reference code

UMDHMM (Hidden Markov Model Toolkit) :

Hidden Markov Model (HMM) Software: Implementation of Forward-Backward, Viterbi, and Baum-Welch algorithms. This is a lightweight VERSION of HMM. UMDHMM homepage: www.kanungo.com/software/so…

Forward algorithm program (forward. C)

/* Function parameter description: * PHMM: known HMM model; T: observe the length of the symbol sequence; *O: observation sequence; **alpha: local probability; *pprob: The final observation probability */
void Forward(HMM *phmm, int T, int *O, double **alpha, double *pprob) {inti, j;/* Status index */
  intt;/* Time index */
  double sum; /* Find the middle value of the local probability */

  /* 1. Initialize: calculate the local probability alpha of all states at t=1: */
  for (i = 1; i <= phmm->N; Alpha [i++)1][i] = phmm->pi[i]* phmm->B[i][O[1]]./* 2. Induction: recursively calculate each time point, t=2... , the local probability at T */
  for (t = 1; t < T; t++) {for (j = 1; j <= phmm->N; J++) {sum =0.0;for (i = 1; i <= phmm->N; Sum ++ = alpha[t][I]* (PHMM ->A[I][J]); alpha[t+1][j] = sum*(phmm->B[j][O[t+1]]); }}/* 3. Termination: the probability of the observation sequence is equal to the sum of all local probabilities at time T */*pprob =0.0;for (i = 1; i <= phmm->N; i++)
    *pprob += alpha[T][i];
}
Copy the code

Viterbi algorithm program (Viterbi.c)

void Viterbi(HMM *phmm, int T, int *O, double **delta, int **psi,int *q, double *pprob)
{int i, j; /* state indices */
  int t; /* time index */
  intmaxvalind;doublemaxval, val;/* 1. Initialization */
  for (i = 1; i <= phmm->N; I++) {delta [1][i] = phmm->pi[i] * (phmm->B[i][O[1]]); psi[1][i] = 0; }/* 2. Recursion */
  for (t = 2; t <= T; T++) {for (j = 1; j <= phmm->N; J++) {very =0.0; maxvalind =1;for (i = 1; i <= phmm->N; I++) {val = delta [t- 1][i]*(phmm->A[i][j]);if(val > maxval){maxval = val; Maxvalind = I; }} delta[t][j] = maxval*(PHMM ->B[J][O[t]]); Psi [t] [j] = maxvalind; }}/* 3. Termination */*pprob =0.0; q[T] =1;for (i = 1; i <= phmm->N; i++){if(delta[T][I] > *pprob){*pprob = delta[T][I]; Q [T] = I; }}/* 4. Path (state sequence) backtracking */
  for (t = T - 1; t >= 1; [t] t -) q = psi [t +1][q[t+1]];
}
Copy the code

0xEE Personal information

Thoughts on life and technology

Wechat public account: Rosie’s Thinking

If you want to get updates on your own articles, or check out your own technical recommendations, please pay attention.

0 XFF reference

[2] How to explain the Hidden Markov model with simple and understandable examples? [Online]

[3] From EM algorithm to Baum-Welch Algorithm [Online]

Itenyh version – Use HMM to do Chinese word segmentation 1: preface

HMM Learning Best Example 1: Introduction

A simple understanding of conditional random fields

HMM hidden Markov model detailed explanation

3. Binary grammar and Chinese participles

From EM algorithm to Baum-Welch algorithm of “Dating Record”

(github.com/NLP-LOVE/In…).

Github.com/NLP-LOVE/In…

How to understand conditional random field (CRF) with ease and pleasure?

Read machine learning probability graph model

15, read an article to understand the probability graph model that won the Turing Prize and the Nobel Prize

Machine Learning — Probabilistic Graph Models (Bayesian networks)

Bayesian networks, after reading this article I finally understand (attached code)!

Probability graph model understanding

Probability graph model system: HMM, MEMM, CRF

HMM algorithm learning — diy implementation of a simple HMM word segmentation (Java)

HMM learning best example vi: Viterbi algorithm 1

A good example of HMM is on wiki

hunch.net/

eps.leeds.ac.uk/computing