33 HMM– Baum Welsh and Viterbi Algorithms

Learning Objectives:

The learning objectives of this module are as follows:

• To understand the three issues of HMM

• To explain the Baum Welsh algorithm for evaluation using the model

• To discuss Viterbi algorithm for decoding

35.1 Recap: Hidden Markov Model

As we have already discussed Hidden Markov Model (HMM) is an extension of a Markov model in which the input symbols are not the same as the states.This means we don’t know which state we are in (hence called Hidden State). For example in HMM POS-tagging, the input symbols are the words, states are the part of speech tags. In addition we make two important assumptions namely Markov assumption and Output-Input assumption. Markov assumption states that the state transition depends only on the origin and destination and the Output-independent assumption which states that all observation frames are dependent on the state that generated them, not on neighbouring observation frames.

35.1.1 Parameters of an HMM:

In Hidden Markov Models, S = {s1,…,sn} is a set of states where ‘n’ is the number of possible states
Transition probabilities is a two dimensional Matrix A= a1,1,a1,2,…,an,nwhere each ai,jrepresents the probability of transitioning from state sito sj.
Emission probabilitiesor the observation symbol distribution probabilityis aset B of functions of the form bi (ot) which is the probability of observation otbeing emitted or observed by state si
Initial state distributionpis the probability that si is a start state

Therefore we have two model parameters n, the number of states of the HMM and m, the number of distinct observation symbols per state. In addition we have three probability measures A-the transition probability, B-the emission probability and p- the initial probability.

The HMM Formalism is represented as given in Figure 35.1 where {S, K, P, A, B} is the model. Here S is the state and K is the observation. p = {pi} are theinitial state probabilities, A = {aij} are the state transition probabilities and B = {bik} are the observation or emission state probabilities.

Figure 35.1 The HMM Formalism

35.1.2 Building the observation sequence

Before we proceed further, let us understand how an observation sequence is generated. Given the above five elements, we can build an observation sequence:

where T is the number of observations . This observation sequence is built as follows:

1.We first set t=1.

2. Then we choose an initial state

q_i=S_i

according to the initial distribution.

3. Then we choose

O_t=V_k

according to the symbol probability distribution.

4. Then we transit to a new state

q_t+1=S_j

according to the state transition probability distribution.

5. Then we set t t=t+1, and iterate by returning to step 3 while t<=T

35.2 The Three Problems of HMM

The following are the three problems associated with HMM:

Problem 1: Evaluation

Here given the observation sequence O=o1,…,oTand an HMM model how do we compute the probability of O given the model?

Q1: How do we compute the probability of a given sequence of observations?

A1: Forward – Backward dynamic programming algorithm – the Baum Welch algorithm

Problem 2: Decoding

Here given the observation sequence O=o1,…,oTand an HMM model, how do we find the state sequence that best explains the observations?

Q2:How to compute the most probable sequence of sequence of observations? states, given a

A2: Viterbi’s dynamic programming Algorithm

Given an observation sequence, compute the most likely hidden state sequence

Problem 3: Learning

Here given an observation sequence and set of possible models, which model most closely fits the data? How do we adjust the model parameters

to maximize

Q3: Given an observation sequence and set of possible models, which model most closely fits the data?

A3: The Expectation Maximization (EM) heuristic.

35.3 Markov Assumption with an Example

Let us discuss the Markov assumptions with an example. Let us assume that we have a sentence with n words. The Markov assumption states that probability of the occurrence of word wi at time t depends only on occurrence of word wi-1 at time t-1.

The normal chain rule would be as follows:

However the Markov Assumption approximates the probability of the sequence of words as:

A common way of representing the Hidden Markov Model is the Trellis Diagram shown in Figure 35.2

Figure 35.2 Trellis Diagram

In this diagram we have the starting state s0 which influences the states s1,1,s1,2,s1,3,s1,4 with observation at time t1 being o1 and so on until finally we end up with the states sT,1,sT,2,sT,3,sT,4at time t T.

35.4 Problem 1 – Evaluation – Forward and Backward Algorithms

Here given the observation sequence O=o1,…,oTand an HMM model we need to compute the probability of O given the model that is we are given a sequence of observations and we need to compute the probability of that specific sequence of observations. This likelihood of a sequence can be determined by either the forward procedure or the backward procedure. We will discuss the Forward algorithm which is essentially a dynamic programming algorithm. Backward algorithm can be similarly explained. The Forward algorithm can be used with two options, one the “Any Path” methodwhere the likelihood is measured using any sequence of states of length T, and the second option the “Best Path” method where we choose an HMM by the probability generated using the best possible sequence of states . Solving the evaluation problem involves the determination of the probability that a particular sequence of symbols O was generated by that model (Figure 35.3). Here the model starts at state q1 with initial probability pq1, followed by the transition probabilities aq1,q2, …………….aqT-1,qT.

35.4.1 Forward Probabilities:

We need to determine the forward probability that, given an HMM l, at time t the state is i and the partial observation o1 … ot has been generated that is

Here we start from the initial state and calculate the probability of each subsequent state in the forward direction, and hence this probability is called the forward probability.

Figure 35.4 Forward Probability

The forward probability at a time slice t of a state j is the sum of each the N forward probabilities i at time slice t-1 multiplied by the transition probability of each state i at time slice t-1 to the state j under consideration t time slice t (Figure 35.4). This sum is then multiplied by the emission probability of observing ot at state j.

35.4.2 Forward recursion:

As we have already discussed forward probability

is calculated using forward recursion. The initialization is as follows:

Here the initial forward probability of state i at time slice 1 is the product of the initial probability of state i and the probability of emitting the observation o1 at state i at time slice 1. The forward recursion is determined as follows:

Here the forward probability at time slice t+1 is determined by considering the forward probabilities of all states at time slot t, the transition probabilities from state i to state j (the state whose forward probability is to be determined) and the emission probability of the observation at time slice t+1 at state j. Finally we have the termination as follows:

35.4.3 Example for the Calculation of Forward Probability

We show an example for the calculation of forward probabilities (Figure 35.5). Here there are three states which can emit the observations R,G and B. We have the observation sequence R,R,G,B. Now the initial probability shows that the start state is 1. The probability of state 1 emitting R is 0.6, G is 0.2, B is 0.2. The probabilities of state 2 and state 3 being the initial state is 0. The probabilities of state 2 emitting R is 0.2, G is 0.5, B is 0.3.and the probabilities of state 3 emitting R is 0.0, G is 0.3, B is 0.7. The transition probabilities from each state to all the other states are also shown in the Figure. At time slot 1 we have the probability of only state 1 and it emitting R is 0.6. Now let us calculate the forward probabilities of each of the states at the time slice 2 with observation being R.

Figure 35.5 Example for Calculation of Forward Probabilities

Now let us calculate the forward probabilities of each of the states at the time slice 3 with observation being G.

Now let us calculate the forward probabilities of each of the states at the final time slice 4 with final observation of the sequence being B.

35.4.4 Forward Algorithm Complexity

In the naïve approach to solving problem 1 the time taken is of the order of 2T*NT computations where T is the number of time slices in the sequence and N is the number of states in the HMM. However the forward algorithm takes time of the order of N2T computations.

35.4.5 Backward Probabilities

Analogous to the forward probability, but just in the other direction that is in the backward direction starting from the last state and traveling to the initial state. Now we need to determine the backward probability that given an HMM and given the state at time t is i, the partial observation ot+1 … oT is generated.

Figure 35.6 Backward Probability

Here we start from the final state and calculate the probability of each preceding state in the backward direction, and hence this probability is called the backward probability. The backward probability at a time slice t of a state j is the sum of each the N forward probabilities i at time slice t+1 multiplied by the transition probability of each state j a t time slice t+1 to the state i under consideration t time slice t (Figure 35.6). This sum is then multiplied by the emission probability of observing ot+1at state j.

35.4.6 Backward recursion:

As we have already discussed backward probability

is calculated using backward recursion. The initialization is as follows:

Here the initial backward probability of state i at time slice T is 1. The backward recursion is determined as follows:

Here the backward probability at time slice t is determined b y considering the backward probabilities of all states at time slot t+1, the transition probabilities from state i (the state whose backward probability is to be determined) to state j and the emission probability of the observation at time slice t+1 at state j.

Finally we have the termination happens when the backward probability of state i at time slice 1 multiplied by the initial probability of state i as follows:

35.5 Problem 2 – Decoding – Viterbi Algorithm

Given the observation sequence O=o1,…,oTand an HMM model now we want to find the state sequence that best explains the observations . In other words we need to compute the most probable sequence of states, given a sequence of observations. For this decoding we describe Viterbi’s dynamic programming algorithm.

As we discussed for the solution to Problem 1 (Evaluation) was the efficient determination of the sum of all paths through an HMM. For solving the decoding problem we want to find the path with the highest probability. Here given a set of symbols O determine the most likely sequence of hidden states Q that led to the observations. In other words, we want to find the state sequence Q=q1…qT, which maximizesP(Q|o1,o2,…,oT) that is as follows:

Here we see we need to find the states that maximizes the probability of the sequence given the sequence of observations and the HMM model.

When we want to find the mostprobable state sequence we use the idea that if we know the identity of Qi , then the most probable sequence on i+1,…,n does not depend on observations before time i.

35.5.1 Viterbi Algorithm

The purpose of the Viterbi algorithm is to carry out an analysis of the internal processing result for finding the best most likely state sequence. It uses the dynamic programming concept to align state and observation transitions. The Viterbi algorithm is similar to computing the forward probabilities, but instead of summing over transitions from incoming states, we compute the maximum at each and every time slice. While in forward algorithm we have

in the case of Viterbi recursion we have

As we can see instead of considering the summation of forward probabilities of all the preceding states, in the case of Viterbi recursion we consider only the transition from the state where the product of the state probability and the transition is the maximum. Figure 35.7 shows the HMM.

Figure 35.7 HMM Model

The forward probability already discussed is as given below:

Similarly backward probability has already discussed and is given below:

The forward probability and backward probability can be combined:

The initialization of the Viterbi algorithm is at time slice 1 and state i.

Then at each recursive step we find the maximum probability from one of the N states as shown by the induction.

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

Then we find the argument maximum to find the termination condition:

In this way the best sequence of hidden states that gave rise to the given set of observations as given below:

In this way the final sequence of states is computed by working backwards given as below:

35.5.2 Example for Viterbi Algorithm

This is the same example that we discussed in Section 35.4.3. However here we need to find the maximum values at each step rather than the sum of forward probabilities coming from each step of the induction. Now let us calculate the probabilities of each of the states at the time slice 2 with observation being R.

The sequence of hidden states are 1,1,2,3 to get observation sequence R,R,G,B

Figure 35.8 Example for Viterbi Algorithm

We will discuss the solution to the third problem associated with HMM that is the learning of the HMM model with the EM algorithm in the next module.

Summary

Explained the three issues of HMM
Discussed the Baum Welsh Forward and Backward algorithms for Evaluation using the model
Outlined Viterbi algorithm for Decoding

you can view video on HMM– Baum Welsh and Viterbi Algorithms

Web Links

studentnet.cs.manchester.ac.uk/ugt/COMP24111/…/Nai ve-Bayes.ppt
web.cecs.pdx.edu/…/2015BayesTrees…/2014_0095_Example%20of%20
https://cse.sc.edu/~rose/587/PPT/NaiveBayes.ppt
www.cs.unc.edu/~lazebnik/spring09/lec20_generative.ppt
cis-linux1.temple.edu/~latecki/Courses/RobotFall08/…/bayesNaive.ppt
www.cs.bu.edu/fac/gkollios/ada01/LectNotes/Bayesian.ppt

Supporting & Reference Materials

Tom Mitchell, “Machine Learning”,McGraw-Hill Education, 1997
AlpaydinEthem, “Introduction to Machine Learning”, The MIT Press; third edition, 2014
Christopher M. Bishop, “Pattern Recognition and Machine Learning”,Springer, 2013 Peter Harrington, “Machine Learning In Action”, Manning Publications, 2012
Peter Flach, “Machine Learning: The Art and Science of Algorithms that Make Sense of Data”,Cambridge University Press, 2012
Stephen Marsland, “Machine Learning: An Algorithmic Perspective”, Chapman and Hall/CRC; 2 edition, 2014
S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, “Learning from Data”, AMLBook, 2012.