37 Q Learning

Learning Objectives

The learning objectives of the module are as follows:

• To discuss the Model based and Model free reward learning.

• To discuss the Deterministic and Non-deterministic rewards and actions

• To explain Q Learning a model free reinforcement learning method.

39.1 Model based and Model free Learning

There are four aspects of reinforcement learning: 1) Policy 2) Reward function 3) Value function and 4) Model of the environment. A policy characterizes how the agent behaves and is a mapping from states to actions, with the probability of selecting the action. The agent learns this policy p : S ® A and usesthat to maximize the cumulative reward V p (st )in the long run. As the reward has to be earned sooner than later, V p (st )is called discounted cumulative reward.There are two types of cumulative reward:Finite horizon, where reward is calculated for some finite number of steps and Infinite horizon cumulative reward where the steps are infinite.

39.1.1 Value Function and Optimal Value Functions

The state value functionVp(s) is the expected return when starting in s and following the policy p. Qp(s,a) is the state-action value function which is the expected return when starting in s, performing a, and following p. This state-action value is useful for finding the optimal policy and can be estimated from experience. We need to pick the best action using Qp(s,a). The general Bellman equation can now be restated as:

Now there is a set of optimal policies and Vp defines partial ordering on policies which share the same optimal value function

The general Bellman optimality equation

Is a system of n non-linear equations that can be solved for V*(s) and hence the optimal policy can be extracted. Having the state -action function Q*(s,a) makes it possible to choose the policy:

The agent has to learn the optimal policy that will give optimal cumulative reward

In the model based learning the agent has the knowledge of the environment model parameters, P (st+1 | st, at ), p (rt+1 | st, at ), uses it to calculate the rewards. As the parameters are known there is no need for exploration and the rewards and policy are calculated using dynamic programming. Theoptimum policy

The value iteration method given below computes the optimal value V(s) starting from an arbitrary V(s) value by computing the immediate reward and following the

Figure 39.1 Value Iteration Method

Optimum policy thereafter (Figure 39.1).

The policy iteration method given below computes the optimal policy starting with an arbitrary policy p through policy evaluation i.e. finding the cumulative value and policy improvementi.e. modifying the policy, until convergence.

Figure 39.2 Policy Iteration Method

39.1.2 Deterministic Rewards and Actions:

39.1.3 Nondeterministic Rewards and Actions:

39.2 Basics of Temporal Learning

Temporal difference (TD) learning is a prediction-based machine learning method. It has primarily been used for the reinforcement learning problem, and is said to be a combination of Monte Carlo ideas because it learns by sampling the environment according to some policy, and dynamic programming (DP) ideasas it approximates its current estimate based on previously learned estimates . Temporal Difference (TD) learning methods can be used to estimate value functions. If the value functions were to be calculated without estimation, the agent would need to wait until the final reward was received before any state -action pair values can be updated. Once the final reward was received, the path taken to reach the final state would need to be traced back and each value updated accordingly.Thiscan be expressed formally as:

where st is the state visited at time t, Rt is the reward after time t and is a constant parameter. On the other hand, with TD methods, an estimate of the final reward is calculated at each state and the state-action value updated for every step of the way:

where rt+1 is the observed reward at time t+1

The TD method is called a “bootstrapping” method, because the value is updated partly using an existing estimate and not a final reward. The Environment, P (st+1 | st, at ), p (rt+1 | st, at ), is not known; model-free learning. There is need for exploration to sample from

P (st+1 | st, at ) and p (rt+1 | st, at )

We use the reward received in the next time step to update the value of current state (action). The temporal difference between the value of the current action and the value discounted from the next state .

39.3 Exploration Strategies

At each state (except a terminal state) the agent must select an action. There are several ways in which to decide which action to take. The simplest of these is greedy selection: the agent always selects the action that the highest state-action value. This method is pure exploitation. There are many more sophisticated strategies used which combine exploitation and exploration. Some of these methods are:

ε-greedy:In this method most of the time the action with the highest estimated reward is chosen, called the greediest action. Every once in a while, say with a small probabilityε, an action is selected at random. The action is selected uniformly, independent of the action-value estimates.

Boltzman method:Here we move smoothly from exploration to exploitation.

Here we need to choose action a in states with probability P Here we have the concept of Simulated annealing where T is a “temperature” .Simulated annealing is a probabilistic technique for approximating the global optimum of a given function. Simulated annealing is interpreted as slow cooling where there is a slow decrease in the probability of accepting worse solutions as it explores the solution space. Large T means that each action has approximately the same probability. Small T leads to more greedy behavior. Typically we start with a large T and decrease its value with time.

39.4 Q Learning

Q-Learning is an Off-Policy algorithm for Temporal Difference learning. Off-policy algorithms can update the estimated value functions using hypothetical actions, those which have not actually been tried.Q-learning is an approach of reinforcement learning that attempts to learn the value of taking each action in each state. This approach is called a model-free approach. Q-learning can be used to find an optimal action-selection policy for any given Markov Decision Process. As we discussed earlier a policy can be defined as a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. Q-learning is able to compare the expected utility of the available actions without requiring a model of the environment. Now while solving MDPs we defined the function Q(i,a) which provides the expected value of taking action a in state i. The Q value incorporates both the immediate reward and expected value of the next state. Q-learning converges to an optimal policy in both deterministic and nondeterministic MDPs.

The advantages of Q-learning are it is easy to work with in practice and is exploration insensitive i.e. optimal value will be reached in the end. As the exploration strategy used has no effect on Q-learning it is the most popular and the most effective model-free algorithm for learning from delayed reinforcement. The disadvantage of Q-learning are, it does not scale well i.e. for generalizing over huge number of states and actions convergence is quite slow.Only practical in a small number of problems because Q-learning can require many thousands of training iterations to converge in even modest-sized problems and hence in many situations the memory resources required by this method become too large.

39.4.1 Q Learning Steps

Q-learning augments value iteration by maintaining an estimated utility value Q(s,a) for every action at every state. The utility of a state Q(s), is simply the maximum Q value over all the possible actions at that state . This method learns utilities of actions not the states and is in essence model-free learning

The working of Q-Learning is as follows: For every state it chooses the action a, which maximizes the function Q(s, a). Q is the estimated utility functionU and tells how good an action is given a certain state. Then update Q(s, At each step s, it chooses a) asthe addition of immediate reward for making the action and best utility (Q) for the resulting state. This is done recursively until convergence.

The Q-Learning rule is given below:

wheres – current state, s’ – next state, a – action, a’ – action of the next state, r – immediate reward, γ – discount factor, Q(s,a) – expected discounted reinforcement of taking action a in state s.Here <s, a, r, s’> is an experience tuple.

39.4.2 Action-Value Function

In Q-Learning as the action should yield best next state we need to choose the best action that leads to that state. The action value pair is as given in Figure 39.4.

Figure 39.4 Action Value Pair

If the action value function is learnt accurately then the action a is chosen using

Figure 39.5shows the graphical representation of Q and U values. The utility values are stored in the nodes or states and q values are stored in the edges or the actions.

Figure 39.5 Graphical Representation of Q-Learning

Let Qtbe our current estimate of the correct Qthen the current policy and utility function estimate are as follows:

As the U table is embedded in the Q tab there is no need to store both. In order to predict the estimate precisely, assume the agent is in state St, calculate the reward for some finite time step n.Following the current policy determine the actual reward and compare it with the predicted reward and reduce the error.

39.5 Q-Learning: Model-Free RL

InQ-Learning instead of learning the optimal value function V of the state we directly learn the optimal Q function. Recall that Q(s,a) is expected value of taking action a in state s and then following the optimal policy thereafter. The optimal Q- function satisfies the condition which gives: V (s) = max_a‘Q(s, a‘) .

To learn the Q function directly we select an action greedily according to Q(s,a) without using a model

The most important RL algorithm in use today Q-Learning,uses one-step sample backup to learn action-value function Q(s,a). It uses one-step error given below follows same schedule as in TD algorithm.

The Q-Learning follows the below steps to learn the function:

The Q-Learning algorithm is given below:

As this is an off-policy method no policy is used and the value of the best next action is used. Q-Learning is used when the environment cannot be controlled example: opponent in chess, the dice in backgammon etc.

39.6 Sarsa Algorithm

The Sarsa algorithm is an On-Policy algorithm for TD-Learning. The major difference between this algorithm and Q-Learning, is that the maximum reward for the next state is not necessarily used for updating the Q-values. Instead, a new action, and therefore reward, is selected using the same policy that determined the original action. The name Sarsa actually comes from the fact that the updates are done using the quintuple Q(s, a, r, s’, a’) where: s, a are the original state and action, r is the reward observed in the following state and s’, a’ are the new state-action pair. The On-policy SARSAinstead of looking for all possible next actions a’and choosing the best,uses the policy derived from Q values to choose a’.It uses the Q value of a’ to calculate the temporal difference.On employing the GLIE (greedy in the limit with infinite exploration) policy where (1) all state-actionpairs are visited an infinite number of times, and (2) the policyconverges in the limit to the greedy policy, the algorithm converges with probability 1 to the optimal policy and state-actionvalues.

39.7 Value-Based TD vs. Q-learning

Both V-function and the Q-functioncan be learnt by the Temporal Difference approach but are used in different circumstances. If state space is small then this may not be such an important issue but if the state-spaces is large we have to choose based on the application. The Value-Based approach learns a model and utility function but it is difficult to learn good models for large complex environments(e.g. learning a DBN representation). If we can learn a model then learning utility function is simpler than learning Q(s,a) and it can be reused for related problems. In Q-learning approach the Q-functionis what is learnt which is simpler to implement since we do not need to worry about representing and learning a model. But Q-functions can be substantially more complex than utility functions a small price to be paid for not having the model.

39.8 Q-Learning – Example

An example to demonstrate the Q-Learning process with a simple grid-world environment is shown in the Figure 39.6 (a). The six grid squares represents six possible states, or locations, each arrow represents a possible action the agent can take to move from one state to another and the number associated with each arrow represents the immediate reward r(s,a)the agent receives on performing the corresponding state-action transition. The state G is the goal state, the agentreceives the reward by entering this state and the immediate rewardforall state-action transitions except for those leading into the state labelled G is defined to be zero. Theonly action the agent can perform once it enters the state G isto remain in this state as G an absorbing state.All the states, actions, and immediate rewards are defined, the value for the discount factor lis chosen as l= 0.9, so that the optimal policy n*, its value function V*(s) can be determined (Figure 39.4 (b)) and the Q(s,a) (Figure 39.4 (c)). Figure 39.4(d) shows one of the optimal policies for this setting thatdirectsthe agent along the shortest path toward the state G.

Figure 39.5 shows the values of V*(S)for eachstate and Q(s,a) for the state. The value ofV* for this state is 100 as it receives immediate reward 100 since the optimal policy in this state selects the “moveup” action. Since the agent will remainin this absorbing state and hence no further rewards are received. Likewise, for the bottom center state value of V*is 90 since the optimal policy will movethe agent from this state to the right andthen upward which generates an immediate reward of 0 and 100respectively for the actions. Hence we get the discounted futurereward from the bottom center state as 0 +y*100+y2*0+y3*0+…..=90.As V* is the sum of discounted future rewards over the infinite future . Once the absorbing state G is reached by the agent it remains in that state infinitely and receives rewards of zero.

In many real problems calculating the state value is unrealistic. For example what will be the reward for the robot exploring mars which decides to take a right turn?Thisproblem can be evadedby exploring and experiencing how the environment reacts to the actions. It is difficult for the agent to learn the optimal policy with the sequence of immediate rewards. Hence a numerical evaluation function is defined over states and actions and the optimal policy is learnt in terms of the evaluation function. If theagent hasperfect knowledge of the immediate reward function r and the state transitionfunction delta then the optimal policy can be acquired by learning V*.We want a function Q(s,a) that directly learns good state-action pairs, i.e. what action should I take in this state. Given Q(s,a) it is now simple to execute the optimal policy, without knowing r(s,a) and delta(s,a) since p * (s) = arg max Q(s, a) ,a V * (s) = max Q(s, a) and Q(s,a) = r(s,a) + gV*(d(s,a)) = r(s,a) + g (d(s,a),a’), and a Q(s,a) depends on r and delta.Consider a robot is exploring its environment and taking new actions as it goes, it receives some reward “r” at every step and observes the environment changes i.e. the state is now changed to new s’ for the action a . we can use these observations, (s,a,s’,r) to learn a model (s,a) = r + g (s’,a’) where s’ = st+1. This equation continually estimates Q at state s consistent with an estimate of Q at state s’, one step in the future: temporal difference (TD) learning.Note that s’ is closer to goal, and hence is more reliable, but is still an estimate itself. This type of updating estimates based on other estimates is called bootstrapping and is done after each state -action pairi.e. the agent learns online. The useful things we learn about explored state-action pairsare most useful as they are likely to be encountered again. Under suitable conditions, these updates can actually be proved to converge to the real answer.

Figure 39.7 shows the initial state s1 of the robot (R) and relevant values in its initial hypothesis. For example, the value (s1, aright) = 72, where aright refers to the action that moves R to its right. On executing the action the robot receives an immediate reward r = 0 and moves to state s2. Then updates its estimate (sl, aright) based on its estimates for the new state s2 as Q-learning propagates Q-estimates 1-step backwards. The value of g is chosen as 0.9 and the value is updated to (s1, aright) = 90 after executing the single action as shown below.

When learning Q (off-policy learning)the agent should not simply follow the current policy.Because the agent may get stuck in a suboptimal solution as there may be other solutions out there that the agent has never seen. So it is good to try new things now and then. Generally if T value is large do lot of exploring, if T value small follow the current policy and the value can be decreased over timeas a tradeoff between exploration and exploitation.

To improve Q-Learning as a trade-off between memory and computation one can cache the (s,s’,r) for observed transitions and replay the update (s,a) = r + g (s’,a’) after a while, as Q(s’,a’) has changed.Other methods to improve Q-Learning are

a). Anactive search for state -action pairs for which Q(s,a) is expected to change a lot (prioritized sweeping) can be done or

b). Updatescan be done for more than just one step along the sampled path (TD(λ) learning).

c) To deal with stochastic environments, theexpected future discounted reward:

space is too large to deal with all statesweneed to learn a function: Q(s,a) ≈ fθ(s,a). Neural network with back-propagation have been quite successful for these types of environments.For example in the back-gammon program, TD-Gammon which plays at expert level the state-space is very large. Hence Neural Networks is used to approximate the value function and TD(lambda) is used for learning by TD -Gammon which trains by playing against itself.

Summary

Outlined the value iterations, policy iterations, TD Learning and exploration strategies
Explained deterministic vs nondeterministic rewards and actions
Reviewed the Model based vs Model free RL
Discussed Q-learning steps, example, and improvements on Q-learning

you can view video on Q Learning

Web Links

http://en.wikipedia.org/wiki/Reinforcement_learning
http://www.cs.cmu.edu/afs/cs/project/theo-20/www/mlbook/ch13.pdf
http://neuro.bstu.by/ai/RL-3.pdf
https://cs.uwaterloo.ca/~ppoupart/ICML-07-tutorial-slides/icml07-brl-tutorial-part2-intro-ghavamzadeh.pdf
ce.sharif.edu/courses/91-92/1/ce717-2/resources/root/Lectures/RL.pdf
www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-10.pdf
https://www.cs.uic.edu/~piotr/cs594/YijueRL.ppt
https://en.wikipedia.org/wiki/Multi-armed_bandit
http://teaching.csse.uwa.edu.au/units/CITS3001/lectures/lectures/3001%20Reinforcement
%20learning.
https://www.udacity.com/wiki/cs271/unit10-notes#!#passive-vs-active
http://www.cs.mcgill.ca/~vkules/bandits.pdf
www.cs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt
http://www.inf.ed.ac.uk/teaching/courses/rl/slides15/rl02.pdf

Supporting & Reference Materials

Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An Introduction”, 1998, MIT press
Wiering, Marco, van Otterlo, Martijn (Eds.), “Reinforcement Learning”, State-of-the-ArtSeries:
Adaptation, Learning, and Optimization, Vol. 12, 2012
Stuart Russell and Peter Norvig “Artificial Intelligence: A Modern Approach Prentice Hall Series in Artificial Intelligence), 2009
Tom Mitchell, “Machine Learning”,McGraw-Hill Education, 1997
AlpaydinEthem, “Introduction to Machine Learning”, The MIT Press; third edition, 2014
Stephen Marsland, “Machine Learning: An Algorithmic Perspective”, Chapman and Hall/CRC; 2 edition, 2014