36 Basics of Reinforcement Learning – II

epgp books

 

Learning Objectives

 

The learning objectives of the module are as follows:

 

•      To understand the Reinforcement Learning and its variants

•      To discuss about Passive Learning and Active learning

•      To explain the Markov Decision Process and Partially Observable MDP

 

•      To illustrate the Least Mean Square and Adaptive Programming

 

 

38.1 Reinforcement Learning

 

As already discussed in the previous module reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards. It is a trial-and-error learning paradigm which learns from rewards and punishments. Reinforcement learning is not just an algorithm but a new paradigm in itself. Its objective is to learn about a system from minimal feedback.

 

Let us recall the mathematical formulation of the learning task, we e xecute actions in the environment, observe results and

The learning task associated with reinforcement learning can be characterized based on three perspectives namely learning type , environment and rewards. The learning type can be passive or active. The environment can be known or unknown, in other words does the agent know effects of actions? Does the agent have a model of environment?

 

Here we will assume that the environment is accessible. The rewards can be terminal only or non-terminal.

 

38.2 Passive versus Active Learning

 

Passive Learning is a learning about an agent which has fixed policy and tries to learn the utilities of states by observing the world go by and is analogous to policy evaluation in policy iteration. Here the agent only learns how “good” each state or action is. The values of states are learnt and the action is taken accordingly.

 

Active learning is the learning where the agent attempts to find an optimal (or at least good) policy by acting in the world. It is not a fixed policy but it is like finding a best policy. It is analogous to solving the underlying MDP (Markov Decision Process).

 

38.3  Reinforcement Approaches

 

Reinforcement Learning is a sequential decision problem, you learn one state and then go to the other state and then upcoming states with the reward for each state.

 

Model-Based vs. Model-Free RL

 

Model based approach to RL will learn the MDP model, or an approximation of it and then use it for policy evaluation or to find the optimal policy. The agent learns values of states (or state histories) and tries to maximize utility of their outcomes. Here there is a need for a model of the environment including the operations that are available and the states they lead to. In another approach the agent learns values of state-action pairs. In this case the agent does not require a model of the environment (except legal moves) and cannot look ahead. The Model free approach to RL will derive the optimal policy without explicitly learning the model. It is useful when model is difficult to represent and/or learn.

 

Deterministic  transitions  and   Stochastic  transitions  are   two   reinforcement   learning perspectives.  is the probability of reaching state j when taking action a in state i. The uncertainty in going from state i and when an action is taken and there is no guarantee that state j can be reached. This is a sequential problem where each and every move is recorded.

 

A simple environment presents the agent with a sequential decision problem:

 

Move cost = 0.04

Figure 38.1 Reinforcement learning

 

Temporal credit assignment problem is a sparse reinforcement problem. In the offline algorithm the action sequences are determined beforehand. In the online algorithm the action sequences are conditional on observations along the way (Figure 38.1). This type is important in stochastic environment (e.g. jet flying).

 

38.4 RL Model

 

The objective of reinforcement learning is to find the mapping that maximizes some combination of future reinforcements (rewards) received over time. The valuation models quantify how good the mapping is. We can consider the finite horizon model where we first consider maximizing total reward over a finite horizon. In other words the agent has n time steps to live. This is defined as

When we do not assume finite time then we have the infinite horizon discounted model. Defining expected value as total reward is problematic with infinite horizons and hence we introduce discount factor 0 ≤ g < 1 that is future rewards are discounted by g per time step.

The Average reward is given as

In order to solve the problem mathematically we formulate it as a Markov Decision Process (MDP) or Partially Observable Markov Decision Process (POMDP). We maximize the state -value and action-value functions using the Bellmann optimality equation. We need to use approximations such as dynamic programming, Monte Carlo Methods, Temporal-difference Learning to solve the Bellmann equation.

 

38.4.1 Markov Decision Process and Partially Observable Markov Decision Process

 

The usual Markov Decision Process has the following components. The set of States S, set of actions A, the state transition probability P(s’|s,a), the immediate Rewards R(s,a) and discount factor g. The current state is always perfectly observed.

Figure 38.2 MDP Process

 

The Partially Observable Markov Decision Processes (POMDP) has just like MDP the set of states S, set of actions A, the probabilistic transitions, the immediate rewards, a discount factor but in addition has the observations Z, observation probabilities and an initial belief b0. The brief theory of how MDP works is discussed below,

 

An MDP has four components: S, A, R, T as already discussed the (finite) state set S (|S| = n), the (finite) action set A (|A| = m). the (Markov) transition function T(s,a,s’) = Pr(s’ | s,a) (Probability of going to state s’ after taking action a in state s) and the bounded, real-valued reward function R(s) (Immediate reward we get for being in state s). For example in a goal-based domain R(s) may equal 1 for goal states and 0 for all others. This reward function can be generalized to include action costs: R(s,a) and can also be generalized to be a stochastic function. Moreover, MDP can easily generalize to countable or continuous state and action spaces

 

38.5  Passive Learner in a Known Environment

 

A passive learner simply watches the world going by, and tries to learn the utility of being in various states. Another way to think of a passive learner is as an agent with a fixed policy of trying to determine its benefits. In passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

 

•      Agent can move {North, East, South, West}

 

•        Terminate on reading [4,2] or [4,3]

 

Figure 38.2 Example

Figure 38.3 State Transition Probabilities

 

 

Agent is provided with Mij = a model giving the probability of reaching from state i to state j (each state transitions to neighbouring states with equal probability). With this assumption it means that the probability of a state is determined by the number of transitions from that state. Thus if a state has 3 transitions from it, the probability of transition to each of those states is 0.33 (Figure 38.3).

 

38.6 Passive Learning in a Known Environment

 

There are three approaches in passive learning in a known environment such as LMS (least mean squares), ADP (adaptive dynamic programming) and TD (temporal difference learning).We will discuss each methods in detail.

 

38.6.1 LMS (Least Mean Squares)

 

In this method we find an estimate of utility Up(s). This method has slow convergence and it takes the agent well over a 1000 training sequences to get close to the correct value. Assume that after entering +1 or -1 state the agent enters zero reward terminal state. We need to follow the policy for many epochs to get the required training sequences. Agent makes random runs (sequences of random moves) through environment (Example given in Figure 38.2) as given below:

 

[1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] =  +1

 

[1,1]->[2,1]->[3,1]->[3,2]->[4,2] =  -1

 

Here we carry out direct estimation which is model free. We will estimate Up(s) as average total reward of epochs containing s (calculating from s to end of epoch). The reward to go of a state s is calculated by the sum of the (discounted) rewards from that state until a terminal state is reached . The key here is to use observed reward to go to a state as the direct evidence of the actual expected utility of that state.

 

38.6.2 ADP (Adaptive Dynamic Programming)

 

Adaptive Dynamic Programming (ADP) is a model based approach. In this method we follow the policy for a while and then use the value or policy iteration algorithm to calculate exact utilities of states given an estimated model. This method makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment. However this method is somewhat intractable for large state spaces. Let R(i) be the reward of being in state i, (often non zero for only a few end states) and Mij is the probability of transition from state i to j .

 

The ADP is a model based approach which follow the policy for a while and estimates transition model based on observations and the reward function. The estimated model is used to compute utility of policy. We estimate transition model T(s,a,s’) which simply the fraction of times we see s’ after taking a in state s. From Figure 38.4 we see that

 

U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2)

 

=  0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430

 

=  0.2152

Figure 38.3 Adaptive Dynamic Programming 38.6.3 TD (Temporal Difference Learning)

 

The key of temporal difference learning is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations. The performance of runs are “noisier” than LMS but with smaller error. The approach deals with observed states during sample runs but not all instances, unlike ADP. Suppose we observe a transition from state i to state j with U(i) = -0.5 and U(j) = +0.5, These values suggest that we should increase U(i) to make it agree better with its successor which can be achieved using the following updating rule.

This intuitively moves us closer to satisfying Bellman constraint. In the above equation a is a learning rate parameter. It can be adapted for good convergence. We will discuss more about this approach in a succeeding module.

 

38.7 Passive Learning in an Unknown Environment

 

The Least Mean Square(LMS) approach and Temporal-Difference(TD) approach operate unchanged in an initially unknown environment. The Adaptive Dynamic Programming (ADP) approach adds a step that updates an estimated model of the environment.

 

In ADP Approach the environment model is learned by direct observation of transitions. The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbors . The ADP approach and the TD approach are closely related. They both try to make local adjustments to the utility estimates in order to make each state “agree” with its successors. While TD adjusts a state to agree with its observed successor, ADP adjusts the state to agree with all of the successors. However an important difference is that while TD makes a single adjustment per observed transition, ADP makes as many adjustments as needed to restore consistency between the utility estimates U and the environment model M . To make ADP more efficient we can directly approximate the algorithm for value iteration or policy iteration. Prioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates . The advantages of the approximate ADP include efficiency in terms of computation and elimination of long value iterations that occur in the early stages.

 

38.8 Active Learning in an Unknown Environment

 

To achieve active learning minor changes needs to be incorporated into the passive learning agent. Now the environment model incorporates the probabilities of transitions to other states given a particular action. In order to maximize its expected utility the agent needs a performance element to choose an action at each step.

Active ADP Approach

 

The need to learn the probability Maij of a transition instead of Mij and the input to the function will include the action taken.

 

Active TD Approach

 

The model acquisition problem for the TD agent is identical to that for the ADP agent. The update rule remains unchanged and the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity.

 

38.9 Explore versus Exploit

 

Exploitation is the maximization of immediate reward while exploration is the maximization of long-term wellbeing. Then choose a random action once in k times. Otherwise, choose the action with the highest expected utility (k-1 out of k times). The learner actively interacts with the environment: At the beginning the learner does not know anything about the environment. It gradually gains the experience and learns how to react to the environment.

 

38.9.1 Dilemma (exploration-exploitation):

 

After some number of steps, should I select the best current choice (exploitation) or try to learn more about the environment (exploration)?. When we choose exploitation it may result in the selection of a sub-optimal action and prevent the learning of the optimal choice. On the other hand exploration may spend too much time on trying bad currently suboptimal actions. Hence we have the combined Wacky approach (exploration) that act randomly in hopes of eventually exploring entire environment. The greedy approach (exploitation): acts to maximize utility using current estimate. The reasonable balance: act more wacky (exploratory) when agent has little idea of environment; more greedy when the model is close to correct. An example of the combination of approaches is the n-armed bandits approach.

 

38.9.2 Single State: K-armed Bandit Approach

 

In the k-armed bandit problem, we must explore a variety of actions and then exploit the best of them. In order to solve the K-armed bandit problem, we must explore a variety of actions (in order to estimate their values) and exploit the best of them (in order to maximise reward). Among K levers, choose the one that pays best (Figure 38.4). Here Q(a) is the value of action a and reward is ra. Set Q(a) = ra and choose a*if Q(a*)=maxa Q(a).

 

Figure 38.4 K-armed Bandit Approach 38.9.3 Example: K-armed Bandit:

 

Let us assume that we are given Rs.10 to play on a slot machine with 5 levers and each play costs Rs.1; each pull of a lever may produce payoff of 0, 1 Rs, 5Rs, 10Rs. Find the optimal policy where payoff is the most. Again it is a tradeoff between exploitation and exploration where exploitation is the continuity of pulling the lever that returns positive value while exploration is trying to pull a new one. Here we have two models, the deterministic model where payoff of each lever is fixed, but unknown in advance and the stochastic model where payoff of each lever is uncertain, with known or unknown probability.

 

In deterministic case, when Q(a) is value of action a, and ra is the reward of action a, we make Q(a)= ra and we choose an action a* if Q(a*)=maxa Q(a). In the stochastic model, the reward is non-deterministic: p(r|a) and Qt (a) is the estimate of the value of action a at time t. We need to use the Delta rule where h is learning factor and Qt +1(a) is the expected value which should converge to the mean of p(r|a) as t increases.

 

The single state (single slot machine) uses Q(Si aj ) the value of action aj in state si as the reward to be learnt while for the multiple states we use p(r|si,aj) to account for the different reward probabilities. Action causes state change, in addition to reward and rewards are not necessarily of immediate value. There can also be delayed rewards.

 

Summary:

  • Outlined the objectives of Reinforcement Learning
  • Explained Passive and Active Learning.
  • Reviewed the known and unknown environment
  • Discussed exploration vs exploitation

 

you can view video on Basics of Reinforcement Learning – II

Web Links

  • http://en.wikipedia.org/wiki/Reinforcement_learning
  • http://www.cs.cmu.edu/afs/cs/project/theo-20/www/mlbook/ch13.pdf
  • http://neuro.bstu.by/ai/RL-3.pdf
  • https://cs.uwaterloo.ca/~ppoupart/ICML-07-tutorial-slides/icml07-brl-tutorial-part2-intro-ghavamzadeh.pdf
  • ce.sharif.edu/courses/91-92/1/ce717-2/resources/root/Lectures/RL.pdf
  • www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-10.pdf
  • https://www.cs.uic.edu/~piotr/cs594/YijueRL.ppt
  • https://en.wikipedia.org/wiki/Multi-armed_bandit
  • http://teaching.csse.uwa.edu.au/units/CITS3001/lectures/lectures/3001%20Reinforcement%20learning.
  • https://www.udacity.com/wiki/cs271/unit10-notes#!#passive-vs-active
  • http://www.cs.mcgill.ca/~vkules/bandits.pdf
  • www.cs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt
  • http://www.inf.ed.ac.uk/teaching/courses/rl/slides15/rl02.pdf

 

Supporting & Reference Materials

  • Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction”, 1998, MIT press
  • Wiering, Marco, van Otterlo, Martijn (Eds.), “Reinforcement Learning”, State-of-the-Art Series: Adaptation, Learning, and Optimization, Vol. 12, 2012
  • Stuart Russell and Peter Norvig “Artificial Intelligence: A Modern Approach Prentice Hall Series in Artificial Intelligence), 2009
  • Tom Mitchell, “Machine Learning”, McGraw-Hill Education, 1997
  • Alpaydin Ethem, “Introduction to Machine Learning”, The MIT Press; third edition, 2014
  • Stephen Marsland, “Machine Learning: An Algorithmic Perspective”, Chapman and Hall/CRC; 2 edition, 2014