38 Temporal Difference Learning

Learning Objectives

The learning objectives of the module are as follows:

• To understand Temporal Difference Learning • To describe the Optimal Value Functions

• To analyse TD-Prediction

• To describe Learning an action value function

40.1 Introduction

Policy evaluation algorithms are intended to estimate the value functions V π or Qπ for a given policy π. Typically these are on-policy algorithms, and the considered policy is assumed to be stationary (or “almost” stationary). There are three classes of policy evaluation algorithms namely Monte Carlo methods, Dynamic Programming, and Temporal Difference Learning or TD learning. Monte Carlo methods do not need a model of the learning environment. From experience in form of sequences of state -action-reward-samples they can approximate future rewards. Monte-Carlo methods are based on the simple idea of averaging a number of random samples of a random quantity in order to estimate its average. Dynamic Programming is based on the Bellman Equation where the problem is broken down into sub problems and depends on a perfect model of the environment. However both Dynamic Programming and Monte Carlo methods only update after a complete sequence that is when the final state is reached. The main issues are whether it is possible to avoid the computational expense and space requirements for storing the transition model estimate of a full Dynamic programming policy evaluation.

TD learning is considered as the most novel idea in reinforcement learning. Temporal Difference Learning is a model free approach which does not store an estimate of entire transition function but instead stores estimate of Vp, which requires only O(n) space. It carries out local, cheap updates of utility/value function on a per-action basis. As you may recall Value Functions are state-action pair functions that estimate how good a particular action will be in a given state, or what the return for that action is expected to be a Temporal Difference (TD). Learning methods can be used to estimate these value functions. If the value functions were to be calculated without estimation, the agent would need to wait until the final reward was received before any state-action pair values can be updated. Once the final reward was received, the path taken to reach the final state would need to be traced back and each value updated accordingly.

40.1.1 Comparison – Monte Carlo, Dynamic Programming and TD Approaches

Before we discuss TD method let us compare Monte Carlo and Dynamic Programming methods. We will discuss two aspects namely bootstrapping and sampling. In bootstrapping update involves an estimate from a nearby state . While Monte Carlo method does not bootstrap that is each state is independently estimated, Dynamic Programming methods bootstraps that is estimates for each state is dependent on nearby states. In sampling update involves an actual return Monte Carlo method does sampling that is learns from experience in the environment and gets an actual return. Dynamic programming does not sample but learns by spreading constraints until convergence through iterating the Bellman equations without interaction with the environment.

TD bootstraps that is learning involves an estimate of a nearby state. TD also samples where update involves an actual return.

40.2 Optimal Value Functions

Before we discuss TD learning let us recall that for finite MDPs, policies can be partially ordered that is there is always at least one (and possibly many) polic y that is better than or equal to all the others. This is an optimal policy. We denote them all p*. Optimal policies share the same optimal state-value function:

This is the expected return for taking action ‘a’ in state ‘s’ and thereafter following an optimal policy. Now let us discuss the Bellman optimality equation for V*.

40.2.1 Bellman Optimality Equation for V*

The value of a state under an optimal policy must equal the expected return for the best action from that state where V*is the unique solution of this system of nonlinear equations.

The relevant backup diagram is shown in Figure 40.1.

Figure 40.1 Backup diagram

40.2.2 Bellman Optimality Equation for Q*

The unique solution of this system is given by the nonlinear equations, where Q* is the unique solution of this system of nonlinear equations

The relevant backup diagram is shown in Figure 40.2.

Given Q* , the agent does not even have to do a one-step-ahead search in order to find the optimal Action-Value functions.

40.2.3 Solving the Bellman Optimality Equation

Finding an optimal policy by solving the Bellman Optimality Equation requires the following that is accurate knowledge of environment dynamics, enough space and time to do the computation, and the Markov Property must be satisfied. The space required is polynomial in number of states but the number of states is often huge (e.g., backgammon has about 10**20 states). Therefore usually we have to settle for approximations. Many reinforcement learning methods can be understood as approximately solving the Bellman Optimality Equation.

40.3 Policy Evaluation

Monte Carlo methods do not need full knowledge of environment, but just needs experience, or simulated experience (Figure 40.3). One method is by averaging sample returns which is generally defined only for episodic tasks. This method is similar to DP in terms of policy evaluation, policy improvement.

Figure 40.3 Simple Monte Carlo Method

Dynamic Programming (Figure 40.4) needs complete model of the environment and rewards. DP bootstraps that is updates estimates on the basis of other estimates

In TD prediction we estimate the actual return as the sum of the next reward plus the value of the next state (Figure 40.5). As we discussed earlier when the environment,

P (st+1 | st , at), p (rt+1 | st , at ), is not known it is called as model-free learning. The temporal difference is defined as the difference between the value of the current action and the value discounted from the next state.

40.3 TD Prediction

TD update for transition from s to s’ is given in Figure 40.6.

Figure 40.6 TD Update

The update is maintaining a “mean” of the (noisy) value samples. If the learning rate decreases appropriately with the number of samples (e.g.1/n) then the value estimates will converge to true values as shown in Figure 40.7.

40.4 Error signal

Another aspect we need for estimation and bootstrapping is the error signal. Error signal is defined as the difference between current estimate and improved estimate. It drives change of current estimate . There are many types of errors as discussed below:

Self-consistent prediction goal is to predict returns that should be self-consistent from one time step to the next (true of both TD and DP).

Learning using the Error Signal: we could just do a reassignment:

where ‘α’ is a small “learning rate” parameter (either constant, or decreases with time). The above algorithm is known as “TD(0)” .

40.5 TD Learning

TD Learning combines the “bootstrapping” (1-step self-consistency) idea of DP with the “sampling” idea of Monte Carlo (MC). Like MC, it doesn’t need a model of the environment, only experience. However TD, but not MC, can be fully incremental. Here we can learn before knowing the final outcome and we can learn without the final outcome (from incomplete sequences). The incorporation of bootstrapping into TD has reduced its variance compared to Monte Carlo, but possibly introduced greater bias.

40.5.1 Example – Driving Home

Initially we will consider TD(0) that is the case where l =0 in TD(l). In this case we only look at the next reward. To understand the concept we consider the example shown in Figure 40.8. Value of each state in this example is the expected time-to-go.

Figure 40.8 Driving Home Example

Figure 40.9 shows the estimated values when we use Monte Carlo and TD approaches.

Figure 40.9 Predicted Values for Monte Carlo and TD Methods

Now the question is whether it is really necessary to wait until the end of the episode (here – arrive home) to start learning. According to Monte Carlo method it is necessary to wait. However TD learning argues that learning can occur on-line. Suppose, on another day, you again estimate when leaving your office that it will take 30 minutes to drive home, but then you get stuck in a massive traffic jam . In TD the initial estimate would be changed. The algorithm for TD(0) is given in Figure 40.10. This can be considered a control method where the policy is always updated using a greedy approach with respect to the current estimate. This is also called Sarsa, the On-Policy TD Control algorithm.

Figure 40.10 TD(0) Algorithm

40.5.2 Optimality of TD(0)

Suppose only a finite amount of experience is available, say 10 episodes or 100 time steps. Intuitively, we repeatedly present the experience until convergence is achieved. Updates are made after a batch of training data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small value of l. MC method also converges deterministically but to a different answer

40.6 TD(l)

Figure 40.11shows the TD prediction for n steps and the mathematics behind that is given in Figure 40.12.

Figure 40.11 TD(l) Prediction

Figure 40.12 Mathematics of N steps

40.5.1 l Parameter

The l in TD(l) provides a smooth interpolation between l=0 (pure TD) and l=1 (pure MC), For many toy grid-world type problems, can show that intermediate values of l work best. However for real-world problems, best l will be highly problem-dependent.

TD(l) converges to the correct value function Vo (s) with probability 1 for all l. For which a lookup table representation (Vo (s) is a table), it must visit all states an infinite # of times, and a certain schedule for decreasing a(t). However TD(k) converges only for a fixed policy.

40.6 Learning backgammon using TD(l)

Neural net observes a sequence of input patterns x1, x2, x3, …, xf which is the sequence of board positions occurring during a game and let xf be the final position. The representation of the board game can be raw board description that is the number of White or Black checkers at each location. We can use simple truncated unary encoding for this representation. At final position xf, reward signal z given can be z = 1 if White wins and z = 0 if Black wins. We can train the neural net using gradient version of TD(l). The trained NN output Vt=V(xt,w) should estimate probability of (White wins | xt ) that is white win given the board position xt.

40.6.1 Multi layer Neural Network

The multilayer neural network for learning Backgammon is shown in Figure 40.13.

Figure 40.13 Multilayer Network

Let the neural net make the moves itself, using its current evaluator where all legal moves are scored and pick maximum value Vt for White, and minimum Vt for Black.

TD-Gammon can teach itself by playing games against itself and learning from the outcome. Works even starting from random initial play and zero initial expert knowledge (surprising). The program achieves strong intermediate play, if we add hand-crafted features it is able to achieve advanced level of play, 2-ply search-strong master play and 3-ply search almost superhuman play. “TD-Leaf” that is n-step TD backups in 2-player games is able to achieve great results for checkers and chess.

40.7 Advantages of TD Learning

TD methods do not require a model of the environment, only experience. TD methods can be fully incremental. With TD learning can happen before knowing the final outcome and hence requires less memory and less peak computation. We can in fact learn without even knowing the final outcome that is we can learn from incomplete sequences. This type of learning helps with applications that have very long episodes. Both MC and TD converge under certain assumptions but generally TD does better on stochastic tasks.

Summary

Outlined the TD Learning, optimal value functions, and TD prediction
Explained advantages of TD Learning
Reviewed the Bellman’s’ optimal value functions and TD Prediction
Discussed TD-learning steps and examples

Summary of the Complete paper

Machine learning – an important approach to solve many of today’s problems
In this course we discussed the following topics
Basics of Machine Learning
Mathematical Foundations – Probability, Probability Distributions, Linear Algebra, Decision Theory and Information Theory
Supervised Techniques & Classification – KNN, Decision Trees, SVM, Neural Networks & Genetic Algorithms
Clustering – K-Means
Semi-supervised Learning
Baye’s Learning – Naïve Bayes, Bayes Belief Networks, EM method, HMM
Reinforcement Learning

you can view video on Temporal Difference Learning

Web Links

http://en.wikipedia.org/wiki/Reinforcement_learning
http://www.cs.cmu.edu/afs/cs/project/theo-20/www/mlbook/ch13.pdf
http://neuro.bstu.by/ai/RL-3.pdf
https://cs.uwaterloo.ca/~ppoupart/ICML-07-tutorial-slides/icml07-brl-tutorial-part2-intro-ghavamzadeh.pdf
ce.sharif.edu/courses/91-92/1/ce717-2/resources/root/Lectures/RL.pdf
www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-10.pdf
https://www.cs.uic.edu/~piotr/cs594/YijueRL.ppt
https://en.wikipedia.org/wiki/Multi-armed_bandit
http://teaching.csse.uwa.edu.au/units/CITS3001/lectures/lectures/3001%20Reinforcement
%20learning.
https://www.udacity.com/wiki/cs271/unit10-notes#!#passive-vs-active http://www.cs.mcgill.ca/~vkules/bandits.pdf www.cs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt http://www.inf.ed.ac.uk/teaching/courses/rl/slides15/rl02.pdf ai.cs.umbc.edu/~oates/classes/2011/ML/TemporalDifference.ppt

Supporting & Reference Materials

Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An Introduction”, 1998, MIT press
Wiering, Marco, van Otterlo, Martijn (Eds.), “Reinforcement Learning”, State-of-the-ArtSeries:
Adaptation, Learning, and Optimization, Vol. 12, 2012
Stuart Russell and Peter Norvig “Artificial Intelligence: A Modern Approach Prentice Hall Series in Artificial Intelligence), 2009
Tom Mitchell, “Machine Learning”,McGraw-Hill Education, 1997
AlpaydinEthem, “Introduction to Machine Learning”, The MIT Press; third edition, 2014
Stephen Marsland, “Machine Learning: An Algorithmic Perspective”, Chapman and Hall/CRC; 2 edition, 2014