# Temporal Difference Learning

## Temporal Difference Learning for Estimating Value function

We want to estimate $V^{\pi}(s) = E_{\pi}[G_t \vert s_t = s]$ given episodes generated under policy $\pi$. If we know MDP models of environment (that is, we know the transition probability of eash state), we can repeat the Bellman operator $B^{\pi}$ to estimate: $B^{\pi}V(s) = R(s, \pi(s)) + \gamma \sum_{s' \in S}p(s' \vert s, \pi(s)) V(s')$

In incremental every-visit MC, update the estimate using one sample of return for the current $i$-th episode:

$V^{\pi}(s) = V^{\pi}(s) + \alpha \big(G_{i, t} - V^{\pi}(s)\big)$

What if we use the following recursive formula:
(replaced $G_{i, t} \rarr r_t + \gamma V^{\pi}(s_{t+1})$) $V^{\pi}(s) = V^{\pi}(s) + \alpha \big(r_t + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s)\big)$

## Temporal Difference TD(0) Learning

• Generate episodes $s_1, a_1, r_1, s_2, a_2, r_2, \cdots$ under policy $\pi$ where the actions are sampled from $\pi$ to make tuples ($s, a, r, s’$)
• TD(0) learning: update value by estimated value
$V^{\pi}(s_t) = V^{\pi}(s_t) + \alpha ( \underbrace{[r_t + \gamma V^{\pi}(S_{t+1})]}_{\text{TD target}} - V^{\pi}(s_t))$

In this formula, we can define the error of TD(0) as

$\delta_t = r_t + \gamma V^{\pi}(S_{t+1}) - V^{\pi}(s_t)$

As you can see, value function can be calculated from current value and next value. So TD(0) can immediately update value estimate using ($s, a, r, s’$) tuple without waiting for the termination of each episode.

Detailed process as follows:

• Initialize $V^{\pi}(s), \forall s \in S$
• Repeat until convergence
• Sample tuple ($s_t, a_t, r_t, s_{t+1}$)
• Update value with a learning rate $\alpha$
$V^{\pi}(s_t) = V^{\pi}(s_t) + \alpha (\underbrace{[r_t + \gamma V^{\pi}(s_{t+1})]}_{\text{TD target}} - V^{\pi}(s_t))$

In MC, $G_t$ is itself estimated by sampling episodes. Bt in TD(0), $r_t + \gamma V^{\pi}(S_{t+1})$ is an estimator of $G_t$. That is, $G_t$ is estimated by other estimator which uses bootstrapping.

## MC vs. TD

• TD can learn online after every step
• But MC must wait until end of episode before return is known
• TD can learn with incomplete sequences
• MC can only learn from complete sequences
• TD works in continuing (non-episodic) environments
• MC only works for episodic (terminating) environments
• MC has high variance and unbiased
• Simple to use
• Good convergence properties (even with function approximation)
• Not very sensitive to initial value
• Usually more efficient in non-Markov environments
• TD has low variance and biased
• Usually more efficient than MC
• TD(0) converges to $V_{\pi}(s)$ (but not always with function approximation)
• More sensitive to initial value
• Usually more effective in Markov environments