Temporal Difference Learning

n-Step Return

Consider the following n-step returns for $n=1, 2, \infty$:

  • 1-step return ($\approx \text{TD(0)}$): $G_t^{(1)} = r_t + \gamma V^{\pi}(s_{t+1})$
  • 2-step return: $G_t^{(2)} = r_t + \gamma r_{t+1} + \gamma^2 V^{\pi}(S_{t+2})$
  • n-step return: $G_t^{(n)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^{n}V^{\pi}(s_{t+n})$
  • $\infty$-step return ($= \text{MC}$): $G_t^{(\infty)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{T} r_T (\approx G_{i, t})$

$\rarr$ n-step temporal difference learning Vπ(st)=Vπ(st)+α(Gt(n)TD targetVπ(st))V^{\pi}(s_t) = V^{\pi}(s_t) + \alpha ( \underbrace{G_t^{(n)}}_{\text{TD target}} - V^{\pi}(s_t))

Combines Information from many different time-steps

In Temporal difference learning, estimated returns are collected. Then, we can efficiently combine information from all time-steps. The $\lambda$-return ($G_t^{\lambda}$) combines all n-step returns $G_t^{(n)}$ with weight $(1-\lambda) \lambda^{n-1}$:
(Assume that termination step is $T$)

Gtλ=(1λ)n=1λn1Gt(n)=1n=1Tλn1Gt(n)λn=1T1λn1Gt(n)=n=1Tλn1Gt(n)n=1T1λnGt(n)\begin{aligned} G_t^{\lambda} &= (1- \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} \\ &= 1 * \sum_{n=1}^T \lambda^{n-1} G_t^{(n)} - \lambda \sum_{n=1}^{T-1} \lambda^{n-1} G_t^{(n)} \\ &= \sum_{n=1}^{T} \lambda^{n-1} G_t^{(n)} - \sum_{n=1}^{T-1} \lambda^n G_t^{(n)} \end{aligned}
  • Forward $\text{TD}(\lambda)$ learning
Vπ(st)=Vπ(st)+α(GtλTD targetVπ(st))V^{\pi}(s_t) = V^{\pi}(s_t) + \alpha ( \underbrace{G_t^{\lambda}}_{\text{TD target}} - V^{\pi}(s_t))

If $\lambda=0$, $G_t^{\lambda} = G_t^{(1)}$ and it equals to $\text{TD}(0)$:

Vπ(st)=Vπ(st)+α(Gt(1)Vπ(st))V^{\pi}(s_t) = V^{\pi}(s_t) + \alpha (G_t^{(1)} - V^{\pi}(s_t))

Errors in TD($\lambda$)

(omit the proof)

GtλV(St)=δt+γλδt+1+(γλ)2δt+2+G_t^{\lambda} - V(S_t) = \delta_t + \gamma \lambda \delta_{t+1} + (\gamma \lambda)^2 \delta_{t+2} + \cdots

MC and TD(1)

When $\lambda=1$, TD(1) is roughly equivalent to every-visit Monte-Carlo. And its error is accumulated online, step-by-step. If value function is only updated offline at end of episode, then total update is exactly the same as MC.

Accumulated error is: δt+rδt+1+γ2δt+2++γT1tδT1=Rt+1+γV(st+1)V(st)+γRt+2+γ2V(st+2)γV(st+1)+γ2Rt+3+γ3V(st+3)γ2V(st+2)=rT1tRT+γTtV(ST)γT1tV(sT1)=Rt+1+γRt+2+γ2Rt+3++γT1tRTV(st)=GtV(st)\begin{aligned} \delta_t + r \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-1-t} \delta_{T-1} &= R_{t+1} + \gamma V(s_{t+1}) - V(s_t) \\ &+ \gamma R_{t+2} + \gamma^2 V(s_{t+2}) - \gamma V(s_{t+1}) \\ &+ \gamma^2 R_{t+3} + \gamma^3 V(s_{t+3}) - \gamma^2 V(s_{t+2}) \\ \vdots \\ &= r^{T-1-t} R_T + \gamma^{T-t}V(S_T) - \gamma^{T-1-t}V(s_{T-1}) \\ &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-1-t} R_T - V(s_t) \\ &= G_t - V(s_t) \end{aligned}

Unified View of Reinforcement Learning

unified rl 1

  1. Figure from David Silver lecture