Combines Information from many different time-steps
In Temporal difference learning, estimated returns are collected. Then, we can efficiently combine information from all time-steps. The $\lambda$-return ($G_t^{\lambda}$) combines all n-step returns $G_t^{(n)}$ with weight $(1-\lambda) \lambda^{n-1}$:
(Assume that termination step is $T$)
If $\lambda=0$, $G_t^{\lambda} = G_t^{(1)}$ and it equals to $\text{TD}(0)$:
Vπ(st)=Vπ(st)+α(Gt(1)−Vπ(st))
Errors in TD($\lambda$)
(omit the proof)
Gtλ−V(St)=δt+γλδt+1+(γλ)2δt+2+⋯
MC and TD(1)
When $\lambda=1$, TD(1) is roughly equivalent to every-visit Monte-Carlo. And its error is accumulated online, step-by-step. If value function is only updated offline at end of episode, then total update is exactly the same as MC.
Accumulated error is:
δt+rδt+1+γ2δt+2+⋯+γT−1−tδT−1⋮=Rt+1+γV(st+1)−V(st)+γRt+2+γ2V(st+2)−γV(st+1)+γ2Rt+3+γ3V(st+3)−γ2V(st+2)=rT−1−tRT+γT−tV(ST)−γT−1−tV(sT−1)=Rt+1+γRt+2+γ2Rt+3+⋯+γT−1−tRT−V(st)=Gt−V(st)