Model-free Policy Iteration with TD Methods
A summary of "Understanding Deep Reinforcement Learning"
Model-free control
Recall Updating Action-Value Functions with TD(0)
To utilize this formula (TD(0)), it is required to gather some informations ($s_t, a_t, r_t, s_{t+1}, a_{t+1}$). So TD(0) is called SARSA
Model-free Policy Iteration with TD method
- Starting from a policy $\pi$
- Iterate until convergence
- Policy evaluation using TD policy evaluation of $Q^{\pi}$ for $\epsilon$-greedy policy
- Policy improvement using $\epsilon$-greedy policy improvement
On-Policy Control with SARSA
Note: On-Policy means that target policy is same as behavior policy
- Starting from $\epsilon$-greedy policy $\pi$ randomly at $t=0$ with initial state $s_0$. Then sample the initial action from policy ($a_0 \sim \pi(s_0)$), observe the next information $r_0, s_1$
- Repeat until convergence
- Take the action $a_{t+1} \sim \pi(s_{t+1})$
- And observe the next state ($r_{t+1}, s_{t+2}$)
- Update the value function
- Improve the policy
Convergence Theorem of SARSA
- SARSA for finite-state and finite-action MDPs converges to the optimal action-value function $Q^{*}(s, a)$, under the following conditions:
- The policy sequence $\pi_t(a \vert s)$ is a GLIE sequence
- Learning rate $\alpha_t$ satisfy the Robbins-Munro sequence such that
for example,
n-step SARSA
Consider the following n-step returns for $n=1, 2, \dots \infty$