Policy Iteration
A summary of "Understanding deep reinforcement learning"
Policy Improvement by Iterative Methods
Greedy Policy Improvement
- Starting from a initial policy $\pi_0$
- Iterate for $k$ until convergence
- Evaluate the policy $\pi_k$
$V^{\pi_k}(s) = E_{\pi_k}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots \vert s_t = s]$
- Update the policy $\pi_k$ by choosing better action for each state:
$\pi_{k+1}(s) = \argmax_{a \in A} Q^{\pi_k}(s, a)$
This process actually improves policy:
- Iteration:
$Q^{\pi_k}(s, \pi_{k+1}(s)) = \max_{a \in A} Q^{\pi_k}(s, a) \ \ge Q^{\pi_k}(s, \pi_k(a)) = V^{\pi_k}(s)$
- By this iteration, we have:
- When we have a convergent policy $\pi_c$, it satisfies the Bellman optimality condition:
$Q^{\pi_c}(s, \pi_c(s)) = \max_{a \in A} Q^{\pi_c}(s, a) = V^{\pi_c}(s)$
- Thus, $\pi_c$ is the optimal policy $\pi^{\ast}$.