Model-free Policy Iteration with Monte Carlo Methods
A summary of "Understanding Deep Reinforcement Learning"
Model-free control
Monte-Carlo Policy Iteration
-
Starting from a policy $\pi$
-
Iterate until convergence:
- Policy evaluation using Monte-Carlo policy evaluation of $V^{\pi}$ for each policy $\pi$
- Policy improvement using $\epsilon$-greedy policy improvement
Monte Carlo Online Control
- Initialize $Q(s, a) = 0, N(s, a) = 0, \forall (s, a)$
Set $\epsilon=1, k=1, \pi=\epsilon-\text{greedy}(Q)$ - Repeat until convergence,
- Sample episode ($s_{k, 1}, a_{k, 1}, r_{k,1}, s_{k, 2}, \dots, s_{k, T}$) given $\pi_k$
$G_{k, t} =r_{k, t} + \gamma r_{k, t+1} + \gamma^2 r_{k, t+2} + \cdots + \gamma^{T_i - 1} r_{k, T_i}$ for $t=1, \dots,T$ - If we update the value with first-visit, when we first visit to $(s, a)$ in episode $k$, then,
$N(s,a) \mathrel{+{=}} 1$
$Q(s_t, a_t) = Q(s_t, a_t) + \frac{1}{N(s, a)}(G_{k, t} - Q(s_t, a_t))$
$k = k + 1$
$\pi_{k+1} = \epsilon-\text{greedy}(Q) (\approx \pi_k)$
- Sample episode ($s_{k, 1}, a_{k, 1}, r_{k,1}, s_{k, 2}, \dots, s_{k, T}$) given $\pi_k$
GLIE Monte-Carlo Control
- Theorem: (GLIE) Monte-Carlo control converges to the optimal state-action value function $Q^{*}(s, a)$
Importance Sampling for Off-policy Monte-Carlo Control
Recall that, importance sampling is
Off-policy means that the target policy is different with current policy (or behavior policy). In this case, the distribution is different, and experience gathered from current policy $\mu$ cannot directly use to train target policy $\pi$. Instead, Importance Sampling is used to train target policy $\pi$ from using returns generated from current policy $\mu$.
Importance sampling affects the weight return $G_t$ according to similarity between policies.
After that, value function with corrected return($G_t$) use to update as the target value:
But target policy cannot used the expericen gathered itself, so its action has higher variance.