Model-free control

Monte-Carlo Policy Iteration

  • Starting from a policy $\pi$

  • Iterate until convergence:

    • Policy evaluation using Monte-Carlo policy evaluation of $V^{\pi}$ for each policy $\pi$
    • Policy improvement using $\epsilon$-greedy policy improvement

Monte Carlo Online Control

  1. Initialize $Q(s, a) = 0, N(s, a) = 0, \forall (s, a)$
    Set $\epsilon=1, k=1, \pi=\epsilon-\text{greedy}(Q)$
  2. Repeat until convergence,
    1. Sample episode ($s_{k, 1}, a_{k, 1}, r_{k,1}, s_{k, 2}, \dots, s_{k, T}$) given $\pi_k$
      $G_{k, t} =r_{k, t} + \gamma r_{k, t+1} + \gamma^2 r_{k, t+2} + \cdots + \gamma^{T_i - 1} r_{k, T_i}$ for $t=1, \dots,T$
    2. If we update the value with first-visit, when we first visit to $(s, a)$ in episode $k$, then,
      $N(s,a) \mathrel{+{=}} 1$
      $Q(s_t, a_t) = Q(s_t, a_t) + \frac{1}{N(s, a)}(G_{k, t} - Q(s_t, a_t))$
      $k = k + 1$
      $\pi_{k+1} = \epsilon-\text{greedy}(Q) (\approx \pi_k)$

GLIE Monte-Carlo Control

  • Theorem: (GLIE) Monte-Carlo control converges to the optimal state-action value function $Q^{*}(s, a)$

Importance Sampling for Off-policy Monte-Carlo Control

Recall that, importance sampling is

EXp[f(X)]=EXQ[P(X)Q(X)f(X)]\mathbb{E}_{X \sim p}[f(X)] = \mathbb{E}_{X \sim Q}[\frac{P(X)}{Q(X)} f(X)]

Off-policy means that the target policy is different with current policy (or behavior policy). In this case, the distribution is different, and experience gathered from current policy $\mu$ cannot directly use to train target policy $\pi$. Instead, Importance Sampling is used to train target policy $\pi$ from using returns generated from current policy $\mu$.

Importance sampling affects the weight return $G_t$ according to similarity between policies.

Gtπμ=π(atst)π(at+1st+1)μ(atst)μ(at+1st+1)π(aT1sT1)μ(aT1sT1)GtG_t^{\frac{\pi}{\mu}} = \pi(a_t \vert s_t) \frac{ \pi(a_{t+1} \vert s_{t+1})}{\mu(a_t \vert s_t) \mu(a_{t+1} \vert s_{t+1})} \dots \frac{\pi(a_{T-1} \vert s_{T-1})}{\mu(a_{T-1} \vert s_{T-1})}G_t

After that, value function with corrected return($G_t$) use to update as the target value:

Q(st,at)Q(st,at)+α(gtπμQ(st,at))Q(s_t, a_t) \larr Q(s_t, a_t) + \alpha (g_t^{\frac{\pi}{\mu}} - Q(s_t, a_t))

But target policy cannot used the expericen gathered itself, so its action has higher variance.