Model-free Control

Recall Optimal Policy

  • Find the optimal policy $\pi^{*}$ which maximize the state-value at each state: π(s)=argmaxπVπ(s)\pi^{*}(s) = \arg \max_{\pi} V^{\pi}(s)

  • For the optimal policy $\pi^{*}$, we have,
    • $V^{\pi^{*}} \geq V^{\pi}(s)$ for any policy $\pi$ and any state $s$
    • $Q^{\pi^{*}}(s, a) \geq Q^{\pi}(s, a)$ for any policy $\pi$, any state $s$ and any action $a$.
  • Iterative approach:
    • Value iteration
    • Policy iteration

Model-Free Control

Control is to find the optimal policy for MDP model. For most of control problems, it has some condtion:

  • MDP model is unknown, but experience can be sampled
  • MDP model is known, but it is too big to use, except by samples.

Examples are

  • Autonomous Robot
  • Game Play
  • Portfolio Management
  • Protein Folding

On and Off-Policy Learning

On-Policy learning is sort of learning approach that learns from direct experiences following behavior policy. And it learns to evaluate a policy $\pi$ from experience sampled from $\pi$.

On the other hand, Off-Policy learning is that learn from indirect experiences such as human experts or other agents. And it learns to evaluate a policy $\pi$ from experience sampled from other policies. Usually, agent learns to follow optimal policy using exploratory policy, or learns multiple policies while following one policy

Importance Sampling

Usually, Off-policy learning samples the experience from different distribution. In that case, importance Sampling can estimate the expection from a different distribution. EXP[f(X)]=P(X)f(X)=Q(X)P(X)Q(X)f(X)=EXQ[P(X)Q(X)f(X)]\begin{aligned} \mathbb{E}_{X \sim P}[f(X)] &= \sum P(X) f(X) \\ &= \sum Q(X) \frac{P(X)}{Q(X)} f(X) \\ &= \mathbb{E}_{X \sim Q} \big[ \frac{P(X)}{Q(X)} f(X) \big] \end{aligned}

Model-free Generalized Policy Improvement (GPI)

Given a Policy $\pi$, estimate the state action value function $Q^{\pi}(S, a)$. Using this information, update the $\pi$ with $\pi’$:

π(s)=argmaxaAQπ(s,a)\pi'(s) = \arg \max_{a \in A} Q^{\pi}(s, a)

Model-Free Policy Iteration

  • Initialize policy $\pi$
  • Repeat until convergence
    • Policy evaluation: estimate $Q^{\pi}$
    • Policy improvement: Generate $\pi’ \geq \pi$

Policy Evaluation with Exploration

What if $\pi$ is deterministic, how can we compute $Q^{\pi}(s, a)$ for $a \neq \pi(s)$? To do this, it requires the data of $(s, a)$ pairs for $a \neq \pi(s)$, and it is called exploration. And it can get through:

  • Get all $(s, a)$ pairs for $a \neq \pi(s)$
  • Or get some $(s, a)$ pairs for $a \neq \pi(s)$ enough to ensure that resulting estimate $Q^{\pi}$ improve current policy

So how can we sure that we collect enough $(s, a)$ pairs?

$\epsilon$-greedy Exploration

$\epsilon$-greedy exploration is the simple idea for ensuring continual exploration. Using this strategy, All actions are tried with non-zero probability At first, Let $m = \vert A \vert$ be the number of actions. Then an $\epsilon$-greedy policy w.r.t $Q^{\pi}(s, a)$ is

π(as)={ϵm+1ϵif a=argmaxaAQ(s,a)ϵmotherwise\pi(a \vert s)= \begin{cases} \frac{\epsilon}{m} + 1 - \epsilon & \text{if } a^{*} = \arg \max_{a \in A} Q(s, a) \\ \dfrac{\epsilon}{m} & otherwise \end{cases}

$\epsilon$-greedy Policy Improvement

  • Theorem : Given policy $\pi$, the $\epsilon$-greedy policy $\pi’$ with respect to $Q_{\pi}$ is an improved policy of $\pi$
Vπ(s)Vπ(s)V_{\pi'}(s) \geq V_{\pi}(s)
  • Proof:
Qπ(s,π(s))=aAπ(as)Qπ(s,a)=ϵmaAQπ(s,a)+(1ϵ)maxaAQπ(s,a)ϵmaAQπ(s,a)+(1ϵ)aAπ(as)ϵm1ϵQπ(s,a)=aAπ(as)Qπ(s,a)=Vπ(s)\begin{aligned} Q_{\pi}(s, \pi'(s)) &= \sum_{a \in A} \pi'(a \vert s) Q_{\pi}(s, a) \\ &= \frac{\epsilon}{m} \sum_{a \in A} Q_{\pi}(s, a) + (1 - \epsilon) \max_{a \in A} Q_{\pi}(s, a) \\ & \geq \frac{\epsilon}{m} \sum_{a \in A} Q_{\pi}(s, a) + (1 - \epsilon) \sum_{a \in A} \frac{\pi(a \vert s) - \frac{\epsilon}{m}}{1 - \epsilon} Q_{\pi}(s, a) \\ &= \sum_{a \in A} \pi(a \vert s) Q_{\pi}(s, a) \\ &= V_{\pi}(s) \end{aligned}

As a result, we can get $Q_{\pi}(s, \pi’(s)) \geq V_{\pi}(s)$. It means that following policy $\pi$, if we choose an action from $\pi’$, our value function will be improved, and derive it like this, Vπ(s)Vπ(s)V_{\pi'}(s) \geq V_{\pi}(s)

$\epsilon$-greedy Policy Improvement

In views of expectation, we can also prove it like this,

Vπ(s)Qπ(s,π(s))=E[Rt+γVπ(st+1)st=s,at=π(s)]=Eπ[Rt+γVπ(st+1)st=s]Eπ[Rt+γQπ(st+1,π(st+1))st=s]=Eπ[Rt+γRπ[Rt+1+γVπ(st+2)]st=s]=Eπ[Rt+γRt+1+γ2Vπ(st+2)st=s]Eπ[Rt+γRt+1+γ2Qπ(st+2,π(st+2))st=s]Eπ[Rt+γRt+1+γ2Rt+2+γ3Rt+3+st=s]=Vπ(s)\begin{aligned} V_{\pi}(s) &\leq Q_{\pi}(s, \pi'(s)) \\ &= E[R_t + \gamma V_{\pi}(s_{t+1}) \vert s_t = s, a_t = \pi'(s)] \\ &= E_{\pi'}[R_t + \gamma V_{\pi}(s_{t+1}) \vert s_t = s] \\ &\leq E_{\pi'}[R_t + \gamma Q_{\pi}(s_{t+1}, \pi'(s_{t+1})) \vert s_t = s] \\ &= E_{\pi'}[R_t + \gamma R_{\pi'}[R_{t+1} + \gamma V_{\pi}(s_{t+2})] \vert s_t = s] \\ &= E_{\pi'}[R_t + \gamma R_{t+1} + \gamma^2 V_{\pi}(s_{t+2}) \vert s_t = s] \\ &\leq E_{\pi'}[R_t + \gamma R_{t+1} + \gamma^2 Q_{\pi}(s_{t+2}, \pi'(s_{t+2})) \vert s_t = s] \\ &\qquad \qquad \qquad \qquad \vdots \\ & \leq E_{\pi'}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + \dots \vert s_t = s] \\ &= V_{\pi'}(s) \end{aligned}

Note: $E[E[X]] = E[X]$

Greedy in the Limit of Infinite Exploration (GLIE)

If learning policy $\pi$ satisfy these conditions:

  • If a state is visited infinitely often, then every action in that state is chosen infinitely often (with probability 1)
limiNi(s,a)\lim_{i \rarr \infty} N_i (s, a) \rarr \infty
  • As $t \rarr \infty$, the learning policy is greedy with respect to the learned $Q^{\pi}$ function with probability 1:
limiπ(as)argmaxaQ(s,a)with probability 1\lim_{i \rarr \infty} \pi (a \vert s) \rarr \arg \max_{a} Q (s, a) \quad\text{with probability 1}

We can call this policy, GLIE (Greedy in the Limits of Infinite Exploration)

Bring this idea in $\epsilon$-greedy exploration, if $\epsilon_i$ is reduced to 0, this strategy is GLIE: ϵi=1i\epsilon_i = \frac{1}{i}