Model-free Control

Model-free Control

Recall Optimal Policy

Find the optimal policy $\pi^{*}$ which maximize the state-value at each state: $\pi^{*}(s) = \arg \max_{\pi} V^{\pi}(s)$
For the optimal policy $\pi^{*}$, we have,
- $V^{\pi^{*}} \geq V^{\pi}(s)$ for any policy $\pi$ and any state $s$
- $Q^{\pi^{*}}(s, a) \geq Q^{\pi}(s, a)$ for any policy $\pi$, any state $s$ and any action $a$.
Iterative approach:
- Value iteration
- Policy iteration

Model-Free Control

Control is to find the optimal policy for MDP model. For most of control problems, it has some condtion:

MDP model is unknown, but experience can be sampled
MDP model is known, but it is too big to use, except by samples.

Examples are

Autonomous Robot
Game Play
Portfolio Management
Protein Folding

On and Off-Policy Learning

On-Policy learning is sort of learning approach that learns from direct experiences following behavior policy. And it learns to evaluate a policy $\pi$ from experience sampled from $\pi$.

On the other hand, Off-Policy learning is that learn from indirect experiences such as human experts or other agents. And it learns to evaluate a policy $\pi$ from experience sampled from other policies. Usually, agent learns to follow optimal policy using exploratory policy, or learns multiple policies while following one policy

Importance Sampling

Usually, Off-policy learning samples the experience from different distribution. In that case, importance Sampling can estimate the expection from a different distribution. $\begin{aligned} \mathbb{E}_{X \sim P}[f(X)] &= \sum P(X) f(X) \\ &= \sum Q(X) \frac{P(X)}{Q(X)} f(X) \\ &= \mathbb{E}_{X \sim Q} \big[ \frac{P(X)}{Q(X)} f(X) \big] \end{aligned}$

Model-free Generalized Policy Improvement (GPI)

Given a Policy $\pi$, estimate the state action value function $Q^{\pi}(S, a)$. Using this information, update the $\pi$ with $\pi’$:

\pi'(s) = \arg \max_{a \in A} Q^{\pi}(s, a)

Model-Free Policy Iteration

Initialize policy $\pi$
Repeat until convergence
- Policy evaluation: estimate $Q^{\pi}$
- Policy improvement: Generate $\pi’ \geq \pi$

Policy Evaluation with Exploration

What if $\pi$ is deterministic, how can we compute $Q^{\pi}(s, a)$ for $a \neq \pi(s)$? To do this, it requires the data of $(s, a)$ pairs for $a \neq \pi(s)$, and it is called exploration. And it can get through:

Get all $(s, a)$ pairs for $a \neq \pi(s)$
Or get some $(s, a)$ pairs for $a \neq \pi(s)$ enough to ensure that resulting estimate $Q^{\pi}$ improve current policy

So how can we sure that we collect enough $(s, a)$ pairs?

$\epsilon$-greedy Exploration

$\epsilon$-greedy exploration is the simple idea for ensuring continual exploration. Using this strategy, All actions are tried with non-zero probability At first, Let $m = \vert A \vert$ be the number of actions. Then an $\epsilon$-greedy policy w.r.t $Q^{\pi}(s, a)$ is

\pi(a \vert s)= \begin{cases} \frac{\epsilon}{m} + 1 - \epsilon & \text{if } a^{*} = \arg \max_{a \in A} Q(s, a) \\ \dfrac{\epsilon}{m} & otherwise \end{cases}

$\epsilon$-greedy Policy Improvement

Theorem : Given policy $\pi$, the $\epsilon$-greedy policy $\pi’$ with respect to $Q_{\pi}$ is an improved policy of $\pi$

V_{\pi'}(s) \geq V_{\pi}(s)

Proof:

\begin{aligned} Q_{\pi}(s, \pi'(s)) &= \sum_{a \in A} \pi'(a \vert s) Q_{\pi}(s, a) \\ &= \frac{\epsilon}{m} \sum_{a \in A} Q_{\pi}(s, a) + (1 - \epsilon) \max_{a \in A} Q_{\pi}(s, a) \\ & \geq \frac{\epsilon}{m} \sum_{a \in A} Q_{\pi}(s, a) + (1 - \epsilon) \sum_{a \in A} \frac{\pi(a \vert s) - \frac{\epsilon}{m}}{1 - \epsilon} Q_{\pi}(s, a) \\ &= \sum_{a \in A} \pi(a \vert s) Q_{\pi}(s, a) \\ &= V_{\pi}(s) \end{aligned}

As a result, we can get $Q_{\pi}(s, \pi’(s)) \geq V_{\pi}(s)$. It means that following policy $\pi$, if we choose an action from $\pi’$, our value function will be improved, and derive it like this, $V_{\pi'}(s) \geq V_{\pi}(s)$

$\epsilon$-greedy Policy Improvement

In views of expectation, we can also prove it like this,

\begin{aligned} V_{\pi}(s) &\leq Q_{\pi}(s, \pi'(s)) \\ &= E[R_t + \gamma V_{\pi}(s_{t+1}) \vert s_t = s, a_t = \pi'(s)] \\ &= E_{\pi'}[R_t + \gamma V_{\pi}(s_{t+1}) \vert s_t = s] \\ &\leq E_{\pi'}[R_t + \gamma Q_{\pi}(s_{t+1}, \pi'(s_{t+1})) \vert s_t = s] \\ &= E_{\pi'}[R_t + \gamma R_{\pi'}[R_{t+1} + \gamma V_{\pi}(s_{t+2})] \vert s_t = s] \\ &= E_{\pi'}[R_t + \gamma R_{t+1} + \gamma^2 V_{\pi}(s_{t+2}) \vert s_t = s] \\ &\leq E_{\pi'}[R_t + \gamma R_{t+1} + \gamma^2 Q_{\pi}(s_{t+2}, \pi'(s_{t+2})) \vert s_t = s] \\ &\qquad \qquad \qquad \qquad \vdots \\ & \leq E_{\pi'}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + \dots \vert s_t = s] \\ &= V_{\pi'}(s) \end{aligned}

Note: $E[E[X]] = E[X]$

Greedy in the Limit of Infinite Exploration (GLIE)

If learning policy $\pi$ satisfy these conditions:

If a state is visited infinitely often, then every action in that state is chosen infinitely often (with probability 1)

\lim_{i \rarr \infty} N_i (s, a) \rarr \infty

As $t \rarr \infty$, the learning policy is greedy with respect to the learned $Q^{\pi}$ function with probability 1:

\lim_{i \rarr \infty} \pi (a \vert s) \rarr \arg \max_{a} Q (s, a) \quad\text{with probability 1}

We can call this policy, GLIE (Greedy in the Limits of Infinite Exploration)

Bring this idea in $\epsilon$-greedy exploration, if $\epsilon_i$ is reduced to 0, this strategy is GLIE: $\epsilon_i = \frac{1}{i}$