Model-free Control
A summary of "Understanding Deep Reinforcement Learning"
- Model-free Control
- Recall Optimal Policy
- Model-Free Control
- On and Off-Policy Learning
- Importance Sampling
- Model-free Generalized Policy Improvement (GPI)
- Model-Free Policy Iteration
- Policy Evaluation with Exploration
- $\epsilon$-greedy Exploration
- $\epsilon$-greedy Policy Improvement
- $\epsilon$-greedy Policy Improvement
- Greedy in the Limit of Infinite Exploration (GLIE)
Model-free Control
Recall Optimal Policy
-
Find the optimal policy $\pi^{*}$ which maximize the state-value at each state:
- For the optimal policy $\pi^{*}$, we have,
- $V^{\pi^{*}} \geq V^{\pi}(s)$ for any policy $\pi$ and any state $s$
- $Q^{\pi^{*}}(s, a) \geq Q^{\pi}(s, a)$ for any policy $\pi$, any state $s$ and any action $a$.
- Iterative approach:
- Value iteration
- Policy iteration
Model-Free Control
Control is to find the optimal policy for MDP model. For most of control problems, it has some condtion:
- MDP model is unknown, but experience can be sampled
- MDP model is known, but it is too big to use, except by samples.
Examples are
- Autonomous Robot
- Game Play
- Portfolio Management
- Protein Folding
On and Off-Policy Learning
On-Policy learning is sort of learning approach that learns from direct experiences following behavior policy. And it learns to evaluate a policy $\pi$ from experience sampled from $\pi$.
On the other hand, Off-Policy learning is that learn from indirect experiences such as human experts or other agents. And it learns to evaluate a policy $\pi$ from experience sampled from other policies. Usually, agent learns to follow optimal policy using exploratory policy, or learns multiple policies while following one policy
Importance Sampling
Usually, Off-policy learning samples the experience from different distribution. In that case, importance Sampling can estimate the expection from a different distribution.
Model-free Generalized Policy Improvement (GPI)
Given a Policy $\pi$, estimate the state action value function $Q^{\pi}(S, a)$. Using this information, update the $\pi$ with $\pi’$:
Model-Free Policy Iteration
- Initialize policy $\pi$
- Repeat until convergence
- Policy evaluation: estimate $Q^{\pi}$
- Policy improvement: Generate $\pi’ \geq \pi$
Policy Evaluation with Exploration
What if $\pi$ is deterministic, how can we compute $Q^{\pi}(s, a)$ for $a \neq \pi(s)$? To do this, it requires the data of $(s, a)$ pairs for $a \neq \pi(s)$, and it is called exploration. And it can get through:
- Get all $(s, a)$ pairs for $a \neq \pi(s)$
- Or get some $(s, a)$ pairs for $a \neq \pi(s)$ enough to ensure that resulting estimate $Q^{\pi}$ improve current policy
So how can we sure that we collect enough $(s, a)$ pairs?
$\epsilon$-greedy Exploration
$\epsilon$-greedy exploration is the simple idea for ensuring continual exploration. Using this strategy, All actions are tried with non-zero probability At first, Let $m = \vert A \vert$ be the number of actions. Then an $\epsilon$-greedy policy w.r.t $Q^{\pi}(s, a)$ is
$\epsilon$-greedy Policy Improvement
- Theorem : Given policy $\pi$, the $\epsilon$-greedy policy $\pi’$ with respect to $Q_{\pi}$ is an improved policy of $\pi$
- Proof:
As a result, we can get $Q_{\pi}(s, \pi’(s)) \geq V_{\pi}(s)$. It means that following policy $\pi$, if we choose an action from $\pi’$, our value function will be improved, and derive it like this,
$\epsilon$-greedy Policy Improvement
In views of expectation, we can also prove it like this,
Note: $E[E[X]] = E[X]$
Greedy in the Limit of Infinite Exploration (GLIE)
If learning policy $\pi$ satisfy these conditions:
- If a state is visited infinitely often, then every action in that state is chosen infinitely often (with probability 1)
- As $t \rarr \infty$, the learning policy is greedy with respect to the learned $Q^{\pi}$ function with probability 1:
We can call this policy, GLIE (Greedy in the Limits of Infinite Exploration)
Bring this idea in $\epsilon$-greedy exploration, if $\epsilon_i$ is reduced to 0, this strategy is GLIE: