Optimal Value Function
The optimal state-value function $V^{\ast}(s)$ is the maximum state-value function over all policies:
V∗(s)=πmaxVπ(s)
And optimal state action-value function $Q^{\ast}(s, a)$ is the maximum action-value function over all policies:
Q∗(s,a)=πmaxQπ(s,a)
The difference between them is that $Q^{\ast}(s, a)$ takes the inital action $a$ based on policy $\pi$, but $V^{\ast}(s)$ is not.
Optimal Policy
Find the optimal policy $\pi^{\ast}$ which maximize the state-value at each state:
π∗(s)=argmaxπVπ(s)
For the optimal policy $\pi^*$, we have
- $V^{\pi^*} \gt V^{\pi}(s)$ for any policy $\pi$ and state $s$
- $Q^{\pi^*} \gt Q^{\pi}(s, a)$ for any policy $\pi$, any state $s$ and any action $a$
Relation of Two Optimality Value Function
Find actions which maximize state-value function
V∗(s)=maxπVπ(s)=maxaQ∗(s,a)
We can derive the bellman equations for optimal policy:
Q∗(s,a)V∗(s)=R(s,a)+γEπ∗[V∗(st+1)∣st=s,at=a]=R(s,a)+γEπ∗[a′maxQ∗(st+1,a′)∣st=s,at=a]=amaxR(s,a)+γEπ∗[V∗(st+1)∣St=s,at=a∗]where a∗=aargmaxR(s,a)
Bellman Optimality Equation for finite MDP
In discrete state space, we can replace the expectation term with summation:
Q∗(s,a)=R(s,a)+γEπ∗[V∗(st+1)∣st=s,at=a]=R(s,a)+γs′∈S∑Tss′aV∗(s′)
Q∗(s,a)=R(s,a)+γEπ∗[a′maxQ∗(st+1,a′)∣st=s,at=a]=R(s,a)+γs′∈S∑Tss′aa′maxQ∗(s′,a′)
V∗(s)=amaxR(s,a)+γs′∈S∑Tss′aV∗(s′)
Finding Optimal Policy
Searching the every policy satisfying:
π∗(s)=argmaxπVπ(s)
If we have finite action space and finite state space, then we should compare the number of $\vert A \vert^{\vert s \vert}$ (huge number). So we need Iterative approach to solve it:
- Value iteration
- Policy iteration
- Q-learning
- SARSA