Finding Optimal Policy

Optimal Value Function
Optimal Policy
Relation of Two Optimality Value Function
Bellman Optimality Equation for finite MDP
Finding Optimal Policy

Optimal Value Function

The optimal state-value function $V^{\ast}(s)$ is the maximum state-value function over all policies:

V^{\ast}(s) = \max_{\pi}V^{\pi}(s)

And optimal state action-value function $Q^{\ast}(s, a)$ is the maximum action-value function over all policies:

Q^{\ast}(s, a) = \max_{\pi} Q^{\pi}(s,a)

The difference between them is that $Q^{\ast}(s, a)$ takes the inital action $a$ based on policy $\pi$, but $V^{\ast}(s)$ is not.

Optimal Policy

Find the optimal policy $\pi^{\ast}$ which maximize the state-value at each state: $\pi^{\ast}(s) = \argmax_{\pi}V^{\pi}(s)$

For the optimal policy $\pi^*$, we have

$V^{\pi^*} \gt V^{\pi}(s)$ for any policy $\pi$ and state $s$
$Q^{\pi^*} \gt Q^{\pi}(s, a)$ for any policy $\pi$, any state $s$ and any action $a$

Relation of Two Optimality Value Function

Find actions which maximize state-value function $V^*(s) = \max_{\pi}V{\pi}(s) = \max_{a}Q^*(s, a)$ We can derive the bellman equations for optimal policy: $\begin{aligned} Q^*(s,a) &= R(s, a) + \gamma E_{\pi^*}[ V^*(s_{t+1}) \vert s_t = s, a_t=a] \\ &= R(s,a) + \gamma E_{\pi^*}[\max_{a'}Q^*(s_{t+1}, a') \vert s_t=s, a_t=a] \\ V^*(s) &= \max_aR(s,a) + \gamma E_{\pi^*}[V^*(s_{t+1}) \vert S_t = s, a_t = a^*] \\ & \text{where } a^* = \argmax_a R(s, a) \end{aligned}$

Bellman Optimality Equation for finite MDP

In discrete state space, we can replace the expectation term with summation: $\begin{aligned} Q^*(s,a) &= R(s,a) + \gamma E_{\pi^*}[V^*(s_{t+1}) \vert s_t=s, a_t=a] \\ &= R(s,a) + \gamma \sum_{s' \in S} T_{ss'}^a V^*(s') \end{aligned}$

\begin{aligned} Q^*(s,a) &= R(s,a) + \gamma E_{\pi^*}[\max_{a'} Q^*(s_{t+1}, a') \vert s_t=s, a_t=a] \\ &= R(s,a) + \gamma \sum_{s' \in S} T_{ss'}^a \max_{a'}Q^*(s', a') \end{aligned}

V^*(s) = \max_a R(s, a) + \gamma \sum_{s' \in S} T_{ss'}^a V^*(s')

Finding Optimal Policy

Searching the every policy satisfying: $\pi^*(s) = \argmax_{\pi} V^{\pi}(s)$

If we have finite action space and finite state space, then we should compare the number of $\vert A \vert^{\vert s \vert}$ (huge number). So we need Iterative approach to solve it:

Value iteration
Policy iteration
Q-learning
SARSA