Variants of Dynamic Programming
A summary of "Understanding deep reinforcement learning"
Policy Improvement by Iterative Methods
Asynchronous Dynamic Programming
Dynamic Programming mentioned before used synchronous backups which updates all staes at each iteration in parallel. (that means, next state value function can be calculated when the current state value function is ready.) Asynchronous DP updates each state in any order. This can significantly reduce computation, and it is convergent if all states continue to update.
 Variants
 Inplace Dynamic Programming
 Prioritized sweeping
 Realtime Dynamic Programming
InPlace Dynamic Programming
 Synchronous value iteration stores two copies of value function

For all states $s$:
$V_{new}(s) = \max_{a \in A} (R(s, a) + \gamma \sum_{s’ \in S} P(s’ \vert s, a) V_{old}(s’))$

$V_{old}(s) = V_{new}(s)$


Inplace value iteration only use one memory space of value function
 For all states $s$
Note: when the $V(s)$ is calculated, it will replace the $V(s’)$ in next iteration
Prioritized Sweeping
Prioritized Sweeping is the approach to update state with priority. To select states for update, find the state with the largest Bellman error:
$\argmax_{s \in S} \vert \max_{a \in A} \big(R(s, a) + \gamma \sum_{s' \in S} P(s' \vert s, a) V(s')\big)  V(s) \vert$ Update the state and Bellman error of affected states ($V_{new}(s)  V_{old}(s)$)
 This can be implemented by using a priority queue
Sample Backups
In the previous post, Fullwidth backups are introduced. It is consider all states and actions. Compared to this, Sample backups use the samples of rewards and transitions instead of reward function and transition function. By sampling reward and transition, it can reduce the computation.
Approximate DP
Value function can be approximated by using
 Function approximator $\hat{V}(s, w)$ with parameter $w$ (usually wa used neural network for training approximator)
 Applying DP to calculate $\hat{V}(\cdot, w)$
It is called Fitted Value Iteration, and follow the process
 Repeat until convergence,
 Sample states $\tilde{S} \subset S$
 For each state $s \in \tilde{S}$, estimate target value using Bellman optimality equation,
$\tilde{V_k}(s) = \max_{a \in A} \big(R(s, a) + \gamma \sum_{s’ \in S} P(s’ \vert s, a) \hat{V}(s’, w_k) \big)$
 Train next value function $\hat{V}(\cdot, w_{k+1})$ with targets ${ \langle s, \tilde{V_k}(s) \rangle }$