Policy Improvement by Iterative Methods

Value Iteration

$V_{k+1} (s) = \max_{a \in A}(R(s, a) + \gamma \sum_{s’ \in S} P(s’ \vert s, a) V_k(s’))$

We can re-define this term with expectation,

$V_{k+1} (s) = \max_{a \in A}(R(s, a) + \gamma E_k[Q_k(s_{t+1}, a) \vert s_t = s])$

V_{k+1} = \max_{a \in A}(R(a) + \gamma \Tau^a V_k)

This is called synchronous backup. The reason we called “backup” is that, we always hold state value in current time step and next time step. And when the state value in current time step is prepared, then we can calculate the next time step state value. So it is synchronous.
Value iteration finds no explicit policy at each iteration, even the optimal policy.
To find the optimal policy, we should save and update the best action for each iteration.
The complexity is $O(\vert A \vert \vert S \vert^2)$ per iteration. ($\vert S \vert$ for current state, and $\vert S \vert$ for next state)