Bellman equations and Markov decision process

Recall: Markov Reward Process

Markov Reward Process is Markov Chain with rewards. it is a tuple of $(S, T, R, \gamma)$

  • $S$ is a set of Markov states $(s \in S)$
  • $T$ is transition model (or dynamics) that specifies $T(t, s_1, s_2) = T_{s_1s_2} \doteq P(s_{t+1} = s_2 \vert s_t = s_1)$
  • $R$ is a reward function $R(s_t = s) = E[r_t \vert s_t = s]$
  • $\gamma$ is Discount factor $\gamma \in [0, 1]$

Note:

  • There is no actions
  • If $S$ is finite, then $R$ can be represented by a vector
  • By Markov property, the transition model is independent of $t$ so that it can be expresed by $T(t, s_1, s_2) = T(s_1, s_2) = P(s_{t+1} = s_2 \vert s_t = s_1)$
  • Also same as $R(s_t = s) = R(s)$

Markov Reward Process for Finite State

If $S$ is finite, $S$ can be represented by a vector: S={s1,s2,,sN}S=\{s_1, s_2, \dots, s_N \} And the transition model $T$ can be represented by a matrix:

T=[P(s1s1)P(s2s1)P(sNs1)P(s1s2)P(s2s2)P(sNs2)P(s1sN)P(s2sN)P(sNsN)]T = \begin{bmatrix} P(s_1 \vert s_1) & P(s_2 \vert s_1) & \cdots & P(s_N \vert s_1) \\ P(s_1 \vert s_2) & P(s_2 \vert s_2) & \cdots & P(s_N \vert s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P(s_1 \vert s_N) & P(s_2 \vert s_N) & \cdots & P(s_N \vert s_N)\end{bmatrix}

Also, the reward function $R$ can be represented by a vector: R=(R(s1),R(s2,,R(sN))TR = \big( R(s_1), R(s_2, \cdots, R(s_N) \big)^T

Computing return from rewards

$R$ is a reward function R(s)=E[rtst=s]R(s) = E[r_t | s_t = s] We can calculate total discounted rewards from $t$ be the return $G_t$:

Gt=rt+γrt+1+γ2rt+2+=t=0γtrt=rt+γGt+1\begin{aligned} G_t &= r_t + \gamma * r_{t+1} + \gamma^2 * r_{t+2} + \cdots \\ &= \sum_{t=0}^{\infty} \gamma^t r_t \\ &= r_t + \gamma G_{t+1} \end{aligned}
  • Time of horizon: number of time steps in each episode
    • Finite ($t \leq T$) or infinite ($t \rarr \infty$)
    • If finite, MRP is called finite MRP

State-value function for MRP

  • State-value function $V$: expected discounted sum of future rewards from initial state $s$ V(st=s)=E[Gtst=s]=E[rt+γrt+1+γ2rt+2+γ3rt+3+st=s]\begin{aligned} V(s_t = s) &= E[G_t \vert s_t = s] \\ &= E[r_t + \gamma r_{t+1} + \gamma^2r_{t+2} + \gamma^3 r_{t+3} + \cdots \vert s_t = s] \end{aligned}
  • By Markov property, $V(s_t = s)$ is independent of the initial step $t$ so that we can use simpler notation $V(s)$ instead of $V(s_t = s)$.

Bellman Equation for MRP

The value function can be decomposed by:

  • Immediate reward ($r_t$)
  • Discounterd value of subsequent states ($\gamma G_{t+1}$)

The Bellman equation for MRP is: V(s)=E[Gtst=s]=E[rt+γrt+1+γ2rt+2+γ3rt+3+st=s]=E[rt+γGt+1st=s]=E[rtst=s]+γE[E[Gt+1st+1]st=s]=R(s)+γE[V(st+1)st=s]\begin{aligned} V(s) &= E[G_t \vert s_t = s] \\ &= E[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \cdots \vert s_t = s] \\ &= E[r_t + \gamma G_{t+1} \vert s_t = s] \\ &= E[r_t | s_t = s] + \gamma E[ E[G_{t+1} \vert s_{t+1}] \vert s_t = s] \\ &= R(s) + \gamma E[V(s_{t+1}) \vert s_t = s] \end{aligned}

If we have finite state, the Bellman equation of state-value function for MRP can be represented by vectors and the transition matrix:

[V(s1)V(sN)]=[R(s1)R(sN)]+γ[T11T1NT21T2NTN1TNN][V(s1)V(sN)]\begin{bmatrix} V(s_1) \\ \vdots \\ V(s_N) \end{bmatrix} = \begin{bmatrix} R(s_1) \\ \vdots \\ R(s_N) \end{bmatrix} + \gamma \begin{bmatrix} T_{11} & \cdots & T_{1N} \\ T_{21} & \cdots & T_{2N} \\ \vdots & \ddots & \vdots \\ T_{N1} & \cdots & T_{NN} \end{bmatrix} \begin{bmatrix} V(s_1) \\ \vdots \\ V(s_N) \end{bmatrix}

Note : $T_{ij} = P(s_{t+1} = s_j \vert s_t = s_i)$

It can simplified by: V=R+γTVV = R + \gamma T V

How to solve Bellman Equation

Solving the linear system by using linear solvers: V=R+γTV(IγT)V=RV=(IγT)1RV = R + \gamma T V \\ (\mathbb{I}- \gamma T) V = R \\ V = (\mathbb{I} - \gamma T)^{-1} R

Note: If $\gamma < 1$, then $\mathbb{I}- \gamma T$ is invertible

Computation Complexity is $O(n^3)$ where $n$ is the number of states. If $n$ is huge, then it is not proper to solve the linear system directly. Instead, we can use iterative methods like Dynamic Programming.