# Bellman equations and Markov decision process

## Recall: Markov Reward Process

Markov Reward Process is Markov Chain with rewards. it is a tuple of $(S, T, R, \gamma)$

• $S$ is a set of Markov states $(s \in S)$
• $T$ is transition model (or dynamics) that specifies $T(t, s_1, s_2) = T_{s_1s_2} \doteq P(s_{t+1} = s_2 \vert s_t = s_1)$
• $R$ is a reward function $R(s_t = s) = E[r_t \vert s_t = s]$
• $\gamma$ is Discount factor $\gamma \in [0, 1]$

Note:

• There is no actions
• If $S$ is finite, then $R$ can be represented by a vector
• By Markov property, the transition model is independent of $t$ so that it can be expresed by $T(t, s_1, s_2) = T(s_1, s_2) = P(s_{t+1} = s_2 \vert s_t = s_1)$
• Also same as $R(s_t = s) = R(s)$

## Markov Reward Process for Finite State

If $S$ is finite, $S$ can be represented by a vector: $S=\{s_1, s_2, \dots, s_N \}$ And the transition model $T$ can be represented by a matrix:

$T = \begin{bmatrix} P(s_1 \vert s_1) & P(s_2 \vert s_1) & \cdots & P(s_N \vert s_1) \\ P(s_1 \vert s_2) & P(s_2 \vert s_2) & \cdots & P(s_N \vert s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P(s_1 \vert s_N) & P(s_2 \vert s_N) & \cdots & P(s_N \vert s_N)\end{bmatrix}$

Also, the reward function $R$ can be represented by a vector: $R = \big( R(s_1), R(s_2, \cdots, R(s_N) \big)^T$

## Computing return from rewards

$R$ is a reward function $R(s) = E[r_t | s_t = s]$ We can calculate total discounted rewards from $t$ be the return $G_t$:

\begin{aligned} G_t &= r_t + \gamma * r_{t+1} + \gamma^2 * r_{t+2} + \cdots \\ &= \sum_{t=0}^{\infty} \gamma^t r_t \\ &= r_t + \gamma G_{t+1} \end{aligned}
• Time of horizon: number of time steps in each episode
• Finite ($t \leq T$) or infinite ($t \rarr \infty$)
• If finite, MRP is called finite MRP

## State-value function for MRP

• State-value function $V$: expected discounted sum of future rewards from initial state $s$ \begin{aligned} V(s_t = s) &= E[G_t \vert s_t = s] \\ &= E[r_t + \gamma r_{t+1} + \gamma^2r_{t+2} + \gamma^3 r_{t+3} + \cdots \vert s_t = s] \end{aligned}
• By Markov property, $V(s_t = s)$ is independent of the initial step $t$ so that we can use simpler notation $V(s)$ instead of $V(s_t = s)$.

## Bellman Equation for MRP

The value function can be decomposed by:

• Immediate reward ($r_t$)
• Discounterd value of subsequent states ($\gamma G_{t+1}$)

The Bellman equation for MRP is: \begin{aligned} V(s) &= E[G_t \vert s_t = s] \\ &= E[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \cdots \vert s_t = s] \\ &= E[r_t + \gamma G_{t+1} \vert s_t = s] \\ &= E[r_t | s_t = s] + \gamma E[ E[G_{t+1} \vert s_{t+1}] \vert s_t = s] \\ &= R(s) + \gamma E[V(s_{t+1}) \vert s_t = s] \end{aligned}

If we have finite state, the Bellman equation of state-value function for MRP can be represented by vectors and the transition matrix:

$\begin{bmatrix} V(s_1) \\ \vdots \\ V(s_N) \end{bmatrix} = \begin{bmatrix} R(s_1) \\ \vdots \\ R(s_N) \end{bmatrix} + \gamma \begin{bmatrix} T_{11} & \cdots & T_{1N} \\ T_{21} & \cdots & T_{2N} \\ \vdots & \ddots & \vdots \\ T_{N1} & \cdots & T_{NN} \end{bmatrix} \begin{bmatrix} V(s_1) \\ \vdots \\ V(s_N) \end{bmatrix}$

Note : $T_{ij} = P(s_{t+1} = s_j \vert s_t = s_i)$

It can simplified by: $V = R + \gamma T V$

## How to solve Bellman Equation

Solving the linear system by using linear solvers: $V = R + \gamma T V \\ (\mathbb{I}- \gamma T) V = R \\ V = (\mathbb{I} - \gamma T)^{-1} R$

Note: If $\gamma < 1$, then $\mathbb{I}- \gamma T$ is invertible

Computation Complexity is $O(n^3)$ where $n$ is the number of states. If $n$ is huge, then it is not proper to solve the linear system directly. Instead, we can use iterative methods like Dynamic Programming.