Dynamic Programming for solving Bellman equation
Dynamic Programming for solving MRP
Note: if $\vert V_{k+1} - V_{k} \vert \lt \epsilon$, then value function is converged.
Computational complexity is $O(n^2)$ where n is the number of states.
Markov Decision Process
Markov Decision Process is MRP with actions, and it is a tuple of $(S, A, T, R, \gamma)$:
- $S$ is a set of Markov states ($s \in S$)
- $A$ is a set of actions ($a \in A$)
- $T$ is transition model for each action that specifies:
T(s1,s2,a)=Ts1s2a≐P(st+1=s2∣st=s1,at=a)
- $R$ is a reward function ($R(s, a) = E[r_t \vert s_t = s, a_t = a]$)
- $\gamma$ is Discount factor ($\gamma \in [0, 1]$)
Policy
Policy $\pi$ determines how the agent chooses actions. It is a function from states to actions ($\pi : S \rarr A$)
- Deterministic policy : $\pi(s) = a$
- Stochastic policy : $\pi(a \vert s) = P(a_t = a \vert s_t = s)$
Note: If a policy $\pi$ is given for MDP (i.e. action is fixed for each state), MDP is just MRP with specific model & reward function:
Tπ(s1,s2)=P(st+1=s2∣st=s1,at=π(a∣s))Rπ(s)=E[rt∣st=s,at=π(a∣s)]
State-value function for MDP
State-value function $V^{\pi}$ is the expected discounted sum of future rewards under a particular policy $\pi$ from initial state s
Vπ(s)=Eπ[rt+γrt+1+γ2rt+2+γ3rt+3+⋯∣st=s,at=π(a∣s)]
Bellman equation for $V^{\pi}$:
Vπ(s)=Rπ(s)+γE[Vπ(st+1)∣st=s,at=π(a∣s)]
Action-value function for MDP
Action-value function $Q^{\pi}$ is the expected discounted sum of future rewards tarting from state $s$, taking action $a$, and then following a particular policy $\pi$,
Qπ(s,a)=Eπ[rt+γrt+1+γ2rt+2+γ3rt+3+⋯∣st=s,at=a]=E[rt∣st=s,at=a]+γEπ[Gt+1∣st=s,at=a]
Bellman equation for $Q^{\pi}$:
Qπ(s,a)=R(s,a)+γEπ[Qπ(st+1,at+1)∣st=s,at=a]
Relation of Two value Functions
Bellman equations for policy $\pi$:
Qπ(s,a)Vπ(s)=R(s,a)+γEπ[Vπ(st+1∣st=s,at=a)]=R(s,a)+γEπ[Qπ(st+1,π(a∣st+1))∣st=s,at=a]=R(s,π(a∣s))+γEπ[Vπ(st+1)∣st=s]
In discrete MDP, Bellman equations for policy $\pi$ becomes,
Qπ(s,a)Vπ(s)=R(s,a)+γs′∈S∑P(s′∣s,a)Vπ(s′)=Rπ(s)+γs′∈S∑P(s′∣s,a)Vπ(s′)=a∈A∑π(a∣s)Qπ(s,a)