Dynamic Programming

For each state:
- Calculuate value of executing policy for one time step
Given the reward after n time steps, compute reward for n + 1 time steps
Naive update:
- For each state s:
  - Value(s) = Value(Policy(s)) + Reward(s)
What could possibly go wrong?