Dynamic Programming

Improved update:
- For each state s:
  - Value(s) = Discount * Value(Policy(s)) + Reward(s)
Low discount
- Instant gratification
- Fast but inaccurate convergence
High discount
- Deferred gratification
- Slow but accurate convergence
Solves the problem of temporal credit assignment
- Early iterations: Future has little to no effect
- Later iterations: Future rewards slowly work backwards