Dynamic Programming
- Improved update:
- For each state
s:
- Value(s) = Discount * Value(Policy(s)) + Reward(s)
- Low discount
- Instant gratification
- Fast but inaccurate convergence
- High discount
- Deferred gratification
- Slow but accurate convergence
- Solves the problem of temporal credit assignment
- Early iterations: Future has little to no effect
- Later iterations: Future rewards slowly work backwards
(next)