Policy Iteration
- While there is improvement
- Call dynamic programming to calculate policy values
- Construct a new policy:
- For each state
- Pick the action that maximizes
- Discount * Value(action) + Reward(s)
- This action will be the policy for that state
- Pros and cons?
(next)