Policy Iteration

While there is improvement
- Call dynamic programming to calculate policy values
- Construct a new policy:
  - For each state
    - Pick the action that maximizes
      - Discount * Value(action) + Reward(s)
    - This action will be the policy for that state
Pros and cons?