Learning State Values

Assume perfect state-action model
Assume fixed policy
Given a policy and a state, select an action
- Action should maximize "value"
  - This is how a policy is assessed
- How do we determine this? The Reward Function!
Resuming our example
- Reward for not bumping: 1
- Reward for bumping: 0