Exploration vs. Exploitation
- Exploitation
- From a given state s, select the action a that maximizes Q(s,a)
- Exploration
- For Q-learning to correctly estimate the rewards for each state/action pair, it must visit every state/action pair multiple times
- Consequently, we cannot exploit our knowledge on every move
(next)