Q-Learning: The Complete Algorithm

Compute the reward r(s)
Select an action
- Randomized Exploration: Select either best action, or (with probability ε) a random action
- Counting Exploration: Select action with fewest counts below threshold, or the best action if all are above the threshold
Apply the selected action a in the current state s
Compute the new state s'
Update the Q-value:
- Q(s, a) = (1 - α) * Q(s, a) + α * (γ * max_a(Q(s', a)) + r(s))
Decrease α and ε
Let s = s'