Q-Learning: The Complete Algorithm
- Compute the reward r(s)
- Select an action
- Randomized Exploration: Select either best action, or (with probability ε) a random action
- Counting Exploration: Select action with fewest counts below threshold, or the best action if all are above the threshold
- Apply the selected action a in the current state s
- Compute the new state s'
- Update the Q-value:
- Q(s, a) = (1 - α) * Q(s, a) + α * (γ * maxa(Q(s', a)) + r(s))
- Decrease α and ε
- Let s = s'
(next)