Q-Learning
-
Q(s, a): Value of action a taken in state s
- Implicit policy
- Select
a such that Q(s, a) is maximized
- Temporal difference Q-Learning
- TD update: v(s) = (1 - α) * v(s) + α * (γ * v(s') + r(s))
- Q update: Q(s, a) = (1 - α) * Q(s, a) + α * (γ * maxa(Q(s', a)) + r(s))
(next)