Q-Learning
-
Q(s, a)
: Value of action a
taken in state s
- Implicit policy
- Select
a
such that Q(s, a)
is maximized
- Temporal difference Q-Learning
- TD update: v(s) = (1 - α) * v(s) + α * (γ * v(s') + r(s))
- Q update: Q(s, a) = (1 - α) * Q(s, a) + α * (γ * maxa(Q(s', a)) + r(s))
(next)