Q-Learning 
-  Q(s, a): Value of actionataken in states
-  Implicit policy
  
  -  Select asuch thatQ(s, a)is maximized
 
-  Temporal difference Q-Learning
  
  -  TD update: v(s) = (1 - α) * v(s) + α * (γ * v(s') + r(s))
  
-  Q update: Q(s, a) = (1 - α) * Q(s, a) + α * (γ * maxa(Q(s', a)) + r(s))
  
 
(next)