Delayed Q-Learning
- Inputs:
- γ, States, Actions
- Magic values m and ε1
- Data structures:
- Q(s,a) (initialized to 1/(1-γ))
- U(s,a) accumulates rewards and temporal differences between updates
- l(s,a) counts the number of updates to U(s,a) since the last Q(s,a) update
- t(s,a) is the timestep of the last update to Q(s,a)
- LEARN(s,a) is a flag indicating whether an update to Q(s,a) can be attempted (initially true)
- t* is the time of the most recent Q-value change
- Algorithm
- For each timestep t
- The current state is s
- Select action a greedily
- Calculate new state s' and reward r
- if LEARN(s,a):
- U(s,a) = U(s,a) + r + γ max Q(s')
- l(s,a) = l(s,a) + 1
- if l(s,a) = m then attempt a Q-update
- else if t(s,a) < t*, LEARN(s,a) = true
- Updating a Q-value
- if Q(s,a) - U(s,a)/m ≥ 2ε1:
- Q(s,a) = U(s,a)/m + ε1
- t* = current timestep
- else if t(s,a) ≥ t*, LEARN(s,a) = false
- Reset t(s,a) to the current timestep
- Zero out U(s,a) and l(s,a)
PAC Markov Decision Processes (PAC-MDP)
- Let ε and δ be in R+, and let M be a Markov Decision Process
- A PAC-MDP algorithm follows an ε-optimal policy on all but a polynomial number of timesteps, with probability at least 1 - δ
- Informally speaking, the policy has a probability of δ of being "bad" for a polynomial number of timesteps at the worst
Delayed Q-learning is PAC-MDP
- Requires certain values for m and ε1
- The proof is complex, but here are the main ideas:
- Delaying updates for m steps means that there will be a high probability that each actual update is a "good idea"
- The combination of high initial Q-values and only allowing Q-values to decrease ensures that the algorithm will regularly visit every state and try every action from that state; this, in turn ensures that every (state, action) pair receives enough hits to decrease to the "right" value
- The LEARN flags, in combination with the ε1 threshold for updates, ensure that only a finite number of Q-value updates occur
Calculating Example Sizes
- Select values for each of the following:
- The discount (γ)
- The error tolerance (ε)
- The probability of meeting the error tolerance (δ)
- Number of states and actions
- QPAC.py
- Computations are complex, and best not done by hand
- Given the five values above, this will tell you:
- The number of timesteps t to guarantee PAC behavior
- The Q-update interval m
- The value ε1
Reference
- Strehl, A.L.; Li, L.; Wiewiora, E.; Langford, J.; Littman, M.L. PAC Model-Free Reinforcement Learning, in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006.