Delayed Q-Learning

PAC Markov Decision Processes (PAC-MDP)

Let ε and δ be in R⁺, and let M be a Markov Decision Process
A PAC-MDP algorithm follows an ε-optimal policy on all but a polynomial number of timesteps, with probability at least 1 - δ
Informally speaking, the policy has a probability of δ of being "bad" for a polynomial number of timesteps at the worst

Requires certain values for m and ε₁
The proof is complex, but here are the main ideas:
- Delaying updates for m steps means that there will be a high probability that each actual update is a "good idea"
- The combination of high initial Q-values and only allowing Q-values to decrease ensures that the algorithm will regularly visit every state and try every action from that state; this, in turn ensures that every (state, action) pair receives enough hits to decrease to the "right" value
- The LEARN flags, in combination with the ε₁ threshold for updates, ensure that only a finite number of Q-value updates occur

Select values for each of the following:
- The discount (γ)
- The error tolerance (ε)
- The probability of meeting the error tolerance (δ)
- Number of states and actions
QPAC.py
- Computations are complex, and best not done by hand
- Given the five values above, this will tell you:
  - The number of timesteps t to guarantee PAC behavior
  - The Q-update interval m
  - The value ε₁

Strehl, A.L.; Li, L.; Wiewiora, E.; Langford, J.; Littman, M.L. PAC Model-Free Reinforcement Learning, in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006.