QV-learning is a natural extension of Q-learning and Sarsa to the case where we also use state values. Its equation is:
Neutral characteristics
It is on-policy.
It learns state-action values (Q values).
Advantages
Using state values decreases the variance compared to Q-learning and Sarsa.
Using state values often speeds up learning.
State values are easily extendable to eligibility traces.
Disadvantages
Cannot handle continuous action spaces.
Algorithm
The QV-learning algorithm in schematic form:
Comparing this algorithm to Q-learning we see that the state value takes the place of the value of the highest valued action. Amongst other things, this makes the algorithm on-policy and it makes it easier to use eligibility traces.
Selected relevant publications:
M.A. Wiering. QV(lambda)-learning: A New On-policy Reinforcement Learning Algorithm. Proceedings of the 7th European Workshop on Reinforcement Learning, D. Leone (editor), pages 17-18, 2005.
M.A. Wiering and H. van Hasselt. Ensemble Algorithms in Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B, Volume 38, 4, 930-936, 2008.
M.A. Wiering and H. van Hasselt. The QV Family Compared to Other Reinforcement Learning Algorithms. Proceedings of IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), Nashville, USA, 2009.