This section discusses algorithms that store the value of a state, in addition to a value for each action that helps determine which action to choose. In that sense, all the algorithms in this section could be called Actor-Critic methods, although we reserve that name for the version that was coined in the book by Sutton and Barto. Note that here a state-action value need not correspond to the expected future rewards, except in that the action with the highest expected reward should normally also receive the highest state-action value.
All algorithms in this section use the following update rule to update the state values:
Note that this equation is easily extended to eligibility traces and will in general become more reliable more quickly than Q values, because the state space is smaller than the combined state-action space. Especially in problems with many actions, this can be a big advantage.
The algorithms in this section:
Cacla also uses state values, but can be found in the section containing the continuous action algorithms.
My contact data can be found here.