Acla is an algorithm that uses state values to update state-dependent action values. For the update to these action values, one observes the update to the state value of the last state. If this update increased the value of the state, the action that was performed was a good action and the update is:
If the state value decreases, the action was not such a good idea, and the update is:
Note that both updated preserve the property that the values of the actions are between 0 and 1, if they are initialised in this interval.
Neutral characteristics
It is on-policy.
Learns preference values that do not hold explicit information on the expected discounted rewards.
Advantages
Using state values often speeds up learning.
State values are easily extendable to eligibility traces.
Has been shown to outperform several other algorithms on some problems.
Disadvantages
Converges to a solution that optimizes the probability of a higher value, instead of the expected reward. In stochastic problems these solutions may differ (although this is not typically the case).
Cannot handle continuous action spaces.
Algorithm
The Acla algorithm in schematic form:
In the publication below, it is shown that in at least some cases, Acla performs a lot better than similar algorithms such as Q-learning and Sarsa. The only algorithm that was shown to be able to perform better on the selected task is the Cacla algorithm.