In reinforcement learning, many different metrics have been used to compare the performance of algorithms and no fully standardized benchmarks exist. Therefore, we compare the results of Cacla to the results for CMA-ES and NAC from an earlier paper [Heidrich-Meisner and Igel, 2008], using the dynamics and the performance metric used therein. We choose this particular paper because it reports results for NAC and for CMA-ES, which is considered the current state-of-the-art in direct policy search and black-box optimization [Jiang et al., 2008,Gomez et al., 2008,Glasmachers et al., 2010,Hansen et al., 2010].
The dynamics of the double cart pole are as follows [Wieland, 1991]:
![]() |
||
![]() |
Because CMA-ES uses Monte Carlo roll-outs, the task was made explicitly episodic by resetting the environment every
s
[Heidrich-Meisner and Igel, 2008]. This is not required for Cacla, but was done anyway to make the comparison fair. The feature vector is
. All episodes start in
. The discount factor in the paper was
. This means that the state values are unbounded. Therefore, we use a discount factor of
. In practice, this makes little difference for the performance. Even though Cacla optimizes the discounted cumulative reward, we use the reward per episode as performance metric, which is explicitly optimized by CMA-ES and NAC.
CMA-ES and NAC were used to train a linear controller, so Cacla is also used to find a linear controller. We use a bias feature that is always equal to one, so we are looking for a parameter vector
.
A hard threshold is used, such that if the output of the actor is larger than
N
or smaller than
N
, the agent outputs
N
or
N
, respectively. The critic was implemented with a multi-layer perceptron with 40 hidden nodes an a tanh activation function for the hidden layer. The initial controller was initialized with uniformly random parameters between
and
. No attempt was made to optimize this initial range for the parameters or the number of hidden nodes.
|
The results by CMA-ES are shown in Figure 4. [Heidrich-Meisner and Igel, 2008] show that NAC performs far worse and therefore we do not show its performance. The performance of NAC is better if it is initialized close to the optimal policy, in which case the median performance of NAC reaches the optimal reward per episode of
after
to
episodes. However, this would of course assume a priori knowledge about the optimal solution. The best performance of CMA-ES is a median reward of approximately 850. As shown in the figure, for values of the parameter
other than
, the performance is worse.
We ran Cacla for just 500 episodes with fixed step sizes of
and Gaussian exploration with
. This latter value was coarsely tuned and the reason the exploration is so high is that Cacla effectively learns a bang-bang controller: the resulting actor outputs values far above
N
and far below
N
. The results are very robust to the exact setting of this exploration parameter.
We also ran Cacla with
-greedy exploration, in which a uniform random action is chosen with probability
. We used
and
, where the latter implies a fully random policy. The
-greedy versions of Cacla do not learn a bang-bang controller, because the targets for the actor are always within the admissible range. Note that the actor is only updated on average once every ten steps when
, because at the other steps the action is equal to its output.
| online | offline | ||||||
| action noise | exploration | mean | se | success | mean | se | success |
|---|---|---|---|---|---|---|---|
| 0 |
|
946.3 | 6.7 | 92.3 % | 954.2 | 6.3 | 94.5 % |
|
|
807.6 | 9.6 | 59.0 % | 875.2 | 9.4 | 84.5 % | |
|
|
29.2 | 0.0 | 0 % | 514.0 | 10.4 | 25.5 % | |
| [-20N,20N] |
|
944.6 | 6.9 | 92.5 % | 952.4 | 6.5 | 94.5 % |
|
|
841.0 | 8.7 | 60.7 % | 909.5 | 8.1 | 87.4 % | |
|
|
28.7 | 0.0 | 0 % | 454.7 | 9.5 | 11.3 % | |
| [-40N,40N] |
|
936.0 | 7.4 | 91.9 % | 944.9 | 7.0 | 93.8 % |
|
|
854.2 | 7.9 | 50.5 % | 932.6 | 6.7 | 86.7 % | |
|
|
27.6 | 0.0 | 0 % | 303.0 | 6.7 | 0 % | |
Table 2 shows the results of our experiments with Cacla. The mean reward per episode (with a maximum of 1000) and the success rate is shown, where the latter is the percentage of controllers that can balance the poles for at least
s
. The online column shows the average online results for episodes 401-500, including exploration. The offline column shows the average offline results, obtained by testing the actor without exploration after 500 episodes were concluded. In Figure 4 the median performances are shown, but the online and offline median performance of Cacla with
and
is already perfect at 1000. This can be seen from Table 2, since in those cases the success rate is over
. To be able to compare different exploration techniques for Cacla, we show the mean performances.
On average, the controllers found by Cacla after only 500 episodes are significantly better than those found by CMA-ES after 10,000 episodes. Even
results in quite reasonable greedy policies. Naturally, when
the online performance is poor, because the policy is fully random. But note that the greedy performance of
is much better than the performance of CMA-ES after 500 episodes.
To test robustness of Cacla, we reran the experiment with noise in the action execution. A uniform random force in the range
N
N
or
N
N
is added to the action before execution. The action noise is added after cropping the actor output to the admissible range and the algorithm is not informed of the amount of added noise. For example, assume the actor of Cacla outputs an action of
. Then Gaussian exploration is added, for instance resulting in
. This action is not in the admissible range
, so it is cropped to
. Then uniform noise, drawn from
or
, is added. Suppose the result is 60. Then, a force of
N
is applied to the cart. If the resulting temporal-difference is positive, the output of the actor for this state is updated towards
, so the algorithm is unaware of both the cropping and the uniform noise that were applied to its output.
The results including action noise are also shown in Table 2. The performance of Cacla is barely affected when Gaussian exploration is used. The slight drop in performance falls within the statistical margins of error, although it does seem consistent. Interestingly, the added noise even improves the online and offline performance of Cacla when
-greedy exploration with
is used. Apparently, the added noise results in desirable extra exploration.
This experiment indicates that the relatively simple Cacla algorithm is very effective at solving some continuous reinforcement-learning problems. Other previous work show that natural-gradient and evolutionary algorithms typically need a few thousand episodes to learn a good policy on the double pole, but also on the single pole task [Sehnke et al., 2010]. We do not show the results here, but Cacla also performs very well on the single-pole cart pole. Naturally, this does not imply that Cacla is the best choice for all continuous MDPs. For instance, in a partially observable MDPs an evolutionary approach to directly search in parameter space may find good controllers faster, although it is possible to use Cacla to train a recurrent neural network, for instance with real-time recurrent learning [Williams and Zipser, 1989] or backpropagation through time [Werbos, 2002]. Additionally, convergence to an optimal policy or even local optima for variants of Cacla is not (yet) guaranteed, while for some actor-critic [Bhatnagar et al., 2009] and direct policy-search algorithms convergence to a local optimum can be guaranteed.
The reason that Cacla performs much better than CMA-ES on this particular problem is that CMA-ES uses whole episodes to estimate the fitness of a candidate policy and stores a whole population of such policies. Cacla, on the other hand, makes use of the structure of the problem by using temporal-difference errors. This allows it to quickly update its actor, making learning possible even during the first few episodes. NAC has the additional disadvantage that quite a few samples are necessary to make its estimate of the Fisher information matrix accurate enough to find the natural-gradient direction. Finally, the improvements to the actor in Cacla are not slowed down by plateaus in the value space. As episodes become longer, the value space will typically exhibit such plateaus, making the gradient estimates used by NAC more unreliable and the updates smaller. Because Cacla operates directly in action space, it does not have this problem and it can move towards better actions with a fixed step size, whenever the temporal-difference is positive.
As a final note, the simple variant of Cacla will probably not perform very well in problems with specific types of noise. For instance, Cacla may be tempted to update towards actions that often yield fairly high returns but sometimes yield very low returns, making them a poor choice on average. This problem can be mitigated by storing an explicit approximation of the reward function, or by using averaged temporal-difference errors instead of the stochastic errors. These issues have not been investigated in depth.