Next: Approximate Reinforcement Learning Up: Updating Parameters Previous: Gradient Descent

Gradient-free methods are useful when the function that is optimized is not differentiable or when it is expected that many local optima exist. Many general global methods for optimization exist, including evolutionary algorithms [Holland, 1962,Rechenberg, 1971,Holland, 1975,Schwefel, 1977,Davis, 1991,Bäck and Schwefel, 1993], simulated annealing [Kirkpatrick, 1984], particle swarm optimization [Kennedy and Eberhart, 1995] and cross-entropy optimization [Rubinstein, 1999,Rubinstein and Kroese, 2004]. Most of these methods share some common features that we will outline below. We focus on cross-entropy and a subset of evolutionary algorithms, but the other approaches can be used quite similarly. For introductions to evolutionary algorithms, see the books by [Bäck, 1996] and [Eiben and Smith, 2003]. We give a short overview of how such algorithms work.

All the methods described here use a population of solutions. Traditional evolutionary algorithms create a population of solutions and adapt this population by selecting some solutions, recombining these and possibly mutating the result. The newly obtained solutions then replace some or all of the solutions in the old population. The selection procedure typically takes into account the fitness of the solutions, such that solutions with higher quality have a larger probability of being used to create new solutions.

Recently, it has become more common to adapt the parameters of a probability distribution that generates solutions, rather than to adapt the solutions themselves. This approach is used in so-called evolutionary strategies [Bäck, 1996,Beyer and Schwefel, 2002]. Such approaches generate a population, but use the fitness of the solutions to adapt the parameters of the generating distribution, rather than the solutions themselves. A new population is then obtained by generating new solutions from the adapted probability distribution. Some specific algorithms include the following. Covariance matrix adaptation evolution strategies (CMA-ES) [Hansen and Ostermeier, 2001] weigh the sampled solutions according to their fitness and use the weighted mean as the mean of the new distribution. Natural evolutionary strategies (NES) [Wierstra et al., 2008,Sun et al., 2009] use all the generated solutions to estimate the gradient of the parameters of the generating function, and then use natural gradient ascent to improve these parameters. Cross-entropy optimization methods [Rubinstein and Kroese, 2004] simply select the solutions with the highest fitness--where is a parameter--and use the mean of these solutions to find a new mean for the distribution.

A generic evolutionary strategy is shown in Algorithm 2. The method to compute the next parameter setting for the generating function in line 5 differs between algorithms. However, all attempt to increase the expected fitness such that is higher than the expected fitness of the former population . These expectations are defined by

Care should be taken that the variance of the distribution does not become too small too quickly, in order to prevent premature convergence to sub-optimal solutions. A simple way to do this, is by using a step-size parameter [Rubinstein and Kroese, 2004] on the parameters in order to prevent from too large changes per iteration. More sophisticated methods to prevent premature convergence include the use of the natural gradient by NES, and the use of enforced correlations between the covariance matrices of consecutive populations by CMA-ES.

No general guarantees can be given concerning convergence to the optimal solution for evolutionary strategies. Convergence to the optimal solution for non-stationary problems, such as the control problem in reinforcement learning, seems even harder to prove. Despite this lack of guarantees, these methods can perform well in practice. The major bottleneck is usually that the computation of the fitness can be both noisy and expensive. Additionally, these methods have been designed mostly with stationary optimization problems in mind. Therefore, they are more suited to optimize a policy using Monte Carlo samples than to approximate the value of the unknown optimal policy. In Section 4, we compare the performance of CMA-ES and an actor-critic temporal-difference approach.

The gradient-free methods mentioned above all fall into a category known as metaheuristics [Glover and Kochenberger, 2003]. These methods iteratively search for good candidate solutions, or a distribution that generates these. Another approach is to construct an easier solvable (e.g., quadratic) model of the function that is to be optimized and then maximize this model analytically [Powell, 2002,Powell, 2006,Huyer and Neumaier, 2008]. New samples can be iteratively chosen to improve the approximate model. We do not know any papers that have used such methods in a reinforcement learning context, but the sample-efficiency of such methods in high-dimensional problems make them an interesting direction for future research.

Next: Approximate Reinforcement Learning Up: Updating Parameters Previous: Gradient Descent
2011-07-01