All the methods described here use a population of solutions. Traditional evolutionary algorithms create a population of solutions and adapt this population by selecting some solutions, recombining these and possibly mutating the result. The newly obtained solutions then replace some or all of the solutions in the old population. The selection procedure typically takes into account the fitness of the solutions, such that solutions with higher quality have a larger probability of being used to create new solutions.
Recently, it has become more common to adapt the parameters of a probability distribution that generates solutions, rather than to adapt the solutions themselves. This approach is used in so-called evolutionary strategies [Bäck, 1996,Beyer and Schwefel, 2002]. Such approaches generate a population, but use the fitness of the solutions to adapt the parameters of the generating distribution, rather than the solutions themselves. A new population is then obtained by generating new solutions from the adapted probability distribution. Some specific algorithms include the following. Covariance matrix adaptation evolution strategies (CMA-ES) [Hansen and Ostermeier, 2001] weigh the sampled solutions according to their fitness and use the weighted mean as the mean of the new distribution. Natural evolutionary strategies (NES) [Wierstra et al., 2008,Sun et al., 2009] use all the generated solutions to estimate the gradient of the parameters of the generating function, and then use natural gradient ascent to improve these parameters. Cross-entropy optimization methods [Rubinstein and Kroese, 2004] simply select the solutions with the highest fitness--where is a parameter--and use the mean of these solutions to find a new mean for the distribution.
A generic evolutionary strategy is shown in Algorithm 2. The method to compute the next parameter setting for the generating function in line 5 differs between algorithms. However, all attempt to increase the expected fitness such that is higher than the expected fitness of the former population . These expectations are defined by
No general guarantees can be given concerning convergence to the optimal solution for evolutionary strategies. Convergence to the optimal solution for non-stationary problems, such as the control problem in reinforcement learning, seems even harder to prove. Despite this lack of guarantees, these methods can perform well in practice. The major bottleneck is usually that the computation of the fitness can be both noisy and expensive. Additionally, these methods have been designed mostly with stationary optimization problems in mind. Therefore, they are more suited to optimize a policy using Monte Carlo samples than to approximate the value of the unknown optimal policy. In Section 4, we compare the performance of CMA-ES and an actor-critic temporal-difference approach.
The gradient-free methods mentioned above all fall into a category known as metaheuristics [Glover and Kochenberger, 2003]. These methods iteratively search for good candidate solutions, or a distribution that generates these. Another approach is to construct an easier solvable (e.g., quadratic) model of the function that is to be optimized and then maximize this model analytically [Powell, 2002,Powell, 2006,Huyer and Neumaier, 2008]. New samples can be iteratively chosen to improve the approximate model. We do not know any papers that have used such methods in a reinforcement learning context, but the sample-efficiency of such methods in high-dimensional problems make them an interesting direction for future research.