Next: Function Approximation Up: Introduction Previous: Markov Decision Processes in Continuous Spaces

## Methodologies to Solve a Continuous MDP

In the problem of control, the aim is an approximation of the optimal policy. The optimal policy depends on the optimal value, which in turn depends on the model of the MDP. In terms of equation (2), the optimal policy is the policy that maximizes for each state: . This means that rather than trying to estimate directly, we can try to estimate , or we can even estimate and to construct and when needed. These observations lead to the following three general methodologies that differ in which part of the solution is explicitly approximated. These methodologies are not mutually exclusive and we will discuss algorithms that use combinations of these approaches.

Model Approximation Model-approximation algorithms approximate the MDP and compute the desired policy on this approximate MDP. Since , and are assumed to be known, this amounts to learning an approximation for the functions and . Because of the Markov property, these functions only depend on local data. The problem of estimating these functions then translates to a fairly standard supervised learning problem. For instance, one can use Bayesian methods [Dearden et al., 1998,Dearden et al., 1999,Strens, 2000,Poupart et al., 2006] to estimate the required model. Learning the model may not be trivial, but in general it is easier than learning the value of a policy or optimizing the policy directly. For a recent survey on model-learning algorithms, see [Nguyen-Tuong and Peters, 2011].

An approximate model can be used to compute a value function. This can be done iteratively, for instance using value iteration or policy iteration [Bellman, 1957,Howard, 1960,Puterman and Shin, 1978,Puterman, 1994]. The major drawback of model-based algorithms in continuous-state MDPs is that even if a model is known, in general one cannot easily extract a good policy from the model for all possible states. For instance, value iteration uses an inner loop over the whole state space, which is impossible if this space is infinitely large. Alternatively, a learned model can be used to generate sample runs. These samples can then be used to estimate a value function, or to improve the policy, using one of the methods outlined below. However, if the accuracy of the model is debatable, the resulting policy may not be better than a policy that is based directly on the samples that were used to construct the approximate model. In some cases, value iteration can be feasible, for instance because is non-zero for only a small number of states . Even so, it may be easier to approximate the value directly than to infer the values from an approximate model. For reasons of space, we will not consider model approximation further.

Value Approximation In this second methodology, the samples are used to approximate or directly. Many reinforcement-learning algorithms fall into this category. We discuss value-approximation algorithms in Section 3.1.

Policy Approximation Value-approximation algorithms parametrize the policy indirectly by estimating state or action values from which a policy can be inferred. Policy-approximation algorithms store a policy directly and try update this policy to approximate the optimal policy. Algorithms that only store a policy, and not a value function, are often called direct policy-search [Ng et al., 1999] or actor-only algorithms [Konda and Tsitsiklis, 2003]. Algorithms that store both a policy and a value function are commonly known as actor-critic methods [Barto et al., 1983,Sutton, 1984,Sutton and Barto, 1998,Konda, 2002,Konda and Tsitsiklis, 2003]. We will discuss examples of both these approaches. Using this terminology, value-based algorithms that do not store an explicit policy can be considered critic-only algorithms. Policy-approximation algorithms are discussed in Section 3.2.

Next: Function Approximation Up: Introduction Previous: Markov Decision Processes in Continuous Spaces

2011-07-01