Here is a synopsis of my thesis. A short abstract is also available.

## The Minimum Description Length Principle and Reasoning under Uncertainty

### Peter Grünwald

ILLC-Dissertation Series nr. DS-1998-03

Most research reported in the thesis concerns the so-called Minimum Description Length (MDL) Principle. Here we briefly describe this principle and summarize the main research questions we posed ourselves and the conclusions we reached.

### 1. The MDL Principle

To be able to forecast future events, science wants to infer general laws and principles from particular instances. This process of inductive inference is a central theme in statistics, pattern recognition and the branch of Artificial Intelligence called `machine learning'. The Minimum Description Length (MDL) Principle is a relatively recent method for inductive inference. It has its roots in information theory and theoretical computer science (Kolmogorov complexity) rather than statistics. The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally. The more regularities there are in the data, the more we can compress it. This leads to the view (which is just a version of Occam's famous razor) that the more we can compress a given set of data, the more we can say we have learned about the data.

### 2. Contents of the Thesis

The thesis consists of three parts. Part I contains an introduction to MDL intended for a general audience, followed by some theoretical research on MDL. In Part II, we apply the MDL Principle in a practical context and we empirically compare it to other well-known methods of inductive inference. Part III reports on some research we performed that is not directly related to MDL; this research concerns non-monotonic logic applied to common-sense reasoning about action and change (the so-called `frame problem' of Artificial Intelligence). In an Epilogue to Part III, we show that the nonmonotonic logic we used there can be interpreted from a probabilistic/MDL point of view (see below), thereby establishing a connection to the first two parts. Below we briefly consider the two central themes of the thesis.

### 3. Research Goals and Conclusions: 'Safe' and 'Risky' Statistics

First Central Theme: can we use models that are too simple?

The result of statistical analysis of a given set of data is nearly always a model for this data that is really a gross simplification of the process that actually underlies these data. Nevertheless, such overly simple models are often used with great success to classify and/or predict (aspects of) future data generated by the same process. For example, we use linear models for data which are not really linearly related; we assume that `errors' (discrepancy between actual data and assumed underlying functional relationship) are normally distributed whereas closer inspection reveals they are not; we assume data to be independent which are not etc. Yet such simplifying - and wrong - assumptions often lead to acceptable prediction, interpolation and extrapolation. How is this possible? And can we identify situations in which this is possible and situations in which it is not?

These are the central questions of the first part of the thesis. It turns out that they are closely related to the question whether the MDL Principle can be theoretically justified: in the presence of few data, the MDL Principle will often select a model for this data that in the end, when more data will have become available, turns out to be too simple. Can one show that this leads to acceptable (or even, in a sense, optimal) results nevertheless? Briefly, we reach the following conclusion: overly simple models can be applied to make predictions and decisions in two different ways: a `safe' one and a `risky' one. If a model is used in the `safe' way, then it will be `reliable' in the following sense: the model gives a correct impression of the prediction error that will be made if the model is used to predict future data, even in the case where the model is a gross simplification of the process that truely underlies the given data. If the model is used in the `risky' way, there is no such guarantee (nevertheless, such usage of a model often makes sense). We state and prove several theorems which show that incorrect models can be `reliable' in the sense indicated above under many circumstances. The concept of `reliability' is based on a non-standard interpretation of probabilities. This is the second main theme of the thesis:

Second Central Theme: the Coding Theoretic Interpretation of Probability

It so happens that the notions of `description method' and `probability distribution' are very closely connected: every description method or code can be re-interpreted as a probability distribution and vice versa. If data can be coded with the help of model M in only a few bits, then this may be reinterpreted as saying `the data has a high probability under M'. From the MDL point of view, a model of the data is really a means of describing properties of the data, and hence a `model' coincides with a `description method'. Because of the correspondence referred to above, a probability distribution can also be seen simply as a means to describe properties of the data. This view of probabilities connects MDL to more traditional methods of statistical inference. MDL is closely related to (yet different from) Bayesian statistics and the controversial yet successful `Principle of Maximum Entropy'. A crucial difference between MDL and these related methods is the different way in which probabilities are interpreted. MDL's coding theoretic interpretation of probability sheds new light on the debate between those who hold a subjectivist view of probability (probability as a degree of belief) and those who hold the objectivist or `frequentist' view. Here, we (roughly) reach the following conclusion: probabilities can - and should - indeed be used in many situations where they are not related to frequencies; the `maximum entropy principle' can indeed be used to assign probabilities in the presence of ignorance. However, the way such probabilities should be used to arrive at predictions and decisions is different from the way probabilistic knowledge is usually applied!

Back to Peter's thesis Page