Information Theory 2014

Get back to the course site.

General Procedure

For the final exam of the course, every student will pick an information-theoretic subject from the list below and study that. Every student has to prepare and give a 20-minute talk followed by 10 minutes of questions about the chosen topic.

Detailed Procedure


A presentation should make your fellow students care about the subject and convey the main ideas of the topic/paper. Try to show how your topic connects to something we have learned in class. You should present the main result(s) with some precision, and it would be nice to give some key ingredients of proofs (where appropriate - some proofs are nasty and/or not interesting). Do not spend all your time putting up formula after formula: prepare a presentation you would like to listen to. It is important to stay on time, you leave a bad impression if your 20-minute talk lasts for 25 minutes or more. Here are a few general tips:


Your grade for the final presentation will be determined by the quality of the presentation, your ability to answer questions about the subject (we will use this list for the evaluation). You do not have to stick to the indicated sources. To the contrary, you are encouraged to find other articles or sources that are relevant for the selected topic. The final presentation counts 1/2 towards your final grade of the course (the other 1/2 are determined by the homework exercises).

Research tools

Scientific papers can most easily be found online via Google or Google Scholar. As an alternative, almost all universities have a subscription to SpringerLink, allowing you to access these articles for free from their computers. If neither works, feel free to contact us.

Suggested topics for final presentation:

Note that you are very welcome to suggest a topic yourself. In that case, send us (at least) one link to an academic paper about the subject. We will let you know quickly if that subject is appropriate.

Some subjects are far more challenging than others. The following list if roughly ordered in decreasing difficulty of topics. A good talk about a challenging subject is more impressive than a good talk about an easy subject, and the grade will reflect that.

Some topics are copied with kind permission from Mathias Madsen's course on information theory

Rate Distortion Theory

Rate-distortion theory gives an analytical expression for how much compression can be achieved using lossy compression methods; it addresses the problem of determining the minimal number of bits per symbol, as measured by the rate R, that should be communicated over a channel, so that the source (input signal) can be approximately reconstructed at the receiver (output signal) without exceeding a given distortion D.

Many of the existing audio, speech, image, and video compression techniques have transforms, quantization, and bit-rate allocation procedures that capitalize on the general shape of rate-distortion functions.

One of the most intriguing aspects of this theory is that joint descriptions are more efficient than individual descriptions. It is simpler to describe an elephant and a chicken with one description than to describe each alone. This is true even for independent random variables. It is simpler to describe \(X_1\) and \(X_2\) together (at a given distortion for each) than to describe each by itself. Why don't independent problems have independent solutions? The answer is found in the geometry. Apparently, rectangular grid points (arising from independent descriptions) do not fill up the space efficiently.

Study it in Section 10.4 in [MacKay] and Chapter 10 in [CT]

Kolmogorov Komplexity

The great mathematician Kolmogorov culminated a lifetime of research in mathematics, complexity, and information theory with his definition in 1965 of the intrinsic descriptive complexity of an object. In the treatment in the course, the object \(X\) has been a random variable drawn according to a probability mass function \(P_X(x)\). If \(X\) is random, there is a sense in which the descriptive complexity of the event \(X = x\) is \(\log \frac{1}{P_X(x)}\) , because \(\lceil{\log \frac{1}{P_X(x)}}\rceil\) is the number of bits required to describe \(x\) by an optimal code. One notes immediately that the descriptive complexity of such an object depends on the probability distribution.

Kolmogorov went further. He defined the algorithmic (descriptive) complexity of an object to be the length of the shortest binary computer program that describes the object. (Apparently, a computer, the most general form of data decompressor, will after a finite amount of computation, use this description to exhibit the object described.) Thus, the Kolmogorov complexity of an object dispenses with the probability distribution. Kolmogorov made the crucial observation that the definition of complexity is essentially computer independent. It is an amazing fact that the expected length of the shortest binary computer description of a random variable is approximately equal to its entropy. Thus, the shortest computer description acts as a universal code which is uniformly good for all probability distributions. In this sense, algorithmic complexity is a conceptual precursor to entropy.

Learn more about the topic in Chapter 14 of [CT].

Stopping Rules and Wald's equality

A stopping rule is a rule for deciding when to stop doing something, based on a finite amount of past data - for instance, when to leave the casino based on your past gambling record, or when to stop a series of clinical trials based on the results so far. Wald's equality is a theorem which states that the average gain associated with a specific stopping rule is equal to the expected gain from each individual trial, times the expected time before you stop. Some things you might want to do with this topic are:

A good place to start learning about this topic is Robert Gallager's MIT course on stochastic processes.

Entropy Rates of a Stochastic Process

The asymptotic equipartition property (AEP) established that \(n H(X)\) bits suffice on the average to describe \(n\) iid random variables. But what if the random variables are dependent? In particular, what if the random variables form a stationary process? In Chapter 4 of [CT], it is shown, just as in the iid case, that the entropy \(H(X_1,X_2, \ldots, X_n)\) grows (asymptotically) linearly in \(n\) at rate \(H(\mathcal{X})\), the entropy rate of the process.

Gambling and Data Compression

At first sight, information theory and gambling seem to be unrelated. But as you can read in Chapter 6 of [CT], there is a strong duality between the growth rate of investment in a horse race and the entropy rate of the horse race. Indeed, the sum of the growth rate and the entropy rate is a constant. In the process of proving this, it is argued that the financial value of side information is equal to the mutual information between the horse race and the side information. In order to work on this topic, you will have to understand parts of the previous topic (Chapter 4 of [CT]) as well.

Codebreaking for Traditional Cipher Systems

Information theory provides general insights about the security of various encryption schemes. However, cracking an "insecure" cipher often requires a lot of clever probabilistic inference, for instance in the form of sampling schemes like Gibbs sampling of the posterior distribution over encryption keys. Historically, cracking the German Enigma encryption machine was such a task, you can read about it in Section 18.4 of [MacKay]. More recently, Kevin Knight and colleagues has also written a number of papers on codebreaking from an explicitly Bayesian viewpoint (e.g., this one). If you like a programming challenge, you can also try to decipher the encrypted texts listed under this topic on Mathias' page and report on your findings.

Randomness Extraction and Privacy Amplifications

In cryptography, we want to limit the amount of information an eavesdropper learns. One way to make sure that a certain n-bit string X is close to unknown to an eavesdropper Eve who holds side information Y about X is to extract the randomness out of X by picking a random function mapping n bits to less bits, thereby making it more difficult for Eve to learn f(X). Study the details of this privacy-amplification procedure in Sections 8 and 9 of the [CF] script.

Hamming Codes

If you send messages through a binary channel with a fixed and symmetric probability of bit flips (i.e., 0 being received as 1 or vice versa), then your task as an encoder is essentially to choose codewords that are as far apart as possible in terms of bit flip distance. A method for creating such codebooks was invented by Hamming in the 1940s. It turns out that his encoding method can be expressed as a certain kind of matrix multiplication, and the choice of encoding scheme for the binary symmetric channel thus corresponds to a choice of matrix. The reverse inference problem then similarly becomes a linear-algebra problem. Things you can do under this topic:

Uniqueness of the Logarithmic Uncertainty Measure

In his 1948 paper, Shannon gave a small set of axioms that were sufficient to uniquely select entropy as a measure of uncertainty (up to the choice of logarithmic base). Other authors since him have used slightly different but equivalent sets of axioms, and many textbooks in information theory contain a version of this theorem. Some things you can do under this topic are:

Hash Codes: Codes for Efficient Information Retrieval

A simple example of an information-retrieval problem is the task of implementing a phone directory service, which, in response to a person's name, returns (a) a confirmation that that person is listed in the directory; and (b) the person's phone number and other details. This is a classical computer-science problem called the dictionary problem: the task of designing a data structure that maintains a set of data during 'search', 'delete' and 'insert' operations. A standard solution to the dictionary problem is a hash table. Learn all about it in Chapter 12 of [MacKay].

Language Processing: Comprehension of different syntactic structures

This paper accounts for language processing effects as witnessed through reading studies in information-theoretic terms. In particular, it uses surprisal as a measure of processing load. The usefulness of this measure is corroborated by many experiments. It also allows for a unified view on previous experimental findings which have traditionally been analyzed in different theoretical frameworks. Finally, a suprisal-based theory does in principle not need to rely on structural assumptions usually made by the generative school in linguistics. It therefore has the potential to free modern lingustic reasearch of some of the theoretical burden carried by earlier approaches.

Entropy rates in language

This paper investigates the entropy rate of English text. It was one of the first in a now blooming paradigm that tries to capture linguistic behaviour as communication in an information-theoretic sense. The authors show that while the average entropy of each sentence in a paragraph rises if the sentence is taken out of context, the entropy given all previous sentences stays more or less constant. It is hypothesized that linguistic communication is organized so as communicate at a constant entropy rate. Later research has backed up this hypothesis for spoken language. See also this link.

Machine Translation

This is the paper that started modern machine translation. The ideas presented therein are still used in modern translation system, e.g. google translate. The underlying idea is that a foreign language is basically just an encoded version of English that has to pass through a noisy channel. The translation task then reduces to decoding the source message. The paper offers a sequence of increasingly complex models to do the decoding. For the presentation, it will suffice to cover models 1 and 2. It may also be interesting to compare it to Warren Weaver's vision of automated translation and check how much of it survived.

Maximum Entropy

The maximum entropy principle is one of the most successful and widely used ideas in machine learning. It basically advocates using the model with the highest entropy. However, the choice of possible models is constrained by the data. Put simply, the maximum entropy principle postualates "Model only what you know and nothing else". A presentation on this topic may focus on either the maximum entropy principle itself (Cover and Thomas, chapter 12) or on its applications (relevant papers can be obtained from the lecturers). Those are widespread and found almost everywhere in modern machine learning systems. For example, the decision module in IBM's Watson system is a maximum entropy model. Also, many neural network models can be viewed as a combination of maximum entropy models

Discriminative vs. Generative classifiers

Discriminative and generative models are the two main categories in which one can divide machine learning algorithms. This paper tries to compare them in terms of their theoretical guarantees. It does so only for two of the most simple representatives of each class but has spawned much follow-up research since is it crucial to know which kind of model to apply in which situation. This is one of the more challenging topics and should only be taken if you already have a background in machine learning.