Information and Communication
General Procedure
In the second half of this course, you will pick an information-theoretic subject from the list below and study that. You are allowed to work alone or in groups of 2. Every group has to prepare and give a 45-minute talk followed by 15 minutes of questions about the chosen topic. Also, every group prepares a written report about the chosen topic (to be handed in by Sunday, 8 February 2015, 23:59).Detailed Procedure
- Look through the list of topic below. We will distribute the topics after the exercise session on Friday, 16 January 2015.
- Study the material and think about how you want to present the subject to your fellow students in a 45-minute talk.
- Write down in a few bullet points how you want to structure your talk. Email this list to Chris by Wednesday, 21 January 2015, 14:00.
- Chris will be available for personal consultation on these slots for questions. Please book ahead by reserving a spot in the doodle.
- In the fourth week, you give a 45-minute talk, followed by 15 minutes of questions from the audience.
Presentation
A presentation should make your fellow students care about the subject and convey the main ideas of the topic/paper. Try to show how your topic connects to something we have learned in class. You should present the main result(s) with some precision, and it would be nice to give some key ingredients of proofs (where appropriate - some proofs are nasty and/or not interesting). Do not spend all your time putting up formula after formula: prepare a presentation you would like to listen to. It is important to stay on time, you leave a bad impression if your 45-minute talk lasts for 50 minutes or more. Here are a few general tips:- Keep it simple! The talk is at most 45 minutes and the goal is really not to impress anybody with technical details, but instead to convey the main message to your fellow students who do not know anything about your topic.
- Use at least some slides, 45 minutes might not be enough for a blackboard-only talk. Presenting from slides comes with the danger of going (way) too fast. Make sure everybody in the room can follow you. A rough average of 1min per slide tends to be reasonable.
- Practice your talk in order to get an estimate of the right timing!
Report
The report consists of maximal 10 A4 pages giving an overview of the selected topic. A proper report contains at least the title of the chosen subject, an abstract, a general description of it, a high-level overview of the main ideas, the interesting low-level details, and references to scientific articles and other resources you have used (please insert hyperlinks for every reference). Make the report look more attractive by putting one or more pictures or figures. As a service, you get the opportunity to submit one draft version of your report at any time, and you will get some feedback from Chris within 2 working days (Monday to Friday) after submission. You have to hand in the final version of your report by Sunday, 8 February 2015, 23:59.Grades
Your grade for the final presentation will be determined by the quality of the presentation, your ability to answer questions about the subject (we will use this list for the evaluation).The final presentation counts 1/3 towards your final grade of the course, 1/3 will be determined by the report, and 1/3 will be determined by the average of the 3 homework exercises.
Suggested topics for final presentation:
Note that you are very welcome to suggest a topic yourself. In that case, send me (at least) one link to an academic paper about the subject. I will let you know quickly if that subject is appropriate. You do not have to stick to the indicated sources. To the contrary, you are encouraged to find other articles or sources that are relevant for the selected topic.Some subjects are far more challenging than others. A good talk and report about a challenging subject is more impressive than a good talk/report about an easy subject, and the grade will reflect that.
Some topics are copied with kind permission from Mathias Madsen's course on information theory
Asymptotic equipartition property
In information theory, the analog of the law of large numbers is the asymptotic equipartition property (AEP). Is is a direct consequence of the (weak) law of large numbers. The AEP states that \(\frac{1}{n}\log \frac{1}{P_{X_1,\ldots,X_n}(X_1,\ldots,X_n)}\) is close to the entropy \(H(X_i)\) where \(X_1,\ldots,X_n\) are iid (independently and identically distributed) random variables. Using the AEP, you can easily proof Shannon's source-coding theorem both for lossy and lossless compression. An extension of the AEP is also the crucial ingredient in the proof of Shannon's noisy-channel coding theorem.
Study and present the material in Chapter 3 of [CT].
Data Compression in practice: Arithmetic Codes
We have seen that Huffman codes are optimal symbol codes for data compression. However, they also have disadvantages, for instance when the source changes. Arithmetic codes are more suited to this case. Arithmetic codes are described in Section 5.5 of [CF] and Section 6.2 of [MacKay]. An interesting application is the Dasher tool by MacKay's group. Describe in detail what the disadvantages of Huffman codes are and how other codes (such as arithmetic codes) get around these problems.Data Compression in practice: Lempel-Ziv coding
Section 6.4 of [MacKay] describes the Lempel-Ziv algorithms which are widely used for data compression (e.g. in the gzip command). They follow another philosophy than the codes we have seen. Understand how they work and try to analyze them in detail. You can also try to get your hands on the source of other modern-day compression algorithms and describe what kind of techniques they are using and how well they work.Noisy-channel coding: Zero-error case
In the problem of noisy-channel coding, we study the question of how much information can be transmitted over a noisy channel. A channel can be described by a conditional probability distribution \(P_{Y|X}\) which for every possible channel input \(x\) describes a probability distribution \(P_{Y|X=x}\) of channel outputs \(Y\). We are interested in determining the maximum number of distinguishable signals for \(n\) uses of a communication channel.
In the zero-error case, the receiver has to be able to identify the message \(x\) with certainty from his output \(y\). In this setting, we are only interested if two input symbols \(x \neq x'\) can possibly be confused by the receiver or not. Hence, (the zero-error capacity of) a channel is a property of the confusability graph which corresponds to the channel. Studying this problem has very interesting connections with graph theory.
Sources you can study for this topic (besides the wikipedia pages above):- Read Shannon's original paper "The Zero Error Capacity of a Noisy Channel".
- The first pages of this survey paper by Koerner and Orlitsky about zero-error information theory.
- The blackboard photos of Christian Schaffner's master course in Information theory (lectures of 24 and 26 November 2014)
Noisy-channel coding: Asymptotic case
In the (more common, and more realistic) asymptotic setting of noisy-channel coding, the capacity of a channel described by \(P_{Y|X}\) is defined to be \(C := \max_{P_X} I(X;Y)\). Shannon's noisy-channel coding theorem states that the capacity is an exact characterization of how much information can be transmitted of the channel in the asymptotic limit of many independent uses of the channel. Any code whose rate is above capacity is bound to make a large error, and there exist codes with rates below capacity that make arbitrarily small errors.It is quite challenging to prove this theorem and its converse, but it is so fundamentally important that you want to have done it. It is very well described in Chapter 7 of [CT] and Chapter 10 of [MacKay].
Hamming Code, an example of Error Correcting Codes
If you send messages through a binary channel with a fixed and symmetric probability of bit flips (i.e., 0 being received as 1 or vice versa), then your task as an encoder is essentially to choose codewords that are as far apart as possible in terms of bit flip distance. A method for creating such codebooks was invented by Hamming in the 1940s. It turns out that his encoding method can be expressed as a certain kind of matrix multiplication, and the choice of encoding scheme for the binary symmetric channel thus corresponds to a choice of matrix. The reverse inference problem then similarly becomes a linear-algebra problem. Check Chapter 1 of [MacKay]. Things you can do under this topic:- Read Hamming's original paper and a more modern presentation of his ideas.
- Explain how the (7,4) Hamming code works.
- Explain the how Hamming codes can be represented as matrices.
- Find the error rate of a Hamming code and show it depends on the bit flip probability, the length of the code word, and the number of parity check bits.
- Study generalizations of the Hamming code and more advanced binary codes.
Hash Codes: Codes for Efficient Information Retrieval
While Shannon's two main theorems are about data compression and noisy-channel coding, a practically very relevan task is information retrieval. A simple example of an information-retrieval problem is the task of implementing a phone directory service, which, in response to a person's name, returns (1) a confirmation that that person is listed in the directory; and (2) the person's phone number and other details. This is a classical computer-science problem called the dictionary problem: the task of designing a data structure that maintains a set of data during 'search', 'delete' and 'insert' operations. A standard solution to the dictionary problem is a hash table. Learn all about them and their relation to information theory in Chapter 12 of [MacKay].Kolmogorov Komplexity
The great mathematician Kolmogorov culminated a lifetime of research in mathematics, complexity, and information theory with his definition in 1965 of the intrinsic descriptive complexity of an object. In the treatment in the course, the object \(X\) has been a random variable drawn according to a probability mass function \(P_X(x)\). If \(X\) is random, there is a sense in which the descriptive complexity of the event \(X = x\) is \(\log \frac{1}{P_X(x)}\) , because \(\lceil{\log \frac{1}{P_X(x)}}\rceil\) is the number of bits required to describe \(x\) by an optimal code. One notes immediately that the descriptive complexity of such an object depends on the probability distribution.
Kolmogorov went further. He defined the algorithmic (descriptive) complexity of an object to be the length of the shortest binary computer program that describes the object. (Apparently, a computer, the most general form of data decompressor, will after a finite amount of computation, use this description to exhibit the object described.) Thus, the Kolmogorov complexity of an object dispenses with the probability distribution. Kolmogorov made the crucial observation that the definition of complexity is essentially computer independent. It is an amazing fact that the expected length of the shortest binary computer description of a random variable is approximately equal to its entropy. Thus, the shortest computer description acts as a universal code which is uniformly good for all probability distributions. In this sense, algorithmic complexity is a conceptual precursor to entropy.
Learn more about the topic in Chapter 14 of [CT].
Stopping Rules and the Secretary's problem
A stopping rule is a rule for deciding when to stop doing something, based on a finite amount of past data - for instance, when to leave the casino based on your past gambling record, or when to stop a series of clinical trials based on the results so far. In the secretary problem, you are confronted with a finite stopping problem.
Some things you might want to do with this topic are:- Understand the definition of stopping rules and stopping times.
- Consider the gambling strategy "Stop when you're ahead": Before you enter the casino, you decide to leave once you have reached a certain level of capital c. What is the probability that you will ever leave the casino if you use this stopping rule?
- Study and describe the solution to the secretary's problem.
Markov Processes: Entropy Rates of a Stochastic Process
In case of iid random variables, the asymptotic equipartition property (AEP) establishes that \(n H(X)\) bits suffice on the average to describe \(n\) iid random variables. But what if the random variables are dependent? In particular, what if the random variables form a stationary process? In Chapter 4 of [CT], such stochastic processes are defined and studied. It is shown, just as in the iid case, that the entropy \(H(X_1,X_2, \ldots, X_n)\) grows (asymptotically) linearly in \(n\) at rate \(H(\mathcal{X})\), the entropy rate of the process.Gambling and Data Compression
At first sight, information theory and gambling seem to be unrelated. But as you can read in Chapter 6 of [CT], there is a strong duality between the growth rate of investment in a horse race and the entropy rate of the horse race. Shannon himself exploited his knowledge to actually win money by gambling. Indeed, on can show that the sum of the growth rate and the entropy rate is a constant. In the process of proving this, it is argued that the financial value of side information is equal to the mutual information between the horse race and the side information. In order to work on this topic, you will have to understand parts of the previous topic (Chapter 4 of [CT]) as well.Codebreaking for Traditional Cipher Systems
Information theory provides general insights about the security of various encryption schemes. However, cracking an "insecure" cipher often requires a lot of clever probabilistic inference, for instance in the form of sampling schemes like Gibbs sampling of the posterior distribution over encryption keys.
Historically, cracking the German Enigma encryption machine was such a task, you can read about it in Section 18.4 of [MacKay]. More recently, Kevin Knight and colleagues has also written a number of papers on codebreaking from an explicitly Bayesian viewpoint (e.g., this one).
If you like a programming challenge, you can also try to decipher the encrypted texts listed under this topic on Mathias' page and report on your findings.
Randomness Extraction and Privacy Amplifications
In cryptography, we want to limit the amount of information an eavesdropper learns. One way to make sure that a certain n-bit string \(X\) is close to unknown to an eavesdropper Eve who holds side information \(Y\) about \(X\) is to extract the randomness out of \(X\) by picking a random function mapping \(n\) bits to less bits, thereby making it more difficult for Eve to learn \(f(X)\). Study the details of this privacy-amplification procedure in Sections 8 and 9 of the [CF] script. This is an essential building block for essentially all information-theoretically secure protocols. It often also works agains quantum adversaries.Uniqueness of the Logarithmic Uncertainty Measure
In his 1948 paper, Shannon gave a small set of axioms that were sufficient to uniquely select entropy as a measure of uncertainty (up to the choice of logarithmic base). Other authors since him have used slightly different but equivalent sets of axioms, and many textbooks in information theory contain a version of this theorem. Some things you can do under this topic are:- Choose a specific set of axioms and prove that the entropy is the only function fulfilling them.
- Compare different axiomatizations.
- Discuss which of the axioms that constitute the most substantial assumptions, and if they can be relaxed.