 because you're here for one of the most hardcore talks in the whole schedule. I'm very excited about this myself. So the title is Nuggets of Shannon Information Theory and our presenter, Christian, will be guiding you through things. In his 1948 scientific paper, A Mathematical Theory of Communication, Claude E. Shannon introduced the word bit. The article laid down the foundations for the field of information theory, which in turn opened up the way to digital information processing. Christian will present in accessible way three nuggets from Shannon Information Theory. So take it away, Christian. All right, thanks a lot for coming. As the heralds have said, my name is Christian Schaffner. I'm a professor at University of Amsterdam and I've been teaching a course about this topic, about information theory for the last few years to master students. So I've now taken on the challenge of compressing this whole course into like 40 minutes, just for you. So, enjoy. I hope I have time to cover some of the nuggets, some of the gold pieces of this theory of Shannon. Let me start off with kind of a philosophical question. Namely, what is communication? We are communicating all the time, but how do you actually define this? Ever thought about that? Yeah, I see. Yeah, eating first? Yeah, yeah, very good. So it involves two parties. Here's what Claude Shannon proposed. The fundamental problem of communication is that of reproducing at one point, exactly or approximately a message selected at another point. So indeed, there's two players, Alice and Bob. Yeah, your two hands. And Alice wants to communicate with Bob and wants to send a particular message, say 0-1-1-0. And then hopefully after communication has happened, Bob thinks that Alice has been sending this message. So that's a way of defining communication. All right, so let us take us back a little bit into the history of wireless communication. This can go all the way back to smoke signals. You can imagine that has been used for maybe hundreds, thousands of years. You can imagine it's not very efficient. Yeah, it does reach very far. You can see smoke from far away, but like sending actual messages, that's probably rather hard. Zooming into a little bit the second half of the 19th century when Maxwell's equations were derived and people started to understand how to handle electricity. And people like this guy here, Julielmo Marconi, demonstrates a wireless telegraph. So instead of using a wire and using Morse code, you can actually do this without a wire, wireless. And around this time as well, 1920s FM radio was demonstrated here by Edwin Howard Armstrong here on the beach with his wife. So back then around this time, things were mostly analog. And there was some ad hoc engineering, a bit what hackers do here as well, tailored to an application. But there were big open questions like, is there a general methodology for designing this communication systems? How can we actually reliably communicate the noise? Imagine these devices are very imperfect. There's a lot of problems with them. And how fast can we go in principle? And these questions were answered by our hero, the hero of this talk, Claude L. Wood Shannon. And he's the father of information theory, and as you just heard before, he is the one who vented the word bit. He's a graduate of MIT, that's a famous university in the US, Massachusetts Institute of Technology, on a totally different topic and algebra for theoretical genetics. So he knew a lot of different things. And then after that, he joined Bell Labs. So Bell Labs was really the place to be in this time as a scientist. And during his time at Bell Labs, he also rejoined as professor at MIT. And here's a funny thing I would like to read out. So when he returned to MIT in 1958, he continued to threaten corridor walkers on his unicycle, sometimes augmenting the hazard by juggling. No one was ever sure whether these activities were part of some new breakthrough or whether he just found them amusing. He worked, for example, on a motorized Pogo stick, which he claimed would mean he could abandon the unicycle so feared by his colleagues. So this was a funny guy. And I think he would have loved to have the experience here at this hacker camp. And as you just heard, he likes juggling. That's actually a little movie I wanna show you. Let's see if that works. I downloaded it, so maybe. Constructing these juggling clowns led another mathematician Dr. Clark Shannon to become more intrigued with the problem of building a real juggling machine. Machines like this only give the illusion of juggling. They have a complex mechanism, but the crops of the coal never actually leaves the ground. They're all held up by black wires and so on. And so far, I know no one had ever built a real juggling machine. And it occurred to me that I would like to try to do that. The juggler's skills cannot yet be engineered to build a device that could imitate him. Dr. Shannon had to simplify the problem. When you bounce juggle, the ball is almost stationary, so you're not wasting much energy. All you have to do is give it a little toss like that, and then we'll come back to the same height. So you're saving a lot of energy. One can see how easy it is. Another aspect of this is that the ball is moving slowly when you catch it. WC Fields, Dr. Shannon's juggling machine, lacks feedback. It can sense variations in its toss or catch, and so must rely only on the repetitive dynamics of the ball's fall and bounce in order to maintain its rhythmical cascade. Like hands help make up for the lack of feedback, catching and funneling the ball's back onto the optimum path. The mathematical analysis of juggling. That's the end. You see? I haven't seen any juggling machines around here, so pretty cool building this back in the old times. And there's some more links on this slide. Actually, everything is available on GitHub. There was the link to it at the very beginning, and I will show it again at the end. So you can also download the slides and also click these links. There's a biographic new movie about him, The Bit Player. So if you want to know more about Shannon, you should watch it. It's great. So what did he do? He actually answered these questions. So he wrote the paper called A Mathematical Theory of Communication, published in Bell Technical Journal, and introduced a whole new field of research, namely information theory, and answering these questions we were asking before. What is communication? What is information? How can we compress it? How fast can we communicate? It's all answered in the first half of this paper. And my whole course was just about the topics of that and this talk as well. So here is the one slide summary of what he did and what is the core message of information theory. Unfortunately, it contains a lot of work that you don't understand yet. So at the end of the talk, I will come back to the slide and hopefully you will have an idea what this is all about. Before that, I want to just give you an idea. So let's try to read this text. So normally, I would just call somebody up and ask him to read. Is somebody there? Yeah. I see someone home. Yeah, go ahead, the microphone. The Alice Hound at first was in managing her flamingo. Yep. She succeeded in getting its body tucked away, comfortably enough. Yes. With her arm, I'm going to guess. Yes. With its legs hanging down, but generally just as she had got it. Very, very, very good. And so on, it will get harder, but this was very well done. Obviously, native speaker probably, that helps a lot. So let me illustrate a few concepts with this. So one could wonder like, if we want to compress this English text, apparently what we could do is just leave out some. Letters. Well, you could just simply read it out, no? So we can kind of just leave out the black letters and we will save some space and compress this text. But of course, we wonder like, how can you do this optimally? How do we can compress this text all the way down so that we can recover it perfectly? That's one of the questions we're gonna answer. Another thing is, if you send this text over an imperfect channel and that sometimes just erases some letters like here, some erasure channel, then how can we actually make sure that we can recover from that? And that's a topic called error correction that we also gonna touch upon. And then you can combine these two and say, well, if you wanna send this text over a communication channel, like how do we do this optimally? Somehow should we compress it first and then put in more redundancy so that it survives the channel? Or, and again, that is answered by information theory. So we'll get there. Before we can start, oh, well, just very quickly, the reactions to this theory when Shannon came up where people are like, what? Error-free communication in noise, how should we do that? And it turned out to be a bit more difficult than predicted. It actually took about 50 years, maybe 60 years to actually figure out how to actually do that. So practical error correcting codes have only come around in the 2010s, achieving the limits that Shannon already predicted like in 1948. Nowadays, applications of information theory are everywhere, and particularly, for instance, the research that I do on quantum computing, quantum cryptography, we use entropy and we use information theory as tools to do things about that. In machine learning, in physics, there's thermodynamics, in philosophy of science, economics, biology, communication theory, you name it, there's a lot of information theory out there and used everywhere. It's great. So finally, that was the overview. Now we get into a little bit more of the technical part. And in order to get started, I will have to tell you something about exponentials and logarithms. So here is a freight of logarithms. One and a half, two, three, four. Okay, a bunch of people are afraid. We will have to cover it because it's a central part. Let's start with something we unfortunately know. Exponential functions. So how fast does the virus spread? Well, if on day zero, we have one person infected, then if the R number is two, then on day two, there's gonna be two new infections. And every day we move on here, we add a day, we double the number of infections. And you see in this column here, you recognize the powers of two. So every good nerd should recognize these numbers as powers of two. Okay, that's the easy part. That's the exponential part. And they kind of go here to the right in this table. So for instance, two to the three, that's a simplified way of writing two times two times two. That's eight. Another example is two to the five. That's just two times two times two times two. That's 32. And the logarithm is the inverse of it. So the logarithm to base two of X, of a number X, is the A, so that if you raise two to the eighth power, you get X. And in this picture, it's simply the outer direction. Now, so you, for instance here, you have 512. The logarithm to base two of 512 is nine, because two to the nine is 512. So logarithms, they were invented for easy calculations because they translate multiplications on this side here on the right into additions on the left. And here's a little example. So if you wanna multiply, for instance, 512 with eight, then you could do this an easy way. If you recognize time as power of two, you can write them two to the nine times two to the three. Well, these are all powers of two. So you only wonder how many times and did you multiply two with itself? So get two to the nine plus three. So that's two to the 12. You can look that up as well here. So it's 4,096. So there's a little rule we used here, namely that two to the A times two to the B is two to the A plus B. And the logarithm, it does the other thing, the other way around. So if you just write the same equation with a logarithm on both sides, well then we can write the log of this multiplication as the log of 512 plus the log of eight. And so that turns this multiplication into a sum, nine plus three is 12, and that's the same result as before. So important to remember is this rule about logarithms. I'm sure you've seen them as at school. The log of C times D is log of C plus log of D. And a little version of that is the log of one over D is just minus log D because log of one is zero. Yeah, so maybe a quick refresher of stuff you have seen once upon a time. Of course these functions are not just defined on the integers, but you can draw a nice curves. So here are the two plots. So the log is the inverse of the exponential function. And here are some kind of noticeable numbers, two to the zero, anything to the zero is one. Two to one is two. So if you notice three points, zero, two, and also two to the minus one is a half, then you can basically write this function. You could also draw it yourself. Another thing then for the log, the same kind of points to remember log of one is zero because two to the zero is one. And the log of two is one and log of a half. So that's minus one, that's here. Yeah, so far so good. Now another little thing I want to introduce before kind of starting with information theory are probability distribution. And we're gonna stick to the simple case, just discrete, so no continuous distributions. And you can think of them in terms of programming languages just as a list of non-negative numbers. So they can be zero, but they're all bigger than, bigger or equal to zero, and they sum to one. That's it. So very easy example is a fair coin, just 50-50. Whenever you flip a coin, you get either one or the other outcome with probability 0.5. A little bit more advanced are biased coins, those that are not fair. So here you have a distribution of say p and one minus p, this, they sum to one if you add. And the p is something between zero and one, but not a half. I mean, otherwise it would be a fair coin. Okay, still pretty simple. Another kind of extreme case is the so-called uniform distribution. Here is a picture, this pie chart of four n equals five. So you have five possible outcomes and all of them are equally likely to occur. So it's just one over five, 0.2 probability for every outcome. Okay, those are easy examples. A little bit more complicated examples you might have seen as well is the binomial distribution. You don't really need to know what these formulas here are, but it's basically the probability to obtain exactly k heads when n times flipping a coin with bias p. So you got something like this, this pie chart here. And that's maybe a good moment where I switch over to the Jupyter notebook that I made for you. So you can download this from GitHub, you can make all these tables, you can, the plots that I put in the presentation, you can reproduce and you see the codes, how to do it. So here's a fair coin, a biased coin, uniform distribution and then there's a little thing to play with here, binomial distribution where you can actually choose the number of coins you flip and you got like an interactive chart of these distributions. You can play around with that here just to get an idea of what these distributions are. Maybe the next thing we wanna do is another example for a probability distribution here. I read in Alice in Wonderland and I take the first 100,000 characters, I kind of clean it up a little bit, I strip everything but the letters and space and then I just count frequencies. So if I run this, I get something like that. So these are just frequencies that means how many times an E occurs, that's the most frequent English letter, how many times a space occurs, so actually this big chunk here and then T, A, O, that's kind of the letter frequencies you can get. So that's actually my next example here, frequencies can always do that, it's kind of statistic and then just count the number of currencies divided by the total. That's another distribution you can work with. All right, final thing before we actually get to the nuggets, we wanna sample from a distribution. So we wanna have a black box, this thing up there with a red button, every time you push the button, we want a sample from this distribution. So for instance, from this fair coin, let's try to do that again, actually in the notebook, you might, I mean, you're probably more familiar with that than I am, so there's some random function if I just evaluate this a few times, then you see I get, apparently I get a float, a floating number between zero and one. So now we wanna turn this into a coin, it's just a 50-50 outcome. Well, a little bit of thought will bring you to this. Now you kind of just check whether it's smaller than a half or bigger, I mean, look at the kind of first digit after the comma and then you know what the bit should be. So this kind of simple transformation turns a float between zero and one into a random bit. And what we actually did is just kind of unfold this pie chart into this interval between zero and one, and then we threw a random number here between zero and one and depending whether it ends up on the blue part on the orange part, we call it a zero or one as an outcome or blue or orange. And actually this turns out to be a good thing to do because we can do this for other distributions as well. So if you have this distribution, we can also kind of wrap it up and then throw this number and it starts out to be here. So here a sample will be this green thing. And let's do a final example with this more complicated distribution. Again, we can unfold it and kind of run our random number. And here is the little program that does it. And it's really just figuring out in which interval you are. So you kind of start with the first interval and then you kind of add one until you hit this random value that you picked before and return that label. So it's nothing really deep going on. And then you can sample. So this is just doing it like 60 times and you get a sequence like this. So it starts like D, A, E, F, F. So here we have some label, gave them some label, it could also be colors. But that's kind of a random sample. So every time I push the button, I get a sample from this distribution. Yeah? With me? Great, now, now it will happen. Okay, because now we start wondering like how surprised we are. So if you have an event with some probability P, then the inverse of this probability one over P is actually a measure of how surprised we are if this event happens. If you have a big probability, well, then we are actually extremely surprised if it happens. Now one over P is gonna be huge if you have a small probability to start with. If you have a big probability to sample, not so surprised, I mean, of course it had. But the funny thing and bear with me, we're gonna measure this logarithmically in bits. So we're actually gonna look at the log of one over P. And this is a positive number because of the rules we saw before. Again, easy example, and that's maybe the most important example as well. A fair coin. If you just flip a coin, you are log of one over a half, and that's log of two, one bit surprised to see in a particular outcome. So if you flip two coins, you're two bits surprised. That's kind of the reason why we use the logarithm. It kind of nicely adds up. And if you flip a bunch of coins, then we are that many coins surprised, bit surprised to see a particular outcome. Okay, now let's look at the other examples of the distribution. Let's look at the uniform distribution. In this case, we are log of one over five. So that's log of five. So 2.32 bits surprised to see in a particular outcome. So just one sample from this distribution kind of gives us roughly 2.3 fair coins of surprise value. So it kind of gives us a measure how to compare, literally. If you do two samples, we have two times this. So that's roughly like 4.6 fair coins sample. So we can actually compare surprise. We can somehow measure how surprised we are. And that's gonna be the key for the next thing. So now a little bit more tricky is a biased coin because then we can wonder how surprised we are on average. So let's say we have this unfair coin. And then we just compute and we basically just say, okay, if we get this blue part that happens with probability 0.1, well, our surprise value is log of one over 0.1. Plus, well, the big chunk is 0.9 mass and surprise value is log over one minus 0.9. You can calculate that, put it in the Jupyter notebook or whatever in your calculator and you get 0.47. So it's just some number, but again, it's kind of like half of a coin, half of a fair coin surprised to see a particular outcome on average of this distribution. And of course, you can do this for any value of the bias and you actually get this very nice curve, this binary entropy function. And if you were a student in my course, then you would have to really know this function really well because you can do a lot of stuff with that. In theoretical computer science, it's a very important function. And you see like at 0.5, the fair coin, you have exactly one bit of average surprise. And if it's certainty, if it's zero or one, well, then you don't have any surprise. All right, now we're basically there where we have the first nugget and that's simply the definition. And actually it's the definition that is used on this picture here next to Shannon's face. There is the definition of Shannon entropy. So the Shannon entropy, it is defined like this. So the entropy is called H of this distribution. It's just the average surprise all measured in bits. It's really literally what I just talked about. It's the sum of the probabilities, PI multiplied with the surprise all value measured in bits. So the log of one over PI. Yeah, so okay, great, we can define it. So it's a number. If you give me a distribution, I can compute this number. Or you can compute it, the computer can compute it. I wrote a very simple program for that. Again, it's on the Jupyter notebook. And it's really just this sum here. Some check that you actually input a distribution. But you start at zero and then just kind of add up P times the log two of P. And you have to watch out a little bit with the minus. So here is not one minus P. That's why there's a minus here. But that's the rule from before. Okay, so once a mathematician gets a definition, then you kind of want to understand it. The hacker kind of want to understand what it's all about. So you want to check some basic properties. First thing to notice is that Shannon entropy, the average surprise all is actually positive. It's bigger than zero. Sometimes it's zero. But it's only zero actually with distribution that have a certain outcome. If you're not surprised, if there's just one outcome that happens with probability one, then it's actually zero. Otherwise you have some uncertainty. It's a measure of uncertainty of surprise all. The other extreme is if you have uniform distributions. We already saw that for five outcomes. Then the entropy of this distribution of uniform distribution is actually log of N. It's the log of the number of outcomes you can have. So that are the two extremes. And then there's some nice computation trick. I'm just gonna show you. I mean, one can prove it, but maybe goes a bit too far. So for instance, if it is this distribution here, you can notice that somehow it's like nicely split in half. You can understand this distribution is coming from a kind of random process that first decides whether it's on the left or the right. So with 50-50 you go up or you go down. That's one bit of entropy. And then here in the upper half, you wanna decide whether you're blue or orange, whether you're A or B. And that's again a 50-50 choice. So with weight 0.5 you go up and you have another bit of entropy. Or with weight 0.5 you go down and then you have a uniform distribution. You have one over three to get to one of these other outcomes. And that's kind of a nice, cute way of computing entropy. And students would practice a lot to actually do that. We don't have time, unfortunately. But we also gonna practice a little bit because now you know these rules. So order these three distributions in terms of entropy. Which one has the highest entropy? Yep. The right one has three bits. Yes. And which one has the lowest? The middle one. Yes, indeed. So here are the entropies. They're all uniform distributions and it's easy. It's just a log. So this is log of five, 2.3. This is log of two is one and this one has three bits. It's eight possible outcomes. Very good. The next one is slightly more tricky. Same exercise because they all have six possible outcomes. Which one has the highest entropy? The left one I hear. Why? Because it's uniform. Yeah, probably. Yes. So indeed this is uniform distribution. This is actually the maximum you can get with six outcomes. Indeed. And then between these two, which one has more entropy? Uncertainty. More surprise value. I see somebody pointing. Pointing. Right, indeed. So this one kind of has bigger chunks. So indeed if you compute it, again Jupiter notebook it will tell you. Yes, this one is actually pretty close to the uniform to 2.45 and this one has significantly less. Because you have these big chunks where you're not so surprised even though there's some small ones but they don't have that much weight. So somehow it doesn't give that much entropy. Okay, now we know how to compute entropy. Let's do something with it. Let's do, let's use it. Let's put it into action. And here comes the first kind of concept. How do we actually measure information? So remember this black box, we press this red button every time it spits out some coins, in this case, fair coins of coin flips, sequence of coin flips, sequence of random bits, just flipped at random. Or with some different distribution of the more complicated one, I showed you how to do it, you can kind of generate sequences like that. And now we're wondering like how much information is in this, in such sequence. And here is a kind of clever idea, again Shannon's idea, of course. You have the data. And if you wanna find out how much information there is that what we could try is to compress it. You can compress it down to the minimal number of bits, let's call it L. So you see something big here coming in and something small there coming out here. And, but we wanna do it in a way so that you can inflate it back. So we shouldn't be, we're not allowed to lose any information. That's, it's called lossless compression. It's not what you do with pictures, often like JPEGs, you can make them really small, but you lose quality, you know? You cannot get that, go back. Whereas here, we want to be able to go back. We wanna like zip, now we do zip, then you can unzip and it's back to normal. And this number, this minimal number L, we will call the average information content. It's kind of what is in there, like it's kind of the crucial stuff that is in there. And we measure it in bits per symbol. So here we got bits, and we measure it per symbol that we had in our original data. Okay, great, we can define it like that. We can define information like that. And now comes Shannon and says that, well, this L, it's gonna be pretty much exactly the entropy of the data distribution, yeah? So somebody picks this distribution, I can compute the entropy and we practiced that before. And that's gonna be the kind of a bound or this gonna tell me pretty much how much we can compress this information down to its essential part. Now that's really cool. And of course, once we get such a claim on mathematicians, we have to prove it. So I'm not gonna really do it. Unfortunately, we don't have time, especially this direction, it's not that hard, but you need a bit of math. So normally when we wanna prove like such an equality, what mathematicians do is we show both parts. We show a lower bound, we show upper bound. So we show that L is small or equal to this entropy. And we show that L is smaller or equal and bigger and equal, so it must be the same. And this part I'm actually not gonna go through except that here's the proof, but you will have to read some more stuff to understand it. But the upper bound I'm gonna do, so the upper bound is saying that we will give a compression method to show that we can actually compress a data sequence from this distribution that reduces roughly the entropy of bits per symbol. And it's actually a very nice procedure that you should all know. It's called Huffman coding. Who has ever heard about Huffman coding? Oh, you guys come here and you already know everything. Okay, just quick recap. Not all of you have seen it. It's very easy actually, so that's why I can show it. So here's what you do. Given a probability distribution, combine the two nodes with the smallest probability into a new one with a combined weight. So here, I've already nicely ordered them by probabilities. So you start with this last two. You combine them into a new node that has the sum of these two probabilities. Then you just continue, keep on doing this. You have now 0.3, 0.2, 0.25, and 2.5, so you have to take the smallest of the remaining ones. These are, for instance, these two. My here have a choice. Then if you have a choice, it doesn't matter what you do. You can just pick any of them. And you combine these two into a new node, now 0.45. So what remains is 0.25, 0.45, and 0.3. So now we have to combine these two, 0.3 and 2.5, to get 0.55. And then the last step, you always combine two to get one because all the numbers sum up to one. So that's a good check if in the end you have to get the one. Okay, so that's step one. We are done. Now we can go backwards. We kind of label these, we've created a binary tree. This is the root of the binary tree, this one. And we just now labeled a path with zero and one. So here we have a zero one, we have a zero one here, zero one here, zero one here. And if you kind of follow that thing back, then we'll get the code words over here. So for instance, to get to the A, we have to go down here. So we get a one and then a zero to go up. And that's the code word here, one and zero. Or to go to the last one, to this E, we have to go down, one, down, one, down, one. So get one, one, one. Yeah, that's it. These are the codes. These are the code words, the binary code words to encode this distribution, these symbols. And the claim is that this is basically optimal. It works really well. Just to show you, I mean, somehow intuitively that's the right thing, no? Because actually the symbols with small probabilities get long code words. It's the ones that you start with. You actually wanna use the long code words for those stuff that doesn't appear too much. For those with high probability, you wanna use short code words. No, that's intuitively what you wanna do if you wanna compress. And here's a kind of real world example, namely this letter distribution that I was talking about before. So here's the whole table. And here is the Huffman code for these letter frequencies. So the space was the most common character. And indeed, that's here on the bottom. That's the code word just zero one because you have to go up, zero, and down. Then you're already at the space. The next one is E, E is down here. You have to go one, one, zero. I think for E, yeah, one, one, zero, zero, actually go up twice. So you got such a tree, and that's the Huffman code. And the claim is that this actually gets you pretty much down to the Shannon entropy. So here the entropy is about four bits. So again, so if you have this distribution and you got a sample, you're about four bits surprised. So it's like flipping four fair coins. Okay, let's see, I did a little experiment. To kind of verify that this is really true. Again, all the code is in the Jupyter notebook. You can just actually do it yourself. Not very hard. So I took Alice in Wonderland and the first 100,000 characters. And if I just store it stupidly using ASCII code, that's also a code. It uses eight bits per character. Then obviously I need 800,000 bits to store this thing. So let's be more clever and use the Huffman code instead. So just use the Huffman code from the previous slide. And indeed the number of bits is gonna be roughly 400,000. And that turns out to be very close to the Shannon entropy to the 400,000 times 100,000. So that's exactly what Shannon predicted. That you kind of need four here. This number is four per source symbols. There we have 100,000 source symbols. So that's very close to each other. And you can also do the same thing with independent letter samples. So instead of using Alice in Wonderland, it could also just take 100,000 samples from this distribution, independent samples, and I would get pretty much the same numbers. Storing them stupidly is 800,000 using Huffman code is pretty much what we expect. And then there's one column that I didn't show you yet. So I'm gonna zip this text. So I'm gonna zip Alice in Wonderland and I'm gonna zip the independent letter samples. What do you think is gonna happen? Yeah? Which one is way better? Yeah? It is better somewhere, but not for the other. Okay, so here we go. It is pretty good. It is actually better than the Huffman code in zipping Alice in Wonderland, English text. But it's actually worse here. So we're actually beating zip here if you just have independent letter samples. Yeah? You have to think a little bit about that. But why is this the case? But the reason is that there's much more structure than just letter frequencies, single letter frequencies in English text. You also have bigrams and trigrams and you have words, you have language structure, no sentences. If you can exploit that, well, you can be better. And that is what zip does apparently, quite cleverly. It beats it like big time. This won't work on independent letter sample because there's no bigrams. I mean, there's just trivial bigrams. So there the Huffman code is really good. But here you actually win by zipping it. And that means you have to be more clever in compressing. You can be more clever if you have more knowledge about your data. Good, final part. I saw something like 15 minutes. Yeah. Final part. One more nugget to go. And this is actually by far the hardest part. And so I noticed that I had to simplify a lot. And so I'm gonna be very hand-waving at least for a mathematician speaking. I'm gonna be very hand-waving. Anyway, I'm already... So here we have a noisy channel. A noisy channel, you can model it mathematically as a conditional probability distribution. Don't worry about it. It gets an input x and it kind of outputs something y. It's kind of messed up a little bit. So it's noisy. So it's not the perfect... It's not outputting perfectly whatever comes in but it kind of distorts it a little bit. And here is the most famous example. It's called binary symmetric channel. You have an input. It's just a bit, either zero or one. And with probability one minus epsilon, it survives the bit. It just outputs the bit that was input. And with probability epsilon, it flips the bit. And let's say epsilon is 0.1. So with 10% probability, it just flips the bit. It's a noisy channel that you can put in bits but like 10% of them are gonna be flipped. And you don't know which one. Okay. Now, how do you cope with such a channel? So here you have our BSC with error probability is 0.1 but you have some data and you kind of want to survive this channel. You kind of wanna encode it in a way that you can recover the original message. So even message, you kind of wanna encode it in some way so that after you get what comes back from the channel, you can decode it again so that you recover the original message. And there's two figure of merits here that we kind of wanna optimize. On the one hand, there's the code rate. That's like how many times you actually have to use the channel per bit you wanna send. And we wanna make sure that we have a good chance of recovering the original message. So these are the kind of things that we wanna optimize and of course, let me give you examples. So kind of a stupid or kind of trivial example is we don't decode at all. Just send whatever you get. So if the message is 0.1.1.0, well, just send that over the BSC. Well, we could be unlucky that some bit is flipped and then, well, we're lost. The rate is very good because we only use exactly the channel once for the every bit we wanna send. But the probability of success is obviously the 0.9 per bit which is not that great. So here's something more clever, namely repetition code. So instead of sending the bit once, send it three times. Just the same bit. So if you wanna send zero, just send 0.0.0. If you wanna send one, just send 1.1.1, et cetera. And then about 10% of these bits are gonna get hit with an error. You don't know where, but you have a pretty good idea to recover. You could just use majority vote. In this box of three, if there are two zeros, then probably the third one was also a zero. That actually works quite well. The rate is not so great because it's only 1.3 to be kind of use the channel three times for every bit that we wanna send. But we boost our success probability per bit quite a bit. So that's kind of some easy examples of codes. There's some better codes like the following one, the Hamming code. Who has seen the Hamming code? The same guys as before. Okay, so very quickly, here it is. So you can use four message bits and you add some parity check bits, some three more bits. And here's a nice graphical way to illustrate how this works. So here's an example. Let's say we have this message bit 0.1.1.0. So we put them in these circles according to the scheme here. So we put the zero at M1. We put the one at M2. We put another one at M3 and then the zero here. And now we just wanna make sure that all the circles are happy. The circles are happy when they have even parity. So if the sum of the entries of the bits in the circle is even. So we have to add a one here to make one and one two. So that's a happy circle. Here we need to add a zero to make this circle happy and here we have to add a one to make this circle happy. And that's it. Kind of the encoding scheme. And it's easy to see that if some bit gets messed up, it actually doesn't matter which one, we will be able to recover. As long as there's only one bit, for instance here, if we hit this bit with an error, then we can easily figure out that it's this bit because we can again check which circles are happy. This one is already still happy. This one is not happy anymore because it's only one, the sum of the bits. And here it's also another one that is not happy. So it must be in the intersection of these two circles where the error occurred. So we can correct that back to one and be good. And this actually works for any place where the error occurs. So we can cope with one error here. The rate is kind of okay, 0.57 and the success probability is like 93%. You see, there are many different error correcting codes. Actually, maybe that's the topic of the next talk I'm gonna give in the future. They can actually study this in detail. Here's a little illustration of the Hamming code. Actually, it's kind of funny. So there's the bits, you encode them with additional bits, parity bits, then you hit everything with 10% noise and you can still recover the picture kind of reasonably well by decoding the way you're just illustrated. And finally, so here are kind of our two figures of merit. The error that we wanna keep small and the rates that we wanna make high. So good codes are here in this corner. And what Shannon, finally, the final nugget by Shannon is that actually every channel, not just this binary symmetric channel, but any channel has a capacity. So you give me a channel, I can compute a number. C for this channel. And this C is kind of a magic bar that says there exists error correcting codes that can cope with this noisy channel at any rate that is below that C. So for this example, I could compute for the BSC with 0.1 probability of error. This C turns out to be 0.53. So it's actually where this line here crosses here, the zero. And Shannon's nugget here tells us that actually we can get as close as we want, we can get the rate up to 0.53 with arbitrarily small error. So that's really amazing. So it's kind of, he's claiming basically this line here that and this line exists for every noisy channel. And that's kind of what people thought about. How can you do that? Unfortunately, the way he proves that it's ingenious and it's actually very nice to see the proof. But what he does is just he picks a random error correcting code. So random codes are extremely good at these things. Unfortunately, they're very impractical because you cannot use them in practice. Decoding them is very hard. And that's why it took like 60 years to actually achieve these limits in practice, to actually make it practical enough to actually do this in practice. All right, that brings me to the end. What you've learned from this talk, hopefully something about exponentials and logarithms about Shannon entropy here, these two formulas, the entropy and actually the capacity of the channel is down here. A little bit about data compression and about error correction. And now I can actually come back to this slide because now you actually understand data compression. If data is sampled according to some probability distribution, then there's an ultimate data compression limit and that's the entropy H of this distribution. In error correction, you can model a channel with a conditional probability distribution. You can attach a number to it, the channel capacity. And finally, you can reliably communicate if and only if the entropy of the stuff you want to send is strictly smaller than the capacity of the channel where you want to send information over. Yeah, that's it. Thank you very much for your attention and I'm happy to take any questions you have. Reminder that eBay asking questions should come to these microphones in the middle. Are there any questions? Nope, none from SIGL, excellent. So we're all just dazzled by this display and so nobody has a question. Oh, we have a question. Hello, thanks, very interesting. I was wondering, is there a theoretical limit if you have a noisy communication when you can no longer reliably get anything across? Yes, yes. So basically in this figure that I showed, that's exactly this line. It is a hard limit. So if you rate, if you come up with codes or if you have codes that are going beyond this line, then your arrow will explode. So it's really, this is a sharp border. As long as you stay below, then there's ways to do it. If you're above, it's over. If you ride exactly on the line, then it becomes interesting. Then there's people figuring out how exactly it behaves, but it's basically a sharp border. So yes, there's just, yeah, channels. For instance, I mean, if you imagine a simple example of the binary symmetric channel that flips the bit with probably one half, it's not gonna get you anywhere. Because actually then, this is the binary entropy of one half is one. So you have one minus one, you have the capacity of zero. You cannot, you just get random bits as outcomes. Then everything is randomized. You cannot see anything from the output. Hi. Yes. You're talking about noisy channels. Do you have like some estimations? How noisy our IT channels are nowadays? Um, what is the IT channel you have in mind? Ethernet, wireless, whatever. Oh, okay. Well, those are error corrected. There's in the background, there's error correction going on. But for instance, think of like transmitting satellite images. No, that was just the James Watt telescope unfolded and his pictures were transmitted. That's like from, I don't know, thousands of kilometers away. The fact that they can make this information and make it down to earth, it's because they put a lot of redundancy in it. They use error correcting codes to make sure that this transmission actually works. So usually the actual noise depends a lot on the distance. It grows, maybe quadratically, it grows with distance. So at the closer you are, the easier it is. Okay, but for everyday applications, is this like every 10-spit is flipped or is it way lower? Oh, this I think is way lower. I mean, for communicating on ethernet, no, it's very low, extremely low. Thanks. Any final questions? Oh, one final question, go for it. I have a bit tangential. But I was too distracted by computers at school to actually learn any maths. So like bits of this I kind of understand because I've transferred a lot of data over my life. But at some point I need to go back and learn maths. Where would be a good place to start? Oh, actually there's many good places. So in the GitHub repository that I'm linking here, there's like all the links that I consulted when preparing on the readme. And there's a lot of like nice instructional videos, for instance, about logarithms, from like extremely simple to brutally hard. So I think there's a lot of very nice explainer videos out there, for instance. Then it depends a bit on your taste. I've tried now to kind of make it a bit accessible, but I know I'm a mathematician, so I don't know how much I succeeded. But I think usually like playing around with things, like trying to program it, just like see playing around with on this Jupiter notebook, this will help a lot. Yeah, cool. Thank you. You're welcome. Well, we've run out of time, but you should grab Christian outside the tent afterward. Obviously, if this is a compressed version of his whole semester lecture, he's got a lot of knowledge in there. He can unpack using these patterns. So let's thank him again for his... Thank you. Maybe one last thing, I would ask the organizers to have a Shannon field next time we do MCH, no? Like there should be a... Without Shannon, we wouldn't be here. Shannon field. Perfect.