 Okay, thank you, Philipp. Thank you so much. Thank you also for the invitation. I'm really glad to be here in this Institute, which I gather is patterned after the Institute for Advanced Study in Princeton. So I think it's fitting that that we talk about information theory today, comma, information here, comma today. Because, you know, being a joint venture or being an effort that is funded by Nokia in a scientific Institute, I think this fits very well because as you saw in Aspect's talk, you saw Claude Shannon's picture before. So information theory came from Bell Labs. Okay, and now Bell Labs, after many changes in ownership, is owned by Nokia. All right, so I'll give you a very kind of broadband type of talk without getting into any particular details because I know that the audience is heterogeneous. So I don't want to bore you with a lot of details, but hopefully give you a good a good feeling for what this discipline is like. All right, so of course it has to do with technology. And so let me give you now a few of the main highlights of what are the technologies under the preview of information theory. So, you know, you name it, any, you know, Wi-Fi, satellite communication. By the way, satellite played a very important role in the development of of information theory because it was really the first application where you had very low signal to noise ratio channels and you wanted very high reliability. And that's when where Shannon theory was really first applied. You know, you go to the Internet, access through copper wires, through coaxial cables, so on, plain old telephone, fax systems, Skype, you know, you want to do exactly the opposite operation that you do in the fax. In the fax you convert bits to sounds. In Skype you convert sounds to bits. And also, you know, storage technologies, digital storage technologies like the CD, the DVD, the Blu-rate, there the communication is happening not across space, but across time. Okay? So, digital storage of movies or images that converts video into bits, digital cameras converting images into bits. Hard disks, also, there is a technology that you have to, it's a purely digital channel if you want, but there is also impairments caused by fingerprints or by dust or imperfections in the disks. And here another technology, this would be a picture from one of these Amazon big data centers. And really the main cost in running these centers is air conditioning. So information theory comes to the rescue in cutting your bill for air conditioning and we'll see why. Or technologies for compressing information, Winsip, PDF files, all those are technologies found in every operating system and we'll see what is the role of information theory there. Alright, so that gives you a whole panoply of technologies and now we're going to see how mathematics actually relates to these technologies. Okay, so really, if we are where we are in all the technologies that I showed, if we are where we are, it's really thanks to mathematics. Of course, physics also gets a distinguished second place because it enables you to do what mathematics does. What mathematics tells you to do, you can do it faster, you can do it smaller, you can do it with less power and so on. But physics is not going to tell you what to do with the chips, it's really mathematics. It's really the triumph of mathematics and it's exciting because you get to see the power and beauty of the subject at the service of technologies that we use every day. Alright, so most sciences develop gradually, right? So there is layer upon layer upon layer of contributions. Well, and very few sciences really have a birthday but information theory does have a birthday and that's 1948 and it's really no exaggeration to say that information theory came out of nowhere. It was completely new. It was really the brainchild of a genius, Claude Shannon, who not only solved the main challenges but actually he posed what the main challenges were back then in 1948 and it took a long time before technology was able to develop so that what he predicted back then had actually something relevant to say about the practical world. Alright, so this is his paper in reprint form, A Mathematical Theory of Communication, the Magna Carta of the Information Age. And here's the first paragraph of the paper where already in the first paragraph it's very much motivated by technology, PCM, PPM, exchange of bandwidth for signal entornation and so on, even though the paper was really a mathematical paper although he sent it to a journal, the Bell System Technical Journal, which was not really a journal read by mathematicians, it was really a journal read by engineers, so he didn't put every detail of every proof there and that created some criticism because people were saying, well, you know, he wasn't really being a rigorous. On the other hand, he wrote the paper in such thinking of a wide audience that also attracted a lot of people to the field. Alright, so there is a trade-off there as to how much detail you give and how much a wide audience you want to attract and I think he made a conscientious decision of trying to attract a wide audience. Alright, so let me tell you a little bit about Shannon. Here's a picture from Life Magazine, 1950. So he became instantly famous after this paper, so he was one of the icons of the post-war era and back then research was very much in vogue, it's not like now. So the Second World War had just been won largely thanks to scientific research, so it had made great impetus in physics and cryptography and statistical signal processing and so on. So anyway, just going very fast, this is the differential analyzer at MIT. This was like the most advanced computing machine at the time and now we're talking about, this is from late 30s, so after his undergraduate, Shannon then joined MIT as a research assistant and was programming this machine. This machine was able to solve six order differential equations and it uses an old system, the Lord Kelvin's method of integration with wheels and balls. And he noticed that, of course the machine as you see is an analog machine but also it had a digital part because it was controlled by a series of relays. So he had this brilliant idea that he would apply a Bull's Algebra to the design of these relay circuits. So really this thesis is a master's thesis and this is really the beginning of computer engineering in a way. The idea of systematizing the design of logic circuits. So that master's thesis actually made him quite famous, he got some important awards and so on. He went on to try to mimic what he had done here in the sense of using algebra applied to a specific problem. He tried to do that with genetics and wrote a PhD thesis on an algebra for genetics but he never bothered publishing it and then he joined the Institute for Advanced Study at Princeton for just a few months. And then after that he joined Bell Labs so he was there in New York, in the New York location in the 1940s and then by the time the 48 paper had appeared he had already moved to Moray Hill which at that time was the largest industrial laboratory in the world. So this still exists pretty much and this is Nokia Bell Labs, the headquarters I think are there. So if you want to know more about Shannon there is now a book, A Mind at Play that appeared for his centenary which happened two years ago, 2016. And there's also a movie that I was involved with which is called The Bit Player and we're now looking for distribution of this movie, maybe Netflix, maybe who knows what. Alright so let me tell you a little bit about what was the technology like at the time of 1948. So in the early 19th century there was the Morse Telegraph. Of course the Telegraph had been really taken off here in France by the time of Napoleon but this was an optical Telegraph that really did not do any compression of information. But the Morse Code of course is very relevant to information theory because it has this idea of assigning short descriptions to frequent letters and vice versa. So this is something that is very important in information theory. Then it came the wireless Telegraph, Marconi, the telephone, AM radio which was the killer application for vacuum tubes, the amplifiers, television, FM radio and this is also very important in the development of information theory because at the beginning they thought maybe instead of modulating the amplitude, because if we say amplitude modulation then the bandwidth of the transmitted signal is twice the bandwidth of the message. So then at that time there was already quite a bit of incentive to use as little bandwidth as possible so then they thought well maybe if we just change the frequency just a little bit as a function of the message then we can make that bandwidth as small as we want. Well it turned out that it was just the opposite, the bandwidth exploded with FM relative to AM but it gave you some advantages and you know it sounds better, it has better noise immunity. And that's also a key ingredient in Shannon's thinking because there you already have this notion of trading bandwidth for signal to noise ratio. So there are benefits to wasting bandwidth. FM gave you one, spread spectrum in the 1940s gave you even more. By using much more bandwidth in the message then you can even have signals that interfere with each other coexisting simultaneously in time and frequency. And then PCM actually was invented here, I think the first patent was in France and that's the technology used in the compact disc so you just sample at the Nyquist rate and then you represent each sample with a binary word. Pulse code modulation, yeah. Perhaps in French you may know it by MIC, I don't know, Modulation Impulse Codifi. I don't, because in Spanish they call it MIC and maybe because they like to copy the French so maybe I thought maybe in France it's called MIC. Anyway, so that's the technology of the CD, that's exactly what the CD does. And that again, you have the ingredient there that you have an analog signal and then you transform it to digital. And Shannon was very interested in actually finding out whether from the viewpoint of being efficient that made sense or it was just a waste. And as you saw here, you see the first sentence of his paper he talks about PCM, PPM is pulse position modulation. Okay, so as I was saying, this is just one of the few cases in science where a whole discipline comes out of the blue and the greatest strength of this paper is how easy it is to read. Alright, so let me give you a little bit of a very simple introduction to kind of like classifying these systems of storage and transmission. So we're going to talk about both the medium and the message. So depending on whether the medium, you know, depending on what the medium is, then you're going to be talking about transmission or storage. So transmission you can think of anything that can transmit electromagnetic radiation, telephone wires, optical satellite broadcasting, microwave links, you name it. Storage, typically now we use semiconductor storage for volatile flash flash memories, optical for the CD, the Blu-ray and magnetic, of course, which is also used in information technology. If we look at the message, then we have analog messages like sound, images, video, sensor measurements, but we also have a lot of applications where the messages are digital like text or software, data, files and so on. So going back to the idea here is that a fundamental notion here, which Shannon noticed very early on, is that if you have a digital message, you can shoot for reproduction of that message losslessly, so an exact copy. But if the message is analog, that's impossible because the message itself, every real number is going to have an infinite amount of information, so there is no way that you can reproduce it exactly. So you have to accept the fact that if you have analog messages, you're going to be reproducing those only approximately. Now we also have the same kind of classification as far as the medium is concerned. So you have an inherently analog medium like radio or wires or vinyl in the case of LPs, or we may have digital media like computer memory like the CD, the internet and so on. Now you may say, well, that's really a simplification, right? Because at the end of the day, all these things work with like computer memories or the internet, they work with electricity. So at the end, there is some analog waveforms and so on that are really the ones that carry information. And the CD is the same, even though we think of it as really zeros and ones inscribed on this medium. Really what you see is also some kind of analog waveforms if you actually look at it at a fine grain. And you could say also that in analog, at the end of the day, things are quantum and perhaps they are also digital there. So it depends on how fine you want to get into the modeling. So anyway, this is a picture from Shannon's paper and this is the coat of arms of communication theory. And basically here, I just reproduced his figure, but basically we have three blocks. The transmitter, the receiver, and then in the middle we have what channel called the channel. So the purpose of the transmitter is to adapt the message to the channel and the purpose of the receiver is to adapt the receive signal that comes out of the channel and recover the transmitted message in the case of a digital message or an approximation to the message in the case of an analog message. All right, so this is a more modern, a more simple representation of the same thing. All right, so the key problems here are really two problems. Okay, so even though it looks like we're going to deal with a translation of sources into channel signals, there are really two problems. And the problems are transmission and compression. And the thing that really unites the leading idea here, the unifying idea, is the idea of redundancy. So as far as compression, what we want to do is eliminate redundancy, eliminate what is not necessary. So things that you can reconstruct at the receiver. So don't waste time sending redundant information, just send whatever is needed. And in transmission it's dual, it's the other way. So transmission you want to protect against it or so you actually need to add redundancy to your data. So that when some of this data is damaged, you can still reconstruct it because the data has this built-in redundancy, which is fully known to the receiver. Okay, so then Shannon says, well really, out of this diagram, what I would recommend is that you study three special cases. And once you study these three special cases, then pretty much you'll know how to solve the most general case. So what are the three special cases? First, a special case is when you don't have a channel or better say you have a perfect channel or what you have at the output of the encoder is what you have at the input of the decoder. If you have a digital message like a text, then you will be interested in reproducing that text perfectly. So the idea here is not only to write this text in binary, but use as few bits as possible. Okay, so that's one problem where the main goal here is to remove redundancy. Now what if we do the same thing again with a perfect channel, but with audio? Then the game changes from before because now as I said is really approximate reproduction. And then this is going to be much more complicated in the sense that the main goal here is going to be to fool the human ear or the human eye. So we have to do a reproduction that is going to sound very much like the original, but using as few bits as possible. So how can we do that? And then finally, this one looks like the original one, but an important special case and that the input of the encoder here is pure bits. So these are just independent equiprobable bits. They have no redundancy whatsoever. And then you want to reproduce that sequence. Not perfectly because whenever you have a channel, whenever you have for most of the time, I shouldn't say whenever, but most of the time you have a random box here. It's going to be impossible to reproduce that thing exactly with probability one. So you have to allow errors. And that was actually one of the good ideas of Shannon that even in the case when when you just have digital information, because you have a channel, you won't be able to reproduce that digital information perfectly. You'll have to allow errors. All right, so the first case is the case of lossless compression and that technology is used all the time. For example, when you see a file, when when you send it through a modem or compact disk and so on, that uses lossless compression. The next one, lossy compression, you know, like like MPEG or JPEG and so on. And then finally, the idea of data transmission where you are going to protect your data with error correcting codes. All right, so here you want to remove redundancy in these two cases. And in this case, you want to add redundancy to combat the channel loss. All right, so now that we've made all those classifications, now we can go back to the technologies that we saw before. And now you see these are these these are all technologies that have an analog message in a digital medium. So then the medium here is inherently digital and and we just want to convert these sounds and images and pictures into bits. Now in this case is the other way. So we have digital messages like the fax and then we want to send them through an analog medium. For example, the copper wire or here we want to have digital transmission of information with satellites. Sometimes we also have digital messages and digital medium. So for example, here's here's one where we are storing stuff in a in a in a purely digital medium. Then, of course, the problem there is how do you do the compression of these digital messages. And sometimes, even though you have digital to digital, you are familiar with with this menu, because sometimes you may want to have some degradation in quality for the sake of reducing the size, even if you have a digital file. Okay, so you may have a picture, a digital picture, but you are willing to degrade it for the sake of reducing its size. So that would be a problem of lossy compression. All right, and then you have, you know, the the plain old systems like the the telephone system FM TV, analog TV, the cassette and so on. So those are analog messages and analog medium. Okay, but analog messages and analog medium, this looks like old old technology. But in fact, of course, nowadays we also need to do things like that, right? We have our phones. Our phones, in fact, are converting analog information into analog in an analog medium because it's an electromagnetic waves digital. I mean, this is television, but now the television almost everywhere now is actually digital. So you convert the the message that is analog to digital form and then you use those bits to then feed a modulator to to to send to an analog channel. So this is satellite radio and the the telephone, the telephone that we use nowadays is rarely analog. Sometimes it's using PCM and most of the time now is using the Internet to carry those bits. Okay, so the parallel there in the last slide is then what we do is first we compress the signal, we go from analog to digital. Then we use a transmitter that goes from digital to analog. That goes through an analog medium and then we undo that. So the receiver goes from analog to digital and the decompressor goes from digital to analog. Now, this is this is a paradigm, which is the paradigm that is used now most of the time in practice, and it has some advantages. For example, the person who designs this box and this box doesn't really have to talk to the person who designs this box and this box. So you just, you know, they just need to agree on a data rate, but the currency that they exchange are bits. So you can have an expert here who knows a lot about how to compress sources, and you can have an expert here who knows a lot about how to combat the noise in this channel, but you can divvy up the work between these two. And this actually comes from Shannon. This idea of breaking the task of modulating an analog signal into an analog signal by going first to digital. This goes to Shannon could show that, in fact, if you do this, you're actually under some conditions, under some caveats. If you do this, you don't lose efficiency. So by separating it into analog to digital, digital to analog, you can do it in a way that is as efficient as if you would say, no, I'm going to allow here the most general type of box, some box that goes from analog to analog. Okay, so this is what's called the separation principle, the fact that you can go to digital. All right, so those are the three special cases. And as we saw in the last diagram, once you have solved those three special cases in isolation, then you can put them together. So you can send texts through channels by hooking up a lossless compression to a data transmission system. Okay, so that's what we saw before. And so what is information theory? Well, information theory is the analysis of the fundamental limits of both compression and transmission. Okay, so what do I mean by fundamental limits? Well, what I mean is that in the case of lossless compression, I'm going to make a mathematical problem out of this by saying, well, what I want to do is eliminate the maximum amount of redundancy. In other words, I want to find what is the minimal number of bits that is required for perfect reconstruction. In the case of lossless compression, I also want to remove redundancy, but now I'm going to say, well, now I'm going to require a given fidelity. For example, a given signal to noise ratio. What is the minimum number of bits I need in order to do that? And finally, for data transmission, I'm going to say, well, I'm going to have to add redundancy. How many bits do I need to include in addition to the raw data so that I can decode the information reliably? So notice that in each of these cases, in fact, the figure of merit is a rate, bits per second or bits per character. And in the compression case, I want to minimize that. In the data transmission case, I want to maximize that. Okay, so the idea of mathematical technology problem was not due at the time of Shannon, of course. And in the particular case of communication, there were already some successes, especially at the time of World War II, in estimation. So in the estimation of signals, in order to eliminate or not eliminate, but reduce the effect of noise, both Wiener and Kolmogorov had come up with optimal filters just based on second-order statistics. So there's a big difference between their work and Shannon's work in the sense that they were also able to find the fundamental limits, but they found the fundamental limits of these problems of estimation by actually constructing the optimal filter, the optimal solution. Okay, so we find what the optimal solution is, and then you analyze what the optimal solution achieves, and that gives you the fundamental limit. Okay, Shannon did not have that luxury because Shannon was facing a much tougher problem. So the problem of designing an encoder or a compressor is actually much, much more difficult than the problems that were being dealt at the time. So difficult that we still don't know what the optimal solution is. Even if I tell you what the channel is perfectly, what the source is perfectly, you can rarely find what the optimal solution is. So then what Shannon did in order to find those fundamental limits is he was able to find what the fundamental limits were without actually telling you how to do it. So how can that be? Well, he used, he invented really the probabilistic method. So the idea is that he said, well, I'm going to analyze not a specific system, but I'm going to analyze an ensemble of systems. So I'm going to put a probability distribution on the set of all possible codes, and I'm going to show that on average the performance is going to be, say, below some threshold. For example, the probability on average is going to be below a certain threshold. So if on average is going to be below that particular threshold, that means that at least there is one system whose probability is below that threshold. So that's an existence proof. And similar ideas were being developed at the same time by Erdos in combinatorics, but what Erdos was a little bit lagging with respect to Shannon, because his first uses of the probabilistic methods, you can really see them just as counting and not so much being as probabilistic tools. So it was really Shannon who came up with this method, and it's still the best way to prove theorems in information theory to use the probabilistic method, which we call random coding. Okay, so here we have three key aspects here, fundamental limits. First, we have this is the performance of the best encoder and decoder. Second, this is technology independent, because Shannon poses the problem of designing these encoders and decoders, but without any type of complexity constraints. Okay, so what that means is that, in fact, information theory never really becomes obsolete. On the contrary, the more technology advances, the more relevant it becomes, because that it means that you can come closer and closer to the fundamental limits. And also, the nice feature is that it's really mathematics driven design, unlike what was happening before 1948, where everybody was like, okay, I have a particular expertise in frequency modulation, so I'm going to be very good at designing a circuit for demodulating frequency modulation, or someone else maybe is working on a coaxial cable or someone is working on optical communications and they have their own domain of expertise. So Shannon says, no, no, no, stop that. There's only one problem here. Okay, so if you understand that problem, you don't need to specialize. You don't need to call yourself an optical communication person. You don't need to call yourself a microwave person. The principles are going to be the same. Okay. And that, of course, did not happen on a vacuum. So, you know, if I had to choose the three most important influences in modern communication theory, I would select these three. So, especially, let me tell you a little bit about Markov. The Markov chain played a very important role in Shannon's thinking. And it was only about 30 years old. The Markov chain was really started in 1913. That's the first paper that Markov gave, where he was doing an analysis of Eugene Onyegin, Pushkin's poem. And he was looking at what was the probability that a consonant was following a vowel and so on. And found that, you know, if you were to use a Markov assumption that that would be good. So, Shannon, of course, recognized that if you want to remove redundancy from a text, from natural language, the Morse code does it only at a very simple level in the sense that, okay, it's going to assign a short code for E and a long code for Z, because one letter is more probable than the other. But language has redundancy not only because of that unequal probability of letters, but because letters have memory. So if there is a Q, then most likely the next letter is going to be a U and you can capitalize on that. So then he said, well, maybe I can simplify and I can take a very simple toy model of language and model it as a Markov chain. And that was really one of his great strokes of genius that he said, well, really language is too complex. Let me just do a very coarse assumption, assume that it's a Markov chain. Markov chain of maybe a generalized order is not that it only depends on the last letter, may depend on the last five letters and so on. But let's see what happens. And of course linguists like Noam Chomsky immediately came up, came after that and says, what are you talking about? That's terrible. No, but you cannot really model language with a Markov chain even if the order is unlimited. Yes, it's a bad assumption, but it's a good assumption in the sense that you get a lot of mileage when you're an engineer and you have that thinking. And when you prove theorems also, I mean you can prove theorems well beyond the assumption of Markov chains. For example, you can assume completely general stationary ergodic processes, but the proofs, even in that general case, work by doing Markov approximations to those sources. All right, so here's the idea. The idea of Shannon really was revolutionary at the time. But when we now sit with our optics of nowadays, we say, well, this is obvious. Why wouldn't people think of doing that? So he modeled the source as a random process. He modeled the noise as a random process. So this was kind of like already people recognized. Probability was already being part of the toolbox of the communication engineer in 1948. That was well known. People had done all sorts of analyses of final modulation systems. But actually modeling the source as a random process as a Markov chain, for example, in the case of text, that was more revolutionary. And the other thing that Shannon said is, okay, I'm only going to worry about average behavior. So what does that mean? Well, it means that I'm going to trust the asymptotic. So it means that Shannon's theory was really an asymptotic theory in the sense that he's assuming that you're using this channel for a relatively long time or that you have sources of information that are really long. All right, so let me illustrate this with a quote by a Frenchman Jean-Luc Godard, who is a famous filmmaker. He says, all you need for a movie is a... Oh, it's Swiss. Oh, it's Swiss, okay. Okay, nobody's perfect. So he said, all you need for a movie is a gun and a girl. All right, so what does that have to do with information theory? So all you need for information theory is a log and a limb. Okay, a logarithm and a limbic. So this is what you always find when you study the subject and so on. Logarithms are everywhere. Where do the logarithms come from? Well, in the most simple setting, you can imagine that if I have to say label the people in this room in bits, then what I need is the logarithm base two of the number of people in the room. Okay? So that's where really logarithms come from. And then the limit comes from the fact that, as I say, we're dealing with asymptotics. Asymptotics in how long you use the channel or how long is your text. And then another thing that we are very, very... And this is really the key to the success of information theory is the obsession with the toy model. In the sense that if you try to really look at real sources like images or real channels like the channel for the smartphone and so on, these are very, very complicated objects. So what we do is we say, you know, we work with a toy model and any similarities with real sources and channels are purely coincidental. But by working with extremely simple models, we can learn a lot. Okay? You learn a lot. You learn how to code for these sources and channels that are simple. But then the proof is the proof is in the actual technology that, in fact, even though we're using these very simple models, they are extremely powerful. Okay? So here's an experiment you can do. I theory. So you can ask your friends, how come the more bars you have, the faster is your download. Okay? So if you ask this question to your computer science friend, they'll say, well, you have more bandwidth. Well, but wait a second. No, you don't have more bandwidth. You have the same bandwidth. The bandwidth of the signals is already assigned. And you're not changing the bandwidth when you are closer to the base station. Okay? So what happens there is that your signals are affected by less noise and therefore the system is able to discriminate more signals in the same amount of time and therefore you can send more bits per second. Okay? So let's matthematize that, and there's Shannon's famous formula. And this gives you the maximum number of bits per second, the capacity, that's the bandwidth of the channel. That's the power. And this is the noise level. So you see that bandwidth appears both here and here. So if you fix this ratio, and you let bandwidth go to infinity, the capacity does not go to infinity, okay? The capacity with infinite bandwidth is just going to be essentially the signal to noise ratio there, okay? And vice versa, if you have only one hertz of bandwidth, you can send one gigabit of information, one gigabit per second of information through one hertz, no problem, you just need to have huge signal to noise ratio, okay? So there is a trade-off between these things. Now this formula is very nice, you can put it in t-shirts and so on, but it's very limited in scope in the sense that it's for a very simple toy model, which is just the ideal Gaussian channel, linear channel with an ideal transfer function, square transfer function, all right? So that's a formula that comes from Shannon's 1948 paper. All right, so information theory, by and large, is based on the study of these information measures. So there are three key information measures, entropy, mutual information and relative entropy. In fact, Shannon only used the first two and he didn't even call it mutual information, the second one, and he didn't develop even a symbol for it, but it plays a very important role. So entropy is a measure of, you have a discrete probability mass function, so it's a measure of how we spread that probability mass function is. So you take the probability, the reciprocal, you take the logarithm, so that's a measure of the surprise of that particular outcome, and then you just take the average. So that's, as I say, a measure of how spread out a distribution is. Mutual information is a measure and these, of course, these measures are all in bits or in other information units, depending on what base you use for the logarithm. So it's just like in physics, you have units in these information measures. So mutual information is a way to measure how dependent two random variables are. Okay, they don't have to be real value, they can be abstractly valued, these random variables. And then what is this? This is the relative entropy. So this is the relative entropy between the joint distribution and the product of the two marginals. So this is 0 only, this relative entropy here is 0 if and only if P is equal to Q. And it's measuring, as I say, the distinctness between these two distributions. So in this formula, this Y, this expectation is with respect to Y, Y has the distribution of Q. Alright, so this was, this was a measure that was introduced by Kulbach and Leibler right after Shannon came up with the introduction of entropy and mutual information in 1948. And they were really driven by their desire to, to extend these to not just discrete random variables but arbitrary random variables. So this has turned out to be an extremely useful information measure that is used in many, many different subjects. And that's why it has so many names. We call it in so many different ways. Kulbach-Leibler discrimination and information divergence and many, many different names. Alright, so then the role of information theory, you could, you could view the role as being a bridge because it's the, is the, is the, is the bridge that goes from the questions and the questions are engineering questions like the ones I posed before. Minimal number of bits per second that I need to do blah, blah, blah. That's a question that the engineer would be interested in asking. And then the, the answer comes in terms of, of an information measure. So typically, for example, you will say, well, the, the, the minimal information content in a source so that you can reproduce this source perfectly at the receiver is the entropy. So that's a very important theorem by Shannon, even though he proved it only in special cases, but we can prove it in very high generality nowadays. When you, when you ask the question, what is the maximal number of bits per second I can send to the channel, then it turns out that the answer is given by the maximal mutual information that you can establish between the input and the output of the channel by optimizing overall possible inputs. Okay. And then relative entropy is also the answer to other problems that you can pose. Maybe problems that do not have to deal with redundancy, but problems like, for example, in, in detection estimation, then this turns out to be a key measure or in large deviations, of course, relative entropy also plays a very important alright. So here I, I put here the emphasis on the theorems rather than definitions because in the kind of like in the layman's imagination in the layman's view of Shannon, you know, he's always, he's always credited with the introduction of entropy or the use of a concept from statistical physics to communications problems. And that is really not the main advance. The main advances to be able to say that the entropy or the mutual information is actually the answer to a technology problem. Okay. So let me give you a very brief overview of some of the technology landmarks in, in the development of this field. Because, you know, Shannon tells you, as I said before, it tells you what is the fundamental limit, but it doesn't tell you how to do it. It doesn't tell you how to achieve it. Okay. So then, since he poses the question without any, without any constraint on complexity, you could even, you could even challenge and say, well, if there is no constraint on complexity, it may be that it's a science fiction, because you may need such an improbable complexity to actually come close to these limits that maybe, you know, if an engineer is designing a Ferrari, the speed of light is really not very relevant, right? If you're designing, but if you are in Qualcomm and you're designing a modem, a wireless modem, so on, you'd better know information theory. And because, indeed, you can achieve very close to the Shannon limit with current technology. So let me, let me spend just a couple of minutes telling you the main highlights. So Shannon already, in 1948, in his paper, makes reference to a code invented by a person in his group in Bell Labs, Hamming, the Hamming Code, and that's a beautiful geometric object. It can, so in its simplest incarnation, you want to send four bits of information, but you're allowed to add three redundant bits, okay? And if you do it in a clever way, in the Hamming way, you can do it so that if there is one error, if one of those seven bits is received in error, then you can correct it, no matter where that error occurs, okay? If there is more than one error, then maybe you won't be able to correct it. Maybe you will introduce even more errors than you started with, okay? All right, so that's really a phenomenally beautiful object and, you know, people, when they try to come up with codes that would come close to the Shannon limit, try for a long time to actually continue along that path, what we call algebraic coding theory of trying to design beautiful mathematical objects, geometric objects, where all these code words fill the space in a very regular way. Well, that turned out to be a wrong turn. People should not have tried to do that because this is nice when you want to do a design for worst case. So you say, okay, I know for a fact that I am only going to get one error, but Shannon was not playing that game in 1948. Shannon said, you know, I know that my channel is going to introduce 10% of errors. Sometimes it will introduce exactly 10%. Sometimes it may introduce 7%. Sometimes it may introduce 15%. So it's more of a statistical model, not a worst case. So it's an average type of model, not a worst case type of model. All right, so then the Nellias in the 1950s came up with this idea of using convolutional codes, where really the name comes from convolution on a linear system, the fact that you have a long stream or even an infinite stream of bits, and then at the output you have also an infinite stream of bits. Simple encoding, but the decoding of those codes actually took 20 years until someone came up with an optimal decoding that was also computationally feasible. So it turned out that dynamic programming was the answer, and we discovered that in 1967. All right, so then some other codes, algebraic geometric codes, Rich Solomon, these are the codes used in the compact disk, for example. Then in the early 60s, Gallagher and his PhD thesis came up with this idea of, you know, Shannon in 1948, remember that he said, well, I don't know how to design codes. The only code I know is this Hamming code, but this is only block length 7. Now if I want to go to the asymptotic regime and have long codes, I really have no clue how to do that. So as I said, he used this idea of random coding, evaluate the performance on average, and then, you know, then for sure there must be someone, some code that achieves good performance. So then Gallagher said, okay, let me kind of borrow a key idea from there, and it's that, you know, I'm going to be analyzing not a particular code, but really an ensemble of codes, except that he didn't pick all possible codes, but he restricted his choice of codes to a very manageable class, which I'm not going to get into detail. And he quickly forgot about these codes, in fact, the world forgot about these codes. He even wrote a book six years later, and he doesn't even mention these codes in that book. So that was kind of like forgotten. And then you see there is a big gap before the next highlight. And what was happening here is that people were actually losing faith in Shannon because, you know, it was already 45 years later, and nobody had really come up with codes that could come close to what Shannon had had promised. People started saying, well, maybe there are other measures, maybe capacity is not the right measure. If you're an engineer, maybe you should be looking at other things, something called cut-off rate, some other stuff that people said, maybe what happens with capacity is that if you really want to come very close to capacity, you know, you have to have an incredibly complicated machine. All right, now 1993, in France, there is the big discovery, the turbo codes. And this is, this was in ENST in Bretagne. Nobody knew these people. Nobody knew them. And the paper was just simulation. They were combining some of the stuff that Elias had done here, some convolutional codes, also with a random ingredient, just like Gallagher had here in these low-density part of the check codes, a random interlever. And they were showing amazing performance, very close to the Shannon limit. So the reaction was that nobody believed them. All right, nobody believed that these simulations could be right. So what did they do? Well, they tried to reproduce them and, lo and behold, people were saying, yes, this works. These guys, indeed, the numbers they have, these numbers are true. So that was the revolution. That was the revolution. And then some people also then kind of rediscovered this idea of Gallagher that had laid dormant. And that's really the beginning of modern coding theory, the fact that now we can have technology in our phones. We have this technology now. And the key was that they went back to 1948. And 1948 Shannon said, well, you know, I don't know how to design the best code, but let me do it on average. And then you say, well, wait a second. If you can do that, and in fact, Shannon showed that not only there is a code that is going to be almost optimal, but in fact, if you choose a code at random with very high probability, it's going to be an excellent code. So then you're saying, why are we worrying about this? If we select a code at random and it's good, why do we need to spend any effort into this? Well, yes, it's going to be good. But if you pick a code at random, it will have no structure. So you won't be able to decode it because you cannot you cannot just list all the code words of a code and then and then just go one by one testing, which is the most likely one. Why not? Well, because you may have two to the 100,000 code words, two to the 1 million code words. So brute force is not an option. So then what Beru and Glavier noticed is that using their structure, in fact, it was easy to come up with a suboptimal decoder, but it was a suboptimal decoder that did a very good job. So in other words, it was essentially like a crossword puzzle that you're trying to, first you try to decode the horizontal, then you go back to the vertical, then you go back to horizontal, and so on. So they had two codes, two convolutional codes, and by first going to one code, decoding one, and then going to the other, and so on. And the key idea in their principle is that you don't make hard decisions in the sense that you don't say at each point, you don't say, oh, that I'm sure it's a zero. That I'm not sure, but I would bet it's a one. No, what you do is you say, well, I think that's a zero with a certain probability. So you are always keeping track of the probabilities of what you think the original data was. So you exchange soft information, and that's a key idea, too. So random coding, the idea of Shannon then got revived, but also the idea of not insisting on optimal decoding, allowing suboptimal decoding, belief propagation, algorithms like that, iterative algorithms that can be shown to be extremely powerful. Irregular LDPC, let me ask you, who is one of the experts in this? You can design these LDPCs to have excellent performance and get as close as you want to the Shannon limit with very good complexity. And finally, the most important milestone in this race was the polar codes. Those are also capacity achieving codes, but you can actually show that they are capacity achieving with a pretty elegant mathematical proof. And now, in fact, they are being pushed for some applications, even though originally this looked like more of a class of codes with academic interest because maybe the tradeoffs of delay performance may not have been the best. Now they are also attracting a lot of attention for wireless communications and so on. All right, so now let me tell you a little bit about the milestones in lossless data compression. So Shannon came up with a code, the Shannon code in 1948, which is very interesting, but is not used in practice. Then Huffman came up with a code, which is a prefix code and has the minimal average length. But the problem is that this only works well with sources that do not have memory. So if you want to capitalize on the memory of the source, not only on the fact that the letters are not reprobable, then this does not scale. So the Huffman code then is not what you want to do. For that purpose, you can use this arithmetic coding of reasoning or you can use, and this is what you have in every computer, the Lempel-Sieff code. In the Lempel-Sieff code, there are actually several Lempel-Sieff codes, but as I always say, this is the most successful machine learning algorithm there is to date. Okay, why machine learning? Well, because it really does that. It really, it's a code that you can encode just in 10 instructions and is able to learn the statistics of the source. You don't have to tune this algorithm to the specific source that you are encoding. So you can feed it a text in French, you can feed it a text in Icelandic. If it's a long enough text, if you can learn because the text is long enough, then it turns out that asymptotically, you don't pay a penalty in the rate because of the fact that you didn't know the source. So you're going to have, you're going to pay a penalty non asymptotically because up until the point where you have learned that source very well, you're not going to be efficient. So the proof that Lempel-Sieff actually is an optimal algorithm in the sense that it achieves the channel limit. That's one of the crown jewels of information theory. The fact that we can prove it in a simple algorithm, extremely efficient, linear time, can get you to the channel limit and essentially works for every source, stationery or codec. That's really one of the, of the big highlights. So you can do better than that. What do you mean better than optimal? Well, on the short run. So on the short run, you know, things can, you can learn things faster than Lempel-Sieff by other, by other methods, but those methods are much more costly to implement and the software complexity is higher. So in most applications, people use Lempel-Sieff. Okay, now there are sources for which we really don't know whether the technology is already mature to the same point as the, as this or not. For example, digital images, if you want lossless compression of digital images, then we don't really know how, how well, how good the current solutions are. Because we know how to exploit redundancy well when we have Markov structures. When we go to the world of real images, then we don't, we don't really understand what is the actual ensemble of images that we, we really are interested in. So of course it's a tiny, tiny, tiny fraction of all possible images, but we still don't understand it. And this is, this is in the domain of lossless compression. When we go to the domain of lossy compression, then we have to worry about not only the source, but only also we need to worry about the eye and the ear. Because we know, we need to understand how they work so that we can fool them. All right. And that's, that adds yet another layer of complexity. So we saw the, we saw the PCM 1938, also at the same time there was this vocoder they showed in the Universal Exposition in New York. So this was a system where they could, they could program a machine to talk. So you, there was an operator that would type a text and then this machine then would read this text. And of course, then you also have the, the dual operation of listening to a sound and then converting into bits. And now of course, in our phones, we have much more sophisticated algorithms to do that, to convert these voices in this, this audio into bits. All right. So, you know, then the latest things would be JPEG, MPEG, and so on. Now, interestingly, I was saying before that if you are engineer at Nokia or Qualcomm, they find designing, designing cellular phones to transmit information through these complicated channels, you really need to know information theory. However, here, people who designed this JPEG, MPEG, and so on, they did not necessarily know a lot of information theory. So this this area of lossy data compression, the gap between theory and practice is still quite a bit bigger than in the other two. Okay. So still, there is a lot of work to do here in terms of understanding more about how the human brain works and how to come up with algorithms that are able to do what we can do in lossless compression by learning about the source and adapting themselves to the source. So how are we doing with time, Philip? As long as I'm okay. All right. So, you know, so information theory is more is more than just finding those fundamental limits is more than coming up with technology that will meet those fundamental limits. There is a lot of technology that actually has been has been spurred by information theory. So information theory also is used as a design driver. And Shannon from the beginning, Shannon already noticed that. So let me give you a glimpse at some of the problems that are of contemporary interest, because sometimes when you tell people, what do you work on information theory? Oh, are people still working on that? I thought that had been done in 1948 by Shannon. Yes, okay, we still have a few problems left. So one of the problems is is the normal asymptotic regime. And this is something that I've been doing the last few years. But you know, essentially, the question is, what is the finite block length capacity? So as I told you before, this idea of Shannon is to use the channel for a long time. But if you say, well, the block length is only 1000. And that's typically what happens. So they'll tell the engineer, look, you cannot use block lengths more than 1000, because there are problems of delay and so on, you have to do it with 1500, whatever. All right, so then how relevant is the Shannon limit to that? And it turns out that, you know, you don't have a clean answer, you don't have like a formula that gives you the capacity. So the capacity is just the limit as a function of block length when block length goes to infinity. So here I'm uploading the best rate that you can achieve the highest number of bits per per channel use as a function of block length. The longer block length, the better. Why? Because you are less at the mercy of of the channel noise. So in the short run, you may be subject to bad luck. In the long run, there is no such thing as bad luck or good luck. You know, you're going to converge with probability one. So there is there is a gap to capacity that you have to take because of the finite block length. And we don't know a formula for that as a function of block length, but we know good upper and lower bounds. And as you see here, this is for a specific channel, a very simple channel, but you can see that the gap in this case is is quite tight. So for the purposes of the engineer knowing things like this is very important. Because, you know, if if you are here, if you develop that technology that is here is at that point, then by being more clever, you'll be able to gap you'll be able to reach the gap only half of the gap. Half of the the other half is unbridgeable because because of the finite block length penalty. Yeah, so you know, in these single user channels when you only have one transmitter one receiver, then oh, I don't know what I did here. You know, there are problems of feedback. We don't really understand feedback very well in the in connection with information theory. Shannon wrote a very important paper on that where he showed that if you have a channel without memory, then feedback is not going to help you. You know, even if you know exactly and the transmitter what the receiver sees, you cannot get better capacity. When you have deletions and synchronization errors, that's also a long standing open problem. A lot of the problems that are still open have to do with what's called multi user channels where we have more than one transmitter and one one receiver. For example, you may have you may have crosstalk so you may have a transmitter and this transmitter only wants to transmit to receiver a and b only wants to transmit to b but there is leakage of information from a to b and from b to a. So then you may want to find here what are the best rates that we can establish for these pairs and this is still an open problem. Even when you have very simple channel models there, that turns out to be an open problem. Okay, so broadcast channels when you have one transmitter several receivers but you may want to send different information to every receiver or every receiver may have different signal to noise ratios that also is a very tough problem and you know I said at the beginning that that if you if you know how to solve these three special cases then you know how to solve the whole thing but that's with one caveat so when I showed you that diagram where you went from analog to digital from digital to analog and so on and I said Shannon showed that this is optimal only showed it in the long run so when you have systems that have these delay limitations that's no longer optimal and then there is something to be gained by doing joint compression and transmission of information. So rather than following this paradigm of first remove all redundancy from the source and then add redundancy tailored to the channel there might be something to be gained by doing it in one shot. Another very tough problem is how to you know how to do compression of short sources so short messages so how you know how how to do this optimally so here you have several issues that you need to worry about one is that because you have short messages it's hard to learn of course you can learn from previous messages and then you can develop if indeed these messages are put out by the same source but then the second is that you cannot rely on asymptotic so entropy is no longer here the name of the game because entropy is only an asymptotic answer to the question so this is still this is still open by and large open although there has been progress nowadays so as I was saying lossy data compression the gap here between theory and practice is larger than in the others constructive schemes this is more like an art you know when they come up with tricks to fool the ear and the eye with jpeg and pegg and so on and and when you go to the multi-user counterparts of these problems say for example you have you have a left microphone a right microphone and these these are different sources but of course they are correlated so then instead of having a quantizer or an analog to digital quantizer for each of these microphones assuming that this is a single source that's wasteful because the dependence between both messages enables you to save bits okay how do you save bits that's something that we know how to solve in very simple special cases but still a lot of work remains to be done and you know another thing that is beautiful about this subject is that it has so many intersections with all sorts of disciplines so of course engineering it came from engineering but mathematics it has had it has had an important an important role of course not just in consuming mathematics to solve its problems but in having something to say that is of interest to mathematicians so for example in ergodic theory Kolmogorov immediately realized that what Shannon had done was very important in ergodic theorems so in ergodic theory so there was the isomorphism problem where entropy is really key large deviations in probability relative entropy there is uh is the key so um you know typically the question there is how do you explain very improbable events these improbable events may have a lot of different explanations so large deviations tells you well you only need to analyze among the all these unlikely explanations analyze the most likely among the unlikely explanations and how do you define most um most likely well it's defined in terms of relative entropy so the relative entropy between what you are observing and what you would expect to be observing that is that is what matters functional analysis uh uncertain the principles information theory has had an important role convex analysis coming up with alternative proofs for a lot of the inequalities like brunmin scott minkowski and so on measure concentration uh talagrand um he has also used uh quite a bit of information theory in isopermetric uh inequalities and so on combinatorics uh a lot of uh good combinatorics have come up come out of information theory for example the notion of graph entropy uh random matrices also is an area that saw quite a bit of excitement uh because it was very uh relevant to wireless communication systems um for example multi antenna arrays cdma communications and so on um there was a lot of excitement about using uh fairly recent random matrix uh results all right um physics i guess every time physics is becoming more and more probabilistically oriented the the push from from quantum physics also is very important quantum information theory so uh of course quantum information theory developed with some lack with respect to uh classical information theory but a lot of the results that were proven in classical models now have counterparts in uh in quantum information theory but of course there the issue uh i mentioned before as an open problem the issue of finite block lengths of non asymptotics is even more important than in the classical case because if in the classical case i i was saying you know sometimes you only have a thousand bits you can send in each of these packets well in quantum maybe five bits or maybe five q bits or ten q bits or something like that okay so at least for the for the rest of the lifetime of people like like the speakers today um yeah so uh computer science um you know colmogorov had this idea that you could define uh complexity by not uh introducing a probabilistic model like Shannon you know Shannon essentially says well you introduce a probabilistic model then entropy is the answer to this very important problem of compression so colmogorov said well if you have uh if you have a specific compressor specific algorithm that does compression then you can classify the complexity of objects just by looking at the length of the output of that compressor and then you say well but this is not very fundamental because then it's just related to that particular algorithm that does compression then he said well actually what i'm going to do is allow you to use the best possible algorithm for the object you are trying to compress okay now the best possible algorithm always depends on what computational model you are using but that only is going to add some a constant okay so he came up with this notion of colmogorov complexity which turns out to be quite related to to entropy when you go to probabilistic sources and so on but it's uh it's really a beautiful object the problem is that you cannot measure it unlike entropy and so on it's a it's a purely asymptotic measure of complexity there is there is a school of uh in theoretical computer science there is a school of people who do information theory they tend to be much less worried about applications than people like us who come from electrical engineering they tend to worry a lot about problems of interactive communication and so on and i think it's fair to say that they haven't they haven't really had a huge practical impact in contrast to the to the Shannon crowd who in general is very very much attuned to what's happening in the real world economics portfolio theory there is also the economics of what they call rational inattention because there there is a lot of economical data that cannot be explained by having decision makers that are you know that have full access to the information they they turn people see that they they act in ways that are not so rational so then there has been a model that says well people get the information but they get it through uh a channel that has finite capacity so then by by introducing that into the model then they are able to explain some of the experimental data that they get but of course that becomes more of a social science and so on and as soon as you put humans into the loop then it becomes quite nasty and then bio uh now of course you know 90 percent of science now is life science uh and um and every time there is more and more uh push for using information theories for some of biological problems for example if you want to do DNA sequencing then you will want to compress those DNA uh sequence in a in an efficient way but still those sequences are very far away from having mark of structure or anything so we really add at the very infancy as to you know how we go about exploiting um redundancy there so anyway i think i'm going to stop here um and hopefully i have given you a feel for what this field is about and for what uh Shannon contribution was at a time when technology was very much driven by specific problems rather than having an overarching theory so i think it's one of the great triumphs of uh that breaches mathematics with the real world you know hopefully some of you will get interested enough to maybe do some reading and if you're going to start reading something i would start with Shannon's paper thank you very much yes please of course to your knowledge do the people who have developed the cell phone network take into account these things because as a user i find that the quality of the sound in cell phone is so much lower than it used to be with standard telephone so did they take into account this kind of analysis or yes or what drives the marks that at the end of the day is so bad you're absolutely right and i think what happens is that especially the younger generation doesn't give a damn about quality i mean i see my daughter sometimes she's watching tv and i say do you realize that you can watch this in in high definition you're watching it in standard definition i don't care you know or one day i saw them congregating in the kitchen table she and her friends and i'm like what is that sound and they were listening to music emanating from a cellular phone which to me i cannot stand it right so i think is you know we could have Skype quality phone calls because you know for voice is you know what your phone is able to do downloading very fast data the voice is not so much of an issue so i think it has it's really a i think driven by what the market is demanding and the market every time demands less and less quality so that's why they get away there is a place where it is especially annoying is when there is an interview or somebody on the radio yes because the contrast yes standard quality of the radio is so large yes that it is really annoying and i remember there was a time when i was in the scientific advisory board of orange and they are trying to implement some kind of high quality phone obviously for the total failure yes point of view for a market and the idea was precisely to provide like journalists or people like that with high quality telephone apparently it was a failure yeah i mean it how does it fit in indeed does it mean that the bandwidth is very small the rate no no you can you i mean if there was if there was a need and if there was a push you know that consumers would say oh i'm not going to use this company because they they give you very bad quality i only want to use something that gives me a skype quality if there was such thing they would all do it immediately because the technology there now don't get me wrong it doesn't mean that you will get you know you are you are in the subway and so on and you'll you'll be able to get excellent quality there because the loss of you you cannot go against the loss of physics right you may have a very very bad signal to noise ratio and therefore you know there are no miracles you can do with that but but in normal conditions we could have we could have much better i mean even even the fixed phones sound worse than they sounded 50 50 years ago because exactly exactly exactly so there is the compression and then there is the the delay and and so on and these vocoders you know they they have done they have done a lot of advances and so on but yet you know it's still it's a still going from analog to digital with very limited data rates so you know i don't know whether in the future people as you know as technology improves and so on people will demand better quality or so but i think people are getting used to you know a lot of it is free you know you get youtube and you watch watch it for free so then they say well you know if the quality is not that good then you know i'm i'm willing to put up with it but but yeah i think that's more you know to answer the question is really more driven by by consumer demand than anything else true because i remember story on radio station they they say at the beginning when they turn from analog telephone to digital telephone the quality was so good that then they were interrogating a journalist from paris and the journalist was in new york the quality was absolutely perfect and people complained say it's not possible the guy is just in the next room so they decided to degrade the quality of reception to give the feeling that the journalist was truly far from the from the yeah they they do this in new york there is a station that every 10 minutes gives you traffic information so the guy is just sitting in a studio reading it but then they put this noise of some helicopter engine so that you can barely hear the guy because then it sounds like the guy is actually in a helicopter just watching the traffic and there is also comfort noise uh i don't know what you've heard of this so comfort noise is um when you know you're having two-way communication but uh so when the other person is speaking of course in in the plain old telephone system you actually had instantaneous two ways so you could talk and speak at the same time there was a decoupling and so on but in modern systems of course um it's happening essentially one at a time but it's it's very um it's very annoying if if when you are talking you don't hear anything you you you think yeah or if there is a gap in the conversation and you don't hear anything you may think like the it has it has dropped so that they inject comfort noise but it was just just local yeah yes actually i was in this information studio 30 years ago i like i thought you put it back it's a time to also discuss another question about sphere parking in higher dimension that was also up about low mount and kovsky mount and so on it doesn't play any roles is this yeah of course because you know and it's um it goes back to uh what i was saying about on the hamming code that originally people thought okay so if we're going to be able to meet shannon's promise then we should really go in this route of what we call algebraic coding theory and have these beautiful geometrical designs and so on um but that solves a kind of a problem which is a bit different from shannon what shannon had posed and indeed there are there are very nice mathematical you know advances and people work on this but the impact on practice is a bit limited because of that so the the great triumph of of codes has been to get away from that algebraic coding theory and try to to use these codes that have a random component that are demodulated or decoded with with suboptimal algorithms and that has been the key so so my feeling is somehow that this theory the information theory is somehow based on on thermodynamics i mean it's very curious to see it is this uh entropy entropy formula in this context so some people say actually darwin bolzmann shannon has been about the same thing and actually shannon was absolutely not convinced that the term entropy is a good term here he asked for neiman about that for neiman answered something like nobody knows what is an entropy so call it entropy yeah so it's an apocryphal story um even though it has it has you know it has appeared in scientific american and so on so what i want to tell you that you mentioned erdoge erdoge is absolutely not thermodynamics yeah so what i want to tell you that somehow this whole thing seems to me a bit marked thermodynamics is absolutely not uh hoa can i tell you this it's not a modern vision on science it's somehow the 19th century yeah you are necessary but yeah yeah i'm i'm yeah so so just because you know uh first of all let me let me address the the issue of von neumann so the apocryphal story is um is uh von neumann told shannon at the institute for advanced study that oh this object you should call it information entropy because you will win every argument because nobody knows what information is and then um and then shannon was asked about it and then no i never talked to von neumann you know he was a big shot i was just a postdoc there i really never talked to him and more over he didn't really need von neumann to tell him that that formula was entropy even in his 1948 paper it has a reference to a book by tollman that is trying to come up with a quantum counterpart to gibbs entropy and he has exactly the same formula except without the minus sign so you know it goes without saying that shannon was well aware that that same concept or very similar concepts had been used in thermodynamics if you go to gibbs book you find some of the manipulations in fact you find special cases of relative entropy also in his work so from the technical viewpoint there is some intersections but what you are absolutely missing between that scientific discipline and this one is that this one is asking questions about information technology what is the maximum bits number of bits of information you can send blah blah blah that is completely absent from there and also in physics correct me if i'm wrong but entropy you cannot measure directly i think it's something that you if you want to know what is the entropy of something then you have to measure temperature and other things and then plug it into a formula so we can in fact you know one of those algorithms like the lempatsive algorithm you can apply it to a long text and then you just count the size of the output that's going to give you a very good estimate of the entropy of that object so there are there are indeed differences and important differences and just because some of the language or some of the alphabet that we use is common that doesn't mean that there is a lot of overlap between the two information there is not real it's not physical it's something different because because for instance energy needs space yes yes information doesn't need anything no no you're right you're right that is not really physical yeah and some people say the opposite some say information you know information is physical okay and so yes we know some lambda words we're not saying people who speak about where does the idea of quantum information comes from site lambda word lambda word say information is physical yes and if the physical object you use to encode information is based on new principle and tunnel man and thing like that then the information should be different so yeah so like like Maxwell's demon for example one of the explanations is that okay so there is information that the demon has to forget and all yeah yeah so i mean now what happens now is that a lot of the a lot of the analysis of these sparse graph codes now i'm talking about a particular technology the technology that has been so successful in in getting close to the channel limits it turns out that a lot of the the methods of proofs and so on that we use to analyze these things are actually enlightened by statistical physics principles because because we you know we have to look at at ensembles you know at from from local properties we want to derive global properties and that at the end of the day you know we go back to bolzmann now but this is this is a comment i wanted to make when you make your your your comment the real relationship is about statistics counting things and you can think of statistic in bolzmann leaving the notion of temperature aside it's just counting things and then it's very close to what he explained us yeah so i really like you presented the first page of the paper by Shannon because the last sentence says it it's it's italic the the word it's the word is meaning oh yeah yeah that was actually it's really important to say yeah yeah yeah it is uh it's a really key point because you can you can say today really different methodologies like hyper dimensional computing or things like that would the encoding has i mean the encoding thing has no meaning anymore yeah you understand me so yeah he got huge criticism for this because he's oh you know semantics play no role oh my god how can this be semantics play no role so especially from science from social sciences he get and then oh information theory needs semantics to be broadied you know so you know 70 years after it you know we're still not not there yet but i like this because it's it's kind of a humorous humorous thing frequently the messages have meaning i think i think less and less frequently it's very important that actually as you say the letter a letter b blah blah blah so this has a real meaning because the encoding is like that mm-hmm four things that have meeting meaning so what i'm going to say that we can search in a really different direct direction different direction so you understand me that you don't for this the what is this pass distributed methods you encoded that really for it's a you know hyper dimensional way you know you know that things uh think i think things has no meaning anymore yeah um so one place that one finds compression and transmission of code nowadays is machine learning yeah your networks is there information theory analysis well a lot of people are trying to you know machine you see what happens now is that uh a lot of people who were going into information theory so we got you know we're usually got a good pool of people and you know from engineering schools and so on that were mathematically oriented they like to do information theory now a lot of that pool is going to machine learning and machine learning of course most of it is very much software oriented and kind of experimentally oriented but but there's also a nugget there is very much solidly grounded in probability and statistics and of course they tend to use some of these tools but still we are we're far away from having a a success story like the lent-palsif algorithm that i said it's a machine learning algorithm even though we don't usually think of it as such but it's an algorithm that is able to learn and then do asymptotically as best as as well if it did not know anything to start with and that's really what you would like in uh so for example in these deep deep networks and so on deep learning you would like to have a theorem like that and they would give you some performance guys because for example when translation happens and translation is an information theory program right from language to language i mean yeah code to code yeah there's no analysis from the information theory right right right so so that's you know that's that's really um what at this point in time um our success story is really the toy model toy models for us have given us so much mileage we've we've learned so much things to these toy models binary symmetric channel binary ratio channel you know once you understand those then you're almost there understanding machine learning you know i guess one day we may get there and say okay now we understand stuff but now we are really in the dark for the first time with deep learning the technology comes before the the the theory yes exactly but you know there's nothing wrong with that like the turbo codes you know glavier and beru they had no clue why this was working yeah i mean they were they were really um you know beru knew some coding theory glavier was really more like a computer engineer i think and then they had a a student who was doing a lot of numerical experimentation they had some very good intuition but they were completely unable to actually be able to say okay it works because of this or i can analyze it and and i know and i think that was very good because the fact that they were willing to do all this experimentation and all that that's excellent because in the u.s coding was very much driven by the academic community and the academic community you know a paper like that you send it to the main journal in the field it would be rejected because it would be a paper with simulations and all that the first thing they would say no we don't publish this type of papers in the journal we want to see some results so a lot of people were discouraged from this idea of okay let's just experiment and and so on because it's just not the way the subject developed it developed in a much more systematic way but that is also something had to understand the functioning of the brain for example because it could be also a link with well Shannon was very interested in that at some point later in his life and you know there are people who claim yes our information theory can help you in that honestly honestly in 2018 no no so i mean maybe maybe you know 30 years from now we will say we will have something to say but so far no yes you mentioned about finite bluffing for some results have there been results of where there are some limits when we have but when we don't look at it we use it to the journal but just the end users like national but there's a big problem yeah so have there been results on that field also to see how far you are from some human way yeah so the these graphs i showed um sorry no it's slowly so i uh the graph i wanted to show you is uh i don't know why it's going so slowly yeah capacity of apple software so anyway yes so the answer is yes the only thing the only thing but very important thing is that the asymptotics we have nice formulas because you can rely on average quantities like mutual information and so on in the non asymptotics you have to work harder you don't have a formula you have upper and lower bounds and the nice thing is that those upper and lower bounds we've been able to come up with clever ideas that are pretty tight but actually what is very interesting is that did i did i lose already the i think i lose i lost it anyway but um yeah it's going so slowly then so the the um the graph i showed so i had the capacity here and then there were these two curves that look like this so this is um so this is ray and this is block length so now what happens is that to get to get this curve the what we call the achievability curve the lower bound we use random coding the same thing as Shannon now random coding if you think about it this idea of just using code words that are chosen um at random when the block length is high that's actually going to work very well but if the block length is very small like say the hamming code of seven block length seven you don't want to hire a monkey to design your code right you want to place those code those code words very nicely so what happens is that in this in this regime when the block length becomes small then uh random coding is no longer optimal but we don't really know of good tools to come up with optimal codes in that regime so that's why at that point these curves are becoming vertical so the the uncertainty here between the upper and lower bounds is very high so i think that's really one of the of the areas that is going to see more growth in the future about you know how to design systems that work for smaller block lengths and then you have since the block length is smaller you do have the luxury of then doing a more sophisticated decoding but still you know you cannot do exponential complexity because you know two to the 100 this is still a very large number so okay i think we are 45 minutes off the school and i know that you have the the jet lags yeah it was a pleasure pleasure to be here