 OK, Vijay, the melody is yours. Great, wonderful. Well, thank you very much for inviting me to give this introduction to information theory. And my writ, as given to me by David Walpert, was to give a basic introduction to the topic, covering the main basic concepts that are necessary for it. I'm not specifically going to make reference to stochastic thermodynamics, although this is a straight introduction to information theory, classical information theory, as a subject. So let me start by talking about the outline for today's lectures. Let's see if this will move. There's a little bit of delay from when I move my iPad and when I talk, so I hope that's not too unnerving. So this lecture consists of four parts. There will be a very short motivation for what we're trying to do. Then I will discuss the notion of entropy, which takes the role of the basic quantification of what we call information in information theory and is related to entropy in statistical physics. Then I will discuss the notion of mutual information, which is a concept that discusses how much information one variable gives you about another, another very important concept in this topic. And finally, I'll discuss the concept of relative entropy, which is a kind of entropic or information theoretic measure of the difference between probability distributions. I will not discuss, there won't be time in one hour, to discuss applications and this kind of thing, to things like stochastic processes and to things in stochastic thermodynamics. But I think with some imagination, you'll be able to figure out many ways of using these kinds of tools and the kinds of topics of interest in the program that you guys are in. So let me begin with the first part, which is the motivation. And our motivation is to quantify the notion of information. So you could ask the question, what is information? And the foundation of this topic is the work of Claude Shannon, who thought about this in terms of how much you learn from a stream of messages. So he considered, for example, a stream of zeros and ones. And if you're thinking in terms of neurons, you might think about a spike, like a voltage spike and no voltage spike. If you're thinking in terms of digital messages, you would think in terms of literally zeros and ones and your digital signal and so on. So basically a binary signal stream. And you observe the following kinds of things. So for example, suppose you have the message stream that looks like 0, 0, 0, 0, 0, it's always 0. Then it's clear that if I've told you that you always get zeros, that each new zero tells you nothing and it tells you nothing because there's no uncertainty in what it's going to tell you. You already know it's always going to be a zero. So there's no, in some sense, no new information in each occurrence of a symbol in this stream. Evidently, we have the same situation if you always have ones. But on the other hand, if sometimes you have zeros and sometimes you have ones, well, your surprise at each new symbol that you see is informative. It tells you something that you didn't know from having viewed the previous sequence of symbols. So information theory seeks to quantify this kind of surprise and try to use that to study how much information is transmitted between one thing and another thing. Okay, some things that are clear from this is that if you have a bigger vocabulary, you can send more information. So for example, if instead of using zeros and ones, you used ones, twos, and threes and zeros, then there are more messages you can send. So in some sense, because you can send more messages, you can send more things. So actually as it'll turn out, information is the quantification of kind of how many messages you can send. It's a count in some sense, as we will see, of the number of things that you can say. Another thing that seems clear from this is that correlation reduces surprise and therefore reduces the amount of information you can send. This is sort of sometimes counterintuitive because you think that things are correlated that they tell you about each other. They may tell you about each other, but the point here is that, suppose you have a situation where every time you see a one, I promise you, you're going to see another one, and you're gonna see two ones in a row. So if that's the case, then every time, there's no point seeing the second one if you like. So that's sort of a wasted message in this case. And so you have a situation that essentially you could replace the one one with the one tilde. And in this way, the correlation between the one and the one is reducing surprise and reduces the information you have. So from this point of view, you would be the most surprised that as you would learn the most from each symbol, if they all appear to be completely randomly distributed. Now that seems very confusing because it would appear that random things are the most informative, but from this perspective, that's true. If what you're quantifying is surprise, then it is true that you get more information when the signal looks less correlated or more random. I'll take the question in a moment. So you might describe this as follows, a picture that one draws an information theory, something like this, where you say you have a signal. The signal passes through an encoder and comes out as an encoded response or output. And you learn the most of our appears to be essentially a random sequence of zeros and ones. Thank you. There's a question from Debas Mita. Debas Mita, you can unmute and ask your question. I'm going to keep going. Yeah, okay. So then, suppose we agree with this perspective, you could ask how can we quantify informative surprise? That's what we need to do to understand how to quantify information. So we're going to start with binary symbols, let's say zeros and ones. And for the moment, I'm also going to assume that the symbols are statistically independent. So I'm going to assume here that there is no correlation from time to time. So what do we want? So qualitatively, some desiderata we might have, things we might want, might be to say, let's suppose that P of RI is the probability of using symbol RI, it's a frequency of use. Then in terms of a measure of surprise, it should be the case that the surprise associated with the probability is P ought to be a decreasing function of P. So the more frequently we make use of the symbol RI, the less surprising it should be. So that's one thing you might imagine. So another thing you might hope for, it would be nice if this were true, is that if you had two independent measurements, here I'm imagining from two different neurons, but it could be from two different signal channels or whatever else you want from two different kinds of sources of signals. And so here I'm phrasing this in terms of neurons because I do work in theoretical neuroscience. So suppose you have the surprise associated with the response of two independent signaling units, you might hope that the surprise is added. That doesn't have to be the case obviously in your measure of surprise, but it would be nice to have surprise be additive for independent events. And this should be true for any probability distribution P. And so we're led to consider that the measure of surprise ought to be the logarithm because the logarithm of the product of things becomes the sum of the logarithms. So it would have these two properties of this being a decreasing function of the argument and what is more having this additive property. So we could take many logarithms, many bases, and if we take base two, the quantity we derive is gonna be called bits. If you take base E, it's gonna be called maths, but using the usual rule for conversion between bases, you can pick whichever base you want. So Shannon basically used this idea, there's an idea coming up with the definition, to define the entropy of a probability distribution as being associated with the average surprise. So you take the symbols R I, the surprise associated with the symbol, with seeing the symbol is the logarithm of the probability distribution. And you sum it all up over the probability, over the symbols. So in this case, it could be just zeros and ones, and that is your notion, that is the notion of entropy. So if you have more than two symbols, it's not binary, one through N, then the entropy is the sum from one through N of P of R I log P of R I with the minus sign out front. Now this was a heuristic derivation, we kind of intuited this definition, and as a measure of surprise, but the way in which people come up with definitions of anything is that you have to guess a formula that would probably do the job for you, and then you ask yourself whether the formula is useful, and it indeed turns out that the entropy, this thing is the answer to many questions and statistics and physics. So rather than coming up with it in this heuristic way, we could have instead asked many kinds of statistical questions, and the answer would have turned out to be the entropy. So we'll talk about that in a minute. And just as a connection to the physics, whoever it is who's got their thing on, please mute it. So in physics, if you agree that P is the probability of different microstates in the system, for example, different kinds of states of a gas, then it does turn out that this quantity H reproduces the thermodynamic entropy of a system and gives it a statistical interpretation. Indeed, that is why Shannon originally called us the entropy because if you applied it to the distribution statistical physics, you reproduced the thermodynamic entropy. So for us, our interest is in asking the question, what does entropy mean in terms of information? So we are going to argue by sort of going through examples of things and by using a theorem called the asymptotic equi-partition theorem, but the entropy of a probability distribution, roughly speaking, tells you how much information is available in the signal, in the absence of noise. And equivalently, it is going to tell you, it is going to quantify for you the logarithm of the number of possible different messages sent through the channel. So this is the sense in which it quantifies information. The idea is that each message is worth the same amount, right? There's no value or difference in value between different messages. So you count the total number of messages that could be sent by the channel and you take a log and that is the entropy. It's a count of how many things you can say. So from statistical physics, you'll also recognize that this is very much like the micro-canonical ensemble where the micro-canonical entropy is the logarithm of the total number of configurations accessible to a system. Anyway, so our goal is to explain these interpretations and their origin. That's what we're going to start with today. First of all, before doing that, it's sort of on the way to doing that, it's just helpful to work out a couple of examples to understand these definitions. So let's suppose, for example, the symbols can take the values zero and one, in other words, you've got a binary signal and let's suppose the probability of zero is P zero and the probability of one is one minus P zero. Then the entropy is minus P zero log P zero plus one minus minus one minus P zero log one minus P zero. And you can sort of work this out in various cases in the limit that P zero, the probability of zero is zero or the probability of zero is one and even plug in the numbers you'll find in both those cases you get zero. And conversely, if you maximize the entropy, you take DHD P naught and set that equal to zero. Well, a quick calculation shows you have to solve the equation minus log P naught plus log one minus P naught equals zero. Solving that, you get the answer that the entropy for this binary case is maximized when P naught equals one half. So in other words, if we had a channel that was transmitting binary signals and I asked how much entropy it has, that is to say how expressive this channel is, how surprised I am at seeing every symbol, the entropy vanishes if the probability, if you always send ones, so the probability of zero is zero, or if you always send zeros, which is to say the probability of zero is one, and is maximized when both ones and zeros occur with equal probability. So that tells us that a well-encoded sequence, namely that maximizes your surprise and then conveys as many messages as possible, basically looks random to the viewer. If you compute the entropy for the particular case, the P naught equals one half of the maximum here, plug it in, you get minus one half log one half, and then you have here from here, you have the one minus P naught, so that's one half to also minus one half log one half, and that's one bit. So you get one bit per symbol transmitted in this case. So you can extend this just to make it clear that we can do this in a more general case of the binary signals. Suppose you had many values, R i going from one through N, you could plug in probability of R i, you could write down the entropy, and then you can maximize the entropy by requiring the D H D P i, P i being the probability of any symbol P of R i equals zero for all i with the constraint that of course the probability sum to one, and with a little bit of work or by using symmetry, you can conclude that this, the entropy is maximized by the uniform distribution for all the symbols are used equally often, and plugging in P i equals one over N, if you have to do minus the sum on i, P i log P i, each of these things is one over N, so you get one over N, log one over N, you sum it all up, and you'll find the answers log N bits of entropy come out of this signal. Okay, so roughly speaking, what we've said here is the entropy is maximized in the absence of constraints by the uniform distribution. So one way of conceptualizing what this is supposed to say is the language that's often used is, I'm gonna use, I'm gonna keep talking about the information channel as being a neuron here, just because for concreteness, but it's any kind of information channel you want. So what this says is that you maximize the entropy, which we can think of the amount of information you transmit per symbol, when you use the bandwidth, namely the expressive range of the signaling channel here on neuron as evenly as possible. So that's the idea here. So for example, in this case, taking the neuron as our example again, if it encodes information into the bursts of messages and bursts of spikes of different lengths, et cetera, then you would like say, you could have one boom, or boom, boom, or boom, boom, boom, multiple spikes like that. If you use them equally often, it maximizes the information. Okay, so another thing that often comes up, especially in physics, is you have a variable that takes continuous values. So let's think about how you would write down a similar notion of entropy for a continuous variable. So well, one obvious procedure to follow is you first bin it in fixed increments of size delta R, and then you would declare the probability of RI times delta R is the probability that the variable R lies between RI and RI plus delta R. Then for this discretized, or slightly coarse-grained distribution, you could write down the entropy, and it'd be P delta R log P delta R sum than I, and you wind up with the following thing. You wind up that breaking up the log, the log of the product into sum. You'd find that this is minus the sum P of RI delta R log P of RI minus the sum on I delta R, P of RI log delta R. So the reason I've split this up is now I take the limit to take the continuum limit, the delta R goes to zero, and the first term here becomes what's called the differential entropy. It's the integral of the continuous variable P log P with the minus sign, but the second term annoyingly diverges because there's a log delta R, which is going to diverge. So what does this mean? Well, what this means is that, if you actually believe in real numbers, you actually believe that you can distinguish the 10,000th decimal place in these different numbers. So that means you're actually committing to the infinite precision in real numbers. And so this infinity reflects the infinite precision of real numbers in the infinite number of decimal places, that there's information, all of them. So in any real situation, you typically have a finite bin size, or there's some coarse-graining that happens due to noise, and this term is basically a constant, and you can drop it because you see this stuff integrates to one, and this is a constant that reflects the precision with which you can understand the variable. So there's some relationship here, but in the amount of information you have, the renormalization group and coarse-graining of the system, but well, okay. Another thing that's worth pointing out about the second term is the second term, although it's sort of annoying, it always cancels when you compare entropies. So if you take the entropy of one distribution P and another distribution Q, and you take the difference, this constant term will cancel. So traditionally what you do is you sort of forget about it, you don't keep this term around, and you just look at this part, which is the differential entropy, and for all really physical questions you ask, that's the thing that'll matter, and this artificial divergence will not matter. So one final comment about this kind of examples. So one thing that we've ignored so far is the possibility that your stream of signals has correlations in it. So how do you deal with that? Well, suppose your correlations occur in blocks of size K. If your correlations occur in blocks of size K, we might divide the overall signal into words of length K. So here's a word of length K, another word of length K, et cetera. And then instead of talking about the probability of the symbols, maybe the binary symbols, let's say they make up the word, we might instead say that we'll talk about the probability of a word as an object in and of itself. So then what we could do is to work out the surprise per word, because I know this is a correlated block. So you work out the surprise per word, or in other words, the information per word in the absence of noise. And so that's like taking the entropy of words of length K is minus a sum on all words, a probability of the word, log probability of the word. And so of course, each word itself can consist of a sequence of symbols. So a sum over I1 through IK, which are possible symbols in the word of probability, log probability. And that gives us the entropy of these words. And then if you wanted to get an entropy rate or entropy per symbol in this sequence, you would divide by the lengths of these blocks of these words and you would get the entropy rate is one over K, HK. And that's gonna tell you how much you can learn every time you see a symbol. Okay, by the way, nobody is keeping their cameras on so absolutely no idea if people are following this or not unlike an in-person lecture. So please feel free to turn on your cameras. That would be very helpful. And of course feel free to ask questions because that might also be helpful. Okay. Yes, sorry to interrupt you guys. So actually there is a question in the chat. So there is a question from Faze. And the question is why this diverging term can be considered as a constant? So if you look at it, I'm looking at this diverging term. This thing here is log delta R. Delta R is a fixed number. It's the bin size. So that's just a number. That's just some fixed constant, which will take to be larger and larger minus log delta R as you increase. Of course, as you make the bin size smaller, this number will get larger. So that's the divergence in question. But this thing here, the integral drp of R is one. So this is really gone. So this is really a constant. And the question is what is the meaning of that constant and has to do with the precision of the real numbers? And again, if you compare the entropies of two things, as we mostly will later, this will go away. Does that answer your question? Okay, so I don't know if it answered the question because the person didn't respond. I think there may be another question. Yes, Sudeeth. Yes, I think we should, you can unmute yourself and ask the question. So... Yes, Sudeeth. Yeah, I think, yes, can I hear me? Yes. I think in the previous example, this example where you are taking a word of length k, so you are considering the correlation between the bits in that word or two words are not correlated, right? Exactly. So I assume that to be the case. I assume that the words were independent of each other. At least that's the way I treated this right now. I'm talking about correlations and blocks of size k. Okay, but, okay, okay. But, okay, you can divide it like that, so that two words are uncorrelated basically. Yeah, now I'm imagining such a situation. I'm sort of imagining, suppose that is the case because the next thing I'm going to do is I'm going to say is that, well, suppose you've got arbitrarily correlated sequences because really what's going to happen in real life is that the correlation length, then there may be some correlation length or something and the correlations will die off at distance but it doesn't mean that they're going to be separated by blocks. So a way of, right? That's exactly what you're going after. So a way of dealing with that is you first artificially chop it up in blocks of length k. Go ahead and calculate. I'm going to take the limit that k goes to infinity. So that is one method amongst others of trying to get the answer for arbitrarily. Okay, great. So I thought that's where you were going. Thanks, thanks. Yeah, yeah, good. Okay, thanks a lot. So let me go to the next page. Okay, there are many nice properties that the entropy has and I'm going to start listing properties and then we'll use them for various things. One important property is something called the chain rule. The chain rule goes like this. So suppose you have two variables x and y and you complete the entropy jointly of these two variables. So in other words, I have the joint, so here really what I mean is that you have the joint distribution p of x and y and you want to complete the entropy of that distribution. So there's a statement that this is equal to the entropy of just the variable x namely of the marginal distribution over x and then plus the conditional entropy of y given x. So you condition on the variable x and you see how much you know about y, what the distribution y is and you complete the entropy of that. So I should say a caution is I'm going to be flipping notations periodically depending upon what's convenient. Sometimes I write h of p meaning the entropy of some probability distribution and sometimes I use a notation h of x where p here is a distribution p of over x. So I should probably write so just have, ooh, gosh, I have no idea what just happened. Okay, so we're here. What I mean is that here this p is a probability distribution over x. So sometimes I'll write the distribution in terms the entropy in terms of the sample space instead of the distribution. I'm sorry about that notational change. Sometimes it's more convenient to do it that way. And we have, so let's see. Suppose you want to complete the entropy of a distribution on two variables x and y. Well, that's a sum on both x and y of the joint distribution log the probability of x and y. So now as usual, we can factorize this distribution into the conditional distribution y given x times p of x. Then the log splits this into log of the probability of x plus log of the probability of y given x over here. And here what I've also done is I've taken p of x and y and split that into the probability of x times the probability of y given x of this factorize. So now if you look at this over here, over here, if I sum over y for free, I'm left with the expression for the entropy of x. And here if I sum over, something's gone wrong with the size of this thing. Let me just try to see if I can take this and shrink it, resize, amazing. Ah! What just happened? Resize, still getting used to this, clearly. So then I can take, grab this and I can move this. Okay, great. Well, that seemed to work. Okay, so did it work? You're seeing a different screen there than I am by the look of it. No, I'm changing, okay, now. Okay, that's my screen. Okay, great, okay. So if you look at this, then on the one hand, this expression after sum on y is precisely the expression for p of x log p of x that's the entropy of x. And the second expression after I sum on x, after someone y here, so here I sum on x and after someone y here for free, this becomes precisely the expression for the entropy of y given x. Yeah, I think I've set this backward. I think here I can sum for free on y and here I can sum on x for free. Okay, great. So all told, you see, here's the chain rule theorem which says the entropy of x and y is the entropy of x plus the conditional entropy of y given x. And you can chain this repeatedly and you can get the entropy of some set of variables x1 through xn is given by this kind of expression where you take the first variable and you condition on the remainder, then you take the second variable and then you condition that on the remainder and you keep going in a chain like this and this is called a chain rule for entropy and you can get that the recursive application of this first result. And we're gonna use this in a moment to get some interesting result out. Okay, so now before I, so we'll use that a little bit. But first, before I move on, I want to talk a little bit again about the meaning of entropy. So we have here, by the way, there's a huge lag in how my screen is being transmitted. So this is not the, on your screen, are you seeing the page of mutual information? Yes. Okay, because that's not what's on my screen. So this must be kind of confusing. How do we trigger it to be in my screen? That is my screen. Okay, now the meaning of entropy again. Yes, because that's what I want to talk about before I go to mutual information. Oh no, what is it doing? Let's hope it's stable now. Okay, okay, let's see. So we're gonna talk about the meaning of entropy again. I hope I'm writing on my screen. It's not showing up. I have no idea what to do about this. Why is it so slow? Is my voice coming through okay? Yes, so now we are seeing page 100, the meaning of entropy again. Yeah, okay, well, let's just keep going. So we have here, once again, this picture that you have a signal, you have an encoder and you have an output of the encoder R. Okay, and now let me suppose that R has an entropy H. And let's suppose that that H is given by this formula minus P of RI log P of RI. Just by the way, I'm using my pen and marking on my screen. I think by and large it's not showing up in your screen. We'll just have to do our best with the situation. So, okay, so there is a theorem, a famous and super important theorem called the asymptotic equipartition theorem that says the following thing. Suppose you draw a sequence of messages from a distribution, a sequence of symbols from a distribution with an entropy H. The statement is the following. So suppose I draw, I want to transmit a message of length L where L is very large. Well, clearly the number of possible sequences that I could send is N to the L where N is the number of possible symbols I could have used. So I could have sent N to the L symbols. So that's two to the L log M, okay. The asymptotic equipartition theorem says that if you transmit your messages using a distribution over the symbols of entropy H, then in fact, a lot of the possible messages never get sent. What actually happens is that there is a smaller set called the typical set. The typical set has a size two to the power of the entropy times L. This is much less than two to the power L log M which is the total number of messages you could send. And what's more, every single element of the typical set occurs with an equal probability of two to the minus HL. So what this is saying is that the understanding that entropy, the quantity entropy gives us is that if you send a long sequence of messages, after a long sequence of messages, the distribution over possible messages becomes uniform within the typical set and is zero outside the typical set. It's an extremely powerful theorem. This fact you can use for all manner of results. Well, I'll discuss later how you use this to prove other very important theorems in information theory. Effectively, the asymptotic equipartition theorem converts any probability distribution into the uniform distribution over the typical set where the size and shape of the typical set is defined by the entropy of the distribution you're using to draw the symbols. And you can use this to prove all kinds of things. So you might wonder how you prove such a thing. So actually for IID variables, it's kind of easy. So I'll tell you for IID variables, how you prove the asymptotic equipartition theorem. The screen will move up in a bit. So imagine you have a, you pick X1, X2, X3, a sequence like that IID, independent, identically distributed. So for the moment we're ignoring correlations, then here's a fact of life. Consider minus one over n, I'm drawing n such numbers and n such variables, minus one over n logarithm of the probability distribution, of the joint probability distribution. Then this is of course, because it's IID, I can write this as sum on the individual draws of log of the probability of the Xi, the symbol of the ith draw. So then by the weak law of large numbers, you know that this object converges in probability to minus the expectation value of log P of X. But that's precisely the entropy. So this is saying the log of the probability of a long string divided by n converges to the entropy. So with a little bit of work, you can show from this that all sequences that occur have the same probability, and namely this is the probability that they will have, the log of the probability is the entropy. And that because of this, since they all have the same probability, if you add up how many of these, if you add up two to the h, e to the h such sequences, then it fills up the probability and there's no room for anything else to happen. So that's how you derive the fact that there's a typical set and within the typical set, the probability is given by two to the power h, the entropy of the two to the power h times the length of the sequence. So this is a theorem you can prove in this way. If you want to prove this in more generality where you don't pick the symbols independently and with identical distribution, that takes a little bit more work, of course, and we don't have time in today's lecture to prove that. Any questions? Okay. In which case, I'm going to go on to concept number two, which is the mutual information. So so far, we've characterized a signal in terms of how variable expressive it is, namely, or how many different messages you can get from it, or if you make the message very long, how big the typical set of messages you can send is. So but often you're interested in a situation where you have a signal, you have some noisy channel, and you have an output R, and you want to ask how much information R contains about S. There are many, many applications like this, including in things like if you have a neuron, and it makes a message, and you want to know how much the output tells you about the input. So some things are clear about what kind of things must be true. So for example, suppose you had four input signals, and they were transmitted faithfully into outputs, then you clearly have perfect information, and the response contains as much information as the signal. If instead the thing is noisy, namely S1 occasionally becomes R1 or R2, S2 becomes R1 or R3, once in a while, you know, stuff like this, then it's clear that the response should contain less information than the signal. If the output has a lower bandwidth than the input, that is to say two inputs become the same output, for example, then it's clear that the output has less information than the input. So whatever quantity we write down to characterize how much R tells you about S have to reflect these kinds of properties. So this, for example, again, take a neuroscience example, suppose you have repeated experiments, so you produce the same input each time to this neuron, sometimes you get this message, sometimes you get this message, sometimes you get this message, you know, it's more or less here, it's very active, here it's not active, but it's slightly different each time, that's like having this noise, and so somehow you have to have a way of accounting for the fact that something in the knowledge about the input has been lost by the time it got transmitted to the output. So the standard way to do this is as follows. So first, you quantify how expressive the response is. So you say the response is a sequence R1, R2, or through the output of the channel is some sequence R1, R2, et cetera. And let's suppose it is drawn from some probability distribution. And so you measure the probability distribution and then it's clear that the wider this distribution, the wider the response repertoire and the more entropy the response is going to have. So the total entropy, H total, is simply you take the responses and you compute the entropy. So that's a total number of bits transmitted at this channel. The point, however, is not all the bits that are transmitted, maybe informative about the input. It's like having a phone line and having noise on the phone line, not all the bits being transmitted by the phone line are informative. So the next thing we need to quantify is the noise in the system. So for a given input, how variable is the response? So we can do that by saying, okay, wait, so suppose I have an input S, I can look at the conditional distribution of the output R given the input S. If it was all completely deterministic, this distribution would be in a completely focused, it'd be delta function. But if it's not, if there's no noise, let's say, this distribution will be wider. So the wider this distribution, this conditional distribution of output given input, the more the noise and the more the entropy associated and the more information you're losing to noise. So you do this, so you quantify this by computing the noise entropy given the input to the channel S. So you compute the conditional entropy that we discussed earlier. We take probability of R given S, log probability of R given S, and then you compute for a fixed S, you compute this entropy. And then the thing called the noise entropy is the average of this over all possible inputs that you could have distributed, that you could have put into the system. So this quantity, the noise entropy quantifies the bits of information lost to noise. So the mutual information, which is supposed to quantify how much the output of the channel R tells you about the input to the channel S. That quantity is measured as the bits transmitted minus the bits lost to noise. So it's the total entropy of the output minus the noise entropy. And there are many ways of writing this. We can just write out explicitly the expression in terms of the probability distributions. And it's often written as the mutual information you take the letter I for that, I of S and R. And you can write this as the response entropy minus the conditional entropy of the response given the stimulus. So that's a standard expression. Now this is also an object with many interesting properties. So I've flipped pages, and also there it is, the new pages come up. So for example, you can take the expression we started with and use basically Bayes' rule and factorization properties and probability distribution. So here's Bayes' rule. And by sort of working with Bayes' rule, I won't work through the details of the algebra. You need to just plug in Bayes' rule and you manipulate this expression a few times. You can show that the mutual information between the output and input is also given by I, S, I'm not sure whether I wrote it earlier. Basically you can show that I, S and R, the information conveyed by S about R is the same thing as the information conveyed by R about S. Namely it's a reciprocal quantity. You have the input and the output of the channel and whatever the input tells you about the output is the same, however much the input tells you about the output is the same as the amount that the output tells you about the input. So I'm not gonna work through these details of these little calculations. This is for you to work out as exercises if you want because these are sort of elementary computations of algebra and you can work out a whole bunch of properties. You can work out that mutual information between R and S is the same as the mutual information between S and R, there are various expressions. This tells you that the mutual information between the output and the input is less than the entropy of the output, right? You can't send more, you don't know more about the input than the amount of entropy in the output. Likewise, the mutual information to output input is less than the entropy of the input. There can't be more sent from the channel than came into it in the first place. Things of this kind. This all makes good sense. And there are other expressions further that are more symmetric between the input and the output. Let's not worry about the continuous version of this. It's the same kind of calculation. Again, I left it there for reference. So am I correct in thinking that I have like 15, 20 minutes left? Yes, 18. Okay. In which case, what I would like to say is that, so I've written down here a set of exercises that you guys should work out. The answers to the exercises are here, but what I've basically done is there is a standard example in which you can train your imagination for these kinds of things, which is called a symmetric binary channel. You have inputs zero and one coming into the channel and there's a probability alpha that the input is faithfully transmitted to the output and there's a probability one minus alpha so that it goes this way. So as you can imagine, if your probability of getting scrambled is one half, you should get zero mutual information here. And if your probability of being transmitted faithfully is one, then you should have perfect information of the output. So you should work that example out and the detail just to give yourself a sense of how mutual information works. And I wrote out the exercise here so that you can check yourself. You'll see that the words I just said that if you randomize the output, you get zero mutual information and if you have perfect transmission, then you have perfect mutual information that works out and that's for you to check as an exercise when you have a moment. Okay, so these are all those examples and then you can do more examples, more symbols, et cetera. So this is a page full of example, two pages of examples for you to do. Now I'd like to come to some important properties of this quantity. First is, remember we talked about the chain rule for entropy. There is similarly a chain rule for mutual information and it goes like this. Suppose I have a bunch of variables, X1, X2 through XM and some other variable Y and I have the mutual information between this collection of variables, Xs and Y. Then the theorem, the chain rule theorem says that the mutual information between X and Y conditioned, well, that this mutual information is equal to the, there's a missing thing here. Sorry, there's a missing thing here. Let me put it in. There should be an I and there's an I, yeah. So that this expression for the mutual information in all these variables and Y can be written as you kick mutual information between each XI in turn and Y conditioned on the other variables. So there's an expression of this kind and the proof of this is, it's pedestrian. You write out the mutual information in terms of the entropies, then you apply the chain rule for the entropies, you manipulate that and you discover this chain rule for the mutual information. So the same thing we showed earlier, using factorization of probability distributions produces this. The reason I want this here, the reason I want this here is because this chain rule leads to an extremely important property. It leads to a property called the data processing inequality. So suppose X, Y and Z are processes that form a Markov chain. So this is the entry point into things like stochastic processes using these kinds of techniques. And that is to say, so if they form a Markov chain, what this means is that if you take the variables X and Y, then the distribution over X and Z conditioned on Y, which is simply the probability distribution of X, Y and Z, the joint distribution divided by the probability of Y, this factorizes into the product of the conditional distributions X and Y, Y and Z. In other words, if I know Y, if I know Y, then the distribution over X and Z become independent variables. So that's the Markov property that this intermediate variable factorizes the thing on the two sides. So suppose that's the case, then you can show that it is a process of this kind. Then the mutual information between X and Y exceeds the mutual information between X and Z. This basically saying you can lose information in a process like a physical process or an information transmission process, but you can't gain information from nothing. If you want to know how much X tells you about Z, it's necessarily going to be or how much Z tells you about X. It's necessarily less than or equal to what Y tells you about X is the chain of transmission. So proof is easy if you use the chain rule. So using the chain rule, if you have the mutual information X and Y and Z, this is equal to the mutual information between X and Z plus the mutual information X and Y given Z. Or you could factorize it the other way using the chain rule. This is the mutual information between X and Y plus the mutual information between X and Z conditioned in Y. But the Markov property tells you the mutual information in X and Z given Y is zero, that's because the distributions factorize. So in that case, because it's Markov the distributions factorize and because they factorize the two independent variables and one of the exercises that I've given you is to show that if the variables are independent, they have no mutual information. Clearly if they're independent, one doesn't tell you anything about the other. So once you have that, you can write this out. And so you see immediately that the positivity of this quantity, of this mutual information implies that the mutual information X and Y exceeds the mutual information between X and Z. So in other words, as you go, keep processing the data, you can only lose information. So that's a very important property that arises from these set of elementary facts about factorization of probability distributions processed through the engine of the mutual information. Okay, any questions? Oh, actually let me say one more thing and then take questions. So all of this has one extremely important application that's called the channel coding theorem. This is, wait, the page hasn't moved. Can you tell me when the page moves for you? Yes, so we are still in data processing inequality. Okay, well now we're halfway to the next page. Ah, there we go. Great, so now I'm gonna talk about the channel coding theorem. So imagine that you have a single W, you encode it into some sequence of Xs. And that sequence of Xs gets transmitted through a noisy channel. So the noisy channel is defined by distribution for what Y comes out given that you stick X into the channel. So there's a quantity called channel capacity, which is the maximum over all distributions over X of the mutual information between X and Y. And that mutual information gets defined by two things, this characterization of the channel, the characterization of the noise in the channel. And it's also defined partly by what distribution you put on X. So the encoder is supposed to produce some distribution over X, and then that gets transmitted through distribution over Y. And we're going to see that that is this quantity, the channel capacity, is the maximum reliable information rate, the maximum error-free information rate over the channel. So this is the central theorem of information theory. This is why Shannon worked in the information theory. So more specifically, the theorem says that for something called a discrete memory-less channel, so there's no correlations, all information rates below this channel capacity C are available. So so long as you're trying to stuff a signal through, that it has a lower information rate than C, the channel capacity, I should put the letter C. All of these are achievable. So we don't have time for the full proof, but I can give you an idea of the proof using the asymptotic equipartition property and the things we've already said. So imagine you have these Xs. So this is the encoded signal. Now, because of noise, so it's going to go through the channel, and when it comes out of the channel, it's going to come out as a Y. But because of the noise in the system, if I stick in a particular X, there's a whole bunch of different Ys that can appear. So it basically, the transmission fuzzies the X, makes it more fuzzies, and it sort of spreads it out in some way. Now I'm going to use the asymptotic equipartition property that we talked about, and imagine that I'm sending a very long message. If I send a very long message, I know that basically I can use the typicality property, that if I look at very long messages, Ym, all the messages that appear are within the typical set, and each of them, and there are E to the N, H, possible such messages. Because of this thing has some entropy, H of Y, it's not the case that all possible sequences of Y to the N appear, you get E to the N times the entropy of Y sequences that appear on the output. That's a result from the asymptotic equipartition theorem. Now, imagine that I had tried to put in a particular encoded symbol X. Well, I know that this is going to get spread out into this region, right? But now, if I give you many, many, many, many instances of this, I also know from the asymptotic equipartition theorem that the size of this set is going to be like E to the N times the entropy of Y given X, right? Because I condition on the fact that I've put an X and I see how many Ys come out, and that conditional distribution, its entropy, tells me the size of this set, according to the asymptotic equipartition theorem. So now we see what we can do. What I can do is I can take the set of all possible things that might come out and divide it up into pieces. And the number of separate pieces that I can have here is the number of things I can faithfully encode. So indeed, what you do is so then you take the total number of things you could say E to the N times the entropy, this should have been of Y, entropy of Y. Oops, the entropy of Y. And then I divide it by the entropy of Y given X. And that, by definition, or construction, is precisely N times the, there's an N there, the mutual information in X and Y. So that's the number of faithful values you can encode. This is heuristic proof. To really do this right, you have to be able to control the errors in the statements, you have to be able to show that this is achievable, that there exists an encoding of X that allows you to do this basically, because otherwise it could happen that this ball overlaps with the next ball, overlaps the next ball, so everything gets confused. So what you need to know is that there exists some encoding, at least asymptotically, that they don't overlap and that you can decode them separately. And that's the kind of thing that Shannon showed and that's like a lecture's worth of discussion. So I have, what, a few minutes, but I'm gonna go ahead and Matteo, I would propose to take 10 minutes into the discussion time. I hope that's okay. That's okay, yes. I don't see many questions, so yeah. Okay, so let's go through the last topic. So I've talked about entropy, mutual information. The last topic, basic topic in information theory that I think you need to know is something called the relative entropy. Let me explain to you what that is. So besides giving you ways of characterizing how much information is transmitted by things, information theory also gives you useful ways of quantifying how different two distributions or modes of transmitting information are. So there are many standard ways of quantifying differences between functions and the distributions of functions. So for example, famously, there's the L1 distance, which is, you know, you take the difference between the two functions or distributions and you take the absolute value, you add it up. There's the L2 distance, which is, you take the difference, you square it, you add it up, right? So basically all of these distances treat functions as infinite dimensional vectors. But there's another quantity called the relative entropy or Kullbach-Liebler divergence, which is the thing that information theory teaches you is a good measure of distance or difference between distributions. And this thing appears again and again. So here's how that's defined. Suppose P and X and Q of X are probability distribution. Then the relative entropy, also called the Kullbach-Liebler divergence, or DKLPQ, is the integral over all the samples of P of X, log P of X over Q of X. So I want to discuss this quantity and why it's useful. So the first thing I'd like to explain is a positivity property that this thing has. So I'm waiting for this to move to the next page. Okay, here. Okay, so we're trying to get to something called Jensen's inequality. And it's application to this relative entropy. So first I need a definition or two. So first of all, we're gonna agree that F is convex. If it satisfies these technical conditions for the purposes of this talk, it's convex if it bends up this way. And you can write that out carefully as saying, if I take a weighted sum of two inputs to the function, then the function acting on this weighted sum is less than or equal to the weighted sum of the function acting on the individual inputs. So here there's a weighted sum of inputs to the function and here you take the function on the inputs and then you weight that. And this inequality tells you that the function is convex. So you can check that that's the case. So there's an inequality called Jensen's inequality which says that if F is a convex function, so in other words, it does this, then the expected value of X of F of X, so X is supposed to be some random variable, then if F is convex, the expected value of F of X, this should have been is bigger than or equal to. The expected value of F of X is bigger than or equal to F of the expected value of X. So basically if you push the expectation value in, you can state that there is some inequality. And the proof of this is basically by applying this convexity condition recursively. You can just do this recursively and then you'll find out that this is true. So now I'm going to apply Jensen's inequality to the Kulbach-Nebler divergence of the relative entropy. So the theorem we want to have is that the relative entropy between two distributions, P and Q, is non-negative with equality only if and only if the two distributions are equal. So the way you see this is you write out your relative entropy or minus the relative entropy. So that looks like minus the sum of X, P of X log P of X over Q of X. So putting in the minus sign to the log, that's sum of X, P of X log Q of X over P of X. So this looks like an expectation value. So I'm going to use Jensen's inequality to push the expectation value inside the log, recognizing here that the log minus the log is a convex function. So log is a concave function. So I've pushed it in and got this inequality. Now the P's cancel and this is log and then the sum of Q of X is one. So you get log one and zero. So we find the minus the relative entropy is negative. So the relative entropy is positive. Okay, so that's a very important property. Likewise, there's a chain rule property that just let's not bother with. The net things coming out of this are two three very important facts. One is we just proved that the relative entropy is always positive. You can also check that it's zero if P equals Q. One sort of annoying thing about the relative entropy is that it's not symmetric. So the relative entropy between P and Q is not the same thing as relative entropy between Q and P just because of the way it's defined. So because of this, so this almost looks like it should be a measure of distance, right? So basically if I take two different distributions, you get some positive distance between them, positive relative entropy. But if P and Q are equal, it's zero. So that almost feels like there's a measure of distance in some sense. But because it's not symmetric, it's not kind of a metric in the usual sense. And normally, you know, if A is one meter from B, then B is one meter from A, and this doesn't satisfy quite that property. But nevertheless, we're gonna find, it's very useful to think about the relative entropy or the Kolbach-Lieber divergence as a distance between distributions. That's what I'm gonna end with today by using it in that way. So there are various ways of seeing this. The first way is by looking at the law of large numbers. Suppose you draw n data points from, oops, this page hasn't changed. This is sort of inefficient. But there, I'm just waiting for the page to move. Yeah, I know. Oh, okay. It moved, yeah. So let's look at the, so I'm gonna first look at the law of large numbers. So suppose you draw n data points from a distribution Q of X, and let P of X be the empirical distribution of the data, namely the frequency of counts divided by n, right? So here's Q, the true distribution. P of X is the empirical distribution. So here's a statement of fact. It's a result in information theory. The probability that the empirical distribution you get is some particular P of X is bounded on the two sides by two to the power minus n times the relative entropy between P and Q times some numbers. So what this tells you is that of course, the most likely thing you get is when DKL is zero. So the most likely thing you're going to get is going to be the, well, P is going to be Q. Actually, P is gonna be very close to Q in the sense of DKL. So this quantifies the sense in which the empirical distribution of data converges to the true distribution. And the quantity that is relevant is the relative entropy or the Kolbach-Lieber distance between the empirical distribution and Q. So if you wanna ask a question, I draw some data, how far away do I expect the distribution I get in the data to be from the truth? It's the relative entropy that controls that. This is why it's such an important quantity, right? It controls the nature of the convergence of data to the truth. Okay, so in this way, it underpins much of the universal behavior as you take finite size systems and make them large. So move to the next page, but it hasn't moved yet. Mathieu, do tell me when the page moves to number two. Yeah, I'm sorry about the slow shift here. Great. So here's another kind of application in statistical inference in the theory of learning. So suppose you're trying to, rather than sort of modeling the data, what you have is you have two possible models or distributions called P and Q. And I want to know how well can you tell these distributions apart with finite data? Maybe you can tell them apart, maybe you can't with finite data. So let's suppose you have events e1 through em, which have been drawn independently from either P or Q and your job is to guess whether P or Q generated the data. So I'm gonna define two possible, two kinds of errors, two sided errors. One is the probability that you guess Q but the truth is P. And the other one is the probability that you guess P but the truth is Q. So these are two errors. We would of course like to pick an inference strategy such that alpha n and beta n, both of these error probabilities are both small, okay? Now suppose you devise an algorithm with alpha n, the one sided error less than epsilon, the epsilon's some small number and let beta n epsilon be the smallest possible error on the other side with this constraint on the first error. So one error is small guaranteed. And I want to know how small can you make the other error? So there's a theorem called Chernoff-Stein lemma and it says the limit as the amount of data goes to infinity of minus one over n times the logarithm of this other error is the Kolbach-Lieber divergence. So this is a fundamental limit on inference. It says that if P and Q are very close in a very close in the sense of relative entropy of Kolbach-Lieber divergence, then it's very, very difficult to tell them apart and there's bounds on how quickly you can squeeze the error down in telling them part, how much data you need to tell them to beat the error down beyond a certain amount. So there's a second sense in which the relative entropy is such an important quantity. So I've moved to the third page and I don't know if you can see it yet. Do let me know if you can. No, I think we still see infinite data limit best you can possibly do. Okay, so let's wait for page three. Tell me when you see page three, okay? Yeah, yeah, sure. Yeah, I'm sorry about the delay but there's nothing to be done. Yes, no worries. Okay, so we can see half of it. Yes, now it's fine. Great. So the next thing is remember in the previous section we talked about mutual information as an important quantity, right? Characterizing how much one thing says about the other. Well, there's actually a relationship in relative entropy in mutual information. So suppose P of X and Y is a joint distribution of X and Y and therefore let me suppose that P of X on distribution. So life that X and Y are uncorrelated and hence are uninformative about each other hence have no mutual information with each other then the joint distribution is a product of the two. So turns out that the expressions I gave you earlier for the mutual information are exactly the same as writing down the relative entropy or Kulbach-Liebler divergence between the joint distribution and the product of the two marginals. So in a very real sense the mutual information is quantifying how different is the true joint distribution from the product of the marginals? Namely how much more do you actually know that there's one variable telling you about the other if you have a full joint distribution as opposed to just the marginal distributions which act as if they're independent. And so there's a corollary here which is the mutual information is non-negative because we already said that the relative entropy is non-negative. We showed that, right? Using Jensen's inequality. So that tells you the mutual information in two variables is also non-negative. So there's lots of other cool facts that suggest that the relative entropy is a good measure of probabilistic distance. Okay, so what's the time? Yes, so you have passed eight minutes of your quota. Don't worry. I think we still have some time because there is no question anymore not even in the chat box. Excellent. In that case, I have two more slides that I would be happy to just go forward and... Wait a second, just... I think there is a question in chat. So Farzia is asking, could you please tell what applications it has? What, the relative entropy? I don't know. Maybe Farzia, you can just unmute yourself and ask the question. Hello. Hello. Can you hear me? Yes. Actually, I'm following your lecture but I lost a little bit. What is the application in general? Maybe end of the lecture, you can tell it. I don't know. I have no idea where I can use it. I mean, I don't see it nowhere. What's it? I mean, in total and whole of the lecture, I have no idea where we can use it and what is the application of that. I'm just following some relation and some theorem but I don't know what is the physics behind and maybe the technology behind it. So for example, you can use the notion of mutual information to characterize how much a neuron says about its input. That's a very important application that people have. Another thing you can do is in coding theory, what you do is you try to devise using the channel coding theorem. If you characterize a noise in a channel, you try to devise and distribution, the problem of encoding signals is the problem of devising a distribution over the inputs to the channel that achieves the channel capacity. So I described the channel capacity theorem. So what you do is you characterize a noise in the signal and then you have drives and algorithm to maximize the channel capacity, to meet the channel capacity. That's another application. So... I see. Sorry. You mean that most of the application is in neuroscience and in the neuron? I don't know. I mentioned neuroscience because that's an area I work in. But the phone company, the cell phone company uses that kind of thing all the time. This was originally devised by Shannon to characterize signals being sent on electronic cables in the presence of noise to decide how much information was being transmitted. What I'm describing now with a relative entropy has many other applications. So for example, I alluded here to questions of statistical inference from data. So suppose I give you data and I tell you which model produces data. That's the general problem of scientific inference. So if you try to suggest that problem, if you try to address that problem and ask how to trade off which model to pick, the relative entropy becomes the quantity that controls which model you pick. Okay, I see. Thanks a lot. Thanks a lot. Right. So let me finish with talking about the relative entropy or about the decal. So let me ask the question now that whether the relative entropy can actually function as an actual distance metric on probability distributions in some situations. So to be an actual measure of distance or separation, it has to be positive, symmetric, and satisfy the triangle inequality. Okay, so let's forget about the Jensen Shannon. So here's one way of doing this. So suppose P of X and Q of X are parametrized by parameters alpha and beta are two models on a parameter manifold. Then we can think of P and Q being, as we've said already, separated by some distance, Dkl alpha beta, right? I mean, so this is the same relative entropy between the distributions indexed by alpha and beta. But actually what you can do is you can expand this locally if P and Q are close to each other. So you take beta, the alpha plus epsilon. So these are two very nearby models that's like this, very close to each other. They're very close to each other. Then you can expand the relative entropy between the two models as follows. Say that D alpha beta, as you need to take alpha, is the parameter vector of the first model, alpha plus epsilon is the parameter of the second model. And then you do a Taylor expansion near epsilon being equal to zero. So now, as we said earlier, if I have two models that are identical to each other, two distributions that are identical to each other, their relative entropy will be zero, that's gone. Now, it's also the case that therefore since the relative entropy is positive everywhere, it'll be minimized when the two models are identical. So when epsilon is zero, so that means the first derivative with respect to the parameters will be zero. And then you have the second derivatives. So you have a quadratic form here, the second derivatives of the relative entropy with respect to the parameters. So there's a matrix of parameters here. So this is an important quantity. It's called the Fisher information. So let's move to the next page, waiting for it to go to the next page. Let me know when it comes to the next page. Yes, sure. Flip it. Yes. So, all right. So locally, if I have two parametrized models that differ by a little bit in their parameters, we see that the relative entropy between them is one half times basically a matrix, Jij, multiplying these little parameter differences. You add this up, right? So this is basically like, right. So this Jij is something called the Fisher information. And it's another very important quantity in statistics and information theory. So the Fisher information, because it's symmetric and positive and stuff like that, it's a true metric or distance function on a parameter manifold. And this is in the sense of differential geometry. It's a true metric on this space. So given this metric, you can define distances between probability distributions as follows. So using the methods of differential geometry, you take your metric, you multiply it by the differentials of the path you're taking between two points, and then you add it up, you take the square root. And so that's the little amount of path length in each little segment there. And if you integrate it up, this is a path that goes between, this is the length of a path of this path between these two points. So now you can use things like the tools of differential geometry, you can find geodesics or shortest distance paths in this kind of metric. And that is the topic of the entire field of information geometry. And you can look at the book by Shunichi Amari, well-known professor from Japan, who was very influential in this field. And it's used in this kind of techniques using trajectories in this Fisher information metric. I used to understand properties of inference and learning. So learning in general involves, you start with some model and somewhere on this model of manifold, that's what you assume, you get data and the data, every time you get data, you modify the learned model by a little bit and then you get more data and you modify your learned model a little bit and in the end you're supposed to make it to the truth. So the question, many questions in the theory of learning and machine learning can be partly cast in terms of trajectories on the space of mental models you have or the models that the machine has. And you might try to characterize that efficient learning would be in some way a geodesic or shortest trajectory in these kinds of senses on these manifest. So let me finish as follows. There are many challenges here. So for example, if I just take challenges involving the relative entropy, one challenges using the ideas that are just described. I've moved to the next page, by the way, you can't see it, but I'm just talking so that I can get finished for your sake. So as I said, in online learning, namely if you have a system that starts with an initial distribution that's trying to learn another one, there's some trajectory an efficient learner would take and that would be presumably characterized in terms of shortest paths in this kind of metric derived from the relative entropy and it'll be interesting to understand that. So I've come to my summary page, but I can't see it on my screen. So we also can see. So we will wait. Actually, let me just start talking because there aren't any questions or anything there. Yes, so there is a question in chat by Zabrin. So Zabrin, I think you can just unmute yourself and ask the question. Yeah, can you hear me? Yes. Okay, so the question was related to the law of large numbers and what I understood is that if somehow we know the decal between the probability distributions in the real sense, we can see that how many realizations we need to basically estimate the true picture. And I wanted to ask if you had some comments about how we can relate it to simulations and experiments. So somehow if you have a simulation of a system and through that you can get two probability distributions and you can get, of course, there are many data points that you could get experimentally. Is there a way we can estimate how many experimental realizations and we need to get close to the true picture? Yeah, so suppose you have a model, right, which tells you that the data ought to be generated, your theory says, by some distribution Q. And what you want to know is empirically you want to gather data and you want to know, is it true that in the world it's really Q, Q of X is really the distribution. Then you could use this kind of thing to ask yourself the question, how likely is it that with n data points I would, by mistake, get a distribution P even if the truth is Q. The probability of producing, that the empirical distribution will deviate by a certain distance from the truth you can compute. So then that allows you to put error bounds on the confidence with which you know that the empirical distribution you have measured is or isn't likely to have been generated from the truth that your theory claimed. So it takes a little bit of manipulation, but you can do that kind of thing. Yes. I have a question. Yes, please go ahead. Hi, I'm Veneta. So, I mean, it was a very interesting talk. So I work in mainly a glassy granular systems and all of this looks very kind of being able to connect, but it's just that I'm not able to make this connection of how to use relative entropy to characterize very different states, for example, in granular systems or glassy systems, because there are aspects of memory that people look into the systems where we have many particles and their degrees of freedom, which is kind of controlling the overall mechanical property, for example, being able to go from structure to dynamics. But how do I kind of understand, meaning condense all such information into using the information theory? How do I kind of make that understanding better in some sense? I don't know that I'm expressing what I think clearly, but I just want to kind of know if there is a way of going from static information, which does affect the dynamics that I do know, because I can characterize it in different ways, but how do I use the concept of, I don't know, relative entropy or being able to use these quantities to measure it? So what you're asking is a complicated and subtle question and a very good one. I will not give you a satisfactory answer in a short period of time. We need to spend some time at a blackboard. So, but to just give you the very snippets of relevance. So there is absolutely no question that thinking about, for example, if you have a glassy system with a sort of complicated landscape, walking a minima and you have one distribution and you have another distribution, how do you compare these things? But these kinds of tools are relevant. In fact, there are famous works by people like Lenka Zyberdova and Floron Shakala and Hul. I think there is some loose connection, John. Yes. I'm not the only one here, I think. Yes, that's true. Anyone got that reference? Sorry, I think Vijay is not there. Vijay, can you hear us? I think there is some issue. I think he will reconnect. So I cannot see him on the participants list now. Yeah, can you hear me? Anyone in the participant list? Yeah, yeah, I can hear you. Okay. Yeah, we can hear you. I don't know why Vijay is not here. Maybe he's trying to reconnect. Sorry guys, I don't have any idea. No, he's still not here. Let me just try to connect with him. Okay, so we have just passed the time constraint. So let's wait a couple of minutes. I'm trying to connect with him. Yeah, that was an interesting question. Did anybody get the reference he was telling? No, actually, no. I think, no, there is nothing in the chat. So maybe one thing is that you can always, of course, send emails to him asking for the references. But this was not supposed to be the way it should end. I'm so sorry. No problem. Okay, so someone is asking in the chat that the lecture notes or tutorial will be available. I can tell you that the videos will be available on the website for sure. And please kindly share the link whenever it's available. It's very helpful. Yes, I think you will get, so the website is the same. So for the particular, I mean, for this workshop, if you go to the ICTP page, so there is this dedicated website for this workshop. So when you will see that just below each of the lectures, there will be some link to the video or if there is any lecture note available, you will get the thing. But I know that the videos will be there for sure. Okay, thank you very much. And then the slides. Is it possible to get a slide as well? Yes, because I think we are going to provide the recorded Zoom session. So, okay, so you were asking for the PDF files, right? Yes, absolutely. That I don't know because that depends on the speakers. But okay, I can ask them. Okay, thanks. Okay, so I just got a reply from Vijay. So he's trying to reconnect, but somehow it's not going through. Okay, so in the chat, there is a message from Moritz and I think he's mentioning the references that Vijay told us. It's Linka, I don't know how to pronounce it, Devrova. Moritz, can you just pronounce it for us? I'm not super certain myself, how her last name is pronounced. Okay, then what I can do is- So I can put it in the chat. Yes, it's okay. What I can do is that let me just send it to everyone. Yes, yeah, by accident I send it only to you. Yes, I think now everyone can see the names, right? Yes, thank you. Okay, so, okay, sorry guys, I think then we have to end this session. So unfortunately it didn't went well to the end and hopefully you will join on Monday, 22nd May. I hope you enjoy and any question for me? Okay, so then let's end the session and see you in the next session on Monday, okay? Thank you very much for joining. Thank you and bye. Bye.