 So, I welcome Prasad Radhakrishnan to give his speech, as I said he will write occasionally on the board. So, those are the back if you have difficulty you can come forward to follow the lectures. Thank you for the invitation and thank you all for showing up. I hope what I have to say justifies your faith. So, I have I do not have very much to say I am going to tell you about one inequality. And we will try and prove that inequality and by the end of the talk hopefully you will see that you could have proved many such inequalities, but we will do this using a notion called entropy and I will tell you what it is. So, let me just get to the problem the Loomis Whitney inequality. So, I will draw some pictures, but what is going on there are n points in three dimensional space there is some n points. And I project them down and I see a certain number of shadows. I may not see n shadows why would not I see n shadows, because it could be that two points are one above the other and they cast the same shadow. So, that is there are n 1 shadows that appear on the x y plane. Now I shine the light this way and a certain number of shadows appear on this wall and that number is let us say n 2 and I do it that way I will not do any permanent damage. So, that way and then I see n 3 points on that wall and the Loomis Whitney inequality says that n 1 times n 2 times n 3 n 1 times n 2 times n 3 is always at least n square. Once again the picture is this there are n points here. I project them down I create n 1 points. I project them here n 2 points I project them back and n 3 different points appear. So, for example, if I had 10 points all lined up vertically 10 points all lined up vertically how many what is n 1 then sorry what is n 1 all of them are vertical I project them down how many points do I how many shadows do I see 1. If I project it that way or this way these 10 points give 10 I get 10 that side. So, I get 10 times 10 is 100 which is square of n. On the other hand if these 10 points were like this at an angle in a line, but at an angle then when I project I see 10 here 10 there and 10 here and then if I take the product of n 1 times n 2 times n 3 I get 10 cube which is 1000 which is more than 10 square. But the Lumis Whitney inequality says that you will never fall below n square and so maybe if my hand waving was not clear enough. So, here is I do not know everybody sees this differently, but there are 3 planes here and these beads are arranged in these 3 planes and there are 9 beads per plane. So, the if you project them sorry I have lost the picture, so if you project them somewhere you get 9 9 9 on each projection gives you 9 points and so the total number of points is 27 n 1 times n 2 times n 3 is 81 times 9 7 29 27 square is 7 29 the inequality survives. So, the question is so unlike the physics we are not asking about how why is this inequality always true and so how many of you are math Olympiad students. So, you probably proved this already and so, but I would like to introduce today a method by which one can sort of get a feeling for why such inequalities hold and that will be through information theory. So, that is the goal. So, here is so called proof of this inequality and so if I have n objects and if I want to describe these n objects to somebody how many bits do I need to describe n possibilities log n bits. If I have to describe something which is 1 among n things then I need log n bits. So, to describe a point you need log n bits to describe its projection on the x y plane there are n only n 1 possibilities. So, I need how many bits log n 1 to describe its projection on this plane its log n 2 and to describe the projection on that plane it needs log n 3, but n 1 log n 1 plus log n 2 plus log n 3 is giving me every piece of information twice. Why am I saying this if you look at the x coordinate of the point it is available from two projections. When I project down I project on the x y plane what information is lost when I project down which coordinate is lost of the point z is lost, but x and y both are available. When I project along the y axis y is lost, but x and z both are available in the other case x is lost and y and z, but nevertheless x the x value of the point is available from two sources. So, at least like heuristically this log n 1 plus log n 2 plus log n 3 is giving me every piece of information that log n contains twice. So, morally different people have different morals morally I should be able to write this inequality that is two log n must be less than log n 1 plus log n 2 plus log n 3 and by now. So, every logarithm is to the base two does not matter you exponentiate it you get this inequality. Now, we did not dirty our hands with anything Cauchy Schwartz inequality and various things and we just wrote this inequality and of course it is all nonsense. Because there is this information what does information mean I mean is this mathematics and so we are trying to I will try to make sense of all this in the course of this problem. Question for now I am assuming Euclidean geometry, but at the moment when I say that I project x y z on to the x y axis I am just saying I am suppressing the z coordinate that is all. Now, whether I got to this by a wavy form or what geometry was there and whether there was a heavy black hole there pulling my curves it does not matter to me it is just again it is all in the end just mathematics I am just trying to make it dramatic because I was told it is being recorded. So, the general idea I will like to address a philosophical issue about science and this is something which a famous computer scientist said about the p versus n p problem which you might have heard of, but this is not important. So, successful ideas in science are those that are pervasive and invasive are invitingly elegant and methodical are open to extensions and variants and answer an objective necessity and capture a widespread, but diffuse sense of dissatisfaction in a scientific community. Now, this notion of information was one such thing people use it we had information and quantum information even in the previous talk and you feel that there is something called information and, but you are not at least comfortable justifying what really information means if you get down to the mathematics and this was the situation when electronic transmission started may be in you know the early years of the last century and the one of the heroes of information theory Claude Shannon address this question and laid out what is called the mathematical theory of information and some parts of it I would like to introduce to you today. So, keep this in mind it turns out that this notion of entropy information etcetera it pervades many areas and there is always a way of thinking about various things in this information theoretic way and we will see, but for today I just want to prove the lumus vitni inequality using information theory I already gave you what the proof is going to look like the proof is going to be like this, but whenever somebody says obviously in the course of a mathematics proof that is the point to be suspicious yeah I mean if it was totally obvious you will usually people do not say obvious you know hence yeah, but then when they start using the word obviously it is a little shaky. So, this obviously here is what I am trying to formalize in the course of this talk. So, yeah and so somehow yeah it is customary to refer to two people in information theory one of them is always called Alice and the other is called Bob I mean once I gave a talk in Pune where Alice was I think Mastani and Bob was Bajirao, but it is usually Alice and Bob yeah. So, Alice observes an outcome of some experiment yeah x1, x2, x3, xn she observes an outcome of an experiment and she would like to communicate it to Bob. So, what does she do they both sit down in advance and they agree on a protocol. So, for example Bob says Alice if you see x1 send me 000 if you see x5 send me 0101 they come up with some arbitrary encoding codes for communicating the outcomes and their goal is that in the end Bob must find out what Alice observed. So, the encoding should be such that when Bob receives the message she should be able to he should be able to figure out what Alice really wanted to tell. Their goal is that they should try to encode these things with minimum number of bits because each time they transmit a bit over a phone or some other medium they are charged for the number of bits you transmit. Now this need not be communication this can even be storage yeah. So, we observe some experiment we put it in the computer's memory, but we might observe a color red and then we write down a set of bits into a computer. So, we would like to encode the colors using bit strings. So, that we can record our information with the smallest number of bits and. So, this notion of information transmission is a very important notion nowadays and Shannon studied what is the minimum number of bits that you need to encode n outcomes and here are the rules of the game Alice and Bob can agree on the code in advance yeah. What outcome will correspond to what code word the code words should be prefix free this is a slightly technical notion let me just tell you it is not a big deal. So, suppose I used a permanent marker or something. So, suppose I said red yeah is 0 0 and black is 0 0 1. So, suppose we both decide that whenever I see red I am going to say 0 0 and whenever I say black I am going to say 0 0 1. Now, this is not a good idea because if I tell you 0 0 now you wonder if a 1 is coming if a 1 comes then it becomes black otherwise it is right. So, this sort of coding is not allowed that is when the code word ends you should know that the transmission has ended. So, such codes are called prefix free. So, suppose this was 0 0 0 and 0 0 1 this is fine because when I see 0 0 0 I know that it is red when I see 0 0 1 I know it is black. So, they whatever coding they use it should be prefix free that is 1 code word should not look like a initial part of another code word and you cannot communicate by silence you cannot say that you know if I see red I will say 0 within the first second, but if I say if I see black I will wait 4 seconds and then say 1. So, you are not allowed to communicate using silence yeah I do not there used to be a old Hindi song may be from the 70s, but I do not know who remembers. So, you will be charged for top time even if you do not say anything yeah yeah. So, that is the rule that Shannon applied and then he asked suppose there are n possibilities x 1 x 2 x n then in the worst case how many bits must you communicate and many of you already said that it has to transmit at least log n bits because the number of strings. So, suppose Alice and Bob decide to use k bit strings they could encode one of the possibilities using all zeros one of another possibility all zeros and ones and one one one one one how many such strings are there 2 to the k, but he they have to accommodate n possibilities so 2 to the k better be bigger than n. So, k better be bigger than log n I mean nothing happened this is almost insulting to tell you this, but log n bits are needed. So, what did Shannon do Shannon asked well you know all my symbols are not equally likely yeah I mean. So, it is January and I want to tell my mother whether it is raining or not raining yeah the chances of it raining are rather low and not raining are rather high and there might be other n possibilities like if she asked me what did you eat yeah. So, you know it is unlikely that at this hour I would have eaten a samosa I mean at least I do not want her to know yeah. So, the possibilities there is a probability associated with each outcome it is not as if all possible all outcomes are equally likely and in that case should be still use code words for all the possibilities of the same length. One might assume that the outcome that is going to be transmitted more often should be given shorter code words and outcomes that are rare can be given longer code words and overall our goal should be to minimize the average communication it is as if this experiment is going to happen again and again and again and again and usually the high probability events will happen. So, we should try and minimize the encoding length for those high probability events for the low probability events it is ok if we assign them longer code words ok. And the question is can we save and how much can we save and may be an example yeah actually I was told that minister will be here and the organizers will appreciate the fact that I use numbers which sort of flatter the ruling party yeah. So, suppose yeah anyway these are numbers I made up. So, here are like four possibilities I mean I do not quite say whether it is for winning or losing, but you can yeah. So, the probability is 0.58 that it will be BJP it will be the Indian National Congress with probability 0.4 and these two parties yeah NCP and BCP. Now, I am observing this event somewhere may be and then I want to transmit this information who won. So, here is a solution we agree in advance that if it is BJP I will say 0 0 if it is Indian National Congress I will say 0 1 if it is NCP I will say 1 1 if it is BSP I will say 1 0 these are prefix free they are all of the same length no question of one being a prefix of the other. So, this satisfies the rules what is the communication that I will you know how many bits will I send on average well in each case no matter what happens I send two bits. So, per outcome it will be two bits yeah and suppose I have to do this reporting for all the constituencies. So, this experiment probably is going to be played out again and again let us assume independently. So, on average it will be two bits per communication, but you know I look at this data and say that assigning same length sequences to all of them does not make sense. It is okay if I could give this fellow a very short sequence meaning I will say that may be when BJP wins I will say 0. Congress wins I will say 1, but then I get stuck by some chance if NCP wins something what will I say if I say 0 0 that is not allowed because 0 is already taken for BJP and 0 1 cannot be used what did I say 0 0 cannot be used for NCP. So, how should I design the code so that nobody is a prefix of another yeah and still all events are covered and the total cost of transmission is minimum. So, here is may be a better solution and you can think about it this probably the optimal solution. So, what I will do is the party which has the highest probability I will assign a short code word for INC I assign 1 0 and for NCP 1 1 0 and for BSP 1 1 1 now notice that this is still prefix free this is very short, but there is no other word here which starts with a 0 yeah and 1 0 is shorter than these two, but these two start with a 1 1. So, this is a valid encoding of these four possibilities and if you compute the expected communication it is 1 with probability 0.58 2 with probability 0.4 3 with probability 0.1 3 with probability 0.1 you add this up if I did my computation correct it comes to 1.44 which is better than 2. So, the question is whenever you are faced with the set of possibilities and their probabilities you can always ask what is the amount of information that I get on learning the outcome and you can take a pragmatic view that let us declare that the information that you get is the amount number of bits on average that you need to transmit to communicate this information. So, here was this abstract outcome with possibilities and we are trying to quantify the amount of information in it and let us take this engineering approach saying that the amount of information in an abstract probabilistic system is the number of bits needed to transmit that outcome to another party. What is the best encoding that you can apply to that system of outcomes so that your transmission is as small as possible and that we will declare as the information you could do that, but mathematically it turns out that a certain formula associated with these outcomes is what Shannon called the entropy of this system. So, let me again I am using a slightly scary sounding word entropy and of course many of you done physics know it from a different point of view, but they are related actually. So, once again x is this set of possibilities x 1 x 2 x n and you have associated probabilities x 1 happens with probability p 1 x n happens with probability p 1 and what is the entropy of this system if somebody communicated the outcome here how much information did they give me and Shannon said measure the information by this formula. So, what is this formula? So, p 1 log to the base 2 for some reason because we are using binary log to the base 2 of p 1 and you add this up for each of these values once again. So, suppose all these values were uniform 1 over n 1 over n 1 over n if all of them were 1 over n equally likely. So, I would like you to tell me what this equation gives 1 over n log of 1 over n with a minus sign log n. So, our general our first idea that if you have to transmit some n things then you need log n bits that is strictly true only if all the outcomes are equally likely, but if the outcomes are not equally likely the number of bits you need to transmit might vary as we saw in the election example that we had. Then Shannon proposed that this quantity should be used for measuring the amount of information in a probabilistic system and by a probabilistic system I just mean this some number of outcomes with associated probabilities. So, it is a things alright so far. Now, why this formula what Shannon proved what is called the Shannon's first theorem is that no matter how you encode you will always need that many bits. How many bits you compute this formula this is a lower bound on transmission. No matter how you cleverly compute your prefix free code you will always have to send these many bits on average to communicate the outcome. Secondly you will never need more than this plus 1. So, this quantity it is mathematically preferable it does not capture the exact amount of communication, but it is within one of communication and then there are certain asymptotic properties that this formula enjoys which we will not be discussing today which makes actually this one also go away that is in the limit H of x is the true number of bits you need to send what limit what quantity I will not tell you today. So, is this clear? So, there is a probabilistic system n outcomes associated probabilities Shannon says compute this formula summation p i log p i with a negative sign in the front that formula captures the amount of information in this system and there are practical justifications why this formula is very important. So, this quantity H of x which Shannon defined measures the uncertainty in x or the amount of information you learn when somebody tells you about x and this H of x is a function of the distribution of x. You see if I change these numbers if I said that this p n was the probability or yeah so for example in the election example if I change this probability and then computed the entropy it will still be the same the entropy formula does not care what outcomes they are it just says what is the distribution of numbers of probabilities that arise. So, which make sense I mean we should not really care about what particular outcome the probabilities associated and it turns out that the entropy is never more than log n which you can sort of believe because you will never need more than log n bits to transmit an outcome which has n possibilities. So, these are facts about entropy I need to tell you one more thing before we can really do the proof of the lumus Whitney inequality. So, I am yes we are going to the next level now, but please ask there is lot of time. So, we talked about one outcome that is x 1, x 2, x n it was just x, but imagine that there are two kinds of outcomes. So, let me just give you an example this x 1, x 2, x n could be you know one of the participants in the Olympia. This y 1, y 2, y n could be the color of their shirt. So, I pick a person at random what is the probability that the person I pick is x 1 and what is the probability that they are wearing a color a shirt of color red or this person wearing a color of shirt of color blue and so on. So, these are the m people who are in this room these are the colors of their shirts and I am going to pick one person on a certain day and it might turn out that I pick person one with color red then this will be the probability. So, this is a joint probability of two things person color. So, all these numbers add up to one is this clear the total probability here I am going to pick a person with a shirt and it might be this, this, this, this or this and each of them has a certain probability and the total probability is one. So, we could compute the entropy of this joint random variable x y I have written x y please do not think of it as x multiplied by y x and y together x and y together how many possibilities does x and y together have m times n there are total m times n possibilities here one of them is going to happen. So, it is the same formula of Shannon this is the entropy of x y together. Now, if I ask this once I give me give this information there is also a distribution on x 1 to x m that is created what is the probability that I pick up this particular participant it is the sum of these possibilities. What is the probability that I pick x 2 it is the sum of these probabilities so I could compute the entropy of x alone ignore y compute the entropy of x alone or I could sum these probabilities this way and compute the entropy of y alone. So, this joint system has three kinds of entropies that I can compute the entropy of the combined system restrict my attention to only the person persons are being chosen at with various probabilities I could compute the entropy of the person alone or ignore the person and focus on the color of the shirt and that would give rise to a certain distribution that distribution would have entropy and these things will be called h of x y which I have written here h of x and h of y. So, what do we know so something called conditional entropy of y given x. So, there are various possible colors various possible persons and then I ask once I tell you the person how much uncertainty do you have about the color of the shirt for example if the person turns out to be me me then pink colored shirt is unlikely I do not have a pink colored shirt. So, whereas if it was all participants together I mean there are people who like pink shirts and it could be that pink is also a possibility. But once you tell me x the amount of uncertainty about y might change because it could be that you picked the person and that automatically eliminate some outcomes for y. So, Shannon proposes that the conditional entropy of y given x should be defined to be the amount of entropy that you originally had in the complete system subtract the entropy that you got by knowing the name of the person and the difference let us call the conditional entropy of y given x there are other formulas for this, but for today let us just call this the conditional entropy of y given x. So, what is the conditional entropy of y given x the amount of residual uncertainty about y about the color of the shirt that remains given that I know the person. For example, it was like the color of the Pagdi if you say Manmohan Singh color of the Pagdi always blue you notice. So, the entropy of the color of the Pagdi given the person can be become very small can be 0 actually there is no uncertainty if the person happens to be Manmohan Singh if it is some other person may be they have multiple color Pagdi's at home and they were one based on their choice for the good. So, anyway I hope you understand what conditional entropy of y given x. So, these are formulas that is if you condition your uncertainty only reduces originally when you are looking at the color of the shirt there was certain number of possibilities there was certain amount of uncertainty. If I tell you who is the person then the amount of uncertainty about y can only go down. So, this needs to be proved this needs to be proved because we have I mean intuitively it is all obvious if you give me more information y can how can uncertainty increase. But since we have committed to a formula to measure information we need to verify all these things, but this can be verified that is the entropy of y given x is less than entropy of y. Similarly, instead of 2 you might have 3 possibilities. So, color person color of the shirt material like cotton or I do not know what people were nowadays terry lane terry cot. So, the entropy in the material given the name of the person and the color. So, if you add give me more information your uncertainty can only reduce. This was the uncertainty in the material given only the person this is the uncertainty in the material given the person and the color. So, if you give me more information entropy only reduces these are formulas to be verified I would like you to accept them. Now, after having accepted all this we will go back to Lumis Whitney. So, let me erase the board yeah take me some effort. So, does it erase better with water or once again let me try let me try it is an experimental science mathematics. So, yeah. So, while the board gets erased. So, all right. So, we had these recall the Lumis Whitney inequality we had n points the projection was n 1 this way n 2 that way n 3 that way. There was no probability I mean we were supposed to prove this inequality there was no probability. I am going to impose a probability on this system just because I am comfortable with information theory and I know some inequalities in information theory and I insist that I use those inequalities to prove Lumis Whitney. So, I pick a point uniformly at random from these n points. Now, what does a point in three dimensions mean it consists of three quantities a point in three dimensions consists of three quantities x y and z. So, the entropy of the point p to say the entropy of the point p means the entropy of x y and z and what is the entropy of x y and z well how much uncertainty is there in x how much uncertainty is there in y after knowing x and how much uncertainty is there in z coordinate after knowing x and y. So, if I is this clear. So, I picked a point p at random one of the n points and I picked a point p here and the amount of entropy in this there were n points. So, the amount of entropy was since I am picking uniformly when I pick out of n outcomes uniformly the amount of entropy is log n. Now, I project it down and this point is my point p 1 and what relation does it have to p we have just suppressed the z coordinate projecting down means suppressing the z coordinate. That means it is just the entropy of x and y itself and the entropy of x y is entropy of x plus entropy of y given x this was our definition of conditional entropy conditional entropy was this minus this. So, this is actually an equality although I have written inequality this is actually equality actually all these are equalities entropy of p 2 is entropy of x and z. So, I have suppressed the y axis y coordinate. So, entropy of x plus entropy of z given x and entropy of p 3 is entropy of y and entropy of z given y. So, is this clear I mean nothing is happening. Now, let us compare the sum of these three quantities and this quantity h of x. So, I am asking you to add these inequalities h of x appears once here in this mass h of x appears how many times twice h of y given x appears once here it appears once here and once it is h of y, but we know that h of y is h of y is greater than or equal to h of y given x. So, the total contribution here is at least two times this one contribution is directly available another contribution is h of y this is conditional entropy this is full entropy full entropy is bigger than conditional entropy. So, this quantity is bigger than this what about this is h of z given x y this is h of z given x this is h of z given x this is h of z given y this is the amount of uncertainty in z I am just repeating myself amount of uncertainty in z after knowing both x and y this is amount of uncertainty in z after knowing x alone. So, this quantity dominates this quantity is more than this quantity because of some formula that I had written in the previous page this quantity also this quantity also dominates this quantity. So, what is the net result the total amount of entropy in these three points the entropy of p 1 entropy of p 2 entropy of p 3 is twice at least twice the entropy of p. Now, we have set up our experiment with uniform distribution. So, entropy of p is exactly log n because it is uniform what about entropy of p 1 well p 1 log n 1 well it is equal to log n 1 well it is a distribution p p 1 is one of these n 1 points, but it may not be the same it may not be uniformly distributed. For example, suppose the original n points were like this heap of points here then the points in the center have much more chance of occurring because there are lots of points over them whereas points around have lower, but it is some others crazy distribution on these n 1 points, but it has only n 1 outcomes. And if an experiment has only n 1 outcomes its entropy is at most log n 1. So, you so and similarly log n 2 and log n 3 which is what I have written in the next slide. So, by direct inspection yeah no calculations we observe that 2 times h of p is less than this term by term matching. And then this is equal to 2 log n this 2 is the same 2 this is log n because we by decree said it is uniform distribution we decide design our experiment that way. So, this is equality this is an upper bound on f h h of p 1 because if an outcome has only n possibilities it can have a entropy at most log n. So, this p 1 can have entropy at most log n 1 this is at most log n 2 this is at most log n 3 and now you rearrange this inequality and you get this. So, we started with what appeared to be some hand wavy argument where at some point we said obviously etcetera, but if you use Shannon's formula that obviously can be made rigorous and you can extract the lumis Whitney inequality alright. So, once again I would like to yeah. So, this formula captures the intuitive understanding of information we define conditional information in this way and there are just other justifications for why conditional information should be defined like this. The entropy is maximum when the outcomes are uniform that is the most uncertain situation and we justified the lumis Whitney inequality using basically these considerations did not really have to dirty our hands with actual numbers and fancy inequalities, but all that is hidden inside the fact that conditional entropy is less than entropy improving that somebody has done the hard work, but now you internalize it and use it yeah and now there are many other combinatorial identities and there is an excellent survey article by David Galvin and I invite you to look at it. Now entropy shows up in many places and in other context there are many other situations in computation where this sort of entropy makes its appearance. So, I will just you know flash the thing that I wrote before. So, it is pervasive it is invasive it is elegant I hope you liked that everything turned out neatly and it is widespread it does capture a certain dissatisfaction in the scientific community. So, I think Shannon should be one of the heroes of our time. My own I have found entropy in unlikely places for example, I found it in Mumbai mirror. So, about two years ago I tried to solve this puzzle you know I do not know how many of you read Mumbai mirror and I usually try to find the longest word and I found poetry and I was really thrilled poetry you know how often do you find poetry and, but then I counted and it did not include all the letters and I cannot you would not believe me I spent like about 20 minutes trying to look for the word and then it turned out that it was actually entropy. So, I have been after that saying that I used to be satisfied with poetry until I found entropy. So, I will be happy to take questions and of course only scratch the surface that I hope it was useful. I have one question. When does equality occur in the inequality? What type of distribution will give us the equality of the point? So, I think we can, so the lazy answer is that wherever we used, so there are some, so certain variables should be conditionally independent given certain other variables then equality will appear, but I have not pinned it down here. There are continuous, loom is witness inequality is actually a inequality in the continuous setting I have presented the discrete version that is if you have a certain volume in space then you compute its area and then you multiply the product of the areas it will be at least this. By the way suppose this was, so here is a quiz now since I, you have any questions? Yeah, question. Sir, the definition of Shannon entropy that you gave, that is a good measure of the information or rather the lack of uncertainty, but how do we reconcile this definition with our concept of entropy because in classical thermodynamics we have a certain axioms for entropy like monotonicity positive derivative etcetera. So, does it also have some analogs like this? So, actually I do not know that much about Boltzmann definition of entropy, but I think that also talks about log of the number of possibilities and in that sense it is the same. Now Shannon entropy if you repeat an experiment many times and ask where most of the probability resides it turns out that the number of outcomes grows exponentially, but the constant in front of the exponent is the entropy. It is the rate of growth after taking log, but unfortunately I would not be able to give you a completely precise answer about how this ties to the physics notion of entropy. Yes. Question, yes. I think this is along the line what the person asked, but this may appear very strange. So I understand you say information theory system nothing to be physical system as such, but I am following the idea of Beckenstein. Beckenstein proposed entropy for black hole out of the blue completely. Something strange happening in dynamics of black hole he said it looks like entropy we are familiar with and he was ridiculed. He was thinking now you are talking entropy, next day you will ask me what is the temperature? He said maybe yes I will ask you. Can I just say same thing now since you have entropy of a information theory system, can one associate. Can you take a temperature? Temperature, yeah some kind of volume of information and change in energy cost or something like that. Yes, I mean so there are those are studied in the information theory. I mean the word temperature is not used, but the derivative of the entropy and things like that are of consideration even here. Some kind of energy cost is there. Thank you. Any more questions from the students? What is the difference between entropy and information? Are they the same? You are using it interchangeably? For me it is the same. So given a probabilistic system the information, uncertainty, entropy, they are the same for me. Other questions? Same, yeah most disordered, but this is a measure of disorder, measure of disorder. So yeah, so question. I am not able to figure this out to get my head rounded. The equation you wrote for conditional entropy, that is right. So I mean the example that you gave, obviously the uncertainty depended on x. That is right. So how can you write something general? How can I write like that, yeah? That is correct. That is a very good question. Where did it go? Yeah. So yeah, thank you. So x has various possibilities. This is the person that we pick. And then once we pick a person, that person might appear, show up, wearing this color shirt, this color shirt, this color shirt, or this color shirt. Whereas this person always likes to wear shirts of only one color. Whenever you have two variables x and y, this is the picture you should have. Each of these branches has an associated probability with it. Each of these branches has what is called the conditional probability associated. Now if it is this person, if you ask what is the residual entropy, it is the log, it is the formula applied to this conditional distribution. If it is this person, it is this. So when I say what is the conditional entropy of y given x? You have a valid question. Which x? If it was this x1, the answer is different. If it is x2, it is different. If it is x3, it is zero, yeah? Because there is no uncertainty about the color of the shirt. So what do we mean by h of y given x? Stated that way, h of y given x is compute this entropy, compute this entropy, compute this entropy, and take the average weighted by the probabilities that you wrote here. With what probability does this become your residual entropy? With what probability that this become the residual entropy and so on? And that average, it turns out that with that definition, also this formula holds. So that is an alternative definition of conditional entropy. Now it is no more clear why the average should be less than the whole, yeah? I mean the, yeah, so this is when I ignore x1 and look at the average of these distributions, why should that entropy be more than this entropy? This is what is called, so this is because of the convexity of some function. So what is called the Jensen's inequality or something like that will be used in the proof of this, not a big deal, but you prove all that 70 years ago, yeah? And then keep using it, yeah? But I have a quiz, yeah? So suppose the same question, I did it in d dimensions. There were points in d dimensions, yeah? Project it, remove one of the coordinates, suppress one of the coordinates, I get n1, suppress another one, I get n2, suppress another one, I get nd, yeah? There are d such projections, my point will have, yeah? And there are n1 different shadows on this plane, n2 different shadows, nd different shadows. Now, the total number of points was n and I write n1 times n2 times n3, n to the power d, what should I write as the exponent here? And why? Because every piece of information, yeah? The x coordinate is lost only in the projection along the x axis, in every other d minus 1 projections that information is available. So every information piece of information that is there here is available d minus 1 times here, so you get to write d, okay? Yeah? So many such inequalities can be proved directly from considerations of information. Thank you very much, yeah? I have gone over time. Any other questions? Okay, if there are no other questions, let's thank Jacob one last speech and both the speakers of this morning, thank you. And right now we have a break, a tea break outside and then we come back for the formal award function at 12 o'clock.