 So it's great to have Tatyara Sharfi with us, even though at the distance. And so as I agree with her that I will be here and I will try to act as a teaching assistant as much as possible. And so then Tanya, you can go ahead. Well, thank you very much Matya for the invitation and for the opportunity to interact and participate in the Spring College. I also appreciate accommodating my Zoom request that was a little bit too scared to travel a long distance. So I hope that we can make the best of the interactions that are Zoom allowed. So this is the first lecture and as I understand it's also the first day for you. So my course will be on information maximization and hyperbolic geometry. Sometimes we read from right to left. And so this is the plan that, so these are the textbooks that I recommend, mostly in addition to the papers that more recent papers that are not covered by the textbooks. These are the two primary textbooks by Baelic, Biophysics Searching for Principles and David Micay, Information Theory, Inference and Learning Algorithms. And there are others to mention here that you can see and use as a reference. So my plan was for these nine lectures is that today I will cover my view of why information theory is useful and we will already start exploring connection with hyperbolic geometry. But basically today I was planning to go over concept of information theory and how they relate to neuroscience specifically. So we will talk about entropy in physics and in biology, the derivation for the Shannon entropy, if various conditional and relative entropy is mutual information. So it's a two hour, hour and a half lecture. So there's basically twice the normal amount that I usually give in the class lecture. And we will talk about the principle of maximum entropy and time allowing maximum noise entropy and also discuss connection between information and hierarchical systems. So going forward, we can adjust. So you will have these lecture notes. You can read over them. We can adjust the plan for other lectures as needed, but this is what I was planning so far for the second lecture. I will talk about information maximization in neural circuits and various limits that are imposed by discretization, by nonlinearities, by filtering. So we will talk about optimal predictions for linear part of the computation, optimal predictions for nonlinear part of the computation and how we can compare neural responses relative to the optimal predictions based on limited number of observations that can make in neural circuits. So this is lectures one to three. And then second half will be more about hyperbolic geometry. So we will explore the connection between hierarchical systems and hyperbolic geometry and information. We will talk about information theory in decision-making also as it relates to hierarchical systems. We won't talk about information theory in a search. So some of the work actually is done at the CTP. And then the last few lectures I was thinking we will talk about information maximization in recurrent networks. And there will be a final concluding lectures to tie it all together. So to begin with, so I heard that you just went through the basics of information theory so we can make it more of a discussion and you can tell me what you know and what would you like to know in this topic. In this topic I will provide my own examples and hopefully they will augment what you have heard just now. So examples of information transfer is radio communication between from arrest to Morris. You know the time delay, it's very exciting movies that are coming from Morris and how you have to those who work on the project have to live on the Martian schedule and wait synchronized with the delivery of signals. The phone communication, so that was one of the original applications of the information theory during transmission and encryption and also in neural transmission. So we are biological systems but we operate in a physical world. And so we have to obey the rules of statistical physics and there are basically two theorems that underline the information theory that I find very useful. One of them is that the one that you discussed recently about the Shannon's channel coding theorem which states that for a given amount of accuracy, colloquially speaking, you need to invest certain amount of energy. So and then another one that we will discuss in the second half of the course is it relates to Kelly Gambling. So soon after Shannon published his theorem people asked, so why is the information theory useful? And why is it relevant to biology? So then there was another example soon after a publication by John Kelly who showed that information determines the maximum rate of growth for a set of, you can say, portfolio in economics or you can formulate equivalent problems in bacterial growth. Meaning that the maximum rate of growth in the bacterial population will be limited by how much information it extracts about its environment. And if you think there are two competing populations of bacteria that are competing for resources, the one that extracts more information from its environment has a chance to grow, to outcompete the other population. So that's some of the examples. And in neural transmission, so we have a signals from the environment and then there are signals in the brain that have to reduce our uncertainty about what's happening in the real world. So, and another example in evolution so that will be accumulation of information across generation in DNA storage in the trivial. And then another example, which is I think is not mentioned on this slide is the accumulation of information in say cultural knowledge and as in books. So how much, and some say that that's the driver for human evolution. So are there any questions or comments from here from the audience before we go through specific examples? More material before questions can be asked. I cannot hear. Yeah, sorry. Can you hear me now? Yes. Yeah, so I think there are no questions as of now, but... All right, okay. So I will begin with an example of inference and it's from David McKay's textbook and he goes over the case where of inference and crime scene investigation. And he says two traces of blood were found, one type O and the other one type AB. So he said, given that we know the frequency of various blood types in the population that AB is fairly rare, 1% and type O is 60% in the population. Then you have two suspects and one of them is determined to have a type O. So the question is whether this observation is evidenced against this person or supports the hypothesis that they should be investigated further. What do people think? What do you think? Sorry, can you speak up? I think it doesn't give too much information because most of the people is type O, so it could be or couldn't. Yes, I think qualitatively you're right, but we will go over the quantitative derivation and it says that actually it is weak evidence towards the innocence of this person. So you can say, so that's the question. And so we are going to compute the probabilities of various events and compare which of them is likely. So one case, one scenario is that the suspect and an unknown person are present there. Another as prime hypothesis is that two unknown people from the population were present, not including the Oliver. And so then the probability of data that we have, given the first scenario S is the probability that you get the other sample from an unknown person. And then the other hypothesis is that you have both the probability of the two blood types. So then we take the ratio of the two probabilities, then it turns out that the ratio is one over two probability of observing the all blood type and that turns out to be 0.8. So roughly speaking, the data provides weak evidence against this opposition that Oliver was there. So it's kind of interesting and the quantitative application of our intuition of how we integrate information from various sources. So the information content of a variable is given. So if I have some variables are very rare and some variables are common. So if I observe a variable that is very rare, then intuitively it should have, I gain more information about the underlying process. Then I have, then about the event that is more common. So the definition that you probably heard is that it's log of one over P. And today we will discuss why this is actually up to the normalizing constant is the same, the only function that can define information. And I see there's a question in the chat. No, it's answered. So we'll take care of the chat. It's about recording lectures. No, the question was about hyperbolic geometry. Well, sorry. Yeah, so somebody asked what is hyperbolic geometry and I think I should answer, no? From Colin. Ah, maybe he sent it to you directly. So we didn't see the question. Okay. All right, so I will pause here and I will say that ICTP is a great place to talk about hyperbolic geometry. As you walk around, you will find lots of specialists on hyperbolic geometry. As you know, as you kind of walk around, we usually think that the world around us is Euclidean, meaning that the parallel lines do not intersect and the distances are measured according to a standard Euclidean distance. So as you know, there were five axioms that using which Euclide formulated his geometry. So it turns out that you can eliminate one of them and still have a set of self-consistent observations. So that was the invention of hyperbolic geometry and it is a geometry that has negative curvature and we will talk about why it is important for biological circuits and also it's the geometry that describes our space time on large cosmological scales and in the case of, you might be surprised to know that it actually describes also your perception of the world on much more local scales although we didn't notice these distortions, but one example of the distortion is that you might know that children, infants are born or babies, they try to grab a moon. They think that the moon is actually much closer to us visually than the real distance and that's an example of the hyperbolic distortion in our perception of distances. So more about that will be in the later part of the course. Is that okay? I can quite see the person who asked the question, but thank you. Colin, is that the right name or is it the last name? Yeah, thanks a lot. So we will talk about information content of a variable and now if you have two independent random variables then you can add information and gather the total amount of information will be the sum of the two independent random variables. So here is another example of the information use, it's the submarine game. So how many of you, I played it as a child, how many of you have played it? You can raise your hand and we will count it. Okay, very good, very popular. It doesn't take a lot of, it can be played with minimal resources. So the question is, so each player hides a submarine on a square of an 8x8 grid. If submarine, just one submarine, we will simplify the game. If the submarine is hit on the first guess, then you have reduced the uncertainty from the 64 beds, 64 squares to one. So you gained in information a log two of 64, six bits of information. And then if the first shot is a mess, then the change in entropy is that we had 64 possibilities and then we have 63 remaining. So minus log of this amount, let's see, I'm trying to find my comment button, but here minus log two, 63 over 64, we gained just a tiny bit of information by reducing the entropy. And then if the second shot is a mess, then we gain a little more and so on it continues. And then after 32 misses, you start adding all of these information and after a while, all of these give one bit. So in essence, we have reduced uncertainty from 64 possibilities to now we know it's 32, so it's roughly one half. And the total information content of all the outcomes is always six bits, independent of how soon the submarine was discovered. So that's one example of how information is progressing. But then another example that we will discuss maybe in more detail in subsequent lectures is that suppose that the probability of finding the submarine is not uniform. So in this example, all squares had equal probability. But then imagine that you have some prior knowledge and maybe you looked over, I don't know, and so where the submarine is located. So in that case, the optimal solution is to order the squares in the order of how likely the submarine is there. And then go one by one in starting with the like list position. And then what we will talk about a little bit in this lecture but also in subsequent lectures is that the average number of steps or gases that it takes to find the submarine, find the target is indicative of the entropy of the underlying probability distribution. So to take the extreme example, suppose you know for sure that it is located in one square. So then you will always find it with one step and the entropy is zero. But also that's one example. But when there is some uncertainty, the number of steps is actually bounded from below by the logarithm of, but they expose e to the entropy of the distribution. So one can go back and forth between how long does it take to find something and the entropy underlying probability distribution and vice versa. The less knowledge we have about the target location, the longer it will take us to find the target. I would like to find my annotate button. So that's okay. So then another example of maximum informative experiments and it was actually proposed in, it has many applications for biological systems. So one example is we are given 12 weights all equal in weight except for one that is either heavier or lighter. So we are supposed to design a weighting scheme that will identify this weight as soon as possible with the smallest number of measurements. We have a two-pan balance that has three outcomes. The combined weight is equal heavier on the right or left. So we have a three-outcome measurement device. So we need to design a strategy to find the odd weight and whether it is heavier or lighter than others using as few measurements as possible. So we are starting with 12 weights. So how would you proceed? That's a question. We have a question on the submarine in the chat. So in the chat it says, so it would be a good strategy to change your estimated target if it took too long to hit the submarine. Can you replace the question? You know, identically it said it would be a good strategy to change our estimated target if it took too long to hit the submarine. So I think another way of rephrasing this question is to say, I expect to find the submarine after this many hits and I haven't found it yet. So it means that my assumption about the probability distribution is wrong. And that's an important aspect of these questions because in many of the problems that you find in a textbook it says, well, here's the probability of various events and what is the optimal strategy of approaching this. But in real life, nobody is giving you these probabilities. They need to be estimated. So in other words, sometimes we have to make a move not because it is kind of a maximum likelihood, not to the square that has the highest probability, but to a square that will help us measure the overall probability distribution in the field. And so that would be an example of a search strategy, the infotaxis, that we will talk a little bit more in detail. But I think to rephrase the question, it would be saying that if it takes too long, then there must be something wrong in my model of the world. So maybe the probabilities that I assumed are wrong. And actually, it's an example of a decision-making that animals also do. In one of my papers with collaborators, we studied how a little worm, C. elegans, makes decisions. And in that case, the C. elegans is just like with this submarine game. It's crawling around and its job is to find a bacteria that they eat. But they have to search an area which is a thousand times their body size. And the question is in a simplified experimental situation, the animal is living on a plate that is full of food, full of bacteria. And maybe these are even dead bacteria such that they do not move. So it's a prepackaged food, if you will, for C. elegans. And they're crawling around eating it and then they're being picked up with a pick and put on a new plate when there is no food. So I'm not quite prepared with the slides. I will just tell you a story now and the slides will be later on the next lecture. Is that what happens is that the animal will search the area where they have been dropped off. Actually, they will crawl a little bit from where they've been dropped off in the estimate of how much they were moved. And then they will search this area for a while for some time. And then they will give up and then start going somewhere else. So it turns out that this type of decision making can actually be formulated in this infotaxis like trajectory where the animal is exploring the most likely possibility. But if it is not finding any food where it's supposed to have found food, then it must mean that the model of the world is wrong. And so he says, well, it's time to move to somewhere else. And you can think of how this is an example of decision making in humans. So I'm working on a research problem. I expect to find a solution. But after a while, if the solution is not found, I will say, well, maybe it's time to switch to another problem. And, you know, and so on with other everyday tasks. Thank you for your question. Is that good? Yeah, yeah, cool. Thank you. Yeah, so we, you know, it's important. I appreciate very much your questions because we should be as in, you know, make more effort to be interactive given that I am on Zoom. And so please ask questions. I heard Bill Ballek once say that it's important to, more important to uncover a little than cover a lot. So if you don't want to cover a lot of material, but not answer questions. So this is an example of maximally informative experiments. And it is very, has a lot of applications. This is a kind of a baby example that we will talk about. But it also has almost applications. So if any one of you have to read, have to apply for a research grant. And what reviewers like to see is that you formulate two alternative hypothesis about the subject that you're studying. And then you will eliminate one of them. So you can think about that actually this is a maximally informative way of studying that particular question because you're setting up your hypothesis as two equal alternatives. We don't want to be testing alternatives where one is most likely false. And we are just fine tuning the other alternatives. So you would like to in a maximum informative experiment to set up alternative that are equally likely. And then you will do an experiment and eliminate half of the possibilities. And so per experiment, that's maximally informative. So now we see this in a synthetic situation with these weights. So at each strategy, our strategy is to set up the possibilities such that they're equally likely. And so for example, we have our 12 weights and we have possibilities. Either one weight is heavier or lighter to second weight heavier or lighter or a third weight heavier or lighter and so on. So we have 12 weights. So suppose we will not always, but four of them versus another four. So we have a possibility that if one of them is heavier, then it means that it could be one, two, three, four that is heavy or five, six, seven that is lighter. Or it could be that one, two, three, four is lighter and five, six, seven is heavier. Or it could be that if they're equal, it means that these two will show equal weight. And then we know that the altered weight is somewhere in the set that was not tested. So then in other words, so if we're in this hypothesis in this track, then you have eight possibilities. And now I can take three, three weights. And if one of them is positive, then it's either one, two, maybe or five is negative and so on. So we have now various three possibilities. And with another, so with basically three measurements, we will be able to see and identify the malfunctioning weight. So I'll give you a few minutes to look at the table here to go over various possibilities. Is this the unique best strategy in this case? Is this a unique strategy? No, if it is unique, are there different strategies that solve this problem? Or are there strategies that solve this problem better? I mean, is this maximal informative? It should be. So we'll do a computation, I think. So let's think about... Yeah, I think it should be maximal informative. So let's see what this... You see how we set up here, this weight here? Two of the positive and one negative, right? So if it weighs positive, we have eliminated that it's not... The six is not a minus and it could be like one or two. So here, these are not fractions, okay? These are the weighting things. Like in this possibility, one plus and two plus and five minus. So if they're equal, so that's the third possibility, then it means that it was five minus. Okay, so we can now compute. Oh, I guess I don't have a computation. So the claim is that in each stage, we have set up to maximize information between... So we can think about what is the entropy here? The entropy is 24, right? Okay. So we have a question, but I don't know whether it is about this one or about the submarine case. The question from Edward said, I don't really understand how the second loop works. You mean this loop over here? This loop over here? Do you see my cursor? Let's see. So let's see here. You don't want to set up, so I guess the question is about this part. So we have determined that the weight can be in one... So we can have weights one through four can be overweight and five through eight can be underweight. So then we make a combination between two positive and one negative versus another two positive and another negative. So if they're all equal, it means that it's either seven or eight, and then this is our possibilities. And here you don't want to select seven versus eight, because then you will not know... I guess you will not know... I guess you will know one of them, so you can do one over seven versus eight. But in other ways, you can take another weight, one over seven. Okay, so then... So the question was answered, or I think maybe it will be helpful to go through the explanation for other people. Okay, so Edward said it's okay. I think you need some time to think through this example. And Edward is fine. Any other question on this? There is another question from Gianluca. We had three possible outcomes and started dividing the groups in three parts. Is this a general rule? Yes, it's an approximation to the general rule. So we would like to set up... So with one measurement, if we had two alternatives, you can learn one bit of information. If you have three alternatives, so you can do log of three base two. So we would like to set up into these three equal, distant opportunities. So the probabilities are approximately equal. Okay, thank you. Yeah, so if we had two, then the setup would be different. If we didn't know whether it would be... Yes, but with weighting, you always know one of them is less, the other one is more. So you are taking advantage of this weighting experiment. So now with maximum informative experiment to generalize this to... There is a series of papers starting with David McKay and others called maximum informative experiments. And they say, they go roughly like this, that you have some model about what a neuron does. And you have some unknown parameters in your model. You can ask, what stimuli should I use that will reduce my uncertainty in these parameters most? And so that's the main idea of these maximum informative experiments. They have a disadvantage in practice because biological systems adapt. So if you had a static system whose parameters were not changing, then this idea of maximum informative experiments would be a great one. But because the parameters of the neuron could change depending on which stimuli we use as an adaptation, in that case, it sometimes can set up a cycle where we are chasing a moving target, where the parameters change as a function of what stimuli are being shown to the animal. So, but assuming, you know, imagine that in this case, this extra weight was hopping. So then it will be a more complicated scenario where you're trying to make measurements, but the extra weight is playing hide and seek. So you're making one measurement and then it jumps around depending on what measurements. So kind of a theory of games. So then it becomes more interesting and complicated very fairly quickly. So going back to this example was infotaxis. Sometimes this is just a point of discussion to provide a counter argument. You can set up situations where, for example, going back to the submarine game, you know perfectly well where it is. But in real life, you can't get to it before it changes its position. So you can have perfect information about its location. But in terms of, you know, actually getting that submarine or getting that fish or getting that print, that doesn't always translate into gathering the reward. So then we are entering the regime of predictive coding and predictive information. So just like you can compute information about the current position, you can also compute information about the future position. And then it becomes a more interesting computational problem, but one can still use these ideas to maximize information, assuming that there is a time lag between knowledge and response. Okay, so we have lots of questions in the chat. I try to address it by repeating what you said before. It's about why do you choose one, two, and one and two in the second waiting instead of one and five. Let's see. So I hope, Gittender, is it okay? Yes. Okay, Gittender is fine. Thank you. So then one of the definitions of entropy is, you know, how rich is the source. So especially during lectures, I try to use this example. So suppose you ask me questions and no matter what you ask, I say yes. So it's obvious that in that case I'm not conveying any information because my entropy is zero. Same thing goes for neurons. So how many of you are experimentalists? Do we have any hands raised for experimentalists? I can't quite see. No, I think no experimentalists. At least here, maybe we have some experimentalists online. Okay, so then I will tell you about neurons. I did some experiments, so I will tell you my theory experiment view of neurons. As you know, neuron is a cell and it responds with kind of zero or one. One means it produces a spike. Spikes are costly. So imagine a dead neuron. No matter what you send the signal to, it's not sending anything out. So that neuron conveys zero information because his entropy is zero. Now, before the neuron dies, sometimes if you're recording, so you put an electrode next to a neuron, and when it is a healthy neuron, we will show you an example of recording. It goes between, you know, sometimes it spikes, sometimes it doesn't. And then just before it dies, it will produce lots of spikes. That's often the signature of a dying neuron. So in that case, it is also not producing any information because it constantly says one, no matter what the input is. Okay. And one example from hearing is on occasion, one can hear a very frequency defined noise in the ear. And then it goes away. So unfortunately, what it means is that sometimes some of our kind of cells in the ear, they have a very complicated mechanical sensors, complicated, very complicated physics and mechanics to fence vibrations. So when these cells die, they cannot be regenerated. So that's all the argument about hearing loss and not hearing at large volumes because they're use dependent. So when this cell dies, it will produce a lot of spikes. And then the person hears a frequency that corresponds to that neuron. So that's not very pleasant, but you know, happens. So this is an example of the information content, but then let's do a variation on the same. So you're asking me questions and now I'm saying yes and no. And then you can get more sophisticated and you can ask me the same question. And sometimes I say yes and sometimes I say no to the same question. So I have the capacity of conveying information because the entropy of my answers is nonzero. There is variable sometimes zero, sometimes one. But then what is interesting is when you specify your input specify the question. And if the entropy of my answers equals to the entropy of my answers to all kinds of different questions, then no information has been conveyed. So in that case, the neuron is still alive. It does zero and one, but maybe disconnected from the input. So then the information will again be zero. So I hope this provides you some examples of a practical use of information. Yes, go ahead. I hear there is some question. Questions. Oh, I think we are all with you. Okay. All right. So now we all know that the definition of entropy in physics. And I like to find my, we found it during tests, the, the, the annotate button. Well, I will have to do with the comments in. Yeah, I think. So in physics, in physics, the entropy, it's not as nice. Is logarithm of the number of possibilities. So now you can write this as logarithm, the probability of various events one over N, and I will do a minus. And then I can also write this as a sum from one, one over N of my possibilities one over N log one over N. So that works out to be, I didn't lose any signs here. Yes. So this is minus sum over I pi log pi. So that's the definition of entropy that when we are talking about gas and volumes of a particle. But here's another definition here that is more nuanced and talks about various possibilities that are not equal. So that's an example of a generalization of information and entropy. So now, this is actually the same thing as this equation, because instead of probability of I we write one over P of five. So now looking at this equation, you can see this is information of one variable I of one outcome I, and then the source produces these possibilities. So we just average in sum over the various possibilities. So the entropy content of a source means that we have entropy for a given outcome, which as I have drawn in the past goes like this. So when P is large, it approaches should be should go to zero. And then when P is small, should be large. And so that's another definition of the entropy of a source. And you can think and then we talked about that information is a change in entropy. So if you learn everything about the state of the source at any given time, then you reduce the entropy to zero. So any other any questions? Does that is that similar to what you just covered earlier in the earlier today? Well, I think what we did was a little bit more theoretical with less examples. So I think this complements what we have done gives more intuition on. Okay. All right, so then we talk about. So maybe this will be the, the more theoretical part we will, we will see whether we can go faster. So this is the famous Shannon paper. If you haven't read it, I recommend that you read it. It's a little bit difficult to get over the first few pages because the language is a little bit old and the terminology about telegrams and teletypes and things like that. So I found it a little bit difficult to get over the first few pages, but then it is very interesting and informative. And so in his paper, he derives a theorem that entropy is the unique measure of available information consistent with three postulates. So also in that paper in rules with going with examples, I would like to ask you if you talked about information. What is what do you think is the difference between information or relationship between information and crossword puzzles information and crossword crossword puzzles. And then there is another hint. Why is crossword puzzles are usually in 2D and not in 3D? You can imagine having, you know, the world, you know, having kind of a 3D structure and they gave you clues, you know, vertical, left and then in depth. We have Nicolo says that each word you guess right, you gain information on other words. Yes. Right. So that's the right, the right. But it turns out so the full answer is in the Shannon's paper. So he was very interested in sending code messages. So it's kind of a world war two communication problem and how they can be compressed. So he studied language and he studied correlation between words in and syllables in the language. So it turns out that the if there were no correlations so between the words, meaning that if you know one word, one letter E, if you couldn't predict if the probability of having the next letter and here would be equal likely with all the other letters, then it would be very hard to basically impossible to solve a crossword puzzle because there are not enough correlations between the variables to help you out. But if the words were too correlated, meaning once you know one letter, you basically know the whole world, the whole word. Then it's not possible. It will not be possible to construct a crossword puzzle. So it turns out that the correlations in the language are just strong enough to allow for some room for variation for generating crossword puzzles. But then in 3D there are too many constraints and that becomes not feasible. So and then further details you can read about in the Shannon's paper. So it's a little bit of a long paper and it has many things in it, not just crossword puzzles. So then he said we have three axioms that this function, so you can stop me and we can go faster if that's something that you discussed, but I think it also provides a link to hyperbolic geometry and hierarchical system so we can go through and then go faster if you already heard it. There are three axioms, three properties that we would like to have an expression for the entropy as a function of probabilities. And we would like it to be monotonic, meaning that as more measurements are made, the information should increase and not decrease. For independent measurements, the entropy should add. And for branching process, this will be a connection with hyperbolic geometry, the total information should be a weighted sum of information gained at each branch. So did you go through the derivation of the Shannon's entropy? No, we didn't go through this morning. Alright, so then it's useful to go because it's an example of the hierarchical system and so we'll relate to later on to hyperbolic geometry. So we will talk, suppose you have constructed your questions and you ask the question, you get an answer A or B. And then once you got A, then you get another question or measurement and you had possibilities A1, A2, A3 or B1 and B2. So information must be a function of how many questions you have asked because this is the number of measurements. And let's consider the case where the number of possibilities is K to the M, so kind of a hierarchical process. So our answer is composed of M independent parts here, one or two, and each part has K equal likely possibilities. So not two and three, but for now will be uniform. So maybe I skipped some slides, but just a second. Maybe my slides got rearranged a little bit. Do you want to break or not? I can find my slides. Or we are going non-stop. I think we were supposed to stop at half past five, so in 25 minutes from now, means half past eight probably your place. All right. Okay, so let me... Maybe we can pause. No, it's okay. So then what we have is that... So because we have independent measurements, so it's okay. I think the slides are okay. There is this joke in the old days, you know, a person arrives with slides and just before they talk, they all fall and mix up and this is not what they do now. I will put them in random order and I will go with this. So I will go with these slides as they are. So what we have is... Remember that our assumption was independence. For independent measurements, the entropy should be additive. So in other words, that if we have KM possibilities, it's the same thing. So this function of the entropy has to be M times the function of one measurement. We have a question. Oh, so I guess there is a delay between the chat and my answers. So in the crossword, you can have more or less black squares. No, this is me. I'm trying to discuss this question with Jitendra about the density. She says, what if you change the density of words? And in a crossword, I guess you can add these black squares. Right. And I think it's an interesting question to ask whether the crossword will become simpler or more complex if you have more or less of these black squares. Yes, and I think also different languages have different statistics. So you can look at the average density of these empty squares for crosswords in different languages. And another example of a recent paper, it's a theory of say human communication. So they measured the how much information is conveyed per syllable and how fast the syllables are being pronounced. So it turns out Italian was one of the languages that they studied. So they found was that there is a more correlation, but people speak faster than compared to they say in the Japanese language, where the information syllables are more independent, but people speak slowly. What does everybody think? Does that sound reasonable? No? Yeah. Go ahead, Colin. Yeah, I was asking correlations between what between different words spoken. Or between different syllables. Okay. Okay, so with an understanding that there is a certain rate of information with which human brain can absorb. And then either you can speak faster, but figure out from correlations what was said, or if every bit is independent, then you have to speak slowly. Okay. So this is a digression to hopefully entertain. So we found that information has to satisfy. So the reason for this is that we ask independent questions and the answer must add so the function. Therefore, as a result, we have M times f of K because this is the amount of information we get. And that's the kind of independence and additivity problem. So consider a pair of integers now, L and N, so we don't know what this function is, but we are going to bound it by kind of a asymptotic series. So we look for two integers and such that this is in between K to the power of M and K to the power M plus one. So this information has to be a monotonic function of N. So therefore, if this has to be less than that. So, and also when we apply F this unknown function to our number of measurements. So that if we have the relationship that KM is less than LN is less than KM plus one, then the same thing will be true when we apply this function F. And by our additivity property here, we convert this to M equals F of K and equals F of L and M plus one times F of K. So this is just the restatement from the previous slide. And now we divide by N. So we have M over N, and we also divide by F of K, so we have F of L divided by F of K, and then M plus one over N. So, from the properties of these integers, which we said that came to the M is L to the N is in between two subsequent powers of K. So now that if we take the logarithm of both sides, M log K is N is log L and M plus one is log K. So now we have two inequalities that are very similar to each other, and you go to the limit of large N. And so the difference between the ratio of our two unknown functions and the ratio of the logarithms has to be zero, because these two functions, they bracket each other. You can subtract one from the other. So it's zero here. This is our absolute value. And then this is one over N. So one over N becomes epsilon. In other words, there is really no chance for this function to be anything else than the logarithm. So that's a nice example in Shannon's paper of how you start with three reasonable assumptions, monotonic, independent, and what was it? Monotonic, independent, and additive, I think. Yes, monotonic, independent, and the branching process. Then starting with these assumptions, the only function for the entropy that satisfies this has to be the entropy. And yeah, so that's the definition. You can change your constant, so it can be log based, different bases, but there is no other choice as to having it be a logarithm of N. So when it is equal probability, the information has to be proportional to log N. And to find the expression for unequal probabilities, we consider the case of rational probability. So we approximate this as KN over some of the possibilities. And we consider the total number of possible answers grouped into these N groups. So think about this case with the hierarchical assumption where these are the number of possibilities. And these are all equally likely, but the P1 and P2 and P3, and at this level, they're not equally likely. So we know the answer and derived intuitively what should be the form of the answer, but this is a quantitative definition. First, we need to find which of the N groups the answer belongs to, so which of these groups. And we denote this as information as a function of PN, which is what we are after. And at the second stage, if we know that we are in, for example, this group, the information is log 2 of this number, because I am narrowed down out of all possibilities to the remaining uncertainty. So if I know exactly full information into which of these possibilities we fall into, then we will learn full information. But if we only know the group, then our uncertainty still remains is log 2 of KN. And so because each group occurs with probability PN, the average information we gain in the second step, let's see, is PN times I of N, or our formula which we had in the past. So, and then the total information that is gained across the two steps has to equal the number of these possibilities, which is N total. And so information at the first step is, this is information in the second step and the total information. So if you subtract one from the other, then you get this famous equation for the information or the total entropy of the source. So any questions about this definition, derivation? I think we are fine, also online, there is no question. Okay, so now we can come back to hyperbolic geometry and explore the relationship between information and hyperbolic geometry. So if you look at this type of network, and so, and I'm asking you to compute a distance between, sorry, pointer issues. I'm asking you to compute a distance between this node here and this node. How would you do it? So, you know, let's forget about information theory for a moment. I just have a hierarchical network. How do you compute a distance between these two nodes? I think it's a question, no? Yes, it is a question for the audience. Yes. How do you define the distance? Well, I mean, that's also the question to you. So you have, you know, I gave you this graph. How would you define a distance? A longer graph. The minimum number of steps to go from one node to the other? Yes. Right. So the minimum number of steps, you can say you draw it here. One, two, three, four. Okay. So, you know how, and then this, from here, I don't want to use a different color because I guess it doesn't have more, you know, between this point and this point. So we go like this, a little trajectory, or within the cluster, you have a little trajectory. So that's our graph distance. So now here comes the hyperbolic geometry. Imagine that you have a hierarchical network like that. But, you know, maybe there's uncertainty in the positions of these nodes, and you would like the distance to capture the distance along this tree. Then it turns out that the distance, the optimal kind of continuous approximation to this between two points will go something like this. That's the optimal trajectory, or the path of the shortest distance, the geodesic one. So in this case, it turns out, you notice that the trajectory, the minimal distance trajectory is curved. It's not, doesn't go this way. So whenever you have a hierarchical process underneath, and you would like to have a distance that somehow captures that underlying hierarchy, then often the hyperbolic geometry arises as a natural continuous approximation to this underlying hierarchical structure. So this is a qualitative statement. And the more quantitative statement is that there are various approximations to the hyperbolic space. One of them is this Poincare half plane, which in which case trajectories do indeed go kind of this way. And, okay, so we have lots of questions in the chat. So in the chat, yes, no, I mean, lots of answers. Gianluca Serra says that he would count the number of jumps in order to define the distance. But say, I have a question. So does this hyperbolic geometry has to do with the number of nodes that you have at a certain distance with the distribution of points? So imagine that you have uniformly distributed points in a space. Yes. And then you ask, you start from one point and you ask how many points are at the distance d. And then we know that, say, say, in Euclidean geometry, this goes as r to the d. Yeah. And how does it go in hyperbolic geometry? So it also goes, it goes exponentially. Okay. So it goes as e to the r and then dimensionality. But what is also interesting in units of curvature, so a curvature goes here. So radius is kind of measures d is the dimension of the space. And r has to be a dimensionless quantity. So r is measured in units of curvature. So an example on this slide is useful because the curvature here will not be uniform. So for it to be uniform, I sort of need the equal branching ratios everywhere. So the curvature here will be larger here and here than here. So I think a good model of hyperbolic space is cauliflower or broccoli, you know, how it grows. And sometimes, you know, overall it grows uniformly. But, you know, there are, you can imagine that it can grow. You know, if there are some adverse conditions, then it will grow towards and have more data in certain direction. So this is also the model of how we talk about allocating possibilities towards, you know, imagine that the states are under our control and we can define categorization variables. When does it make sense to split one category such as this one into a five or just two and or how many categories should I assign to? So that's a question that is related to the information that is related to the source. So one example from my own life, so when I was transitioning from physics to neuroscience, I had one folder in my kind of we had a paper and those times papers were printed out. So I had one binder and it said neurons. So all of my neuroscience paper would fit into one kind of one concept. And then as you know more or learn more about neuroscience, then you start sorting into visual neurons or auditory neuroscience or general principles of coding and so on. And the same thing is when a child is learning a language. They first learn one concept animal, and then it will be cats and dogs and then the subdivisions and so on. So there is a was experience on how do you form an appropriate tree. So that relates to the information content of the source. Okay, so I think. Okay, the dimension is related to the number of branches. Now that's it. So as you see, on this slide, I guess the notes maybe this notes that are still here. It is a tricky question. So on one case, dimension can represent the number the branching factor, and that's I think the most reasonable hypothesis. But if you just look at the growth of the states with the distance, then it can find the dimensionality and say radius and curvature. So I would say that dimension is more closely approximated with the branching factor. But if we just look at the asymptotical expansion of distances with radius, then the curvature of the space and the dimensionality or when the branching ratio are confounded. Thank you for the question. So the last concept for today is if we're stopping at just about now, right? Yes, yes. Or do you want me to go for 15 more minutes? So I wasn't sure. Well, I think, well, by the way, I think we already went on the asymptotic acuportition property. Okay, so then you know, then I think that's then then then then you can tell me just so I think it's kind of a magical magical property, because it's the number of that. You know, you have the probability distribution, and you can approximate it as a constant and then a zero. And, you know, the entropy is the number of these likely possibilities. So if you went through this, then we can stop here and we will go to conclusions. Okay, so we have a couple of questions on the on the chat. So I think the first question I wouldn't mind repeating the topic, though. So, of course, I think we are going to talk much more about hyperbolic geometry. Right. And then the second question is does hyperbolic geometry is an optimized way of dealing with the distance? I mean, is it equivalent to Euclidean geometry? Or we just can't do the work with Euclidean geometry. So, well, I think your point is that it's a it's the outcome of an optimization principle, right? So the way in which, say, topics, I mean, in your example of folders of neuroscience. So the way in which these are organized is the outcome of an optimization principle, and it turns out to obey this hyperbolic geometry. It is what I gathered from this informal introduction to this general topic. Yes, thank you, Mattel. Ah, okay. Valery Engelmeyer said she wouldn't mind repeating the definition of typical sets. So maybe we can take five more minutes, if you all agree. Just a different person's perspective on the typical sets. So roughly, you know, in this figure, we have these are the sets of possibilities. You have lots of possibilities. And so basically, if you think that I have a probability distribution and it is more like a Gaussian, so it goes up here and goes down. So I will just approximate this Gaussian as a zero and then a constant and then another zero. So then it says that because this is the entropy and these are all equally likely, so they're equally likely, the log n of these, the size of these possibilities. So the typical set has this many elements, two to the n times the entropy. So this is the number of elements. If I take log two of this number, I get n times the entropy per variable. So the entropy determines how many points are within the likely scenario. And then you have a logarithmic number of points that are unlikely. So for the most part, we can ignore. So for example, you know, they are sometimes people ask I went to the doctor, I did this test, and they tell me the probability of disorder is p to the minus 10 to the minus seven. So what does it mean? It means you don't have it. But you know, but that's not what they can tell you. So if, you know, for all effective purposes, you have some possibilities that are likely and the size of these possibilities they're equally likely is the entropy. The entropy determines the size of these equal likely possibilities. In other words, I'm ignoring the variations in the probability between these, between these points. I hope this was qualitatively helpful. Yeah, so I think I will put the chapter three of Cover and Thomas that discusses asymptotic equipartition property on Slack so that whoever has problems with it can go through it. Yeah, I think this is from Cover and Thomas, right? Yeah, exactly. So I guess you can send me comments by email or Slack and wishes for the next lecture and I guess I will see you all on Wednesday. Is that right? Is that the right time frame for the lecture? Yes, yes. So next lecture will be on Wednesday at the same time. Okay. Thank you very much, Tanya. Thank you. Have a nice day. Be well. Bye bye.