 Good morning, everyone. It's a pleasure for us to have Professor Daniel Brown from the University of Ud. He will deliver three lectures on sensory motor learning and computational motor skills. Thanks, Antonio, for dividing me. Thanks for coming. So before I start, maybe I quickly jump to the acknowledgments because I tend to forget them. So I also wanted to start with the first part about where I'm coming from and what I'm doing. So you see a little bit here already. So I did my undergraduate studies at the University of Freiburg and also started my PhD in computational neuroscience without Erzsen. And we did work, started work there on structure learning that I'll talk about tomorrow probably, also together with colleagues at the Hebrew University. And then I moved on to the University of Cambridge to work with Daniel Wolford in computational motor control. And I started to collaborate with Pedro Ortega on most of the theoretical ideas that I will present today. Then I spent one year at the University of Southern California with Stefan Schaal, also computational motor control lab working with robots mostly. And then I moved to the Max Planck Institute in Tübingen and had my own research group funded by the German Research Council. And these were my PhD students there, Tim, Jordi, Felix, and Jan. And in 2016, then I moved to the University of Ulm. And I have a team there now. These are the four people that started with me there. Sebastian, Cecilia, the postdocs, Heineken, Sonja, PhD students at the minute. And Cecilia and Sonja work more on experimental stuff. And Sebastian and Heineken want theoretical issues. OK, so what am I doing? So I'm interested in intelligence and intelligent behavior and how to understand it. And if we think about intelligence, we typically think of abstract thinking skills like playing chess. And not so much about mundane skills like moving chess pieces across a chess board. But intriguingly, it's turned out in the history of science that it was easier to build a chess computer that can beat a human chess master than building a robot that can even come close to the dexterity of a human child. Because it's so hard to formalize all this implicit knowledge that seems so natural to us. And I guess this is also interesting from a neuroscience point of view, because brains and nervous systems appeared during evolution together with a phenomena of motor coordination and locomotion. So what I'm trying to say with this slide is that, I believe, and many others as well, that intelligence doesn't start here, but rather here in our everyday interactions, like a sensei motor intelligence, a basic intelligence that basically also all animals have, and that these higher forms of intelligence then build onto these more primitive forms of intelligence, if you wish. Now if you open an AI book and look for models of intelligent behavior or models of intelligence, so this is from the Russell Norvig book, you basically find formalizations of intelligence as optimal decision making in one way or another. So for example, in the Russell Norvig book, the idea is that you're given a task environment. That means you specify an external environment, some performance measure that tells you what you want to do. You're given sensors and actuators. And then we draw this in this diagram, and we have the agent with a big question mark that we have to fill out as AI engineers. And what we're looking for is the agent function. And this agent function maps basically any possible history of the agent, the so-called percept sequence, or percept action history, into the next action. So you have to basically decide what to do next, given all possible histories. And there's two kinds of questions, or you can ask with respect to this. And these come from ethology, the study of animal behavior. We can ask proximate questions, or we can ask ultimate questions. So proximate questions basically ask, why do I raise my arm, for example? And the explanation would be that there's some neuron firing in the brain, and then this leads to some muscle contraction, and then my arm lifts up slowly. So it's like a mechanistic explanation. One thing that leads to another thing that leads to another thing. But another explanation could be, I wanted to lift my arm to get your attention, which is a completely different kind of explanation. The ultimate explanations are typically, in some sense, utilitarian, and not mechanistic. So for example, in evolutionary biology, these kind of explanations are given when you ask, why do we observe a certain behavior, because it confers a certain adaptive value to the animal, some kind of advantage. And the utility function is considered as often the fitness, basically the number of offspring, in some sense. And we can also do these two kinds of views also with artificial systems. We can, I don't know, if we build a dishwasher, maybe, we think about all the things, the parts, and how we put them together, and how they interact to do something. But we can also think about our robot like that. Or we can think more abstractly, what do we want the robots to do, and we give a description on this level and that doesn't maybe not necessarily answer the mechanistic question yet. So these are, of course, two compatible views just from different angles. So we're speaking mostly about this utilitarian perspective now. Now if we think about intelligent behavior as decision making, then we have to ask, OK, how do we formalize decision making? And in very abstract terms, we can think about decision making as choices between lotteries. So a lottery can imagine it as a roulette wheel. So there are different outcomes. So each choice that you have can lead to different outcomes. And each outcome has a different probability of occurring that's indicated by the size of the slice of the cake, so to say. And each outcome has a certain benefit or utility for the decision maker, which are given here as euro values. And the question is now, how should you decide between these two lotteries? And the idea is that any decision, be it to buy this car or that car, should you study this course or that course, that anything that you decide in life can be eventually cast as such choices between lotteries. And we can apply the same idea also to the sensor motor intelligence that I was talking about. So every movement that I make, I can move from here to here, there's many different ways I could do that. So why did I choose the one that I did? So we can think about lotteries here as distributions over movement trajectories. And we have utilities that can be, for example, some task performance measure, trajectory smoothness, motor effort, or many other factors that have been considered in the past. Maybe what's slightly different to the lottery concept, but actually it's not fundamentally different, as we will see, is that the uncertainty here can arise both extrinsically from the environment. So imagine you were hunting a rabbit that's trying to evade you, or it come from your own body intrinsically, illustrated here dramatically with Wilhelm Tell, where you aim somewhere, but you don't have full control of what your actuator will do in the end, because there's noise in your system. So in any case, the idea is that decision making is basically an all-encompassing framework to think about intelligent behavior. And if we again open our iBook, in this case, again, the Russell Novik book, standard textbook in AI, we find that this idea of intelligence as decision making rests on the fundamental paradigm of rationality. So what is rationality? So here's again the quotation from the book. For each possible history, a rational agent selects the action that is expected, that's important, to maximize performance given history and prior knowledge that you have. So because things happen that are out of your control, that we describe with these probabilities, I do something, and then this or that might happen. I don't know yet. We cannot optimize directly for the best outcome, and we're forced to optimize for some statistical measure of the outcomes. And the statistical measure par excellence, we will see, is the expected utility. So in this case, i is the index for the lottery. It can be one or two, left or right. We want to compute the expected utility for each lottery. Each outcome s has a certain utility for us. And for each lottery, there is a probability of s happening. And I just compute the expected utility, and then I take the lottery that has the higher expected utility. So what you see here is that to make decisions, you need basically two things. You need probability theory, and you need utility theory. So here you see then basically these two maximizations. This is the maximal expected utility that you can achieve, and this will be the optimal lottery that will achieve that. Now, these probabilities here, as we will discuss later, they can either be given on a roulette wheel, the objective in that sense, or they could be what you believe to be the case. And this could then be subject to what you learn about your environment. And these probabilities could change. So we are basically learning and acting combined in this maybe innocuous-looking formula. And this kind of rational decision-making models, or base-optimal models are sometimes called base-optimal because they include updating of these belief models, or can include at least that, have been extremely successful over the last decades, explaining a wide range of phenomena from sensory processing to cognition to motor control. So in sensory processing, the basic idea is usually that we have stimuli that can be ambiguous as well. And we have prior knowledge about the world that we can use then to disambiguate these stimuli. And in this sense, sensory processing is cast as an inference process. An old idea that goes all the way back to Helmholtz, that we make inferences. We try to basically infer hidden causes that explain our sensory sensations. Also in the realm of cognition, people have used such base-optimal models to explain phenomena like induction, or rule learning. Or here is an example from Griffith and colleagues about learning to learn. And also here, a motor processing, an example from Thomas Heuser. So here the idea is that you have to do very quick pointing movements. So you have these circles. And if you get into this circle, you will get, say, 2 and 1 half cent. If you get into the other circle, you will lose 10 cent. And now you have to make this rapid pointing movement. So these very small circles on the screen. And it's so difficult that you're basically, like Wilhelm Tell, you're not going to hit the point where you want, but you're going to basically produce spread. That's roughly Gaussian distributed. And the question is, will you choose your aim point optimally in the sense that you will basically optimize your expected return that you will get? And people seem to do that more or less spontaneously in these tasks. So it's like a whole range of experimental findings, especially in what I've called sensory motor intelligence, that support the idea that people seem to make these base rational decisions, which has even led to the idea or the concept of the Bayesian brain as a book that supports this idea, for example, namely the idea that the brain constantly makes predictions about the world, updates these predictions, and acts optimally with respect to these predictions. Now, this notion of or this formalization of intelligent behavior as base optimal behavior has been taken to the extreme by Marcus Hutter, who developed this concept of universal artificial intelligence. And this idea is that we shouldn't restrict our agent to only a certain class of environments, but we should look at an agent that can basically deal with all possible environments. And that's the agent that he calls IKSI, who basically can adapt to any computable environment and maximize its reward by learning then over time which environment it's in. And you can show that this agent will basically, you can show this theoretically, will achieve the highest reward and expectation over all possible environments. I mean, you can bias an agent to a particular environment, of course, but over all possible environments, this is the best that you can do. The problem is that you can also show that this agent itself is not computable. So that means you cannot build a machine that can implement this agent. And that really raises then the question whether we should then just try and approximate this agent as good as we can, maybe, or whether we fundamentally missing something in our formalization of intelligent behavior. And I'm, of course, trying to argue in this lecture that the latter is the case, not just me, but many other people as well, that work on bounded rationality or computational rationality or different names. Basically, by looking at real-world decision makers and saying these decision makers are limited, they've limited resources. And the fact that they've limited resources is not just something that we can then ignore and try and approximate the optimal solution, but the fact that you have limited resources should be part of your optimization problem. That's part of the intelligence, to allocate your resources in a clever way. So to have a theory about intelligent behavior without taking resource limitations into account would be a mistake. So the basic idea is that you cannot simply optimize functions with arbitrary precision because you have limited processing power. I mean, an example is chess. So if we were not, say, mentally handicapped, chess would be the most boring game in the world because everything would be predetermined, at least if the other player would also be like we are. And we know exactly what they will do. We know exactly what I will do. It would be no point in playing. But because we're limited, the game becomes interesting. And challenging and not predictable completely. Another fact that you will see is similar to the limited information processing power is model uncertainty, which basically means that if you have a belief about the world, but this belief has never been tested, so you're not sure whether this belief is true. What should you do? Should you optimize up to 10 digits behind the comma to choose your action with respect to this belief? Or maybe not. Should you maybe explore more to first learn whether this model is true or should you act robustly? We'll talk about that later. Okay, so this sums up my little introduction about what I'm interested in. So in my lab, we're interested in studying bounded rationality principles for learning and acting. And then we sort of test these ideas in human sensor motor control because this is like one way we can test intelligence. As I said, I think that sensor motor intelligence is one basic intelligence where we can test these ideas. Mostly in virtual reality where we can control things, we can expose people to new environments and see how they react and learn and so on. And the idea is that we also distill then general principles in this way that help us again to improve theory and maybe also to find interesting applications for machine intelligence. Okay, so the plan for the lecture was that I first wanted to talk a little bit about the theoretical ideas that we've been pursuing just like the basic concepts. Then in the other two parts, sort of an application of them, if you wish, in for sensor motor processing for decision-making and for learning. That's the idea, let's see how far we get. Okay, so let's start with the first part. So I thought before delving into the details, maybe to do a quick run through the key concepts in decision information theory. So maybe I need a little bit of feedback from you if this is trivial, then I will skip or whether to go slower or faster or any time also please ask questions. So I know that I'm not talking to myself. Okay, so here is a totally eclectic selection of events in decision theory over the last 300 years. But obviously I try to pick those that are most relevant to the lecture. So we have already in the 1700s, Daniel Bernoulli that was talking about utility and also this idea of expected utility. Then we have Knight Ramsey De Finetti who revised or proposed new ways of thinking about uncertainty and probability. Then Neumann Morgenstern, who had the first axiomatic treatment of utility. So that was almost 200 years later, then Bernoulli. And then Savage, who basically brought these ideas, these axiomatic ideas from Neumann Morgenstern together also with the notions of subjective utility. Okay, so let's have a look. So in the 1700s or already many decades before, people started to be interested in gambling and games of chance, right? And they were developing this idea of probabilities and the relation to frequencies and so on. And it was the conceived wisdom at the time that we should basically try and maximize expected return. Okay, so expected return just means I take the expectation value of the number of euros or dockets or dollars or whatever they had at the time. But this led to a problem and this is the famous St. Petersburg paradox that was raised by Bernoulli. So imagine I offer you the following game that I toss a fair coin until head shows up, okay? That means when head shows up, the game ends. When there's a tail, we play another round and another round, okay? And every time we play another round, the payoff doubles, okay? So if you finish after the first round, you get maybe one euro in the second round, you get two euros in the third round, you get four euros and so on. After, I don't know. So this explodes very quickly, right? So you can, after 10 rounds or so, you're already like a billionaire or something. And if you now compute the expected return of this game, then you have to say, okay, what is the probability that the head will show up? And of course, if you play once, it's 50%, right? One-half, and if you play n times, it's just one-half by the n, right? And of course, we also chose the return to be two by the n. And so what happens is if we compute the expected return, we have an infinite expected return. And according to the wisdom of the time, that would have meant that you would be prepared to pay any amount to enter this game, right? If I ask, if you saw me on the street as a street vendor, slightly dodgy looking, right? And I offer you this game and I ask you, okay, how much are you prepared to play this game? You should be prepared to pay any amount, okay? Anything. And of course, that's insane. No sane person would do that. And so where's the problem? So Daniel Bernoulli postulated basically two key ideas. The first important one is that value is subjective, okay? So that means we need to distinguish the objective or nominal value. That's, for example, that, I don't know, on my bill, it says it's like 100 euros or something like this from the subjective value, how much value it has for me. And this idea was that, for example, 100 euros have more subjective value to somebody who's poor than to somebody who's rich, right? And basically what he said is that he proposed that we should represent the subjective value of money with the logarithmic function, right? So you would put the nominal value of money here and you put the subjective value of money on the y-axis, then we plot this curve. And what's important about the curve is not so much that it's logarithmic, but that it's what's called marginally decreasing. That means if I add here 100 euros, the increment in utility will be smaller than if I add 100 euros here, right? Utility increase will be larger. Okay, and then he also proposed that with this, we obviously then this infinity is removed and that with this we can then optimize expected utility instead of expected return. So what's important about the expected utility is the central concept of what economists call the certainty equivalent. This is like the main concept that I also want to discuss later. So the certainty equivalent is basically if I offer you a lottery, so for example, this lottery, right, with 60% you get 100 euros and 40% you get nothing. What is the equivalent amount in cash? That means without uncertainty that you would accept as being the same value as this lottery, right? And for example, it could be 60 euros, but it doesn't have to be, right? This would be the expected return. You could have higher or lower value depending on maybe also your attitudes towards risk, right? So the important thing is that this expected utility, you can think of it or this expectation operation is basically a way to determine the certainty equivalent of a lottery, of a choice that involves uncertainty, okay? We'll come back to that. So here's the eximatization introduced by von Neumann and Morgenstern and their idea was quite simple, right? So the idea was that I expose people to different lotteries, right? And I ask them, which one do you prefer, A or B, okay? And I ask you that hundreds and hundreds of times for different pairs of lotteries, okay? Every time you tell me which one you like more. And that way you reveal your preferences, okay? And what they could show then is that if your preferences between these lotteries fulfill these four axioms of rationality, right? So the first one, completeness says that you should either prefer one or the other or you should be indifferent. Transitivity means if you prefer A over B and B over C, then you should also prefer A over C, right? Also makes sense, I guess. Continuity means that if you basically have a lottery M that in your preference is between L and N to other lotteries, then you can come up with basically a probability and mix L and N in a way such that you're indifferent with respect to M, right? So that means you can represent any lottery by appropriate mixing of two other lotteries that have more or less value, respectively, right? Also sort of makes sense. And then the independence axiom which says that if you prefer one lottery over the other and now you create a new lottery, right? By saying, okay, this lottery is now chosen to probability P but with one minus P I select another lottery and I call it N, right? But you add the same lottery to both choices then that shouldn't affect your preference, right? You should have the same preference because you've added the same thing to both. And what they could show is that if your preferences follow these axioms then and only then I can basically explain every single choice that you made by you trying to choose the one that has a higher expected utility, right? So I can, if you choose, if your preferences have these, obey these axioms, then I can show that there exists a utility function over outcomes that I can explain all your choices by always picking the one that has the higher expected utility. And what's important here is that the probabilities were assumed to be given, objectively given, like on a roulette wheel, okay? And this is now the next concept that has been challenged because if utilities can be subjective, why not probabilities as well? This is a long debate. So Knight is one of the first people who have raised this issue that we should distinguish known uncertainty and unknown uncertainty. So known uncertainty is when I throw a dice, right? I don't know what's gonna happen but still I have a pretty good idea that is going to be between one and six and have an idea about the probabilities and so on. But there's also unknown uncertainty, right? I don't know, I sit on a plane, I travel to another country, I get out and then I don't know what's gonna happen next, right? It's a different kind of uncertainty, maybe. That's the argument. And one famous experiment that was designed by Daniel Ellsberg to capture these ideas here, we'll discuss it later in more detail is imagine you have to choose between two different urns, right? On the left, there's an urn with 50% red and white balls and on the right, there's an urn, you don't know what's in it and I ask you, I give you a euro if you draw a red ball which earned you like better, right? And subjects tend to prefer the one with a known uncertainty, right? Why this is a paradox and so on, we'll discuss later. So this idea of probabilities being subjective was proposed by Ramsey and De Finetti in the 20s. And the important point is that you have to operationalize this notion somehow, right? So what is the probability? So the classical operationalization of a probability is the one of the frequency that converges, right? So I throw the die a million times and I record how often appears the six or something like that and then I will notice that it will converge to something and then something is the probability. The problem is if I want to use probabilities as an expression of how much I believe that something is true, I cannot use that necessarily, right? Because I can believe things that are not repeatable. And so here's the way that they tried to operationalize this idea, namely through so-called touch books arguments, okay? And so you have something that you believe, there's a clown under my bed, I think the original one was there was life on Mars a million years ago, right? Just to hammer home the idea, okay, these are not repeatable things, right? And I want to know now how much do you believe this is true, right? And the way this works is that, okay, that you have to set the price for a ticket, right? And if it's true, so for example, if there is a clown under the bed, you get $1, if there's no clown, you get nothing, okay? And you set the price for this ticket. Now the trick is that there is an opponent who decides whether you have to buy the ticket or he buys the ticket of you, right? And the idea is that now you're forced to put the price honestly, because otherwise the, so for example, I don't know, say you know that there is a clown under the bed, right? That means you know you can do, you will get $1, but if you now say, okay, I set the price to zero, right? The opponent will buy these tickets of you and you will lose one euro all the time, so you would not do that, right? And of course also the other way around, so you're forced to put the real price, because otherwise the opponent can create a machine to extract money from you indefinitely, right? And nobody wants that. Okay, and so this price that you choose is the operational subjective belief that you have, okay? And so that's important that you can now define probabilities for non-repeatable events. Yeah, so we have these two ideas, the objective utilities, as I said already, like a roulette wheel and the subjective ones, like a host race would be an example there. Finally, these two ideas of the subjective utility and the subjective probability were put together by Savage into his famous book, 1954, and yeah, it's almost incredible that you can basically, you should be able to infer both things from observing behavior. But his axioms are quite intricate, so I'm not gonna look at them, but I'm showing you here his introductory example, so you get basically the basic idea of his framework. So there are states in the world, right? And in the end, you want to have beliefs about these states of the world. There's actions that you can take, and there are consequences of these actions. So his example is states of the world, so the scenario that he has in mind is you're making an omelet at home, okay? You're trying to cook an omelet. And so you have already five eggs in the pan, right? And there's one last egg or something like this. And this egg could be good or it could be bad, okay? And now you have three different actions that you could take. You could break the egg into a bowl, well, where all the other eggs are already. You could break it into a saucer and look, okay? So the good egg or a bad egg, or I could just throw the egg away without looking, because you think, okay, five eggs enough, I have to worry about my cholesterol or something, I know. Okay, and each of these actions will lead to different consequence. So if the egg is good, and I break it into the bowl with the other eggs, I have a six-egg omelet. If I break it into the saucer, then I also have a six-egg omelet, but then I have to wash the saucer, right? Which is maybe not so good. And if I throw the egg away, then I have the five-egg omelet, and maybe then I have one good egg destroyed, which is annoying. If the egg is bad, and I put the bad egg into the bowl with all the others, I've destroyed the omelet. For sure, that's the worst case outcome. If I break it into the saucer, then I have my five-egg omelet, the saucer to wash. And otherwise, I just have the five-egg omelet, right? So the robust solution would be the five-egg omelet, where it doesn't matter if you believe, right? Okay, but that's not the point here. So these are all the possible consequences, right? So these acts are sort of mappings from states to consequences, and what he could show is now that if I ask you for your preferences between these different actions that you can take, right? And these preferences follow again certain axioms, rationality axioms, then, so that's what's there in the left corner, you can show that there exists a utility function and over these consequences, and there exists a probability distribution, a subjective probability distribution over the states, okay? And that I can explain all of your behavior by your optimizing your subjective expected utility, okay? So that's it, that's still the, I guess, the gold standard for thinking about decision-making, for sure, in the economic sciences, but also everywhere else. Yeah? What is it, like a fully irrational agent and just, Amiran, will you choose? And what is the case in which this situation does not... So there are many situations where people violate these axioms, okay? So people even violate trivial things like transitivity, right? But there's also, like, famous paradoxes that, for example, have also been tested on Savage and even he got it wrong, okay? So there's the famous LA paradox, for example, which is sort of testing this independence axiom, so he also has this independence axiom there, where you can basically show that, okay, I have two lotteries, and if I now add the same lottery, all of a sudden my preferences, I have a reversal in preferences between the two lotteries, and it sounds like incredible, but if I actually showed you the numbers for this lottery, like almost everybody does it, because, and then the question is, okay, what's going on? Are humans stupid, irrational? But with the LA paradox, for example, even if I explain to you that you made a mistake, I guess most people afterwards still feel that they didn't make a mistake, right? And so the question is, okay, maybe, yeah. Maybe these axioms are okay as a sort of normative framework, but maybe there are other things to take into account, right? Also, this idea that there's, like, I mean, we'll come back to this later, measurable and unmeasurable uncertainty. So here, the idea is that every uncertainty that you have can be captured by a subjective probability distribution, but maybe that's not true. That's, for example, what Daniel Asberg's arguing. Yeah? If you were to say that if you remove the subjectivity, you will basically study game theory of, like, Nash equilibrium kind of thing. I mean, here you don't specifically look at game theory because you have this state of the world that you have beliefs about, but of course, the ideas of expected utility are also applied to the game's theoretic setting. The difficulty with the game theory is that, say, we're two playing a game, right? That if I apply this basic notion of expected utility, then I have to have a belief about the world, right? And then I know, okay, if I do this, what are the consequences, what will I do? Now, the question is, can I have the same model with you? A belief, what you're going to do next? If that was the case, then you would be like a tool, right? So it says there's a hammer. And I know exactly if I do this or that with a hammer, this and that will happen. Now, if I knew neuroscience so well that basically you would be to me like a hammer, right? Then maybe this would be the case, but that's not true. I don't know what you're going to do next. And in fact, what you're going to do next depends also what you believe about me and so on. So game theory is a little bit more complex, right? And then the idea was that, okay, I don't have this model, but what I'm asking is, this is the notion of the Nash equilibrium, right? That we come to a situation where neither of us has an incentive to deviate anymore from our behavior. And so we define attractor points, which is slightly different to what we do here. Even though these attractor points are defined by that neither of us can optimize their expected utility. So expected utility series is still there, but not in this way that I have a belief model about you. This idea doesn't work in game theory. At least if you ask for the points of optimal behavior, I mean, these beliefs will change all the time. Okay, any more questions? So then I can also apply these ideas to more complicated setups, where I've sequential decision making. And then I'm basically looking for a policy. Okay, and a policy tells me, in each situation in life, what to do. So I'm looking for optimal policy. So maybe the simplest one is on the left, the expected max. So every time there's a max node, that's my decision node, I make a decision about what I want to do next. The round expectation nodes are chance nodes where nature makes a move or where something happens that I cannot control, right? And the question is, every time I can make a choice, what should I do? And as long as these trees are finite, they're easy to solve by backward induction, for example. So I go to the last row of the outcomes and I ask, okay, what is the utility of this outcome for me? Then I can compute the expected utility with a chance node. Then I go back and I say, okay, I pick the one that has the higher expected utility. That is going to be the value of my choice node, right? With a triangle. And then I go again back one level further up. I compute expectations and I say again, which one do I choose? And so I can work myself backwards and solve the policy problem just by a sequence of one-step decisions, right? Does that make sense? Now, this also the same idea applies to when I play, say, a zero sum game or something like that, where like chess or something, I do a move then the opponent makes a move and so on. And I can again work my way backwards, right? Okay, what is the utilities of the end states? What will the opponent do? Well, it will do something that will harm me the most, so to say, because that's what gives him the highest reward in a zero sum game. And then again, I work myself backwards or I can combine both nodes, right? Chance nodes and opponent nodes and my own nodes. And the reason, I'm showing you this is because with this generalized notion of the certainty equivalent that I want to talk to you about, we can basically generalize in between all these, well, all these three and more. From the earliest days of people studying decision making, people have also looked at stochastic choice theory because if you study decision making in animals and humans or in psychophysics and so on, you will realize that even though I give you the same stimulus pair, for example, people decide differently in different trials, so to say. And if you want to explain everything with maximizing expected utility, then you have a problem, right? Because decision makers choose deterministically, always choose the best one, even if the difference is tiny, because all of decision theory is deterministic, except game theory, that's an exception here. And so people have looked at stochastic choice rules early on, so this started, for example, with Thurstone looking at the law of comparative judgment, like different psychophysics stimuli, Luz had this idea of the relevant alternatives, so basically like softmax choice rules that come out of that. And there's different ideas how you can get these stochastic choice rules. One is, for example, that you have stochastic utilities, that you always choose deterministically, but the utilities themselves are stochastic, and that gives you, in the end, probabilistic choice, or the choice is, well, at some stage of the processing pipeline, you need random variables, of course, to get stochastic choice rules. And these were initially considered, as I said, to explain experimental results, so they were more descriptive in nature, and not so much as normative decision-making principles. And then there is basically the research area of bounded rational choice theories, a research field that was maybe initiated by Herbert Simon, so he was interested in decision-making in organizations, corporations, and so on initially, that's when he came up with his concept. Well, here's like one quote, where basically you have to make decisions with limited knowledge or another famous idea, which is the idea of satisfying, that you don't look for the best option, but just one that is good enough, right? So that sort of triggered this sort of research area, but then there's now, there's no generally agreed upon theory of bound rationality or so, even though I guess most people agree that it's a good idea to study these kind of questions. They disagree immensely on how to study this, right? So there's people that come from this sort of heuristics background that take what's sometimes called this Swiss army knife approach to the human mind. So the idea is that we have many problems in life and evolution has somehow given us many different ways to solve these problems, but there's no general problem solver, there's no general principle to produce intelligent behavior, right? It's just, it's a bunch of tricks, that's it. So try and understand the tricks and you've understood bound rationality, right? And that's the research approach, hence the name heuristics. And then there's the, I guess the other side of the chasm that still tries to hang on to this idea of optimizing, right? So I guess here the idea is that optimization is even the wrong way to think about it because optimization would basically be the idea that there's some generic problem solving machinery or something like that. And they also argue that very often you have an optimization problem that you cannot solve, right? And you say, okay, now I have limited resources that means a constraint on my optimization problem. So I have a new optimization problem, I mean a constrained one that is even more difficult to solve. And how do I solve it? Well, I've just made my life hard and basically I'm back to where I started, right? So framing, bound rationality is optimization, doesn't make any sense from this point of view. Okay, and then there's the idea of computational, what's called computational rationality sometimes that computation is costly, right? That I would like to do this or that, but to do this costs that much and to do that costs that much in sort of computation costs, thinking costs. And then I take the one that is overall the cheapest, right? Or the best. Or the idea is another concept is bounded optimality. So here the idea is that you look, what is the, you take a particular platform, say this Intel computer with this, I don't know, speed and so on. And you ask what is the best performing program to solve a particular problem on that platform, right? Here the problem is, you get the best solution for particular things, but here it's maybe not so clear how you would get a general theory of decision making with limited resources. Okay, so this is the sort of the research field that we also want to talk about. Okay, so what's important, I guess I said it already, but I'll say it now again explicitly the, when we think about decision theory, we need to distinguish normative from descriptive theories. Normative theories are basically telling you how you should choose. That's a nice quote from Gilboa. If someone points out your mistake and you feel then embarrassed afterwards, right? Then this has a good chance of being a normative rule, right? If you don't feel embarrassed and you still feel like you've done the right thing, then maybe it's not so good as a normative rule. And descriptive is, of course, you want to describe how individuals choose, whether it makes sense to choose like that or you should choose like that. It's another issue. Okay, so that would finish the decision chapter. I don't know how are we doing with time. Yeah, okay, so then I carry on. Okay, so now the next review is going to be information theory. I think you heard already a little bit about information theory. Enough to lead to this lecture. So maybe some of the things I'm going to say here not necessary, I don't know. Then give me a sign to move faster. Okay, so the question is, why am I talking now about information? What does information have to do with utility, right? I can say, hmm. On the other hand, I could pose the opposite question. Okay, so if we said that decision making is central in AI, right? And decision making requires information processing. Then is it not strange that we don't talk about information? Could also ask this way around, right? So in all these expected utility theories, people don't talk about information. Why not? Okay, so here's some ideas. Oh, the information has been shortened to Informa. I don't know why. But you get the idea. So I believe that information and utility are very similar things that are closely intertwined, okay? And if you have a search on Google Scholar or something like that, you'll find there's not too much about that, right? You'll find one paper. That claims that basically utility and entropy have the same kind of representation problem, topologically. And then, well, there's a couple of papers where we try to argue that utility and probability are duals. And what's the intuitive idea behind that? So imagine that I offer you to choose between an apple and a banana, right? And then I record how often you choose the apple and how often you choose the banana if I repeatedly show you this. And I would use this probability distribution to describe your choice. And now, the idea is that there is a utility, right? And that the utility is larger for things that you choose more often. You choose it because you like it, okay? So I say there's this quantity, I call it utility, that is larger for things that you like and not so large for things that you don't like, that you don't choose. And you can also basically say, I mean, we'll do that in like half an hour or so. We'll look at this in detail. You can say that utilities should be additive. And then basically you get that there's this relationship between probabilities and utilities. Okay, and if you're a physicist, then you'll also recognize this as a Boltzmann distribution, right? That you can think about physical systems exactly like this except that they don't have a utility but they have an energy and they'd like to minimize the energy instead of maximizing the utility, but so it's still exactly the same. So you can imagine that, I don't know, you have a ball, right, and a ball in there, and it tries to roll to the bottom of the ball. And now if you imagine that this ball is a molecule now and there's thermal noise, it'll jump around, right? But it will still roll down and you will describe the state of the ball with this probability and you can describe the energy of the ball with this utility, right? And the ball goes there where, well, the energy is small so you would have a minus sign here. It goes where the energy is small because that's the place it likes, so to say, right? Okay, on the other hand, we know from Shannon that information is due with probability, right? So log probability, essentially. And then if both of these things hold, then, of course, utility and information would be the same thing up to a constant, right? And to some extent, I guess, that shows in physics where you can basically define what's going to happen in terms of maximizing entropy or minimizing energy, right? Both things will give you the same kind of solutions if you ask what is the state that the particular system will be in. So what I want to, the intuition is maybe that the information can also be regarded as a kind of a utility. Maybe think of it as an intrinsic utility. The system visits a certain state often because it likes that state, okay? And now if you want to change the behavior of the system, then we apply maybe an extrinsic utility. So say, for example, I don't know, you prefer the apple over the banana, but now I put one euro as an external utility on top of the banana, now you like the banana, right? Or you like to stay here, I put five euros there and then maybe you like standing now here like this, right? This is how you can imagine it, right? And that's what we do, because humans, for example, they do certain things that they like to do. We give them money and that way, all of a sudden they do other things, right? We call it work, okay? Yes, that's the plan. Yeah, so this is not supposed to be mutual information. Sorry, this is just supposed to be the... But then shouldn't it be any... Yeah, information is the negative log, that's right, yeah. Yes, yes, it should be, all right. I'm sort of sloppy here with the signs, also here with energy and utility and the same here with surprise, which would be negative and the log probability, which doesn't have the negative. Okay, sorry, yeah, exactly, exactly. Absolutely, we'll come to that. That'll be a big part of the discussion. Okay, so where do information and utility meet? So in information theory, this happens, and we'll talk about that in a minute. We talk about rate distortion theory. It's the theory of lossy compression and that's not a coincidence that the two concepts meet there. In the decision sciences, I would say this interplay between utility information is an active area of research, and maybe the oldest, even though it's not that old, it's from the 90s. Concepts are from the economic sciences again. Quantum response equilibria. It's basically a concept developed for game theory. The idea is, I mean, in the end, these just, it's assumed that players choose, according to some softmax rules, right? This Boltzmann factor kind of rules. So they play stochastically. This is quantum responsibly when people have looked at, okay, how does it change the position of the Nash equilibrium? Does it explain behavior in certain games and so on? Then there's the idea of rational inattention, also from the economic sciences. It's also from around that area, era, which is essentially just an application of rate distortion theory. And that idea is that we have to make decisions, but we cannot process all the information. So we have limited information capacity, channel capacity, and we need to make the best choices that we can with our limited capacity. So this is very closely related also to what I'm going to talk about today. Then there's ideas about robust decision making, variational preferences. So as I said, robust decision making means that I have model uncertainty. I don't know whether a certain belief I hold about the world is true. And the question is now, what do I do? I try to make decisions that sort of, well, it doesn't matter exactly what my model is. And in this context, and we will see actually that this can be formalized in exactly the same way than this foundationality ideas. Yeah, so we'll talk a bit about that in more detail later, I guess. Then another thing is KL control theory. I'll mention that also briefly later, which has been proposed by Kappen, Todorov, and others, which is stochastic way to control systems. So it comes from like a control perspective, maybe. Then active inference is an example from Friston and colleagues where they also try to basically bring utility and information together with the idea that we choose our actions in a way to minimize surprise, just the way we also do inference. This also fits somewhat in our framework. I'm not going to talk about it too much today, but if you want to discuss it, we can also discuss that. Then you already saw Naftali Tishpi. Don't know whether you also spoke about his general ideas about information flow and perception action systems. I think this time he was concentrating on information theoretic interpretations of deep learning and so on. Susanna Stil, the paper on semi-dynamics of prediction, she comes more from a physical perspective and was asking a system that predicts what kind of semi-dynamic properties will it have. So for example, if you retain information that is non-predictive, that will lead to dissipation and so on. So this kind of question. So that would be maybe a natural drive to avoid that. Okay, so you see that there's a whole bunch of different research groups that are interested in this sort of topic. Many of these ideas, I guess, fit together quite nicely. So hope to show you. Okay, so maybe this is the part where we can go through fairly quickly now. So what is information? Originally, the idea that Shannon was interested in is how to basically put lots of information through a phone cable or something like that. So we have a sender and a receiver. The sender knows that there are certain states in the world, there's four different states here. They can basically have an alphabet, a code alphabet, and okay, it's interesting. So obviously every state should have a different code. So it's three code words. So what you see here immediately is that, okay, I have four states. I have basically fixed length code words with length three. That means basically I have two by the three possibilities to encode, that's a bit wasteful if there's only four possibilities, right? And so that's why I will get redundancy, okay? So the information is whenever we learn something new, but abstractly we can think about information just as accounting measures like kind of a loc cardinality. So if we have two possibilities, we can express that with zero and one. It would be one bit if we basically have zero, zero, zero, one and so on with four possibilities. It would be expressed with two bits and so on. That's why I'm saying it's the loc cardinality. It's just counting, right? And so if I show you, for example, a picture like that, I can ask, okay, what is the information? Well, that depends. Just count the number of possibilities, right? And each of these square can be on or off. So you count, take the log of that and you have basically the information. And now from interested in transmitting information fast, and I'm interested in the question of compression, I will come to that. And so there's two kinds of compression. You need to distinguish one is the lossless compression and one is the lossy compression. In lossless compression, we basically exploit redundancy and that means effectively we find that there are less possibilities than we originally thought. So for example, if I send you only pictures of faces, right? Then you know that if there's this black dot on the left, the left eye, there's also going to be somewhere a black dot on the right, right? And you're not going to be surprised about that anymore. And so the number of possibilities has shrunk, right? And if I know that, then I can use that for compression, right? And I don't lose anything. I will show you, you'll be able to see exactly the same picture. Lossy compression means to throw away information. For example, I could down sample, right? I could replace every four squares with one square. In that case, I would also have less possibilities in the end, but I will lose information. And the question, the interesting question is, how do I decide what information to throw away or not? And this is where we'll see later that the notion of utility comes in. Okay, so Shannon was interested in these two basic question. What is the lower bound on how much I can losslessly compress a message? And the one that is not so important for us today, but was really important for him was, what's the fastest way I can transmit a message if I have an unreliable channel? So this was basically the idea we had in mind. So you have a channel you send a bit through, but this channel has an error probability and sometimes it flips, for example, a zero into a one or the other way around, right? And so you would basically have these two things that you say, okay, I have some information source. I have to basically compress that somehow, losslessly, for example, and then that means I reduce redundancy. And then I add again redundancy to basically make the message more robust to this sort of error, right? And then you decode again. So you reduce and add redundancy, not the same kind of redundancy, of course, you have to add redundancy in a clever way to make it robust, right? And I guess what's interesting is that Shannon's theory, I'll come again to that point later, but I think this is something that I want to emphasize is not really saying anything about what codes you should pick or anything like that, right? So people were interested in finding and developing these codes, those are these error correcting codes and so on, but he came up with theoretical predictions of what you could achieve with the best possible codes, but he didn't have any idea how to produce and generate these codes, okay? And that's, so that's really a normative statement that he could make and we want to do something similar today, okay? So here's a simple example. We have these four symbols. Say we choose this fixed length code. The information would be the number of yes, no questions to predict the symbol, right? So it would be two bits to know the symbol. You need to ask two questions. And now the question is, and this is, I guess, what information is this about? What if these symbols are not evenly distributed but unevenly distributed, right? And then you could of course use just the same kind of code and you would use the same kind of information, right? Two bits, but if you're clever, right? You realize that you could design a different code. So for example, you could, and the basic idea here is also one that is older than information series. It was already used for Morse coding and so on, that you give short code words for frequent symbols, right? Because they happen very often and then maybe the overall message becomes shorter. And here in this example, you could say, okay, if A happens half the time, I encode A with a zero and to compensate for that, I encode B and C that are rare with longer code words, right? This would be so-called prefix code that means that none of the code words is a prefix to another code word. So it means it's uniquely decodable. And then you realize, if you then say, okay, half the time, right? I need one bit, quarter of the time I have two bits now and an eighth of the time I have three bits or two eighths of the time, that if you compute that, you will use less information overall, right? So that's the basic idea. And so you say, okay, the ideal code word length, right? Should be decreasing function of P, that means the more probable the particular X, the shorter the code word, right? That's the basic idea. And you say that if the probability is one, you don't need any code word length, right? Because then you know already what's going to happen. And you assume the additivity for basically independent outcomes. So the two code word lengths basically add. And then you have already the information content this time with minus sign. That gives you basically the ideal code word length, right? And then you can of course compute expectations on that which would just give you the entropy, the expected code word length. What's interesting here is also maybe as a side remark, you can think of this as writing down the symbols and then basically generating the code for it. But you could also think about the other way that you want to generate the symbols A and use this as a kind of machine, right? So you could put a marble in here, right? And the marble rolls down and every time it makes a 50-50 decision. And if you do that, right? It will generate A, B, C, D with exactly these probabilities, okay? You can see we think both ways. So how many random bits do you need to generate these symbols or how many bits do you need to write down a particular random symbol sequence, okay? So now the question is what happens if we use the wrong code, right? I said we could still insist in using this code even though they're non-uniformly distributed. Then we can basically compute the cross entropy, right? Let's log with this distribution. Basically it is the code word length that you use and P is the distribution where the symbols are generated and you can basically write this as two parts. One is the entropy of the real distribution and the other one is the so-called Kulbach Leibler divergence. So this here is the minimum bits that you need to encode symbol sequences with X and this here is the extra amount of bits that you need because you chose the wrong code. So you can think about, so we'll talk about the Kulbach Leibler divergence quite a bit today. So in coding theory, it measures the extra amount of bits that you need when you choose the wrong code. And it's a strictly positive quantity. Okay, so that's the important source coding theorem from Shannon that basically the entropy is the lower bound for lossless data compression. If you have a symbol code, then this has to be integer. If you have stream codes, then you can exactly get down to this bound, okay? But the important message is, what is the most that you can compress losslessly? It's given by the entropy, right? That's the ideal code word length. This is the best that you can do. Then another quantity that's important that sort of can also be sort of as a special case of the Kulbach Leibler divergence is the mutual information. So if you have two random variables and you say, okay, these random variables, they can be associated with each other, right? They can have a dependency. And then you look at the marginals where you ignore this dependency. And then you ask basically what is the distance, the information distance, so to say between these marginals and this joint, then this gives you an idea about the dependency of the variables. You can express that maybe more natural like this in terms of the entropy and the conditional entropy. So the mutual information between X and Y tells you basically how much one random variable tells you or the other random variable, right? So for example, you say, okay, what is my uncertainty about X? And what is the uncertainty about X if I know Y? If I take the difference of the two, that will tell me the mutual information, right? How much did Y tell me about X? And so what you see is also that the mutual information ranges between zero and the entropy. So for example, say if Y tells you everything about X, right? Then this conditional entropy will be zero, right? Because you have no uncertainty about X anymore once you know Y. And the mutual information will be the entropy of X. If, however, Y tells you absolutely nothing about X, right? Then the entropy, the conditional entropy is going to be the same as the unconditional entropy and the mutual information will be zero. Okay, so this mutual information, this is not so important today, but I just mentioned it quickly. So mutual information is also important when defining the channel capacity, right? So if we have a channel, that means we have a given probability distribution. If I put in X, what is the probability of Y coming out? Then I can define the channel capacity as the maximum mutual information between X and Y depending on what distribution I feed into this thing, right? Very often this is the uniform distribution, but it doesn't have to be. And why is this channel capacity important? Because Shannon could show that essentially as long as your information transmission is below the channel capacity, you can basically achieve arbitrarily small error rates, right? And how you find, so again, how you find these error correcting codes, right? He couldn't tell. So if there's the colleague from coding theory coming, knocking his door and asked, hey, you said I can find a code that achieves this little error with this transmission rate, how do I do it? He said, I don't know. I know you can try harder, right? And then the conversation was ending. Okay, so finally, the last question is, what happens now if we need to compress below the source entropy? Okay, so we're talking about lossless compression. And we said the best we can do is this source entropy. What if we need to go below that? That's the so-called rate distortion problem. So we need to throw away some information. But of course, the question is, which information should we throw away? And the idea is that we throw away the information that is useless. Okay, what is useless information? Well, then we need a measure of usefulness, okay? And that's where utility comes in. But in information theory, utility is called distortion. And the idea is, for example, I don't know, say I give you an image or an audio file, right? And now you down sample that file and I show you both files or you listen to the music on both files. And then you tell me, how much do you think the compressed version is distorted compared to the original, right? And then you look at them and maybe, I don't know, when one picture of the nose is missing, then you say, oh my God, this is completely distorted. But in the other one, you think, okay, that's still fine. It's not so distorted and so on, right? And so you have this distortion function that tells you basically how much you like the compressed version, how much it is distorted. And then what you want to do is, now once you've given such a function, right, you can ask then, okay, what is, so there's two ways now you can do this. If I set myself a certain level, right, and say I want my distortion to be smaller than D star, that's the maximum I will accept, otherwise I think it's not acceptable. What is the minimum amount of information that I need? Or you can transform it in the other way and say, given that I have a certain information rate, R star, what is the minimum distortion, right, what is the best I can do? And both these problem formulations are sort of equivalent, okay, which is two different ways. And this is sort of illustrated here for the Gaussian channel. If the information rate here and the distortion here, so if you don't want any distortion, right, then the best that you can do is the source entropy, the lossless compression. If you basically are happy to have any distortion, right, then you don't need to transmit any information anymore, I can just show you random noise and you'll be happy. And for any kind of distortion in between, we have this line, right, and this line, and we'll come back again to this later as well, it's an efficiency frontier, right? And it says whatever algorithm you use, right, for your compression, you cannot go below this line, right? So we already said that here for the source entropy, we cannot go below, right, without introducing distortion, but you can never go below this line. It's an efficiency frontier that falls out just from purely theoretical considerations, right? Yeah. So but generally, I have an example you made before about image compression, right? Yeah. For example, it is very much a subject where you can find an image with distortion. I'm not, I mean, the fact that you see the noise missing, right? It is essentially only a human observer. Yes. It's a machine that has evolved to actually recognize patterns, and then, I mean, Sure, I mean, where this distortion measure comes from is not part of this theorem, right? So for example, if we say about this music example, then it's clear it must come from a human listener, right? So we cannot hear certain frequencies. So that must be captured by maybe a psychophysics guy into such a function, right? And this guy gives them this function to the engineer is going to design then an algorithm to do this compression. So the function is given? The function is given, yeah. Okay, any other questions? Okay. Then we would start talking about bounded rationality. We're trying to bring this concept of utility and information together. Okay. Let's start with this intuitive picture, hopefully. Let's assume that there is a set X, right? Of possibilities, okay? So there's lots of possibilities in this set, and each of these possibilities, this little X in the set has a utility UX, right? And now you are the decision maker, and you want to find the best X, the one that has the highest UX. And if there are several that have the same highest UX, then you want to find that set. Now, if you are limited, right? So this is idea from the satisfying idea, maybe from Herbert Simon. If you're limited, then maybe you cannot afford to search for the best X, you content yourself with solutions that are good enough, right? And we would call that a set XR, R for resource, because the more resources you have, the smaller the sets can be, and if you have no resources, then you would just choose randomly in this big set. Okay, that's the idea. So you accept things from such a subset. So that's this idea that I just said. So the more resources you have, the smaller the set becomes, you could talk about things like, if you would choose randomly in this set, right? You would have this utility, you can talk about the uncertainty, maybe you say, oops. Oh no, that was not good. If you assume, say, a uniform choice probability initially, then you would have log N bits and then log NR, if there are still NR options in here, so you would have to reduce this amount of uncertainty so you could try and make this kind of reasoning, right? And then you could generalize this idea and say, okay, maybe instead of having this hard partition between acceptable and non-acceptable outcomes, maybe let's have a soft partition. That means in principle I can accept any outcome but with different probabilities, right? And then this would be just, this hard partition would be a particular instantiation where basically the probability for all these are zero and in here I have a uniform probability, right? But I could also say maybe more like a Gaussian cloud, say. I have a higher probability here but then these with small probabilities, I accept them as well. So it's a more general idea to look at that, okay? So we don't have these crisp sets anymore but we allow for soft partitions, right? And then we would describe the decision maker now with a probability distribution, right? So the bound-addressional decision maker would be described by a probability distribution and again this distribution would depend on the resources that I give you, right? So if you have lots of resources then somehow this distribution must be more concentrated and if you have less resources then it's maybe broader, right? And of course after what we've just been talking about you could think, okay, let's measure this for example by the entropy, right? So the more resources you have, the more the entropy goes down and the less resources you have the higher the entropy is over the distribution that you choose, right? So this is just some intuition to start out, okay? So now with this intuition, we want to think about computations as processes that transform uncertainty into certainty, okay? Or let's rather say that transform uncertainty into less uncertainty. Okay, so here would be an extreme example, right? So in the beginning you don't know which X is the best so we represent this by uniform distribution over all X then the deliberation process starts and at the end of it you know exactly which is the best X assuming there is one only, right? And this could be considered also as a distribution especially distribution, the delta distribution. And if you now assume that you have limited resources, you run out of time in between or something like that, your decision would be interrupted and imagine now that the decision making process some kind of an anytime process so you can interrupt it anytime, you get an answer and then you could imagine that after some time you were able already to rule out certain alternatives where you know, okay, this is not very likely that this is the best, right? And instead of the delta function maybe you'll get the distribution that starts sort of concentrating, right? But doesn't give you the exact answer yet, okay? And so if you want to go by this intuition then we can think about decision making very abstractly just like we think about inference as well as a transition from a prior to posterior distribution, right? So we have our prior and then this could be a posterior this could be a posterior. Without taking the particular mechanisms of decision making into account. Now we can think about this even more abstractly as a search problem, okay? And that search is costly, okay? We could think now that for example that this cost, the search cost should fulfill certain axioms. So say, so this is supposed to be illustrating say you want to find a treasure on an island or something like that, right? So this is an event space omega and a certain events in it. And now you say, okay, I know the treasure is somewhere on the island. I know omega is true, right? And now if I want to reduce this uncertainty further that's costly and the cost should be real valued. It should be additive, right? In the sense for example that if I want to reduce from omega to B, right? It should be the same cost as first reducing from omega to A and then from A to B, right? So it means it's decomposable. And you want the cost to be monotonic in the sense that it's more expensive to find things that are small, right? Rare. So if you postulate that, I mean these are exactly the more or less the Shannon axioms just told in a slightly different way. You will get the solution that the search cost should be the information value, right? Of going from knowledge state omega to a knowledge state X. And if we now express this information value as a difference between potentials, okay? So now for each event we assign a number, U, okay? We do it in such a way that the differences between the U's, so if I want to know for example what is the cost to go from omega to B, right? Then omega would have a cost U or a potential U. X would have, X given omega would be a U. And the difference between the two would correspond exactly to this log probability, right? If we do that, then we essentially fix the value of omega as, well, if you're physicist you know you would call this the log partition sum. We could call it the free energy, hence the notation F. But this is then given. So in a way, what happens here is if we zoom out a little bit from this story with the treasure and so on and the search cost that we can think about the informational surprise as a utility, as an intrinsic utility, right? And in fact we can think about these potentials that I introduce here also as utilities but that are not normalized utilities, right? Because the R, maybe there's a, if I take the R, right? So probabilities they have to sum to one, right? And so now if I take this and then I can write P of X as E minus alpha Rx, right? That would have to sum to one, also. So that's why these are sort of, you could think of it as a normalized utility, right? And if I now have these potentials that have an arbitrary scale, then I will have basically X E minus alpha Ux. X minus F is equal to one, right? So this Ux minus F is the same as this R. There's like a fine transformation between the two. So you could think of this as unnormalized utilities and these are sort of normalized utilities if you want, right? Okay, so should we have a break now? Okay.