 Hello and welcome everyone, it's active guest stream 33.1 January 17th, 2023. We are here with Ali and Ali Reza, we're going to hear about the taxonomy of surprise definitions. There will first be a presentation followed by a discussion. So Ali Reza, thank you for joining. Please take it away with the presentation, we'll be looking forward to this and the discussion. Thank you very much Daniel and Ali for inviting me. So as you said, I'm going to talk about the taxonomy of surprise definitions, which is basically all the materials or most of the materials that we published in this recent paper in the Journal of Mathematical Psychology that is a joint work with Johan Ibrea and Wolfram Gersner. And Wolfram is my PhD supervisor. So talking about surprise, let's start with a simple thought experiment. Let's say that you want to plan something for your weekend. And to plan something for the weekend, we need information like how the weather is going to be over the weekend to plan something. And there are some cues like a spring that are informative about the weather of the weekend or that the weather forecast says is going to be sunny. Given these cues, we can make some predictions that oh, the weather forecast said it's sunny and it's spring, so most likely it's going to be sunny. Let's say that it's Friday evening, you plan that for example go for a bike, you go to sleep, wake up the day after, open the window and it's snowing. And seeing the snow, given that you expected the sunny weather, this mismatch is what's been natural language would agree on that it's a feeling of surprise. We can easily talk about these moments and say that yeah, I was surprised because I had planned to go for a bike. So in the civilian scientific community, we can ask this question that for these moments that you see a snowy weather when opening the window, what does really happen in our brain? And there are lots of literature in a long lasting debate in neuroscience and psychology about different roles of surprise in the brain. And there's this loop of different what I call surprise related signal, things like prediction error, free energy, surprise or Bayesian surprise. Some people say that these things are related to learning, play important role in model building or predictive coding. Some people say that they're quite important for exploration. For example, surprising events attract attention or drive curiosity. There are also literature on memory that surprising events segments or continuous stream of observations, so they help us to segment our memory. And of course, there are lots of work and physiological signals of surprise signals, for example, when there's something surprising, your pupil dilates or EG signal as a peak and so on so forth. The thing is that if I naively look at this picture and from far away, it seems like that surprises everywhere in the brain. But then the puzzle is that it's really the surprise in the sentence, surprise and martial arts synoptic plasticity, same as the one in the sentence, surprise, drive curiosity or not. And that's basically the motivation of the main motivation of our work, that different studies in neuroscience and psychology use similar words, but to refer to different aspects of the moment of surprise. And I would call this art piece, the moment of surprise. And basically the thing is that different studies, different experiment, different paper. They all talk about moments like these situations like this, that we can all agree on to call it a surprising moment, but they talk about different aspects of it. The way that we summarized it or put it in words in the abstract of the paper was there's no consensus on the definition of surprise. When I posted the paper online on Twitter, the linguist Martin Hastelmaft wrote in reply that it's unsurprising that there is no consensus on the definition of surprise because it's not a technical term. Everyday words are usually somewhat vague and uniform definitions are desirable only for technical terms of science. And I cannot agree more with that. And I want to emphasize that today I'm not going to talk about the meaning of surprise when we talk over a coffee break. It's in the context of neuroscience and psychology and when we talk about these moments and the role of the surprise related signal in the brain. So before going to details and math and all the framework, I want to present the paper in a lot nutshell that what is the abstract of the paper? What we do is that we take many papers from the field and we identify several definitions of surprise. I list ten of the definitions here in the paper, there are 18 with all the details. And these definitions include things like unsolved reward prediction error, confidence corrected surprise, Bayesian surprise, postictive surprise, and many things that people really use in different contexts. Then we ask this question that what are their similarity and differences between these different surprise definition? And we have three main results that is analyzing these different definitions from different aspects. The first main result is a technical classification. So to talk about the technical classification, I would like to introduce a short notation. We call observation Y, we call Q's X things that are going to be informative about the observation. And then predictions can be seen as either a probability distribution over the observation given the Q's or one estimate of the next observation. And looking at this surprise definition, we could identify a category of these definitions that what we call observation mismatch surprise definition. They only depend on the next estimate. So they don't care that, okay, what was the probability of weather being sunny or rainy or snowy? They say that I predicted the weather is going to be sunny and now it's not sunny. And what is the difference between being sunny and being snowy? And that's the surprise. There's another group that we call probabilistic mismatch. They actually look at this distribution and say that, oh, it's true that I didn't expect the snowy weather, but it was not too unlikely. So it may be not as surprising as I would expect it. So they work with the distribution instead of a single point estimate or predictor. But actually this conditional distribution is a marginalization of something bigger because there's actually some uncertainty about how accurate, for example, weather forecast is about the weather of the next day. And that's what we would call, I would put in this parameter theta that are underlying rules of the environment or hypothesis, the reliability of the weather forecast. And we don't have access to the true value of it. So we make it believe and this pi of theta is our belief. It's a distribution over these possibilities. And actually the marginal distribution of next observation given the Q is this marginal probability. And the third group of the surprise definitions, people defining these definitions argue that actually what we care about is this belief. For example, how much we learn, how much we change this belief by observing a new observation. And those are the definition that we call belief mismatch surprise definitions. So this was a technical classification. We put different definitions in three different groups and say that there is some difference in how they depend on the belief of subject. Then we go to a conceptual labeling. And that's what we call the taxonomy. And we talk about what is the concept or what is the conceptual arguments behind the definition of each of these surprise definition. One group, that's what we call prediction surprise. So they care about how accurate or inaccurate the prediction is, how expected or unexpected the next observation is. The more accurate, the less surprising the observation. Then there's another category that we call change point detection surprise. They don't only care about the prediction, they always compare prediction with a baseline prediction. If something is unexpected, but that thing is unexpected under any hypothesis, then they claim that is not surprising because it would be just kind of an outlier. So I should not consider that surprising. And these are important to detect change points because what they care about is to see if there is another hypothesis under which the current observation would be more likely. For example, in this case of weather forecast, during COVID time, there was this argument that there is this paper I'm citing here that because there was much less flight going on, we had very less data and sensors. So weather forecast suddenly stopped being accurate. And actually, if you were checking weather forecast, you could see without even knowing the news that it gets less and less accurate as time pass on. And this kind of change point detection surprise definitions are somehow optimal. I show later on that they're optimal for detecting these kind of unobserved changes. The third group or the third class are what we call information gain surprise. So no matter if there is a change or not, this new observation made me learn something more about those underlying rules of the environment. So which media or which weather forecast is more reliable than the others? Every day that I see the weather, I would use that to update my belief about different weather forecast. And the last category is no matter which of these ones I care about, what is the conceptual argument behind the definition? The argument is that if I'm more confident about my prediction, I should feel more surprise. So the argument is that no matter what is the definition, I should have an explicit term for confidence in my definition. So having this technical classification and this conceptual labeling, we can arrange this list of surprise definition in these different boxes and have some structure. And as you can see, some of these boxes are empty. And I would like to say, you know, we have these mathematical definitions on the side, but their justification come from their relation to experiments and what they tell us about the brain. So it can be the case that in future we see that none of these definitions are helpful or we see in some situation that we need another definitions. And this figure here tell us that, okay, there are actually places that we can come up with new definitions in new categories. But there are also restriction. For example, talking about information gain. If we want to talk about information gain, we should always talk about the belief mismatch. We cannot talk about only at the level of observation or marginal distribution because we are talking about learning. So we should talk about these very high level hidden variables. Or when we talk about confidence, we inevitably need to talk about probability distribution and they cannot talk about point estimates. But still we can talk about these three different empty boxes here. And our last main result is, although these different definitions are in different boxes, but there are some links between them. So we found conditions under which these definitions are indistinguishable. So there are some situations that there is one to one mapping between some of these definitions. We cannot really experimentally distinguish them. So that was a summary of our work. And now I'm going to go to the mathematical framework and go one by one through all of these definitions and their link to each other. And please let me know if there is any clarification question. So for mathematical framework, let's look at one very traditional experiment for studying surprise in the field of neuroscience. It's what's called volatile oddball task. In a typical oddball task, at each time point, we have two possibilities for the stimulus. It's either, for example, here a blue disk or a red square. And for a while, the blue disk is much less likely to appear on the screen than the other one. And the reasoning behind the design of this experiment is that the participants who are watching this sequence on the screen one after another expect to see the red screen more often than the blue disk. And at the moment that he or she sees the blue disk, it's surprising. And that would give them a feeling of surprise. And a question that one can ask is that, can we really quantify how surprise a participant feels at time t equal to four? So if we know that this is the sequence until time t equal to four, can we quantify the amount of surprise or not? And to talk about the feeling of participant about surprise of an observation, we should make some assumption about how participants perceive this sequence. And we take a very common assumption of the field that participants perceive their sensory observation as probabilistic outcomes of a generative model with hidden variables. So at each time, each of these observations come from some distribution with some hidden variables, for example. We assume in a way that the way that participants think about this sequence of observation is that at each time with some probability one of these stimuli appear on the screen and over time, given the sequence, they try to find this probability. To formally define this, we consider a generative model. Let's say that at time t, we show the observation what appear on the screen by y t. And there is this hidden variable, which we call environment parameters, which is, for example, the probability of blue disk appearing on the screen. And now this link for people who are familiar with, I'm going to make a Bayesian network. And this link shows that the environment parameter determine the distribution over the observation. And I can repeat the same thing for time t plus one. Something that I didn't talk about in this volatile oddball task is that for a while we had this one image after another with some fixed probability. But once in a while we have a hidden change point, which changed the probability of blue disk and red square. So previously blue disk was rare and after that the other one can be rare or frequent. So the environment parameter over time can change. So we define this parameter, this random variable change point indicator that at time t plus one, if it is one, it means that there's a change in the environment. There's a new parameter. And if it is zero, it basically means that there's no change the environment is stable and theta t plus one is a copy paste of theta t. So this link just take theta t and take it to theta plus t plus one. So far the way that I present this in this task was that different observations are independent of each other. So at each time one of these stimuli appear on the screen randomly. This very beautiful paper left manual and a colleague in 2016 in Pila's computational biology, they argue that it's very nice. They argue that even if the sequence of observation is randomly designed, so these observations are independent of each other, people when they're observing this sequence assume that there's some dependency. So what they estimate over time is not really probability of observing the blue disc or red square. They are estimating the probability of observing blue disc after red square or red square after red square. So what they assume is that there's a link between yt and yt plus one. So so far I have a model of this outball task that says that each observation depends on the environment parameter and the previous observation. And there's a dynamic for this environment parameter. I do a bit of trick just define this dummy variable, which I call Q. And for the outball task, I just copy paste the previous observation in the queue. But the reason that we use Q is to have one abstract notion of what has predictive power for predicting next observation. And for the for the task with the action action can go to Q for the example of the weather forecast weather forecast can go to the queue and all these things. And now having this motif I can repeat it and make the whole sequence. And given this generative model given this mathematical model of the experiment, I can now mathematically define the surprise of observing by t plus one. And with this assumption we can we can propose some quantification of surprise of one stimuli. And what is worth mentioning is that this generative model here is generalization of many other models in the field, which means that many of the experiments already done to study surprise can be modeled by this generative model and a result hold for that. Going forward, this generative model also account for this moment of surprise can model that that the queue was X is here. The observation was the weather of next day is here. And TETA was environment parameter was how reliable, for example, the weather forecast is about the next observation. And the change point is whether something happened in the environment, for example, the pandemic and decrease in the number of planes. And now I'm going to make this definition more formal and discuss the dynamics of each of these different levels of in the generative model. I start with observations. And we assume this is the probability of a time t plus one observing why condition on Q being X and parameter being TETA. And basically this equation here means that this probability is time independent. As long as I know the queue and the parameter, it doesn't matter that if it's time step one or time step T, it all the time the same distribution which we show by P of Y given X. So basically any change in the environment goes to the parameter and the distribution is fixed. For the queue variables, we don't put any constraint. And for example, it can even be independent of anything that subject assume, which is the, for example, for weather forecast. We just turn on the TV and see what's going to be the prediction for next day. Then we have the change points indicator and their dynamic is at each time independently we asked whether there is a change or not. And CT plus one at time t plus one is a Bernoulli binary sample with probability PC, which is the change what we call change point probability. So at each time point with probability PC, there is either a change or not. Then this is the most important in a way part of the model that how these parameters change over time. We consider prior belief pi zero and says that at time one before the experiment starts participants have some assumption and some belief about what are the parameters behind this generative process. And that we show with pi zero. And at time t, if there is no change in the environment, if CT plus one is equal to zero, teta t plus one is just copy paste of teta t. But if there is a change in the environment, we assume that teta t plus one is sampled again independently of past from this prior belief. So we have a stable environment with the same teta until CT plus one is equal to one. Then we have a change in the environment and a parameters is re sampled from the prior belief. These equations that I wrote here is fully described the joint distribution of C1 to CT plus one teta one to teta t plus one and so on so forth. The whole generative model is described by these definitions. But if you want to see the precise mathematical definition, please look at the definition one in the paper. Before going forward, I would like to throw some notation that whenever I write something like this, whenever there's no ambiguity, I shorten the notation by dropping the random variables that are capital letters. Then I define what I call observer's belief at time t. If you remember pi zero, I define it as initial belief. And now pi t is the observer's belief at time t. That is given that I observe y1 to yt with Qs x1 to xt. What is my knowledge about the current parameters of the environment, which basically can be summarized in the posterior probability of teta t given the previous Qs and observations. Which in this example is that given a sequence of days and prediction of the weather forecast, I can make a belief on how reliable the weather forecast is at time t. And at the end, we define this marginal probability, which is the probability of y given x. And now I drop teta and I put the knowledge here and integrate over all possibility that if the world is at time t, how my knowledge about the world identified the relation between Q and observations. So this was the mathematical framework. And given this mathematical framework, now I can go to a formal investigation of different definitions and present or classification in distinguishability condition and taxonomy. What is surprise? Having this model, I can ask this question that condition on the previous observations y1 to t and Q variables x1 to t plus 1, these ones and these ones, how surprising is the next observation yt plus 1? So we are at time t. We are going to observe yt plus 1. And this is now a valid mathematical question in this framework. To make some prediction about yt plus 1, I need information about teta t and xt plus 1. That is the Q and these are the rules of the environment. Whatever I know about the rules of the environment is summarized in my belief, in whatever knowledge this observation and previous Qs gave me about teta t. And given the belief and the Q, I can get rid of, I can put away, throw away all the previous observations and say that I only need the belief and the Q. And surprise is basically a function that gets us argument, the belief, the Q and the next observation and give a real value back. One famous example is what people call surprise or Shannon surprise. The history is so long. It has been used for so long that I cannot really point to the first paper that use it, but you can look at the paper of Barto in 2013 for a review. Which basically says that given a belief, given a Q and an observation, surprise of this observation is equal to minus logarithm of marginal probability. So the more likely this observation, the less surprising it is. There's also another way to define it, which because here I had the knowledge at time t, one can say that there was a possibility to have a change point here. And one can extend this probability and say that, oh, there was a, when I'm at time t and I'm predicting the next observation, I should also consider this possibility that the world may change in between. And I should put some weights on this prior knowledge. So the next observation come with some probability from the current world or from the prior word that is reset. So there are two possibilities for these definitions. And what I want to emphasize is that in volatile environments, we are not sure which one we should take necessarily. One may say that this definition makes more sense because it's the full probability. But there are experimental evidence in this paper of Nazaran colleague in 2010 in the Journal of neuroscience. What they argue is that when people observe YT plus one, they update their knowledge by assuming that there might have been a change. But when they predict the next observation, they don't consider that there may be a change. So in a way, when people are thinking about the future, they don't consider this possibility of a change. And in this sense, we argue that the first definition, the one that I call as of SH2 is more consistent with the experimental evidence. But through the paper, we take all these two versions. It appears for many of the definitions and we do really this tedious job of keeping all these definitions through the whole paper. But today in the talk, I only present one version of each definition. So this was one of the examples. I can present different examples of surprise definition in the field. The three of them I'm going to present the first classification that we have. The first one was Shannon Surprise. Another one is the absolute error. As I said before, one can say that given the belief that I currently have, I can make a prediction. For example, take the average of this marginal probability, and then the difference between this average, this was my prediction, and the difference between the average and the true observation, the actual observation, is defined as a surprise. It's some sort of prediction error. And there's also this other definition, very famous definition, Bayesian Surprise, that says that I don't care that what is the difference between yt or yhat and what is this marginal probability. What I care about is that by observing, upon observing this yt plus one, how much I update my knowledge about the world. And what is the difference? Dkl here stands for KL divergence between the belief before observing yt plus one and after observing yt plus one. It is not written in the form of yt plus one, xt plus one and pi t. But in remark two in the paper, we show that if PC, if probability of change was equal to zero, so if there were no change in the environment, Bayesian Surprise could be written as a difference between expected Shannon Surprise and Shannon Surprise itself. And funnily enough, the Shannon Surprise here appeared with the negative sign. So it seems like that the Bayesian Surprise and Shannon Surprise are really complementary definitions. And particularly it is important to say that we cannot really get rid of the belief, the belief really appears, remains here. So we have marginal probabilities here, but we also need the belief to evaluate Bayesian Surprise. For the case of volatile environment with PC greater than zero, we showed that in Proposition 8, the interpretation remained the same, but the equations become a bit more ugly, so that's why I don't bring it here, but you can check it out. So having these three definitions here, something that's important is that Shannon Surprise could be written in terms of the marginal probability. Absolute error could be written in terms of the y hat, one estimate of next observation, but for Bayesian Surprise we needed the whole belief because we had to compute this expectation. And that's how we make our first technical classification, which is based on the dependence on the belief, that this surprise definition, how it depends on belief pi. Here on top we have pi, from pi we can extract the marginal probability and from the marginal probability we can extract one estimate. The first group are the observation mismatch surprise definition that only depends on the estimate, for example the absolute error that I introduced. The second group are the one that only depends on the marginal probability, there are one like Shannon Surprise, it is minus the probability of the marginal probability. But the last group, they cannot be evaluated by only knowing these two variables, by knowing these distribution and the next prediction. They know the whole belief, like Bayesian Surprise. So going back to this figure that I showed in the introduction, so far we have seen two definitions here in this column, two definitions here in this column, and two definitions here in this column. The first two, absolute error and Shannon Surprise, they're more about the prediction. So this one was saying that, what is the mismatch between the prediction and observation, and this one was saying that how likely or unlikely the observation was, while the Bayesian Surprise was saying that how much information I gained upon this observation. Okay, this was a technical classification. I can go forward now and discuss the indistinguishable conditions. Before going to indistinguishable conditions, I want to introduce another definition of Surprise. There is this opinion, this idea in the field, that surprising events increase learning rates. And there are some theory for it, and there are some experiments showing that it actually happened. So when there is something surprising, the learning rate increases. But there is this logical question that is increasing learning rate always good for making better predictions or making better knowledge of the environment or not. Because sometimes something surprise, something can be unlikely, but if it is just outlier, maybe I should not change my learning rate, I should just ignore this observation. What we find out is that there is, is necessary to have a comparison between the current belief and the prior belief. Because basically the moment that you want to increase your learning rate, often surprising events, is when there is a change in the environment. And you want to say that I want to get rid of the old observation. And this base factor surprise, we proposed it in a paper in Neural Computation in 2021. And it always compared the probability of observation under the prior belief with under the current belief. It's some sort of hypothesis testing at each point saying that, oh, whether there was a change or not. So if this, the denominator, the probability under the prior belief, if we ignore that, this is basically a decreasing function of the probability under the current belief. Very similar to the Shannon surprise. So the less likely an event, the more surprising it is. But if that event is also unlikely under the prior belief, then it may not be surprising. And the reasoning become more valid when we show in our proposition one in the paper that actually this definition is optimal to modulate learning. In proposition one, we show that after observing righty plus one, the new belief can be written in this term, which is a trade-off between integrating new observation into the previous belief. So take the previous belief, take the new observation, combine the two, make a new belief. Or through rate the old belief say that, oh, there was probably a change in the environment. I should reset and start from scratch and put the new observation together with the prior belief and make a reset belief. And this trade-off is controlled by something that we call adaptation rates here shown by gamma t plus one, which turned out to be an increasing function of base factor surprise. So the higher base factor surprise, the higher adaptation rate, and the higher adaptation rate, the more rate on the reset. Basically saying that the surprise, and this rule really comes out of the Bayesian inference is a normative rule. And basically saying that base factor surprise naturally emerge, or at least we define it in a way to have this decomposition of the belief. In this paper, we propose also a variational method that turned out to be, have a very simple update rule for distribution in exponential family, which we call variational surprise minimization learning. And I would like to mention here that recently it got some applications in the active inference community and people used it for model building. So I introduced the base factor surprise as a new base surprise definition. Let's go and just overview the definitions that so far I showed. The first one was absolute error surprise was in the family of observation mismatch, and it was, it care about how good my prediction is. Then there was channel surprise saying how likely or not likely an observation is. It was again caring about prediction, but it was in the probabilistic mismatch. The base factor surprise that I just showed isn't the family of change point detection because as we saw it care about whether there was a change point or not. But it is still in the probabilistic mismatch surprise definition. And another definition that I showed was Bayesian surprise, which was a difference between prior, previous belief and the new belief and is in the belief mismatch surprise definitions. Although they are in different columns in different boxes, one can ask this question that is there any link between Shannon surprise here and the base factor surprise here because at least there's one term that is repeated in both definition. That's what we show in the proposition two of the paper, that the base factor surprise can be written as function of these interesting things here that are not Shannon surprise, but very interestingly related to the Shannon surprise. There are differences in Shannon surprise. So there are different between the Shannon surprise under the current belief and under the prior belief. So base factor surprise can be written as a function of difference in Shannon surprise. So one may say that base factor surprise has interpretation like a relative surprise. And in that sense is always compared prior belief with the current belief. And that's why it's good for learning. As a consequence in corollary one, you can argue that optimal surprise modulation of learning can also be written in terms of these differences instead of the base factor surprise. But the most fun thing is that in corollary two we can we show that if the marginal probability under the prior belief is uniform. So if basically means that if this term is constant, then there are strictly increasing mappings between base factor surprise and two definition of Shannon surprise. And strictly increasing mappings basically means that these are one to one function of each other. So they're invertible. And if I have one of them, I can find the other two. So in a way they're indistinguishable. And that's one of the big results of our paper that we identify the conditions on the reach different definitions are indistinguishable. So here in this figure, you see different definitions that are connected to each other with lines. Each line correspond to one of these condition that if this condition is satisfied, then these two definitions are indistinguishable. And C2 here means that this condition is proved under corollary two in the paper. Why this indistinguishability conditions are important. The first thing is for experiment design. For example, let's take the oddball task that I introduced before. Usually an oddball task, it is assumed that marginal distribution under the prior belief is flat because there's no preference for the, at the beginning, there's no preference between the blue disk and the red square. As a result, basically means that in this kind of experiment, all these five definitions that I have here, they're indistinguishable. So I cannot do an oddball task and expect to see a difference between these five definitions. In addition, some people argue that even the prior belief is flat. And if it's that the case, then these two definitions are indistinguishable even more because oddball task is a categorical task. It is one of the special cases that we identify, which basically means that seven of these definitions are indistinguishable. So it's quite important that if we want to make an experiment to distinguish some of these surprise definitions to study different definitions of surprise, we should be careful to not have an experiment where these are indistinguishable. And what we do is that we take a series of papers in the field that have experiments where they study different definitions of surprise and we identify which of these conditions are satisfied or not in these papers. The second thing that this indistinguishability condition can be important for is for computational modeling. There is this new preprint from ORLAB and via archive. The authors show that they design a spiking neural network that try to learn in a volatile environment and modulate its learning rate with a notion of surprise. In the setting that they analyze, we can show that these three lines, the condition corresponding to these three lines are satisfied. So basically these definitions of surprise are equivalent. Why this is interesting for us? Because this one is optimal for learning is the base factor surprise that I showed before. While this one is pretty easy to implement while base factor surprise was a function of probabilities. This one is the observation space. And that's actually the fun thing that what they do here is they make a spiking network that computes something related to this observation mismatch surprise definitions while there are some optimal justification for why this surprise definition can be helpful for learning. So putting together, so far I proposed, I presented the technical classification that we proposed and indistinguishability condition that we found out. And now I'm going to discuss the taxonomy. I had 40 minutes, so I still have time, right? So for the taxonomy, I need to introduce another definition. I promise this is the last definition that I introduced today. It's the notion of confidence correction. I want to introduce that. And there's this idea that higher confidence, higher confidence must lead to higher feeling of surprise. So the more confident you are about your predictions, the more surprise you should feel. Syringes at all in this paper propose this definition. It may look a bit strange at the moment. I would demystify it in a bit. It's the difference between the current belief. And after observing this observation, what would be my belief if I would throw out all the previous observation and make a new belief? So it's a next belief, but with the assumption that there was a reset. Just for comparison between the confidence corrected surprise and Bayesian surprise. In the Bayesian surprise, I had a difference between the current belief and the next actual updated belief. But where the confidence corrected surprise is the current belief with the reset updated belief. The reason that is interesting is that in Proposition 9 of the paper, we show that this definition can be written in this way. The first theme is different in Shannon surprise. The second term is kind of difference is Bayesian surprise. And the third term is a notion of confidence. So this one looks like a relative surprise. As I argued before, it's good for modulation of learning. And that was actually the original plan of Fergital to find the surprise definition that is good for surprise modulation. And the last term is a notion of confidence. And in this preprint online, we show in some situation it seems more intuitive to have some confidence correction in the definition. But I don't want at the moment to enter that discussion. I just want to introduce that there are these class of definitions with the confidence correction and overview of all the definitions so far. The absolute error was a difference between observation and prediction in this class. Shannon surprise, how unlikely something is, base factor surprise, how unlikely something is compared to a baseline or prior belief. Bayesian surprise, how much I gain information upon observing something and confidence corrected surprise should include the explicit notion of confidence. Having these five definitions, now we saw one definition per row and one definition or at least one definition per column. Now I can introduce our taxonomy. On the screen now you see in a seemingly arbitrary order these definitions that I had here. And we propose to label these different definitions based on the quantity they measure. This group is what we call prediction surprise. They care about how good your prediction is about the next observation. This group for information gain surprise, they measure how much you learn based on a new observation. And this group for change point detection surprise that were some sort of relative measures of surprise. And we had this overlapping labeling that normally there is prediction or change point detection. We can make the effect stronger or we can correct the effect by adding an explicit term for confidence. One can ask that, okay, so far I talk about vendor indistinguishable and put them in different category. One can ask how different are these definitions in practice? Because if I pick a random output task from the field, look at the data, all these definitions look quite correlated. And one may argue that, okay, does it really matter to have different definitions? And it is actually possible to design experiments. Here I'm showing one example. I'm not going to discuss that what is the experiment we presented in this preprint. On the x-axis you see time on the y-axis average surprise and four definition from these four different categories. And here's a change point in the environment and we see after change point they have a completely different behavior. One goes down, one has a spike, one goes down and go up again and down again. So there are possibilities to make theory-driven experiments that different definitions of surprise can have very different behavior and different predictions. Another example is in this preprint the focus on the role of different surprise definitions on curiosity-driven exploration. That how differently they can thrive exploration. And there are also many other works that are not from our lab and they focus on the physiological signatures of different definitions of surprise. Particularly this one is in PLAS computational biology. I found it quite impressive. They found signatures in EEG for all three groups of prediction surprise, confidence correction surprise and information gain surprise. So a fixed experiment looking at the EEG, they could see that all these three have their own correlates or signatures in the EEG signal. So I introduced the mathematical framework and formally discussed many definitions of surprise. We had these technical classification and distinguishability conditions and the taxonomy take on messages one. Surprise in different experiments refer to different definitions. These different definitions can be classified based on their dependence on the subject's belief and the quantity they measure. Which was the taxonomy or the labeling we proposed. And the third message under a specific condition, some of these definitions are indistinguishable. And I argued that it is useful for experiment design and also useful for computational modeling. So putting everything together back to this figure that I showed at the beginning of the presentation. Our proposition is to replace or at least order these different signals or definitions. In a structured way that we can now, if you put it in a structured way, let me rephrase it in this way. Then we can talk about that, oh, for example, for exploration, which is these different domains is more relevant. Is the belief mismatch surprise definitions that are more relevant to exploration or the observation mismatch surprise definitions. So we are not talking about only specific definitions but more like categories. In addition to this surprise related signal or these surprise definitions, we argue that there's also a need for novelty signals or novelty definitions. And we argue in this paper that novelty and surprise are fundamentally different. So everything that I said today would not stand for novelty. So I would like to thank my collaborators for this joint work or very cool lab in Lausanne and you for your attention. Thank you very much, Daniel. All right. Just getting things back. Awesome. I will restore our videos. Well, thanks for that. Awesome talk. If anyone has a question in the live chat, please feel free. Otherwise, Ali, if you'd like, you can re-enable your video and feel free to give a first comment or I'm happy to give a comment. But you can go first. Yeah, sure. But for some reason, I can't activate my camera here. Let me check it again. Okay. So thanks a lot, everyone. First of all, I congratulate you for writing and of course your co-authors for writing such a fascinating paper. I truly enjoyed it. Well, I have some comments if I may. And also I have maybe a couple of questions. Your work, actually these kinds of classification work or the taxonomy is, in my opinion, is really a significant contribution to the whole area of psychological or especially in cognitive sciences. Because as Muhammad Ali Khalidi's recent book, the cognitive ontology claims, this kind of taxonomic approaches has been long debated among neuroscientists, psychologists and cognitive scientists, especially about the difference between cognitive kinds and natural kinds and whether or how they map unto each other. Because as you also mentioned, we have a folk psychological notion of surprise or let's say cognitive kind of surprise. But we're not always clear about how to translate that into precise mathematical terms. So I remember in Active Inference book by Par Pazulo and Friston, there was about a paragraph or so distinguishing between the folk psychological notion of surprise and the statistical surprise or surprise of functions corresponding to various distributions. So basically, as I see it, your paper is a kind of expansion on that single paragraph to consolidate all the different conceptions of surprises into a whole unified framework. We were talking the other day about the importance and significance of the recent move toward applied category theory as a kind of, as I like to call it, the conceptual housekeeping or better still a kind of cartography of the knowledge train. So this, this kind of research, in my opinion, can be seen in congruence with this recent move toward as Toby Sinclair Smith would call well to do a kind of well typed science to be precise and clear about our definitions. And I also wanted to kind of getting back to that distinction between the cognitive kinds and natural kinds. There are different taxonomic approaches in cognitive science addressing these problems. But as I understand it, your work here is more in line with the revisionism approach to do these taxonomic, these classification approaches, which basically says the current cognitive taxonomy is not precise enough to adequately address the problem of distinguishing between cognitive kinds and natural kinds. So we need more precise formalism to do that. And it also reminds me of Dalton Saktivadevel's notion of blanket index introduced in his paper, a weak mark of blankets, because before the publication of that paper, there was much confusion and ambiguities about the exact mathematical definition of the mark of blanket, or to put it in other words, to make it more general and more precise, what do we mean exactly by mark of blanket. So my question is, as I see it, so far, you have somehow mapped these kinds of different surprises in a kind of discrete space, as you showed in one of your tables. So do you see it as a feasible move forward, or let's say a feasible move toward more general conception of this framework, by conceiving this unified framework as a two dimensional continuous, or n dimensional continuous manifold of possible definitions of surprises, and maybe even define an index to navigate this continuous manifold space. Because in that case, obviously, you could possibly avoid those kind of blank spaces you have on your tables, and maybe even we can gain kind of a much deeper insight into how exactly those different kinds of surprises relate to each other in a much more continuous sense, rather than the discrete sense. So do you think this kind of move even feasible from your viewpoint? That's a very good point. Thanks a lot for all the remarks. I enjoyed it very much. The way that I define surprise in a general sense was this function or functional that get the Q observation and the belief and map it to a real value number. So all the types of functional in this space can be seen as a measure of surprise. So the whole space is quite big and continuous. So in principle, yes, of course, we can do that. But the space is too big, actually, we need to put some constraint on it. So I think for future work, if one want to go to that direction, one should say that what are the properties from a surprise definition that I would like to have to put some constraint on this space. And then in some sense, parameterize this space to have some interpretation because at the moment I'm saying that, okay, let's look at this part of this space that these definitions were there. People are already talking about these definitions. And I would call these definitions, for example, information gains to price definitions. So maybe one can really formalize all the definitions that it is possible to call because, okay, I only talk about these two particular definitions with KL divergence. If you replace KL divergence with any measure of distance on the space of probabilities with another information gains to price definition. So I think it is feasible in principle, but I think it's a bit of a tedious job. And whether how useful it's going to be is not clear for me because we already have many definitions. And as I try to, in a way, say, what justify definition is at the end of the day, how useful it is to explain experimental evidence and help us think about cognition. Justified true belief distribution. Thank you, Iresa. I wanted to add a general comment and then bring us back to active inference. So in exploring the active inference ontology, we've often found this tension where a given term would have one definition, very folk psychological, very everyday belief, surprise, observation, sense, action. These are some of the most common words in any language. And then your work started to dissect or characterize due taxonomics on that iceberg and show that the pointer of a given word, surprise in this case, can actually host diversity of technical definitions with some subtle contradictions or compatibilities and areas of indistinguishability. So I think it really raises our attention to the horizon and shows that for a lot of terms that people again use in a day to day context, there may be multiple definitions. And there's many implications like there isn't a single way to model action or belief or surprise. There are different flavors and even within a flavor, there's going to be different ways to deploy it in a given model or script or modeling context. So I think the formal coherences that you're providing are going to go a long way to ensuring that given that kind of pluralism in modeling and statistics, which has always been the case that we do have connections that are robust to enable comparison like you did with different empirical setups. Whether the experimenters understood SH1 versus SH2, we can post hoc, essentially as peer review, annotate or augment the definitions, repartition different ways that they analyzed data. And so I think that really opens up a lot of directions. Something that you had on several slides but didn't really go into that much was F star. So could you add a little bit on free energy? What is free energy? What's free about it? What's energetic about it? Why do we use it to help us engage in perception, cognition, action in the free energy principle and active inference? Thanks a lot for the question. The way that I see free energy in the context of what I discuss about is as long as I talk about exact Bayesian inference and really have my exact update tool for the belief going from the previous observation and previous belief to the exact next belief, free energy doesn't play much of a role. Free energy plays a role as soon as I want to talk about it. I want to talk about approximate inference and go to the variational setting. And in that case, basically free energy defines some loss between my approximation and the exact true belief. And one can show that this is the upper bound for the Shannon surprise that I define. And the difference between the free energy and this Shannon surprise is how good my approximation is. And that's why if I do exact inference, the minimized free energy is equal to the Shannon surprise. So in that sense, I cannot really distinguish minimized free energy from Shannon surprise. But if I work in the approximate inference setting, the minimized free energy is the best approximation of Shannon surprise that I can get in that framework. Awesome. Just to kind of restate that because it's actually one of the fundamental aspects of the approach taken in active inference. If the true state of the world is a Gaussian and we're doing exact inference with a Gaussian prior, we will seek to minimize our Shannon surprise. We could. If the world is not truly a Gaussian, but it's some other unstated or unstable distribution, but we choose to do variational approximate Bayesian inference with a Gaussian prior, then free energy tracks us doing strictly as well as we could, given our choice of prior family. And so it operationalizes with approximate Bayesian computational techniques, these formal definitions, which may or may not be tractable in large state spaces or with large amounts of data. So it's a super important point and it's almost like there's that iceberg of surprise definitions. And then free energy ports us into the approximate, which enables us to get some tractability in settings where an exact Bayesian approach is not plausible. Okay, two things. First, it's even more than that. So even no matter what is world, I may assume that world is as a Gaussian distribution, let's say, but there are some dependency. And I still, even though I'm aware of this dependency, I can assume that there's no dependency and do that in my inference. So when I think about the world, it is actually true. I believe that there are dependency, but they still do approximate inference. So I think when we do variational inference, we may not necessarily care that what is the true distribution. We assume the distribution has this form, yet we want to approximate it with something simpler. This is the first part. And the second part, all the definitions of surprise that I talked about can be applied to this approximate setting as well. So I can use the belief approximate belief that I found with variational inference and just import it in any of the definition that I had and come up with some definitions of surprise. I think what I want to emphasize on is that variational free energy and minimize free energy is one of these particular one, but all of them are still applicable in the approximate variational setting. Thank you for the correction. Like you mentioned variational free energy, and there is expected free energy about future, but also people have proposed variance like the free energy of the expected future. And so it speaks to categories of functionals that do different computations, and there isn't just one specific functional form. They might highlight different aspects as your plot of surprise all through time. Let's just say it was some stimuli. And I think that leads to a next question, which is if we're interested in some biological phenomena, how does this work help shape our experimental design, statistical power analysis, discussion section? Like how do the thousands of papers surely using surprise in a qualitative or a quantitative way? How do we translate this work to increase the rigor and interoperability of those studies? That's a very good question. And that's somehow the ideal place that I want or paper to be used that it makes a framework that we can think in a very formal way that if I wanted to design an experiment to distinguish these two different notion, what was the necessary features that my generative model need to have? And how can I make the realization of that generative model in an experimental paradigm? What we do, for example, in the preprint is not in the main paper, it's in the preprint that is online and bioarchive with the same set of authors. There we try to find situations that, for example, it's some election situation that each week before the election assume that news media tell you that this party is going to win with this probability, then you see the results and we try to make some assumption that what are the features of these experiments? And what we see is that, oh, actually, different definitions make very different predictions. So I cannot give a systematic approach that, oh, okay, this is the algorithm that if you want to design an experiment, just go one by one these steps. But I want to say that this framework gives us the tools to think about before doing the experiment how to design it to have the maximal effect. In many of these also indistinguishability conditions, I say, for example, that if something is flat, these two definitions are indistinguishable. The thing is that if that distribution is not flat, the two are distinguishable, but the effects are very small if it is close to flat. So the best thing is that if I want to make two definition of surprise is indistinguishable in an experiment, then it's best to look at that condition and make the opposite of that condition in my experiment. And in that way, I make the effect size as big as possible in some sense. It's a very non-rigorous way of talking about the theory, but just intuitively that's how we designed those experiments and could convince our experimental colleague at the moment to run the experiment. So we are doing that. Awesome. Yeah, there's a lot of ways that we could imagine pre-registration of studies and just the development of templates and patterns for statistical analyses so that different kinds of surprise could be highlighted. And then we could say, if the change point is going to be like this, then you will be able to distinguish with this power these kinds of outcomes versus another. Ali, do you want to question or? Yeah, actually this discussion reminds me of Terence Deacon's famous paper, What is Missing from Theories of Information? Because in that paper he also claims that one of the misunderstandings about all the theories of information or all the relevant theories for that matter is the assumption that in order for any content of information to have any real-world consequences, it must have some substantial properties and must correspond to some real-world situations or something present as an extrinsic entity in some form or another. But as he shows in that paper, we can regard the information, whether as a kind of probability distribution or any kind of other theoretical construct, as exactly conveying the kind of information that we wanted to explore in a much real sense, and it's not just pure data or pure abstract information for that. So for instance, Shannon's analysis of information, of course, is based on the constraint on the entropy. But on the other hand, the capacity to convey that information depends on the relation to something that is specifically absent or not produced. So this absence or presence of the entities that the information is about that entity or so-called the aboutness of the information is... I heard the aboutness of the information, but we'll pause and see if he'll rejoin, otherwise we can continue. Okay, I'll just kind of... Do you want to provide a thought on Shannon and the aboutness or... That's actually a very important point that Ali was about to talk about, which is in all these definitions that I proposed, that we didn't really care about what is Y or what is X. We were working in these very abstract states and very abstract framework. One may say that, what if I don't care about these observations at all? And I still may have some predictions. And so this aboutness that you were talking about is quite important. No matter how confident I am or what is my prediction, how much I care about that problem is quite important and how surprised I am. If I don't care about it at all, then I may not feel surprised at all. So that's something that's really missing from the whole framework that how we can model the other things that have, let's say, semantic because at the moment, all the framework is semantic free. How can we include some semantic into this framework and talk about... Then we have another dimension to talk about surprise. Yes. And some hope that with progressive syntax schemes, we can partition that semantics into other surprise definitions. But the bigger question is what are we surprised about? And I think that connects to these other cognitive phenomena slash word complexes like attention, confidence, salience. And just to kind of put a point on that, when surprise is being used as an imperative for optimization and inference, or merely as a signal that plays into action selection, your robot or ant or person or software is going to do different things depending on the surprise formalisms used. And knowing many flavors of formalism, it would be awesome if we could have a software package that computes 18 different numbers. But then what would we do? We could choose one. We could blend them. We could make summary statistics. We could say that we're in zone A where these five are aligned, but this one isn't. And then what would we do with that action? How surprised would we be by that location and phase space? So it's like a very interesting top-down situation, independent phrasing of those relationships. But in the implementation, it's always going to come down to what the modeler chooses. There just isn't a general answer to be found in the specifics. One other thought or area to explore is you mentioned the Bayes factor and its role in learning. So could you speak to optimal Bayesian inference and optimal learning? What do you mean? Sorry. Like Bayes optimal inference? Okay. That's what I meant that Bayes factor surprise appear in Bayes optimal learning and also appear in the variational Bayesian update rule. And that was actually what we found cool about it because neuroscientists, particularly people talking about synaptic plasticity, talked about surprise and how neuromusulators influence synaptic plasticity. And we saw that looking at the Bayesian framework, we can find that such a normative modulation based on surprise exists in how the belief is updated. And it is funny that the same thing appear in lots of approximate inference scheme as well. And it seems necessary to use such a surprise measures to modulate learning as we learned and there are lots of experimental evidence for it. So what I want to say is not necessarily only about the exact Bayesian inference, but it is something that appeared there and is useful for also other kind of other approximation to exact inference. So what does optimal learning in this case mean? Optimal learning means to find the exact posterior given the model for those parameters, because if the model is true, if I believe that this is the true model, all the information that I can have about the hidden variable is in the posterior distribution of that variable given the observations. And basically what we mean by learning is to updating this posterior distribution from one time to another. So as we go on and add more and more data points, how we track this distribution over time and how we update our belief. And by optimal learning, we mean the exact Bayesian update from time t to t plus one. But this could be done also with variational inference. The cool thing was that also bank variational inference would benefit from such a modulation. So one thought on that in a little bit of a meta science area is Bayesian statistics is becoming more prevalent. And one way of reporting the effect of a treatment would be like with a Bayes factor to present the relative evidence for one model versus another. And something that that plays a functional similarity with but of course a technical difference is a p value. And so in the presentation, I thought about how the Bayes factor was associated with learning as well as we could representing the evidence for two different possibilities in a ratio. Whereas a p value, especially when we put a discontinuity point oh five is like a change point. It's like we're sampling from the experimental outcomes and eggs are healthy, eggs are healthy, eggs are healthy. Now there's a difference between the groups. So that puts the task of science and learning into a change point detection framework. Where oh now we're in the world where this is the case versus the sort of unrolling Bayesian approach where we get some experimental results that are all over the posterior, but merely drawing results from a potentially quite dispersed posterior doesn't mean that we've entered any change point with respect to the efficacy of a given treatment. I'm trying to connect what you said about science in general to what you were thinking just as a point that the Bayes factor surprise we really took the word Bayes factor from the field of statistics for the nature of hypothesis testing that exists in this kind of change point detection. Because at each point you want to see that, oh, okay, was there really a change in the environment or not. And, for example, for this weather forecast in the pandemic, I really had this before knowing that this is really a phenomenon and weather forecast got less accurate for a couple of weeks. So I was confused that, okay, what's going on, why my meadow Swiss is not working well anymore. So it's always some sort of hypothesis testing that, okay, am I living in the same role that it was yesterday, or something changed from yesterday. So there is some hypothesis testing. And in the Bayesian setting, it's done by comparing two models to Bayes factor. We can also have, even in the Bayesian setting, if we go again towards approximate Bayesian setting and sampling, like with particle filtering, we sample this change point. So at some point we make decisions that, oh, this is the probability of, for example, there's a change or not at some point I should decide. I would say, yeah, there's a change from today onward I behave as if something changes today. And in a frequentist approach, it's a bit difficult for me to include things like p-value in the framework that I was talking about, because the nature of p-value is in talking about probability of things more extreme than things that already happened. While in this Bayesian setting we are talking about this particular observation, I don't care about the values higher than observation, lower than observation, different from, I care about only this observation. In p-value we compute the probability of observation more extreme than this one. And I don't see how it can easily fit in the framework. But if I want to find the closest measure to say that it's a bit similar to p-value, I would say it's something like Shannon's surprise, because it doesn't have the nature of comparison. I would say that under the current belief, which let's say that the null hypothesis, the current hypothesis, how unlikely is the new observation? Yes, one could certainly apply a change point. Base factor over three means we publish the paper, base factor under three is no evidence. And there's definitely some subtlety there with what the p-statistic and other empirically calculated statistics tell us. I'll read one question from the chat then, Ali. So Dave writes, Have folks working on these problems modeled the situation where counter-evidence has no impact on confidence? Delusion and cognitive dissonance are terms that come into that discussion. I'm sorry, can you read again because I opened the page to read it myself as well, but then the voice got a bit... All good. Have folks working on these problems modeled the situation where counter-evidence has no impact on confidence? Delusion and cognitive dissonance are terms that come into that discussion. That counter-evidence doesn't have effect on confidence? Yes, potentially where surprising beliefs are continually entertained. How could we model a situation where persistent cognitive dissonances or delusive beliefs are stabilized? Okay. When I talk about hypothesis testing in a Bayesian setting, the fun thing is that all hypothesis live at the same time. We just change the weight over them and some of these hypothesis are in disagreement with each other. So these counterfactual assumptions that, oh, what if there was a change a while ago or what if... Oh, wait. No, the previous observations are fixed. I have this imagination only in the unobserved states. And then I talk about different scenarios. I talk about unobserved states. What I think the question was about, if I want to think about counterfactuals, then I should imagine that how would the world be if the observation was different? And that at the moment is not included in the framework. But one could think about it if we would integrate the framework with a bit of causal inference. And in those settings, we have this counteractual and talking about now I don't only condition in the past. I imagine that some of the observations were different. And now I do my inference. But I'm not sure if I got the question and if my answer was relevant. I hope it is related. Awesome. And on that response, Ian wrote cognitive dissonance. I think there's something to do with having different beliefs at different levels of consciousness. So yes, these taxonomies were presented at a single level model as presented and with nested models. We might have different interactions amongst surprise definitions. Maybe even different surprise definitions playing out within a nested Bayesian architecture. Maybe one type of surprise is more amenable to edge detection in the retina. And then in a different part of the generative model, there's another operationalization of surprise in a decision making or in a risk aversive setting. Of course, that's definitely the case that different level of processing may work with different levels of surprise. And maybe I don't know. Belief mismatch surprise definitions are not relevant for the primary part of cortex at all. But if that's the case, I understand. But maybe I misunderstood what they mean by cognitive dissonance. Ali, any more thoughts or questions? Maybe we can each have one more question or contribution. Yeah, actually, first of all, I apologize. I got disconnected in the middle of my question. I don't know how much. It was fine. All right. So one thing about these kinds of nested hierarchies is that in the interactivist framework, they usually define different kinds of information as different levels of emergence in a kind of nested hierarchy. The biggest of all is the Shannon information, which deals with the kind of medium capacity of conveying the information. And then on another emergent level is the referential information, which is about the aboutness. And at the highest emergent level, we have the significant information or as they would like to call the usefulness or the semantic content of the information, as they say. So my question is that this kind of tax taxonomic approach that you provided in this paper, how can we map this taxonomic system? Or even whether we can map this kind of taxonomic system into this kind of nested hierarchies of different information? Because obviously, we can probably even say that the nature of these different kinds of informations are different, right? My short answer would be the table that I show. If we want to account for the things that you discussed, we should make a few of those tables. So we should add another dimension. So I don't think it's really possible to map these particular definitions to those different categories. And for the long answer, I give again another short answer, which is basically we need to make a new framework to really include all these semantics or current probabilistic framework. The one that I presented today doesn't have the capacity to account for that. That's why I mentioned that in manifold thing, I said in my first question, because I was thinking of something similar because maybe in order to generalize this framework, we would need a lot more dimensions than just the two dimensions that we have already provided in the paper. So I don't know if, as you say, it would be useful for any future research or whether it would be any kind of attractable computational framework. But at least in my opinion, or from a purely theoretical point of view, maybe it's worth investigating to see how these kinds of different frameworks integrate into a much more generalized way of thinking about information and surprise. But as I said, your paper is a very important and significant contribution in that regard. Oh, in near closing, the taxonomy notion, it's a classification scheme. And just as in biology, early natural history observations of different things led to taxonomics and the classification of different things and on through phylogenetics and understanding the temporal relationships. So today we have a taxonomy for surprise that we didn't have a year ago. What can be contributed by your group and by others in this emerging ecology of formalisms? If we have the taxonomy today, just a sketch, napkin sketch, two by two grid, we sorted the birds that people saw into roughly these categories. We don't even know that it was this bird, but now we can disambiguate it was that bird. So where will you be heading in this space and where do you think there are areas for people to contribute to this living area? That's a very good point, a very good question. There was this slide that I had these different board domains of learning, exploration, memory and physiological signals that people talk about surprise in neuroscience and psychology in the middle of these different definitions. I think if you want to go beyond the taxonomy, we should try to do more computational modeling and see that which of these definitions, which part of these two degrees is related to which of these functions. So basically, now I'm just proposing some tools and a framework which can be used later on for computational modeling. So these definitions can be useful or not for some modeler to explain exploration in some task or learning in some other task or segmentation in some task. So the next step is really finding out where and what function these definitions make sense. Yes, you brought up the novelty curiosity learning and surprise is woven sometimes technically or not into those discussions. But it's not all definitions of surprise for all definitions of learning salience attention novelty. So disambiguating that space is very important. Ali, any final remarks? I would like to thank you again for inviting me. Thanks a lot for all the discussion and nice question. Oh, awesome. It's a great honor and learning experience for us. So you're always welcome back till next time. You appreciate it. Thanks. Thank you. Have a nice day. Bye. Bye.