 is that basically people can hold seemingly contradictory beliefs when you change the utility function, right? And the way to explain that is that you, because you have model uncertainty or ambiguity, you tend to, you want to have, say for example, robust beliefs. So you tend towards worst case scenarios or if you're a gambler, the opposite, you tend to the positive side. So here's just a very brief example how you can use exactly the same idea about the ambiguous and in order to model cooperation. So here on the left you see the pay-off matrix of a famous cooperation game, the stack hunt game, I think that goes even back to Rousseau. And the idea is that there are two people and they have to decide whether to hunt hair or a stack. Now, if you hunt the hair, you get a little reward, but the good thing is you can hunt it by yourself. You don't have to rely on the other person. If you hunt the stack, then you get a big reward, but only if both hunt the stack, right? So if the other person decides not to cooperate with you, then you go to bed hungry, which is very bad. So there's basically two solutions to this game. One is the risk dominant equilibrium that would be to go for the hair, right? Because then there's no risk involved. You know what you get or the stack, which is the payoff dominant because you get a lot, but it's a bit risky. And so what we did was that we designed a sensor motor game where basically allows you to translate these payoff matrices from classic game series into like a continuous and smarter decision. And the way it basically works is that you have to move from a start to a target and the position, and you can touch the target anywhere, right? But the position that you choose basically tells you how much stack and how much hair you're doing, right? And the same for the other player. Now in this case, the other player was a computer player because we had to repeat this many times. And this thing could change, right? But what you would feel is the force here. And the force tells you the payoff, right? And if the other player changes their position, then you feel a change in the force because that's like them changing their choice, right? And the same for you. So basically you have a dynamically coupled system, right? And we designed this previously for the prisoner's dilemma where we actually had two people playing against each other and we saw that people converge to Nash equilibria and so on without actually knowing the kind of game they're playing, right? They were just feeling the forces and trying to adapt to that somehow. And here we chose to do this with one player because we needed to have full control of the other player, so we chose it to be a computer player. And what did we do? So when you make this decision, whether to cooperate or not, you may rely, well, you have a first prior belief, right? How likely is it that the other person is going to cooperate with me or not? And you can think about that like as if the other player was an earn with an unknown bias, right? So cooperation is on one side, no cooperation is on the other side and maybe cooperate some of the time, right? So it's like a probability to cooperate and you don't know what is the probability to cooperate. So the other player becomes for you the earn and now you see data, right? What did the other player do in the past, just like in the example yesterday, and then you update this belief, right? What is the probability that this player is going to cooperate? Now, if you are pessimistic or say ambiguity averse, right, then you're going to assume the worst that the other player is not going to cooperate. If you're optimistic or you like the ambiguity, right, then you're going to think, okay, I don't know this other player, but life is good, right? The world is my friend and this other player is part of the world, so it's also my friend. So he's going to cooperate, right? And then over time, this ambiguity attitude is washed out, the more and more data comes in, right? Like we saw yesterday, the more data you have, the ambiguity is washed out. But of course, in this case, this initial attitude towards ambiguity is a big effect on what equilibrium you're going to converge to, right? Because if both of you are optimistic, then you're going to end up very likely in a cooperative scenario. If both of you are ambiguity averse, you're very likely going to end up in the non-cooperative scenario, right? And so what we looked at here is we looked at the trials exactly like in the example yesterday where basically, so at two time points, time point three and time point 11, right? And at time point three, we looked at trials where the computer player cooperated once and did not cooperate once, right? So we were 50-50 chance. And the second time after 11 is he cooperated five times and did not cooperate five times, so also 50-50. But in the latter case, you know more, just like in the case of the urn, right? And then we looked at what is the probability of the human players, so these were our six human players, to cooperate, right? And then what you see is that even though it was 50-50, so if you just care about the mean, right? Again, this shouldn't make any difference. So for example, if you were doing fictitious play, this would be exactly what you do, you just care about the mean, you would predict no difference here. But of course we see a difference between three and 11, right? And we see that in the beginning, most subjects actually were being cooperative in this scenario, right? Even though it was 50-50, thinking, okay, maybe the other guy made a mistake, you know, let's try. But after 10 trials, this wasn't so much the case anymore, right? So then subjects also mostly converged to a 50-50 cooperation, okay? Yeah? The same subject played this many, many times, right? So that's how you, but do they learn what they play? Well, the thing is that that's why I'm saying we played these subjects against a computer player because if we had two subjects, right, they would probably, I mean, you can't tell them what to do. You'd be just able to observe this once maybe, and then that's it. And so with the computer player, you can basically say, okay, now we put computer players with different ambiguity attitudes, right? Which means they would be more or less cooperative finally, in the beginning at least, right? And so every time, so we told our subjects that they would play against this computer player and that they have to decide what to do, right? I mean, they were playing this game, right? So they didn't even know what it means to cooperate or not cooperate. They were just trying to basically avoid large forces, essentially. But wouldn't it be that they were still playing against the same computer? Oh, no. We told them that, yes, but we told them that the player's changing all the, so in blocks. So they were told that. So they went back to their prime. Yeah, yeah. So they knew that it's not the same player all the way through. So we told them after every block that there's going to be a new computer player, so to say. Yeah. Yes, so it's dynamic. It's exactly like this earned model, essentially, right? So basically if we go up here, so the value, so the V theta that you see there, that's determined depends on your utility and it depends on your Bayesian belief about what happened so far and that's dynamic. Every time you see an interaction, this belief changes, right? Okay, so... Sorry, just one more question. Yeah? Yes. Yes. Do you mean... Yes, so that's what... That's what you would predict, actually, if you were ambiguity averse, right? So in the beginning, I have only, so I'm suspicious of my environment. I see you for the first time, I think, okay, this guy looks dodgy. Let's avoid him and hand the hair, of course, I'm kidding. And then after a while, I realize, okay, you cooperate 50% of the time and then I'll do the same. So you can change the result. Yeah, the more data you have, the less ambiguity you have. The ambiguity is washed out with the data, so to say. Okay, then here's just a short example of how you can put these two things together. The constraint in the policy, meaning that you are restricted in the actions that you can pick and the information constraint in the belief space, which means that you have ambiguity. So what you see on the top is the definition for what you want to optimize in a Markov decision problem. So, okay, so you don't know what a Markov decision problem is. Just imagine this is the example I'm going to explain. You have to navigate this maze from a start to an end in every cell. You have to decide where you go forward, backward, and then the environment has, for example, a wall or a pit where you can fall in and so on. And you have to make this sequence of decisions. And what you typically want to do is you want to maximize your expected reward so without these log terms. And what we do is basically we just add these two log terms. One is for ambiguity, basically. And the other one is because you have limited action capabilities. And that allows you then to compute a value function, which is basically free energy. Again, that's one free energy that consists of these two variables where we have uncertainty about. So now we have an alpha and a beta, right? And the question is, then, what effect does this have? Let's just look at the example that gives you the intuition. So if we look at, let's see, what kind of plan do you come up with? Okay, let's look at the second line. So you have to go to the goal state, which is over here, and you start here, I think. And now in what happens is there's these question marks in the environment. And so you know everything else, but you don't know what happens in these question marks. So they're like an unknown jungle or something like this. You don't know what's in there. So it could be positive or negative about it. Now on the left side, you see alpha is small and the beta is large. So beta large means you believe your environment is friendly. So that means you will have positive attitudes towards the question mark. The alpha is small. That means that you cannot compute your actions so precisely. And so the shortest way to go for you would be to go either that way or that way. But that's the pit. And if you cannot control your actions precisely because you're limited, then it's better to go the long way around. And that's exactly what you do there. You go the long way around and because you're friendly towards the ambiguity, you go the long way around where the question marks are. So you're an adventurer basically. In the second column, you still have imprecise actions, but you are averse to ambiguity. You're scared. And what you do is, again, you take the long way around because you cannot choose your action precisely, but you choose the way that does not contain the question marks. And then here, you have basically a higher alpha, meaning you can control your actions precisely, but you believe the world is a bad place. So you take the shortcut, but the one that doesn't contain question marks. And if you can control your actions precisely and you believe that the world is a great place, you go the shortcut including the question marks. So you see basically how these two things interact and all the combinations. So now I want to start the next discussion how I think that limited resources lead to the emergence of abstractions. And the basic idea is very simple. Imagine that you would have unlimited resources. In that case, for each context in your environment, you would be able to compute an optimal policy. So you would have basically infinitely many optimal policies and they would be more or less dependent from each other. Now imagine that you have limited resources. Then you may be forced to apply the same kind of policy or at least similar policies to different scenarios. And that means that you have to abstract because you're behaving as if different things are the same. That means you have to ignore something and behave the same way towards them. That is abstraction. That's the basic idea. So how do we formalize this? So here is the trade-off between the utility and the information. What's new is, so this A would be the action again and this W is the world state. And now we have a scenario basically where if you just look at what is inside this bracket where for each world state you're looking for the best policy in this world state. But now and here this would be a prior over actions that does not depend on the world state. So if you have infinite resources what you would do is this prior would be irrelevant and you would just pick the best action for each world state. But now if you don't have infinite resources then this prior becomes important and we can now ask what would be the optimal prior. So that's what I'm doing in blue over there. So I'm asking if I average over all the environments what's the optimal prior. What does that mean intuitively? Intuitively it means that imagine that I'm the prior now and I need to be updated to the posterior and there's different possibilities, different world states. So if this world state happens I have to walk over there in information space and if this world state happens I have to walk over here in information space and so on. So which is the best prior? Well the one that is sort of in the middle between all of them intuitively. So it wouldn't be good if I have my posterior here and my priors over there. Okay, so if I do that you can actually show the optimal prior here is the marginal of this joint distribution over F and omega order A and W and then we can rewrite this equivalently like this. So if we choose the marginal here the Kullberg Leibler divergence becomes equal to the mutual information. I also said that yesterday you can think about the mutual information as a special kind of Kullberg Leibler divergence. And this is actually known as the this is equivalent to the rate distortion problem that we were discussing yesterday of information and utility that you trade off. Okay, the solution that you have here is the one that's depicted here is the self-consistent solution. So that is because we're not only looking for the best posterior but also for the best prior. And now self-consistent means that this is not actually a solution but you would need to iterate these equations to find the solutions. And this is what you do in information here with the so-called Blaute-Arrimoto algorithm. So what you do is you initialize the prior somehow and then you run these update equations just that they're there. What would the posterior be for this prior and then what would be the prior for this posterior and so on. You keep running until they're consistent with each other. That's why it's called self-consistent equation. So basically while you're running this the information flows both ways. How to update the posterior and how to update the prior. So now let's look at the simple example of how this can model abstraction. So let's assume that we have different things in the world. So these are different like cats and dogs and trees and flowers and that we play the following game. I show you one of these items and you have to tell me what it is and I give you a reward if you get it right. And let's assume just for argument that you get three euros if you guess exactly what it is. You say okay this is whatever a Persian cat. If you just recognize it's a cat you get 220 and if you at least recognize it's an animal and not a plant you get 160. I mean the numbers obviously don't matter. And now what do you see here? If basically information is fairly cheap abundant or you have lots of resources and the best answer is for each item I show you to tell me exactly what it is. The utility that you will get is maximal the three euros you need 3.7 bits to do that in this case. If you have more available it doesn't improve anything that's what you need. Now let's say the information becomes more expensive or you're less capable then you basically jump right and your answer is going to be on the intermediate level you just say it's a cat or a dog and so on if you would basically choose differently then you would lose money so to say right. I mean you could choose to spend this information differently you could choose to always recognize the duck's hunt right then you cannot tell anymore between trees and flowers or something like that. Depending on your environment that could be a sensible choice but in this environment it would make sense. It can make the information more expensive then you can do barely any information processing you see it's just 0.2 bits here you distinguish animals and plants and if you have no information anymore you just say it's a plant why would you do that because sneakily there was one more plant here than an animal so it makes sense to just guess that if you can't do anything else. So you see as you change the rationality parameter you have this sort of phase transition in your response. So this is an ongoing experiment I can show you one subject and we've recorded a few more but this is just to give you an idea of what we're trying to do I guess. So what we're doing is that we have an identification task so you see here different ellipses so to say that go from line to circle right and I show you an ellipse and you have to tell me where it is on the line and basically you have an action space that looks like this so you can say if I show you this ellipse you would have to hit this one, this button if you hit this one or that one it would be wrong or you could say let's take less precision let me choose the action on this row then you would just say okay it's this one I don't know which one this or that but I'm sure it's here or even in Corsa you decide to go here so not only do you need to decide which one is the ellipse that you see but you also need to decide that's important the level of abstraction that you want to choose so you say okay this is sort of a round thing so that this level is maybe round, straight and something intermediate so that would be only three distinctions you make in the world so this would be the utility function of course you get most reward if you basically say precisely what it is and less if you cannot say so precisely what it is just like in the example of the dogs here these I guess you can see this better here because these were videos yeah some noise yes so what we did is basically we manipulated two things the perceptual noise and the action noise so to say so we had the perception noise was basically so these were randomly moving dots and on the circle there were some dots that were moving coherently and basically by reducing the percentage of coherently moving dots this perception becomes harder and the other thing is that we manipulated the reaction time which you had to decide so the video length was always the same but the reaction time was different one in case 5 seconds the other one 400 milliseconds we had to make the choice and so I'll show you some preliminary data here as I said it was more like to explain to you the idea of what we're trying to do so this would be the six conditions so easy slow would mean easy perception slow reaction time that's the easiest condition hard fast is the worst and then what you see is that initially when it's easy obviously you go to higher levels and as it becomes harder you go back to the lower levels now if you model that with the rate distortion model you get these kind of predictions so in principle the same that you move from the top to the bottom if you look more closely then there are some effects that obviously this model cannot explain the most prominent being that if you see to the left that on the borders you go to the top level but in the middle you stay in the middle level and here in the rate distortion model you don't see that that on the border you have the perception is easier so it's easier to recognize these extreme values also because there's no neighbors on this side that could confuse you this is not modeled here because in this rate distortion model there's no neighborhood relationships everything is discrete in fact this model doesn't care that these are all circles these could be all different objects basically so there's clearly one limitation but the question we can still ask now is how optimal you make your choice just like we asked yesterday in this study with the reaching and so you see that so we've recorded 10 subjects this would be a representative subject so you see that there's this rate distortion curve again what is the best utility that you can achieve and you see that the harder it becomes the less information you have and the less utility you achieve but you also notice that we're lying close to the curve but we're not lying on the curve so the subjects are not really bounded optimal and now the thing that we're trying to figure out at the minute is what are the reasons what are the extra constraints maybe the subjects have what explains this sub-optimality I mean sub-optimal from bounded optimal and here is basically showing which level you should choose optimally so it's basically something like a rate distortion curve for which precision level to choose and you see for example also that this subject would be conservative in the sense that they should actually considering the information the discrimination that they can do they should choose a higher level but they don't they stay back so if you have points in between that means that you mix that sometimes you go to this level and sometimes you go to the level below but you see that you're clearly below so that's something that we're trying to figure out what are the explanations basically for these inefficiencies okay so then I would start with the last part about learning and basically that has two parts the first part is on structural learning which is again learning of abstractions so it's a bit also of earlier work that I was starting when I was a PhD student so I was always interested in abstractions the question how they are learned and then the second part how to model learning with foundational decision makers like in your parametric functions and then we also have at least one example we go back to learning of these abstractions okay but it's more general also other things so we also looked at learning with neural nets and stuff like that so I don't know how much we can squeeze in okay so the question that basically I was thinking about when I was a PhD student was how can you learn when you have variable environments and the example is okay you have different bikes or something like that so you learn to write them do you learn something different each time or do you like extract the commonality and the question was that typically when we study learning then we expose people to particular problem we observe how they get better over time we record the learning curve and make analysis about that but I was interested in what happens if I change the learning problem all the time do you just always keep learning from zero or do you learn something more abstract in the end so that you improve over time even though the learning problem changes all the time and obviously this can only happen if the learning problems share some commonality so that was sort of the question I was interested in and so the idea is that if you imagine now that you had these different bikes and you imagine now in the brain there's different dials so that you can adapt say synaptic weights or something like this to solve the problems and then imagine now just to make an argument that there's only two such dials that you turn to try and adapt from one bike to the other and then you would search essentially through this two dimensional space that's the power point and on the point to find the new solution but if you now know that all the bikes actually they share a commonality then you might figure out that they lie on a subspace they're not independent essentially the two parameters that you need to adapt and if you knew that subspace then you could just basically whiz along that subspace and the exploration would be faster because you would essentially dismiss immediately lots of possibilities that are not promising so the idea is that if I would be exposed to varying environment that has structural invariance that you would learn these invariance and then you would become faster to adapt to new problems that belong to the same sort of subspace so the experiments we did was again on this virtuality so actually we used the same system also for many of the experiments that I showed you already so you have here this manipulanda on this mirror you have basically a projection from a screen you can't see your hand so you can create forces here also that means you can create virtual objects of virtual dynamics so to say so here is one simple experiment that we did so just to be a motor experiment so you have a reaching task the subject doesn't see the hand just the cursor the task is to control the cursor from a start to a target and now we can dissociate the hand and the cursor movement or make the relationship a bit more complicated so one thing that we could do for example is we could introduce what's called the vis-motor rotation so you move your hand straight up but the cursor goes there and then you see ok the cursor goes there you don't see your hand and you may want to then correct that movement so your hand goes now over there and over time when you learn that the cursor movement becomes straighter again and you learn to move your hand another way and you don't actually realize that it's a rotation you just think when you do this there's something strange and you just adapt to it it's a little bit like riding a bike implicit learning so the question is how can we model this so this was the first experiment that we did in this direction and what we did in this experiment is that actually most of the time there would be straight movements without perturbation and then interspersed in this sea of say normal movements where individual trials single trials that had a random rotation between the cursor and the hand movement and you would have to do a correction just like it's depicted here and what you see let's look at the experimental data first right there on the very top left you see that we had four different rotations actually with eight but plus minus was collapsed into one so 90 degrees, 70, 50 and 30 degree rotations and you see basically the subject always starts moving straight forward and there's this correction and of course the larger the rotation angle the bigger this correction is you also see that in the speed profile the second bump or the angular momentum the same also in the variability pattern now to model this we basically assumed a simple optimal control model that assumes like linear dynamics and quadratic costs that means that it has an inertia that you want to move so it's like a model that many people have used I guess that have used optimal control and you basically trade off two things on the one hand you want to go to the target on the other hand you don't want to spend too much effort so that's the trade off but now the problem is when you do optimal control you need to know everything but the problem is here we don't know the rotation angle so basically what I did here was I made a model that knew already there's going to be a rotation but I don't know the rotation angle and this rotation angle is estimated on the fly from the data that comes in so basically with this control loop where I give motor commands are going to go to the target I get sensory feedback and I realize actually oh I'm not going where I want to go and I have a system identification unit that basically says okay is there a parameter phi meaning an angle that would explain my observation and I would adapt that and of course depending on the noise this system could immediately detect this or if you have a lot of noise and delay in the system which is the case for humans and it takes longer to realize and that's why when we model this we can also reproduce these kind of trajectories that sort of start out and then go straight for quite a while until it's realized okay something's wrong and then you start correcting okay so you could say okay that's nice so this simple model sort of reproduces this data but what's more interesting is when it doesn't produce the data so we looked at essentially early trials and late trials okay so what you see here is as an example for the 90 degree rotation trajectories on the top left you see in purple the early trials okay and then you see that as time goes on so these are average trials I think over like blocks of 200 trials so we picked out all the 90 degree variations and averaged them and you see that this starts quite low and if you look at the speed profile there's no bump the second bump is not there the variability is quite high and as time progresses you seem to converge to this sort of limit trajectory and the speed bump comes out and the variability reduces here you can see this also in the bottom minimum distance to the target and the peak of the second speed bump so it becomes faster the variability goes down and so what you're looking at here is that the formation of a learning to learn process or learning to adapt because remember that each single trajectory is an adaptation that you produce here so it's some kind of a learning process you have to adapt your policy if you didn't you would not get to the target that means you have to change the direction to action and so we're basically learning to improve this over time and actually to me this is the interesting part that cannot be modeled by this because here I presume already that I know that there's going to be rotations the only thing that I can say is if I do presume this knowledge then I can explain the data but how do you get there and in control theory it's difficult to manage because control theories usually always assume that you know more or less everything except a few parameters that's a typical assumption if you didn't do that then you come to the realm of reinforcement learning but then I guess control engineers are more interested in things like stability and that the system is under control which makes sense if you think about controlling aeroplanes and chemical power plants and I don't know what so they don't want systems that are they don't know what they're doing they're learning too much that's right so here you see also that the point that I was making about the learning to learn so if you do a control group that does target jumps so you move and then while you're moving the target jumps over here you also need to change but you don't have to learn anything you can use the same sense to motor mapping and that's what you see here there's no learning effect anywhere you get the stable behavior straight away okay so the question is what is this learning to learn can we learn more about it and so what I did then in the next study was I basically had three different groups of people and they would be trained on different things and then I would expose them to the same kind of learning problem new problem right and I would ask would they adapt to this learning problem differently just because they have had exposure to different statistics before and maybe they learn different structural invariance right and that puts an exploration bias when they have a new problem okay and so the first group was just a naive group that did just straight movements for 800 trials and then they were exposed to a block of 60 degree rotations a block of plus 60 and there you see the learning curves basically how the movement gets faster and straighter you also see that this second curve it's the opposite right so you have this typical interference effect and this sort of re-learning of the first one so these are all effects that are known in literature and then we had the second group who did random rotations and then we saw that this second group was much faster in adapting and also had reduced interference but of course the question is well the third group maybe it just memorized all the rotations there was no invariant right so we had a third group that basically was experiencing random transformations random linear transformations you can think of them as being composed of rotations, shearing and scalings and they can be quite crazy and 20% of the trial we exposed them to rotations namely exactly these rotations transformations would drown here in a sea of noise so to say so if you're looking at invariance from all these trials then you wouldn't find one but if you would just memorize everything then you should be able to memorize these 60 degree rotation trials so what we found was that this random group performed no better than the group that didn't learn initially and suggesting that they were not able well first of all they didn't memorize and they were not able to extract an invariant from these say rather unstructured transformations now we had here a similar experiment in 3D well we had two groups one group that was learning basically rotations around the horizontal and the vertical axis these are the two groups and then of course we exposed each group to a 45 degree either horizontal or vertical rotation and then you see maybe not so surprising the horizontal group adapts faster to the horizontal rotation and the vertical group faster to the vertical rotation but what you also see is if you then look at the endpoint spread where subjects are pointing that you find that the horizontal group tends to be spread more in the horizontal direction these are not after effects these are active explorations because we had a wash out block before here and the vertical group tends to explore in this direction because they learned that this is most of the time this was the direction that was relevant when they were adapting in the training trials then here is a more recent study about the same idea so we again so we were asking again about can you learn structure so you remember this experiment from yesterday where you have combining sensory information with your prior knowledge to do the tennis example and everything and here the question was instead of just learning instead of just learning one dimension I could say I have a two dimensional hidden variable so let's call it Sx and this could be having any relationship that I want they could be correlated they could have a structure in this space so I don't only have this bias that I have to estimate in the where is the tennis ball going to land if I put an extra bias in I don't have to estimate it in the X direction so it's not tennis anymore I don't know what this would be maybe badminton and to see can you actually learn this structure so that's what we did so this would be one group that would basically have no structure so what you see is basically a Gaussian cloud so we have these two variables that would correspond to the target position right but here it would correspond to that you basically add a bias where you have to go and so here you basically look onto the top of a Gaussian distribution and that means that these two variables are not correlated if I tell you Sx then you don't know anything about Sx and the other way around and so if I show you now a data point this red one and you have to guess what is Sx and Sy then of course you just say okay it's there this would be like the example where you have reliable sensory feedback you know exactly what the target is but if I tell you now the target is somewhere here then you combine that with your prior which would be most of the time it's here but I don't know and you cut out the slice however if you knew that the two things are tightly correlated like that and now I'm telling you the target is somewhere here then you would know precisely where it is because knowing this coordinate precisely intersected with this tells you exactly what the other coordinate is so that was the idea can you basically can we take this well put experiment make it in 3D such that we have a two dimensional plane where we have two hidden variables that you have to estimate in each trial and we impose a structure a two dimensional plane namely a simple correlation structure in that case could you learn from in the end these are the critical trials can you basically learn this correlation structure that if I give you information about one axis you know what is the value of the other axis so that's what we did and also this experiment took quite a very long time because it didn't work for long time because we didn't do it for enough days so the learning of this correlation is very very slow and I think in the original experiment there were 2,000 trials and after 2,000 trials you hardly see anything and then only after we realized we have to do many more trials we started to see an effect so we're quite perplexed so the trials that we did trained people full feedback that means I show you exactly the point and the partial feedback which yesterday was this cloud today it's also a cloud but it's basically a cloud that's really stretched like a long ellipse or like a line almost and if you remember this that the slope tells you basically what the slope is 1 we said yesterday then you don't have any feedback, there's no information the best you can do is if you don't see anything play in the dark you go to the middle of your prior if you have information, full information then you would have a zero slope you would always, independent of where the target was and now what you see here is the correlated and the uncorrelated group and I'm just looking at the trials with no feedback and in the no feedback trials you don't know where this thing is you should always go to the middle of your prior you see that they all have a slope of 1 more or less and now the interesting question is what happens in the partial feedback trials where I just have this line this line and of course this line could be vertical or horizontal these are the two possibilities here and you see that in the uncorrelated group there is no because there also wasn't anything that they could learn there was no effect there but in the correlated group there were effects and in the correlation perfectly then you would expect this here to be zero because you know then exactly what the position of the other coordinate should be but you see it's not zero so for example the subject 3 is particularly bad but even here so this is the best case that we have and this is after like 4000 trials so you can also look at how this slope evolves it starts with 1 in the correlated group it stays at 1 because you never can learn a correlation and in the correlated group you see so we only look at the trials where we have this partial feedback of the bar you see how slowly slowly so at this point you've done in total 4000 trials but it's only 630 so partial feedback trials here it's slowly moving away and you've not even learned 50% in contrast learning the mean of the distribution is fast and this works for both groups and the question is okay how can we model this because there's basically two kinds of learning going on in each trial you have to learn or adapt to what is basically this hidden variable which in the old experiment was the position of the target in this case it would be the cursor shift and at the same time you have to learn this correlation structure but this correlation structure you would learn over many trials because that's basically the prior that you have to figure out so in a sense that the variation that you have in every trial is that this hidden variable changes in every trial but the correlation structure stays the same so this is what you would learn the structure invariant so the way you would model this is with a hierarchical Bayesian model so you have here on the left you want to basically know this is the posterior what is the probability of this hidden variable given my observations and the observations are split in two parts namely my current observation little d and all the past the big capital D and essentially the the p s with the capital D is my prior and this prior now itself is is where is it would be corresponding to this or it's to this whole thing because you have now also a hyper this prior is parameterized and you have another distribution which would learn in this case this covariance matrix and then you have a belief also about the mean and the covariance matrix for this prior and so over time you will learn this and you will learn in each trial what is the current shift these are the two things now when you model this what you will notice is that basically for this Bayesian model it's super easy to learn this correlation structure if I show you dozens of points you would immediately know this is the correlation that explains this and then you have a really hard time to say okay why why does it take for subjects, thousands of trials to figure out this correlation structure and the only way you can model this is basically by putting into this hyper prior that this correlation is extremely unlikely so you have a prior that basically says the uncorrelated that the fact that these two are uncorrelated has a huge prior probability and then you need a lot of evidence to move away from that and when you do when you model it like that these are basically simulations where you see how you basically learn this slope then you get sort of similar curves that are very noisy and very slow in moving away from the wand that means that you slowly learn this correlation over thousands of trials but this only works when you basically put in that you have a very very strong prior that this should be not correlated okay bring in time 10 okay so then the question is then that was also a follow up question from this kind of studies how should we select between different structures right and a structure would now mean that you basically it's like a parameter and model selection problem right so you have some you have some observation let's call it D then you have some parameter S and you have some model M and if you have different structures this would correspond to different M's we have all different S's that would trace out the subspace of that structure so to say that's the idea and so in the end this becomes a question of model selection and model comparison and the the Bayesian way to do model comparison is to basically say okay what is the probability of this data point and model 1 and what's the probability of the data point and model 2 if we assume that we give equal prior weight to both kinds of models and for that we need to basically compute the so called marginal likelihood so what does that mean I give you quickly an intuition so if we have two models let's call it model 1 and model 2 right these models can explain certain data they give probability to this data now a simple model would basically be a model that is fairly concentrated that means it can only explain a relatively small number of things in the world okay that would be a simple model and that gives higher probability to that because the probabilities always have to add up to 1 right a complex model would be one that gives probability mass to everything right or to relatively lots of things so maybe maybe an intuitive example would be okay what is a complex model if something happens in your life and you say okay fate right fate was this or that so fate can explain everything right that would be in this jargon a complex model and something more specific like I don't know you make some kind of scientific explanation maybe about how a stone falls or something like that that can explain only certain amount of things right that cannot explain maybe why you had an argument last night or something like this right so more specific model okay and that means now if I give you a data point that is here then both models would be able to explain this data point right in particular the complex model will always be able to explain this point right and the question is which model should you use and if you basically take this base factor or margin likely you will choose the one that gives the higher probability to this data point right you should choose and that means that if two models explain the data equally well choose the simpler one um and so this margin of likelihood um where does it come from um well it's basically just you want to know what is the probability of the model given the data right and that is just phase rule so then we have this right um and to get to get this essentially you have to basically uh get rid of this s here right so this would be the joint distribution then you get rid of the s and then you have it which is what's written there um now what does this what's the intuition behind this so this here explains to you right what is the likelihood of this data given the model and this particular parameter setting and of course in the complex model you will always find a parameter setting such that the data is explained well right and that is what you see for example here right and if you just look at the square distance um from from the model to the data points this model will be better than that one but because it's more complex there's also going to be many parameter settings that will not fit the data well right and in the simple model there will not be so many parameter settings that don't fit so well presuming that there is one that fits well right and so what you can think about that is the margin aggregate is like and you take the average over all the parameters that you're considering for the model right and look at the average likelihood and if you have a complex model then the average likelihood maybe not so high there is maybe one or two parameter settings that fit well but then there's like millions that don't fit well and that will take this likelihood down right and so what happens is if you use the margin likelihood to score the model complexity uh to score to decide between models sorry then you will take into account basically automatically the model complexity right that's the history of basic model selection I guess and so we also want to test this in the experiment and what we did was basically we showed people these dots right and we asked them to draw a line that basically would fit to these dots okay and what we did is we trained people different kinds of models um we used Gaussian processes with different length scales if you don't know what that is don't worry just um I'll show you the picture so the basically the two different models were like this one was sort of smoother and the other one was more wiggly okay and so you train on these models and as you can see you can have basically the same kind of dots right but depending on the model that you've been trained you would draw a different line right and what we did was we trained um the people each person on both models and we had basically uh a color queue in the background that looked like uh yeah so the training on the model means that um basically I can draw from this Gaussian process I can draw samples right and I can draw curves um and then I would show you the samples and then you would have to guess the line from the Gaussian process that belongs basically to the same samples and if you do that many times right then you will learn that okay this here is smoother and here I would probably try a more wiggly line right so they would try and guess this underlying line um and so we had basically a queue in the background with color and it showed some kind of a mountain range that could be more wiggly or flatter right so you had some queue to know which model would fit now with these points and you could learn these over time and then the interesting part is now when you expose them to test trials right and in these test trials there was no queue and they had to decide now which model they should use to connect the dots right and essentially the test trials were chosen so if we look for the Gaussian process so it's a little bit like a Gaussian um we look at the margin of likelihood it has this form okay and basically this margin of likelihood splits into two parts uh like a data fit error right and the complexity term and in this complexity term um we have the basically the wiggliness of the model right so the more wiggly one has the higher complexity and the other one the smoother one has the lower complexity and then we chose these trials such that the data fit error right that we can compute from the Gaussian process would uh be the same for both models right such that the only difference would blind the complexity so the question is then if you have a trial where basically both models explain the data equally well which one do you choose right that was the question and lo and behold they chose the simpler model like they uh should right so you see here the we tried basically two three different levels of um of this data fit error right and most of the time um the subjects would choose the uh the simpler model um and if you looked at all the trials and said they and assumed that they would choose according to the base factor right so that would be then the experimental probabilities and theoretical probabilities they should roughly lie on this um diagonal which um was roughly the case and we checked for two more things then we had control experiments one was um for checking for physical effort because it makes more effort right to draw the wiggly line than the smooth one so we just had one where you clicked a mouse button left and right to see whether it's simpler or uh the more complex that you want um and we also looked at the the spatial frequency by doing um control experiment where actually the more wiggly line can be the simpler model than the straight line if the prior distribution over this is very narrow right if it's always the same line then this model will only explain a small data set so to say right and this one if it has a lot of variance will maybe explain a larger data set and so now the situations reversed where basically the wiggly becomes the simpler model right because in the end it's just about how many data sets can you explain and when we did that subjects would again choose the simpler model so it was not so the more wiggly one so in the test so in the test trials only we're talking about the test trials where basically this data error term where both models would explain the dots equally well and then so it looked like this shape essentially and then you would also pick this shape even it's more wiggly it's but in Bayesian terms it's not more complex this is the simpler model it's more wiggly in terms of spatial frequency sure but that is because this notion of model complexity is most of the time okay but if you want to be strict it would not be okay because what you see here is the prior distribution matters over these parameters right so even if you have a higher order polynomial but there's only like I don't know say one value for one of the coefficients then it wouldn't be so complex because it's always the same and you can predict it so that would be a discussion about how you should measure the model complexity and if you want to be a Bayesian then the complexity would basically just be okay how many data points can you explain so to say okay okay so now I start with something that looks slightly different but then we come back again to the abstraction problem in a little bit okay how to model learning with parametric models and so what we thought was okay if we take this optimization criterion and now we assume that our posterior belongs to a parametric family let's for example say it's Gaussian or whatever how how do we then improve how do we basically determine this and obviously an obvious answer would be to try gradient descent and so you take this whole thing as an objective and you do gradient descent on the parameters that you want to learn so for example if the distribution depends on theta right then you would take the derivative with respect to theta and you can actually write the update rule like that because well some people call it the log trick right so what's nice about this is that the derivative just is there in the log and you know the expression of the distribution so you can do this derivative and this term here is so to say without derivative and this expectation then can be taken right so you can do an online update I can just look at samples from W and A and update my theta for each sample online update okay and we tried this out in several scenarios so one the first one was a thought experiment where we said okay what if spiking neuron was acting like a bound rational decision maker what would this neuron do and essentially the idea is that we have different spike trains that come in they have a synaptic weight right then they're summed up like they have this post-synaptic potential that's summed up then they're fed through some activity function and then according to that spikes are produced with certain probability and this neuron basically also has a reward signal and this reward signal depends of course on the spike trains that come in and the spike train that comes out and the question is now what would this neuron do how would it adjust its weights if it were to use this kind of update rule and so that's what we do basically we say okay we want to find the weight that is the optimum trade-off between the reward and the limited information between input and output spike train and if you do the math you will get basically an update equation that looks like this and the intuition behind it is actually not so difficult so if we plot delta r so delta r is the difference in reward that you get for firing versus not firing because you only have these two actions so to say and now if the information plays no role you get this sort of this sharp curve so if the reward for firing is positive then basically you increase the weights to make it more likely that you will fire again if the reward is negative then you don't then you actually decrease the weight slightly because you don't want to fire in that case and now what this log term is doing in the end is it's taking the baseline firing rate into account and so for example if if the beta is very, so let's take the extreme case it's always easy to understand so if the beta is zero that means you only care about the information then you basically have a flat line that means you don't deviate from the average firing rate anymore right when you have the beta infinity case you just care about the reward and the average firing rate is of no concern to you right and if you have betas in between then the adaptation rule basically takes both things into account right so you want to increase the firing rate when the reward is positive but you don't want to deviate too much from the average firing rate because that would mean that you have to upset your system a lot so basically you get something like a neuron that doesn't want to deviate too much from the average firing rate which in the end then turns into a neuron that basically economizes on the synaptic weights so we can see this in a simple example so this was just an example experiment so assume that we have a signal spike train right that then is translated into other spike trains that are more or less correlated with this spike train your neuron sees only these intermediate spike trains and wants to recreate the old spike train right that's a simple problem where you have an easy reward function that's there every time you create a spike that coincides with the spike in the signal spike train you would get a reward otherwise there's no reward and then you see what happens so if you if you just care about reward then your neuron will fire all the time and will have really high weights because this increases the chance that you will at least by accident produce a coincidence with the signal spike train but if you have this information punishment you don't want to deviate too much from your average firing rate then you see that the weight growth here is limited compared to here right and that you still achieve a similar level of utility so it's again like a whatever 95% or so okay so that was as I said it was a thought experiment so you say okay bounded rational neuron would basically be a neuron that doesn't deviate too much from the mean firing rate and that economizes on its weights then as a second attempt we thought okay can we apply this to an artificial neural network so made of many neurons and again with using the same kind of update rule of course the representation of this probability distribution in each case and we try two different scenarios one where basically the network is composed of neurons that each neuron is bounded rational or the scenario where we regard the network as one bounded rational decision maker so to say so these are the scenarios so the network was like this with the inputs then the layers and the softmax output layer was important because then you can interpret this output layer as probability distribution right so we need that in order to be able to apply mutual information and so on and then these are the two scenarios in the one scenario basically each neuron is a bounded rational decision maker so you have the utility that is for everybody so you just do normal back propagation with that then you basically also don't want to deviate too much so that becomes basically a weight regularization right and for the network as a whole we also do just ordinary back propagation but with this whole term but again you can think about this in both cases you can think about it essentially as a regularization that we're doing in the learning process and then we tried this out on this identification of written hand digits the MNIST dataset and we compared it to other methods just to see how well it compares I mean this was never designed to do recognition of hand written digits or anything like that and then so you see here the classification error for the individual neurons and for the network as a whole and these are basically errors that are achieved by by recent algorithms right that use other regularization methods like drop out or drop connect and so on and this is not the best but it's definitely in the same sort of league that's what we found and we also tried it out for a convolutional network and again there it wasn't the best but in the same sort of league so it seems that to use the DKL as a regularizer in a neural network setting also seems to work fairly well that is basically what we found in this study what about the training no I think it was would you mean until it converges compared to the other methods yeah that's true for the beta yeah so the beta does influence the training time of course because you trade off the accuracy like you say I'm not sure now how the training time compared to these one when we have this like similar accuracy I would have to look but I don't think there will be much different because otherwise I would remember that we had discussions about this with my students I would expect but I would have to look so I can look up later if you want okay so then we want to use this idea again so this idea of using neural networks we used it also in the context of the learning abstractions when coupling action and perception so let me explain to you first the setup without the parametric part and then we do the second part where we introduce parameters so we had this simple situation with two variables essentially right the world state and the action and you would choose a policy for each world state and you could have abstractions basically that are encoded in the prior and so on and now we have three variables we're slightly getting more complex so we have a world state then we have perceptual state if you want O and then we have an action state A so we call this the serial scenario because the action doesn't know about the world state except through what the observation states tell the action so it's the same and usually you would think about this perception action thing separately right that's also what people would do in robotics because later we have a robotics application that you have somebody that figures out the perception problem and the other one figures out the action problem and then you sort of try to put these things together afterwards and here what we would do is we would basically optimize both things at the same time so we have a utility function that we want to optimize and then we have two basically constrained information channels one is from world to perceptual state and the other one is from perceptual state to action and I have two parameters with which I can regulate the accuracy say or how much information I want to allow and the what we're doing here is we're basically optimizing for the posterior and the prior so we get again a set of self consistent equations only this time we have four of them because we have one more variable so how can we read them so let's first look at this equation it's fairly easy so in this equation you have given an observation you have to select an action and what you do is you have a prior over the action and then you optimize this and this is basically your expected utility where you average over the unobserved world states given your observation right this is just the base posterior and this is not put in this sort of false out from this model like that okay so that's pretty clear but the question is what is the perceptual model the one that maps from the world state to the observation state and so this has also utility function that's right so the observation state is picked in such a way that essentially on average over all possible actions the action stage will be able to get a lot of utility and will basically not burn too much information right so basically you create an observation that is first of all understandable for the action and also creates a lot of utility right because that's what you want so you can think about it as some kind of feature selector that appears and what this guy the information this guy transmits depends on the next on the action stage so to say right so what can you do with this so this is just a little toy example that my Ph.D. students come up with as Tim so imagine there's different animals that have different sizes right and there's different actions that you can take so for example you can run away from the big animals you should always run away for little animals there's maybe different ways you can hunt them like different techniques or so and there's also like techniques that you can that are not so specific where you also get the utility where you do the same thing to the animals sort of thing what you see now is that in this scenario on the top left right you would see like a typical way to think about the problem you say okay I have the animal size then I have a perceived animal size true size plus some noise or something like that that's indicated by this diagonal and then this is used by the action module to make the decision in the bottom it's not like that in the bottom there is basically economies information economies and that means that for the little animal it makes sense to so the perception module is the one on the left for little animals it makes sense to distinguish the three little animals and waste information on that but it doesn't make sense for example to distinguish the big animals because the big animals all have the same action of course this was in the utility function but still the big animals basically if you distinguish them it has no utility advantage for you so the perception module will also not distinguish them because that would be a waste of information and and what you can do now is that if you now change the action module right so I make for example the action more imprecise then if you treat action and perception separately of course nothing will happen but in the lower case if I now say okay I cannot choose my actions precisely anymore then I will basically make more generic techniques that are simpler but that will also then have consequences for the perception because if the action module doesn't distinguish anymore between different animals in terms of action then it makes no sense to waste resources in perceptually distinguishing them and then you will in this case just distinguish two categories of animals large and small say for example and so the I guess the philosophical idea behind this is that and that was already the case when I talked about abstractions earlier that the way you perceive the world depends on the computational resources you have right so the less resources you have then you have to carve up the world more economically then if you have more resources yeah yeah so I had this discussion before when I was giving talk this is about like visual psychologists and then we actually both agree that maybe for humans this model is not too great because because there are so many things that you can do right that it makes sense for perception to be fairly generic independent from the action part I guess this I mean that was the argument that we were having that maybe this is a better model of say like insect vision or so where you have more limited setup and where this evolves through evolution and more than an ontogenetic setup but still I believe that I mean that doesn't invalidate this model because what I'm saying is that perception should be coupled to the action but if we're saying that humans can do so many things such that it doesn't make sense to restrict the perception too much just because there's too many possibilities right that then these two things become a bit more independent again but I guess still in humans there are studies right that look at the connection between action perception that if you learn for example to manipulate a new object in a new way that then also I guess the perception of the object changes in some sense because you know then what you have to attend to and everything so I guess to some extent it's true but not in the extreme case I'm making here that all of us you know that I don't know your motor cortex is you're paralyzed and all of a sudden there's only like small and big animals in the world even though so I don't know you know Niels Bierbaum I was working with ALS patients that lose basically all connection with the world in terms of action one of his claims is that of course that's not proven but he's always saying that then you lose also the ability to think and to maybe also perception to some extent I don't know but yeah I think in this extreme version for humans yeah yeah maybe we can talk afterwards yeah yeah pack all the information in the utility function then you don't need the utility function yeah I mean okay so this is the general discussion I guess about active inference so yes you can do that so in fact the two are where's the clearing thing or here in fact the two I guess are equivalent right so I can write for example PA if we just have say one variable as an example okay so this is the thing that I've been showing you the whole time right but now you could say okay why do you write it like this let's write it differently let's write this as let's see and then we do log PA P0A and then I would have to put E beta UA I think okay so I could write it like this this is the same and now I could if I wanted okay this is not normalized right but I could in principle I mean the normalization is just the constant so for the optimization it wouldn't make any difference I could call this now QA right and that would be I could say this is my desired this you called it prior I would maybe call it desired distribution but we mean the same thing and then basically all you would have to do is basically adjust this distribution PA this is the one you want to vary right to match this one as close as possible and then now you could basically you could make different stories so here the story was that I have this prior and I tried to go away from the prior and I tried to optimize this utility the story that you would make would be okay I have this I call it now the desired distribution to be clearer I have this desire distribution now I have basically somewhere where I can change this distribution P that I want and I want to make it as close as possible to my desired distribution and I mean both things would do exactly the same so it just depends whether you like better to talk about utilities or you like better to talk about desired distributions but in the end mathematically they do the same thing well I don't know I would still say I mean that was the argument the log probability would also be it's like a utility so for me you only need one name you only need one name is that true I don't know the desired distribution and the utility for me yeah I mean if you say prior then I wouldn't say it's wrong but for me that would be confusing but because when I'm talking about price I mean something different I mean not the thing that you desire but the thing that you want to get away from if you can essentially that's the logic in the price that I'm talking about and the thing that you call prior is the thing that you want to get to but you don't know how so it's like the target so in some sense it's okay to call them price but in some other sense they're like almost the opposite things one thing you want to get away from is to okay but I guess this is arguing about the names which you can always do but I think more importantly is that the math is more or less the same right okay so then I said we can apply the same idea to what we have basically parameterized distributions so that's what we tried here another of my pictures from Jen so she used now simulation of the now robot and the idea was that to make it slightly more realistic even though it's still a toy problem so what's more realistic is that as the we used the neural network for the perception part right so we have two distributions perception and action so the perception distribution was a neural network that got basically a pixel image from the camera so that part's quite realistic the action part was then simple because we only had four different actions and we also had only four different world states and the four different world states are these different marks that you see there and the marks could have basically a handle on the left on the right on both sides or no handle at all and depending on the handle there's different ways you can grasp it right so if it's on the left you have to you can grasp it on the left if it's on the right you can grasp on the right or you can grasp these with both hands also right you waste one arm movement but it would be a generic arm movement that always works except for the mark that has no handle and so that's what illustrated with this utility function so the highest utility would be if you always grasp the mark with the grip that's perfect for it you grip the left handle mark with the left hand, the right with the right hand and the one that has two handles with both hands I guess you could also do it with one but with both hands it's the best but you also get the utility if you do the two hand grip with the marks that have only one handle and then there's the the mark that has no handle and there you should basically have the action where you don't do anything so it doesn't slip it's just an example okay and so the question was now we do the same thing just parametrically right here neural network and here a simple multi varied distribution so this would be just a simple feed-forward network nothing special just compute the gradient with respect to these parameters right using this log trick that I said before applying the same idea and this is the action module which simply learns a categorical distribution to map the X to one of these four actions and this is what you get so depending the beta values again that you set right the betas are generous you will always do the perfect action if the betas are intermediate you will just distinguish basically also in the observation space mugs with handles versus mugs without handle and if you have basically no resources then there's just always one action and there's just one state right so then you see that the way you carve up the world would depend on this accuracy that you can afford or the information resource you have yeah in which part here in general in general yes yes it does so you get the pixel image that's fed into the neural network and this neural network then produces an output X and this X is then mapped to the action here but the difficulties you don't have that knowledge so the question was sort of I mean this was in iris which is like a robotics conference so the question was assume I have two separate modules one for perception and one for action right and the action module was a simple parameterized distribution but the perception module was actually quite realistic like just the neural network where you optimize the parameters and now you want to couple the two things and the idea was okay let's couple them like this with this equation that we have and update it and and basically it worked and you get this effect that's depending on how you choose the betas the the way you make distinctions is different okay so then there's like two I think last studies that quickly can present so it should just fit with the time so also this idea that limited resources should lead to an optimal division of labor so for me that is coupled with this idea of of the abstraction what I ultimately would like to understand how maybe this limited resources lead to formation of hierarchies and division of labor so if you think for example about a company or something like that right if everybody was had unlimited abilities then there would be no need to join together to form a company or to divide labor or something like that but if everybody is limited somehow then we have an incentive to say okay I'm limited you're limited but if we join together we have more processing power and we can solve new problems right and we can increase the utility for everybody so to say and and yeah so you see that in the shop where for example then there tends to be this hierarchical formation that seems to be efficient right that's what I would like to understand I don't know then you have also these different time scales of abstraction that arise that I don't know the CEO of the company decides things on a time scale of years maybe and in the shop the shop assistant makes talks to a customer on a time scale of minutes maybe right but this should be a general principle maybe also applicable to brains with simple neurons that you join together each neuron by itself is pretty useless but if you put them together they do amazing things same is true for humans right each human by itself would be not so impressive but as big groups do impressive things so yeah I mean these are first steps towards these ideas I guess so here again a model where we say okay we have three variables just like before with a slight difference such that now we have basically the action module can still see the world state from before okay but this gives now rise to different interpretation so the world state then the world state is turned into an X which you can think of as a selector okay so this guy selects between experts that are indexed by X and the expert then looks at the W and makes a decision about the W right so basically like a mixture of expert systems and then the question is how do you divide basically the work between these experts because there's just one utility function right so we assume there's a utility function that depends on the world state and the action and now the selector and these experts first of all you have to decide what should I be an expert about and you have to then decide which expert should I select and so the problem formulation is very similar to before I want to optimize utility I have this limited capacity channel for the selector and here also for the action so what's new here is that we have these these three variables right and here basically X is given because we said the index of the expert and the expert has to make a decision from world state to action that's why this which information is written like that and again you will get these coupled equations that you can solve and then this is another example from my PhD student it's also a bit constructed but so you assume that there's different diseases in the world so these are hard and different lung diseases and that there are different treatments that are given by this utility function and what you can then see is for example that so there's two different say populations one where these diseases are equally probable and then in another population where heart diseases are more prevalent and and then so there's two different heart diseases they're not the same and then in one case you get basically a situation where the experts you'll have experts for heart disease and the two different lung diseases but in the other population you will basically get a different division of labor where you have basically one lung doctor and two different kind of experts for the two different heart diseases it makes more sense to specialize there and make more distinction because in the example it was assumed that there's two different heart diseases where if you distinguish them you can get even more utility and so this is just some other stuff that we've been playing around with so here in this example we assumed what happens if we now model these expert systems in such a way that each expert has a prior and this prior would be a generative neural network a variational auto encoder in that case so you can that doesn't mean anything to you just imagine it's a very flexible distribution where you can sample from and that then the world state comes in you select one of these experts and then there is the thinking process would be modeled by a Markov chain on the Carlo where you try to basically try out different possibilities to find the best and you combine these two things right you learn these priors and you combine it with this search process then to get to the posterior and you can then see if I have multiple experts then and I have basically less thinking time then I can do better right if you have infinite thinking time one expert can do everything because you can optimize everything if you have less thinking time then it's better to have multiple experts that can basically already like do with a few search steps find the optimum in their surrounding right but if you have just one expert that is there in the middle of the room say and has to find the optimum in anywhere in the room then it would take lots of time essentially yeah this is another example we tried to ask okay what if we represent the posterior distribution again parametrically so here with linear functions so it's like different experiments that we did so this is just stuff that's going on at the minute say for example you have a classification problem that is cannot be linearly separated but each expert is just the linear expert right so no individual can solve this classification task can you put them together such that you can solve the problem right or also regression if you want to regress this function we just have a linear regressers and then you have different experts that sort of try to approximate this function and the last thing I can briefly mention is that now we're also looking at bigger networks right I just showed you with three variables so you can ask now okay if I have a utility function over action in world state and now I have many people many experts that can do different things and each of them sort of has a different limitation in how much information they can process what is the optimal way to put these people together to solve the problem right to look over all the possible ways you can combine this and we restricted ourselves here to basically just pure feed forward information processing paths right and then in that case you can decompose everything again with these free energies and more or less the same way that we discussed already and then you get basically graphs like this where the information goes from world state to action selection and you can choose between many different graphs and then the difficulty is you try these out for different utility functions and then you have to sort of argue is there a common principle what are good design principles what are structured that appear very often that are successful in solving tasks this is the kind of stuff I'm looking at at the minute so that brings me to the end of my talk just in time and I hope that convinced you that rationality is an interesting research topic this kind of free energy principle is maybe also an interesting way to think about problems to give a unified perspective I guess on many things that otherwise disparate ok that's it