 So, it's my enormous pleasure to introduce Lucy for her presentation this afternoon. I apologise on behalf of the faculty. I'm the one person who is here because everybody else is in an assembly meeting. So, I'm sorry, Lucy, that we can't all be here. I think I got the good end of the deal. So, as you can see from Lucy's slide, she is a senior Harvard University PhD student who's here in Okinawa with us for the next 12 months. She's sharing her time between the Visiting Scientist Programme, which is part of what her presentation is today. But she's also spending time in our unit, the Human Developmental Neurobiology Unit and Kenji Doe's unit. So, she gets around. So, as I say, Lucy is a final year student at Harvard. Her adviser is Sam Gershwin. And her research applies information theory and reinforcement learning towards understanding how humans learn and make decisions under cognitive resource constraints. So, how good is your decision making when the resources that you have available to you are limited? She has been a somewhat restless researcher. In fact, I'm a little bit jealous of all the places that Lucy has had the opportunity to work at. She spent time at Baylor College of Medicine. She spent time at MIT at Janelia Research Campus and at UCL. Lucy's also a really passionate teacher, as well as a researcher, and I can attest to that because I imposed on her to give a lecture to my class, and it was better than any of the talks I gave to them, and she really engaged them. So, just be careful, because she'll have you up at the board doing all sorts of things during this presentation if I know her. She's obviously developed and taught many courses, including math tools for neuroscience and how music plays the brain. And she's served as a resident tutor at Harvard College, where she lived with an advised undergraduate students. And Lucy's going to be here for another few months. So, if you have an opportunity, either call in and see her at the theoretical scientist visiting program or come to our lab, we would love to see you. Actually, most of my lab are here today, which is a testament to how much we enjoy Lucy's company as well as her science. So, I should probably stop talking because you didn't actually come here to listen to me today, and I'm going to hand over to Lucy. But she is going to leave some time at the end for questions, and please, please use the microphone so that the people on Zoom can hear your questions. So, Lucy, I'm going to stop talking and actually hand over to you, which is hope everybody here has come to listen to you today. Well, I haven't even said anything yet. Hi, everyone. So, thank you all first for coming today. I'm really excited to be here at Goyce. And my family is actually visiting me this week, so it's really special to have them all in the back row there. So, today, I'm going to tell you about how cognitive constraints shape our decisions and actions. Imagine you're a new graduate student, you're new to Japan, and you're grocery shopping at Saunei after work one day. You need to buy milk, so you go to the milk aisle, but there are so many options to choose from that are very unfamiliar. So, being in this new environment, you haven't figured out what's your favourite brand of milk yet. But because you're a grad student, you're probably also mentally drained after a long day of solving quantum physics or hatching neurons. Or you might need to rush back to lab after the grocery store to run some more experiments. And so, you decide that it's easier to just buy the same milk brand that you bought last week instead of trying out a new option or figuring out which one you like the most. So, as I've illustrated here with this example, our behaviour is undeniably shaped by constraints on our cognitive resources. This is because every decision comes with a cognitive cost, and we often make a trade-off between the value of our choices and the cognitive cost of making the choice. So, in other words, the more cognitive resources you spend on making a decision, the better your decision could be. So, the fundamental question I seek to tackle in my research is this. How do humans and other animals make decisions when their cognitive resources are limited? And I tackle this question using mathematical tools from reinforcement learning theory and information theory to build models of decision-making behaviour. So, as a quick primer to those who are unfamiliar, reinforcement learning is a branch of machine learning that describes how an agent learns from its environment. Now, information theory is a branch of mathematical statistics that quantifies how information is communicated from a source to a destination. And how novel that information is based off what is already known. Using these two formal frameworks, we can build mathematical models describing how people learn from their environment and what information they use to make decisions or select actions. Okay, so let's start cooking up a model together. So, in reinforcement learning models, you have a couple of ingredients. You first have an agent, in our case, a person or the brain, that takes actions in the environment and gets feedback in the form of a reward signal. So, here I'm going to sort of lay out the mathematical notation along with the concept and try to give you an intuition for these things. So, a policy is what the agent follows in order to know what action to take in every state. You can think of the policy as instructions that tell the agent what action to take and tell the agent what to do in every state. So, as a concrete example, you can think of states as different restaurants that you could go to and actions as different food orders that you could place at each restaurant. Now, each of these entries in the grid is the probability that you would order that particular food item at that restaurant. So, for example, at Sushiro, which is my favourite rotary sushi chain, you're probably most likely going to order sushi and not ramen or gyoza. Similarly, at Ichiran, which is a popular ramen chain in Japan, you're probably going to order ramen and not sushi or gyoza. And at an izakaya, which is just a very common Japanese eatery pub combination, you might be equally likely to order any of these three options. That's an example of what a policy would look like. Another ingredient in reinforcement learning models is value. So, value is the average reward earned by following a policy pie. And the goal of the agent in reinforcement learning is to find the optimal policy that maximizes value. So, what do I mean by this? I'll give you two contrasting examples. So, here's a policy which I would say is pretty bad. So, at Sushiro, you order ramen and nothing else. At Ichiran, you order gyoza and nothing else. At an izakaya, you order sushi. Maybe that's actually an okay choice. But, in general, you're not exactly maximizing the overall value of the food that you're eating at each of these restaurants. On the other hand, an optimal policy is one that maximizes value. So, here, you're ordering sushi at Sushiro, you're ordering ramen at Ichiran, and maybe this izakaya is actually known for its ramen and gyoza, so you could order them equally, for example. Now, this basic setup of reinforcement learning has been used to model human behavior in a lot of different learning tasks, but a lot of studies have also shown that people aren't exactly optimal at maximizing value. And the whole point of my talk and the whole thesis of the fields of resource rationality is that humans are not optimal at maximizing value, but instead they do the best that they can to learn and make decisions with limited cognitive capacity. Now, this is where information theory becomes useful for defining this problem, because another way to think about policies is as a limited information channel that transmits information about states of the environment to guide action selection. By combining reinforcement learning with information theory, we can now model decision-making with a resource-rational approach by introducing this following constraint. So, now you want to find a policy that maximizes value, but it's subject to a limit on capacity. And specifically, the policy complexity or mutual information between states and actions, if you're familiar with information theory, must stay under some kind of capacity limit C. And policy complexity, as we'll explore later, is the amount of information about the state used to select actions. So, in mathematics, this is what we call an optimization problem. And by solving this equation, we arrive at a solution which is a model that describes how people should behave if they want to maximize value with limited capacity. And this is that equation. So, before this equation scares you, I'm going to provide a translation that hopefully gives some intuition about what this means. So, what this equation is saying is that in state S, choose an action A by striking a balance between choosing the most rewarding action in the state and repeating an action that you've chosen most often in the past. In other words, even more simple, information from the environment or the state combined with your history of actions will determine the decision you make in the present. Now, if you're familiar with reinforcement learning, the components of the model map onto these familiar RL terms. So, the optimal policy combines state action values with the marginal action probability or sometimes called the default policy. And the balance is determined by this inverse temperature parameter beta, which also controls how much the agent explores or exploits. OK, so returning to this more concise definition first. So, what do I mean here by striking a balance between, right? How do you decide how much information from the environment to use versus how much to rely on your previous actions? OK, well, in fact, this beta parameter represents how much you're paying attention to the state to choose your actions. And in doing so, it's also the parameter that controls the trade-off between average reward and policy complexity. So, the larger the value of beta, the more complex your policy is, which you can also think of as the amount of cognitive resources you're willing to spend in order to increase your reward. This is kind of like the first milk example that I was giving. All right, if you're further mathematically curious, an interesting result is that the inverse of beta is actually the slope of this optimal reward complexity trade-off frontier. OK, to give one more concrete example before jumping into the data, here I've depicted a policy where the most rewarding action to take in this state is action two. But in your history of actions, you've most often taken action one. So, to compare and contrast, an individual with a high capacity, a capacity of one, and therefore a high value of beta, would pay more attention to the state and select an action that's best in that particular state. So, you can see that their decision-making policy ends up looking more like the state action values. On the other hand, an individual with a lower capacity of C equals 0.2 and also with a lower beta tends to select actions based off their action history, which, as you can see, action one is the action they've chosen the most. So, therefore that's reflected in their decision-making policy. So, one more time. Beta controls how much you're paying attention to the state, to the environment, and how much cognitive resources you're willing to spend in making your decisions. Now, to briefly summarize the main takeaways from the model we just built, any resource-limited agent, humans, animals, robots included, must learn and make decisions under a capacity limit. This leads to a trade-off between average reward or value and the policy complexity. As a result, policies will always be compressed to remain under the capacity limit. And this is why, in our work, we call this a model of policy compression. So, how do we know that our computational model of decision-making is actually an accurate description of human behaviour? Well, we designed simple decision-making tasks for people to do and see if their behaviour matches what the model predicts. So, in our experiments, people see a bunch of images, one after another, on a screen, and they have to learn which out of three keys on the keyboard to press in order to earn money. So, here's an example of a task progression. So, on every trial, the subject sees an image on the screen, and then they have to pick a key out of these three keys to respond in under two seconds. And they'll receive feedback as to whether their choice was correct or incorrect. So, they don't know anything about which keys are correct for each image when they start. Then they see the next image, they make another response, they receive more feedback, at this time it's correct, and so on and so forth. And the goal of the participant is that they need to get as many trials correct as possible because, eventually, that converts into the amount of money they earn from participating in our experiment. Now, each image, which is the state in this task, maps on to an optimal action. So, that's the action that will most likely give a correct feedback in this task. It's very simple. Three different images and three different actions. Now, for each different state or image, there's one action that is optimal. And we can construct a reward matrix or a reward table that describes the probability of getting reward or correct feedback for each state-action combination. So, in this example, every time you see the blueberries and you hit the K key, you will always get green feedback. So, that's correct. Now, we can play with these reward matrices so that they're probabilistic. So, what that means is, for example, now here, if you press the K key for this image of blueberries, you're only going to get correct feedback 80% of the time, even though it's still the action with the highest probability of reward. So, sometimes you'll get rewarded by pressing J or L, but only 20% of the time. So, we do this because, one, it makes it harder for people to learn what the best action is for each state. And second, because we can see how people's behaviour changes in response to different reward distributions. Okay, so we have a series of tasks where we change the reward tables and we ask whether people's behaviour changes in a way that aligns with our policy compression model. So, in this first task, we have two conditions, Q1 and Q2, red and blue, and they share the same reward table. But in the blue condition, Q2, one of the states here appears more often than the other two states. So, practically this means that people are seeing one particular stimulus image like three times more than the other two images. And the effect that it should have if they learn this correctly is that in the blue condition, they're going to be using action A1 more often than in the red condition. So, that's what this little distribution here means. So, A1 here is being used more often than A2 or A3. Okay, so the model here predicts that there's going to be more policy compression when one state appears more frequently than others. The intuition here is that you can encode high-frequency states with more precision and the other states that occur less frequently with less precision. An example that relates to our running restaurant example is that if you visit a certain restaurant more frequently than others, you're more likely to spend more cognitive resources learning which dish is the best thing to order compared to a restaurant that you don't visit very often. Behaviourally speaking, the model predicts that when one state is more visited more frequently in the blue condition, people earn more reward with the same policy complexity and that choices are less variable. Now, this pattern is not predicted by a standard reinforcement learning model or another very successful model that combines reinforcement learning and working memory which is the RLWM model. And when the data came in, we indeed saw that people's behaviour matched the predictions of the policy compression model. Same task, we can also look at how people's choices are biased. So this task again manipulates the frequency of images that are appearing and like I said, as a result, people are picking action one more often in general. And so what we find is that or what we should expect is that people should tend to pick the optimal action for the state that appears the most, right? So A1 again. So again, the example is, if Gyoza is the best thing to order at the restaurant you visit the most often, maybe you're more likely to order Gyoza at other restaurants too. So to analyse this, we can compare how much people are picking action one compared to the other action that's not optimal but that they share the same reward value, right? So, you know, if people are picking with respect to the probabilities then they should pick A1 and the other action in equal number of times. But here we predict that people would pick action one more often than the other action. And that's exactly what we see and what's predicted by the policy compression model. So you see here that action one is being chosen more often than action three and action two in the other two states and that the standard RL model doesn't predict that choice bias. In the second task, we manipulate how frequently actions are chosen by designing the reward matrix such that several states share the same optimal action. Sorry, this laser is fading. But you can see that in Q2 all three states share action one being an action that is optimal. Now, you'll also notice that in state two and three there's actually two actions that are optimal, right? So choosing either action will always give you a reward. But as a result, overall, you're going to choose action one more often in general. So the prediction here is more policy compression when one action is chosen more frequently than others. And the intuition here is that you would encode states that share optimal actions with less precision because if you choose the same thing in every state then you don't need to think too much about what the state is, right? So the restaurant example here is if you have a favorite dish that is served at multiple restaurants, you don't need to think about which restaurant you're at in order to choose your food order. And again, this shows up in behavior. So when one optimal action is shared across states, people earn more reward for a less or the same policy complexity, their choices are overall less variable. And actually their choices are made slightly faster as well because if you don't need to think about it you can just make your choice. And again, so what's interesting here is like this pattern is actually predicted with the standard reinforcement learning model and the same pattern is predicted. So you might ask, well, how do you know your model is better than the other model? And again, you can look at the pattern of choice biases to answer this question because what happens is that in this particular condition, like I said, if you're just responding to the particular choice probabilities you should pick A1 and A2 equally in S2, for example. And A1 and A3 equally in S3. But the policy compression model predicts that you would actually pick A1 more often than A2 or A3 in the other two states because it's a simpler policy to follow. So again, if you always order gyoza you tend to order gyoza at other restaurants even if there's another dish that's just as good. So again, we can make that comparison and what we see here is indeed people are picking A1 more often than the other action that's just as good and it's not predicted by the standard reinforcement learning model. Finally, in a third task we wanted to know how time pressure changes the way that people behave. So in this condition, again, the two conditions share the same reward table but now in the second condition people have to make the response in one second rather than two seconds, right? So they don't have as much time. And the prediction here is that under time pressure policies will be further compressed. And the intuition here is that decoding the right action from the state takes time. So less time to do that means less precision. This is pretty intuitive. If you're in a hurry you don't have much time to think about what you wanna order. So here the paveral signatures are that under time pressure people have lower policy complexity usually lower reward and that their choices are made faster. And in our data actually we find that choices are a bit more stochastic because under time pressure people choose more randomly. That's not necessarily predicted by our model but again data are not always as clean as the model. Okay, so time pressure also intensifies choice bias. So when you're under pressure you're more likely to choose actions that you've chosen in the past. So you rely more on your action history. This is kind of like the milk example again where you just reach for the same milk you bought last week. Right, so if you're in a hurry you're more likely to go with your default order. And again this time what we can do is we can examine the tendency to pick an action that's not optimal. And we should see that in fact in state three people are picking action one more often under time pressure. So that's exactly what we see. So in state three people are picking action one more under time pressure which is the open bars. And this is just a difference between the two conditions. All right, so for time sake I will skip the result but we also find that choice bias increases with cognitive load and sort of here attribute to my other interest in computational psychiatry in another data set we find that patients with schizophrenia have more choice bias under cognitive load than controls do. Okay, so this is all great. We can show that behavior matches this model which is pretty cool. But why does this matter, okay? So I've thought about this really interesting application of this work which is that this allows us to design choice environments that help people make better decisions. So the logic here is that many people are going about their lives with high cognitive load. Like we're all busy, we have a lot of things to worry about and think about but if we understand how people's behavior is biased under cognitive resource constraints we can design choice environments to help people make better decisions. So this has already been done with sort of qualitative heuristics. So for example, one area that this is being done is in organ donation. So for example, there's this huge problem in America and worldwide too that people die every year because they are waiting for an organ transplant that they never get. And a lot of people are waiting for an organ transplant. So one solution here is to change the policy such that organ donation is actually opt out rather than opt in. So what that means is that opt in programs mean that you have to go and sign up to be an organ donor yourself, right? And this is like high cognitive load because it takes a lot of effort to go do the thing. But opt out programs mean that by default you're already enrolled as an organ donor. And you're automatically an organ donor unless you indicate that you don't want to be. So it's low cognitive load in terms of being already enrolled. Now this is really consistent with our computational model where people are biased towards the default status quo especially under high cognitive load or time pressure. You can also think of this as behavioral inertia. Like people want to keep doing what they've been doing even though what they've been doing is like nothing. So people want to continue doing nothing, right? And this is this very interesting result where countries with opt in organ donation policies have less organ donations than countries with opt out policies, right? So when your default is you're automatically enrolled to organ donation, you'll have more organ donors being registered. A second example, great example is saving for retirement. So there was a policy change in the last few decades in the US where retirement savings plans became automatic enrollment. So this is for example like 401k plans in the US which are provided by your employer to help you save for retirement. So what happens is when you get hired you're automatically enrolled to these retirement plans. So they found that these automatic enrollments increased participation. So for new hires 86% of new hires are participating in automatic enrollment which means only like 14% of people opted out of automatic enrollment. Whereas before automatic enrollment it took almost 20 years to reach that same level of participation in retirement savings. This automatic enrollment also increases savings outcomes because people start saving earlier. And that's incredible because you can see here on the y-axis is the cumulative savings which is a percentage of the workers wages. And as you can see automatic enrollment is much higher than opting in policies. So the idea here is that computational modelling might allow us to do this even better to design choice environments that help people make better decisions. And there was some recent work from these two researchers suggesting that just kind of like the analogy to buildings and architecture. Like people saw an increase in the height of towers in the world once they switched from architectural thinking to engineering thinking. And so the idea here is that quantitative models of choice are potentially more useful than qualitative psychological principles which these policies that I told you about are currently based on. So here's what we learned so far. Humans adapt decision making strategies under cognitive resource constraints. And as a result choices are biased by the frequency of encountered states. So you're more likely to devote extra cognitive resources to states that you visit often. Choices are also biased towards frequently chosen actions. So people tend to choose actions that they've chosen most often in the past. Three, choice bias increases with cognitive load and under time pressure. You're more likely to go with familiar choice when you're extra limited in cognitive resources. And understanding these biases can help us design choice environments that help people make better decisions. All right, so as a neuroscientist a natural follow up question to ask is where in the brain can we find evidence for cost sensitive action selection? And this is important because somehow even with its limited capacity the brain is still able to generate flexible and adaptive behavior in many different environments. So to tackle this question we teamed up with Bensai Owekski's lab at Harvard and they study motor learning in rats. And specifically they study this circuit between the motor cortex and the striatum namely the dorsal lateral striatum, DLS and the dorsal medial striatum, DMS which some of you might also study. And with the data that they were seeing from their lab we really were able to map on our model and our model predictions to what they were seeing in this animal motor learning behavior. So I'll tell you sort of like the overarching hypotheses first and then show you the data in support of that hypothesis even though it was more of a cyclical journey of looking at data coming up with hypotheses you guys all know. So our hypotheses are that first motor cortex modulates attention to external state. So in our model that's like the beta. Second dorsal medial striatum learns about rewarding state action pairs. So it basically learns the values in this state action value table. And three dorsal lateral striatum acts as a default policy that stores frequently used actions or action patterns which you can think of as like habits. So again our approach here is in collaboration with their lab we're going to look at a motor sequence learning task and what we're going to do is lesion these different areas of the brain before and after learning to see how they affect the animals behavior and also we can do the same thing in our model to see how it affects the models behavior. Okay. So a really talented former graduate student in Benson's lab came up with this piano playing task for rats. So in this task there are several different kinds of sequence learning tasks. So in one of them it's they're visually cued sequences. And so here what happens is that the animal gets cued for each action that it's going to take. So here's a video of an animal doing this task. Oops. Here we go. It's a bit laggy. Uh-oh. Maybe it's the resolution is not very high here. It's not playing on the computer either or see it's kind of moving but. Okay. I'll just describe what's happening. So the animal sees a light light up and then it presses the lever associated with the light and another light lights up and it presses that lever and then a third light lights up, presses that lever and then finally it gets rewarded after three distinct lever presses, cued lever presses. Okay. In contrast another task that deals with automatic sequences an animal is overtrained. So it's taught to do a particular sequence many, many times in its entire life. And that's the only sequence it knows. So in the next video which may show up the animal basically there's no cues involved. It just goes to the levers and just starts pressing the sequence itself. Right. Okay. So all right. So this is what we learned from that study. So here if what Kevin did was he basically lesioned each of these different areas in the brain. Okay. And then he basically and these are in animals that have already learned these two different tasks. And so he looked at whether animals were still able to do each different task after the lesion. So here what happened is that animals with a motor cortex lesion were still able to do the automatic sequence but they were not able to do the visually cued sequence. So this is a fraction success is how many trials they got correct. This is before pre lesion and post after the lesion. Now when he'd lesioned DMS dorsal medial striadum surprisingly he found that the animals could still perform automatic sequences and they could still also perform the visually cued sequences. Now you might think that's interesting because I thought you were deleting this part of the equation. But like I said, the hypothesis is that DMS learns about rewarding state action pairs. So if you've already learned what you needed to learn and then you cut out that part of the brain then there's no learning involved anymore. So animals would still be able to do the two tasks. Okay finally when he lesioned DLS dorsal lateral striadum which is in charge for frequently used actions as you might expect they weren't able to do the automatic sequences anymore which required the animal to do a habit that they've already learned. But animals were actually able to do the visually cued sequences which is very interesting. Okay so we have a host of results here. I got the spinning wheel of death. Oh there we go, okay. All right, so that was a whole line of work. His whole PhD of seven years. But you know then a new PhD student came in and with another postdoc in the lab, Chesha and Kea decided to zoom in on the function of dorsal medial striadum. And so here what they basically wanted to compare was what is the role of DMS in learning because they sort of had this vague similar hypothesis. And so they had two different tasks in their setup and one, the animal basically only had to do hit one lever to get reward. So these were all cued tasks. So they saw a visual cue light up and they would have to press the corresponding lever to that visual cue and then they would get reward. Now all animals first learned this task. And then afterwards they learned the three lever tasks. So that's the same task as I described before where animals are seeing visual cues one after another and then pressing the corresponding lever one after another. It's a shame because these videos are pretty cool but maybe you can see them later. Okay, so what they found is that DMS is critical for learning sensory guided representations. So in blue I have the performance of control animals so not lesioned animals. And in red is the performance of animals that have DMS lesions. And you can tell that even though they end up both learning the task up to 85% accuracy the DMS lesion animals take way, way longer in order to learn it. And one sort of hallmark signature of this is that they ignore visual cues early in learning which is a signature of low policy complexity. Now in the three lever tasks so once they've learned the one lever task now they do the three lever task. And again what you can see is that controls are able to kind of learn this task maybe 50 to 60% but DMS lesion animals are all over the place and their average accuracy is very low. And as you can see the red traces are much shorter than the blue traces and that's because animals actually just give up at some point. So after eventually learning the one lever task these naive DMS lesion animals cannot learn the three lever task. So again remember these are animals that are lesion before they've learned anything. And specifically they're unable to break this action chunk of lever to waterport. So what happens is that animals are still sort of doing the one lever task behaviour in the three lever task. So they see a light and they'll press the lever and then they'll go to the waterport even though they're supposed to press the next lever and then the third one before they can get reward. And this suggests that a habit has really formed. They're unable to break this habit of pressing the lever and going to the waterport. And again I wish you could see these videos but it's very distinct behaviour here where the DMS lesion animals really cannot stop doing this behaviour that they've learned in the previous task. We used our model to confirm that learning is impacted in the DMS lesion. And the way that we did this was that we first fit the learning parameters in the model on the one lever data. So these parameters are responsible for learning the different state action values in the model. And when we fit the model to the data we see that the learning rate is much higher for the controls than for the DMS lesions which is consistent with our hypothesis. And as you can see if you take the model and then you simulate behaviour you have to model do the same task as the animals. We can recapitulate the result here. You can also take the same fitted model parameters on the first task and then simulate it and run the model on the second task, the three lever task. And you can actually get pretty good comparable results to the three lever task even though the model has never seen the three lever task before. Okay, so that's encouraging. So now we can add to our host of results that we have observed. And we can also then use our model to model the lesions in our own equations. So again, for motor cortex lesions in our model would correspond to setting beta to zero. Lesions DMS again would correspond to setting these learning rates to zero. And then for DLS lesions would correspond to completely cutting off this entire module of the model. So, you know, modelling is very useful because it allows us to explore a large hypothesis space that we haven't explored before. So to give you a sense of that you have three different lesions sites and then you have like three different points that you could actually lesion the animal. So you could lesion the animal before it's ever learned anything. You could lesion the animal after it's learned the second three lever task and you can lesion the animal between the two tasks, right? And now you can also look at a behaviour before and after the lesions in each of the three different tasks. So modelling is very useful in exploring like a wide variety of hypotheses that you might not have thought about before. So you can do this for all three lesion sites. And so this is the result I just showed you. So here if you lesion DMS before any learning has occurred animals can kind of do the one lever task. These are just lesion animals and then the three lever task, there's no bar here because it's just zero. So the animals can't do the three lever task at all. And then going back to the first result I showed you. So I'm only going to show you the results from the cued task in Kevin's data. So here all of his animals are experts. So they only learn the three lever task. And you can see here that again, the model recapitulates this result, although more extremely. So before the lesion, it can perform the task very well. And then after lesion, basically zero, which is consistent with his data. And then again for DMS, similarly, there should be no significant change between before and after lesion for the three lever task. And then similarly with DLS, dorsolateral striadome, there should be no significant difference before and after the lesion. You can see some idiosyncrasies in how the model deals with things, but generally speaking, it's qualitatively consistent. And so now you can simulate a host of different hypotheses that you might expect for different points of lesions and different tasks. And this is really interesting because it allows us to see things that we might not have thought about before. So for example, an interesting prediction here is that is that expert animals with a DMS lesion should be able to perform the three lever task, which is what Kevin showed, but they shouldn't be able to go back and learn the one lever task after they've already been lesioned. So that's not something that you would think about intuitively. And so that's something that we could test in the future. Another interesting thing to look at is that DLS lesion animals should always be able to perform any queued task no matter when their lesion happens. OK, so the big picture here, and I'll sum up, is that the brain balances reward and complexity by penalising complexity during learning. And so there are a bunch of other equations involved in this model, but these are sort of the two most important ones. And here the sort of anatomical picture is that dorsal medial striatum here learns the state-dependent policy. Dorsal lateral striatum stores the automatic default policy, which these two get combined to form the full policy. And that cortical areas like motor cortex, and I didn't even talk about prefrontal cortex, but there's evidence and it projects to dorsal medial striatum, that these two cortical areas modulate attention to external state. And then finally, the last part of it is that learning happens through dopamine, which is related to average reward and maybe controls this attention to external state through cortical projections. All right, so to end, the final takeaways here are that cognitive constraints shape our decisions and actions. And any resource-limited agent, humans, rats or robots, must compress their policies. And that, as a result, choices are biased by the distribution of states, actions and rewards in the decision environment, as well as external factors such as cognitive load and time constraints. And finally, lesion studies reveal that distinct brain areas work together to balance the automatic execution of learned behaviors with the more cognitively demanding ability to flexibly respond to relevant sensory information. All right, so with that, I just want to thank my lab, past and present, all the people I've gotten to work with and then all my collaborators in the Oveski lab, which they're in the Department of Organismic and Evolutionary Biology at Harvard. So thank you. Well, Lucy, I want to thank you for an amazing talk. I'm exhausted, but I'm also really engaged. So there's a little bit of balance going on there. I think we have time for some questions. So I'd like to open it up. Yeah, so they start with something even more simple, which is simply just like basic is just going to the water port and realizing that's where they get water. And then next step is associating the lever with the water port. So knowing that if they press the lever, they get water. And then finally, the next stage is like queue, lever, water port. Oh, well, one cool thing about the Oveski lab is that they've built these automatic animal behavior boxes where they can just like let the animal do its thing in there like 24 seven, basically. But even then, it still takes months to learn these queue tasks. Yeah. Okay, other questions. You have to turn it on, maybe. Hello. So you gave some examples of using this information from these models to help people make better choices. Do you have examples when these things were used to make people do necessarily better choices? Like worse choices. Let's say, or better for somebody else. Classic example is social media. Like the whole point of social media, right, is to keep you on social media for longer. And, you know, I mean, I don't know what kind of research goes on at social media companies, but I suspect a lot of it is like, how do you increase engagement? And what are the psychological principles behind addiction, basically? Yeah. So, you know, use the power wisely. I'm going to give you an example of any parents in the room. One of the things we've all learned is if you want your child to choose quickly, give them a choice of two things, not an open choice. You know, so at its most basic level, that's limiting your options. Yeah, yeah, exactly, yeah. I have a question on that comment, actually. Lucy, thank you very much for that talk. It was amazing. I would like to know if you could map the concept of analysis-paralysis when you have too many options and you don't want to take, or maybe you just need to be handed down two best options. Yeah, yeah, yeah. What do you think from that modelling perspective? Yeah, so I think a cognitive load is essentially what that is, right? So, because if you, it's hard to evaluate each particular option, which you can think of as the different states, right? So, if you have too many options to evaluate, then, you know, you're kind of overwhelmed by the cognitive costs you have to pay to make the decision. So, and you might as well just go with whatever someone tells you, for example. Yeah, something that doesn't require you to think as much. If you'd really like to take a look at that, Gaston, come into our lab one day and watch us let children choose a prize from the prize box. Not only is it exhausting for them, it's exhausting for the appearance. Questions? Yeah, over here. Do you have a microphone near? Hello? Oh, OK. You talked, thank you for the talk. You talked a lot about cognitive load and I was wondering if you model or if you plan to model the stop of, how can I phrase this, just the animal or the model stopping the actual making process as a whole, instead of just choosing an action quickly or choosing whatever, just making the choice to not choose anything at all and continue without reward. Yeah, so like you're kind of talking about the part where the animals had just given up. Yeah, yeah, yeah. So in other studies, not this one, I've modeled inaction as an action. So they can choose to not do anything, for example. So that's definitely something that I could incorporate into the model to make it even more similar to how the animals behave. Yeah, and for example, you could think about making inaction even less costly. So it's less costly to do nothing, obviously, than to do something. Yeah. I was curious if that was the reason why you called beta inverse temperature. Yeah, so in reinforcement learning, that's a very common parameter that people use. And temperature is a parameter that basically describes how stochastic your choices are. So sometimes people write it as like that Q over T but instead using beta as a parameter is sometimes more intuitive because it's multiplicative rather than divisive. So that's what they call inverse temperature. OK, any other questions? It's just turned four o'clock, so if anybody needs to be anywhere, yeah, we've got one up here. Yeah, that I asked the questions from Zoom. There's two, actually. So the first one is from HiHow. What explains the high rate of organisation in the US despite being up in just culture? I didn't know if that way was very high. Really, the organ donor. I mean, I guess if I think about it myself, like I'm an organ donor in the US, they do make it kind of easy to check the box. Like for example, they pair it with another action. So when you go do your driver's licence business in the US, they just give you a form and it has one box that says, do you want to be an organ donor? You're like, yeah, sure. So making choices easy, again, like Gail said, like very few number of options might also increase participation, yeah. OK, thank you. And the second one is from Fabienne. I was wondering about increased elasticity in participants experiencing time pressure. Wouldn't it be caused by the decision-making process being cut short and whatever the state is at that point ending up being the choice? Say the last part again, whatever the... So whatever was the thing you had last, whatever the state you had last is just a choice because your time is over. Yeah, people have different strategies for what they do under time pressure. Like if you think about a keyboard task, if you're limited in time, sometimes people are just kind of like pressing keys randomly with three fingers, you know. Other times people are just jamming one key over and over. So it kind of just depends on what people's like time-constrained strategy is, yeah. Thank you. Yeah, we can hit mine. Yeah, hi. Thanks a lot. I was wondering about the inverse temperature parameter because presumably the compression should make it less computationally expensive, but at the same time it's just a multiplication. So it doesn't really make the computation per se any easier, does it? Yeah, so that's related to... So that's related to the information theory part. So in reinforcement learning, we think about beta as high beta is less stochastic, but in information theory, beta also represents, in this setup, beta also represents how much information you're paying attention, how much attention you're giving to the state when you're choosing your action. So the value of the complexity is an information theoretic value and not like a... Like there's different kinds of costs, right? An information theoretic cost is more of a memory cost rather than like a statistical cost. Yeah, it's a bit, I'd have to sort of show you sort of the breakdown of different types of complexity, but that's a good question, yeah. Maybe one or two final questions and then we'll let Lucy relax. Nick. Hello, cheers for a cool talk. I have a broader, maybe slightly philosophical question, I guess. So it seems like from this perspective, generally and from your model, there's a kind of normative dimension where like doing well or making good decisions is somehow maximising reward or optimising. And I've noticed in the broader culture recently, there's a bit of a pushback against this kind of like single metric where we're trying to optimise everything. And we've kind of, I don't know, optimised ourselves into a corner where we're just doing maximising this kind of one thing. And I wonder about like the implications of something like these types of models where they kind of celebrate that tendency, right? And I mean, my own feeling is there's, we do a lot more satisfying than we do optimising. We're kind of doing a lot more, this is good enough, this will do. Yes. And when we have like really constrained experimental conditions, yeah, we get the kind of feedback that we get when we do these types of things. But in real life, it seems like it is a lot more stochastic or it looks to be a lot more stochastic, at least. Yes, 100%. And that's the argument that I have been having with my adviser. And let me tell you that in the full paper actually, we show that there's two different ways to look at this problem, which have equivalent mathematical formulation but two different objectives. So like you said, instead of optimising to some constraint you could have satisfying where you're like, this level of reward is good enough for me. Or like, maybe I don't even care about reward, there's another dimension that I care more about, right? Although a lot of people try to define reward now in much broader ways, like reward might be like play or like curiosity, right? Like that kind of stuff. But at least here, there's another formulation of the model where instead of saying like you have a capacity constraint, you have a satisfying level, right? So instead of saying like vertical line C, this is my capacity constraint, it's actually horizontal line. This is just the amount of reward that I'm fine with. Because sometimes you're like, I don't need to do well on this test, for example. Or like I don't need to put that much effort into this because I don't really care about the outcome. And so therefore you also adjust your behaviour based on that. Even though you have the ample of resources to do it if you wanted. It's a really good thought. Final question for Lucy. Okay, well I think the fact that everybody stayed past four o'clock is an indication that they found being here reinforcing. So I'd like us to now reinforce Lucy in this standard manner.