 Hello and welcome everyone to the Active Inference live stream. This is Act Inf Stream 8.1. It is November 10th, 2020. Welcome to TeamCom, everyone. We are an experiment in online team communication, learning, and practice related to Active Inference. You can find us at our website, ActiveInference.org. You can find us at our Twitter, at email, our public key base team, or on YouTube. This is a recorded and an archived live stream, so please provide us with feedback so that we can improve our work. All backgrounds and perspectives are welcome here, and as far as video etiquette for live streams go, mute yourself if there's noise in your background, raise your hands so that we can hear from everybody on the stack and use respectful speech behavior, et cetera. Today, we are excitingly in Act Inf Stream 8.1, and it's gonna go a little something like this. We're gonna start with some intros and warm-ups, and then we'll get to the sections of 8.1. In 8.1, as well as 8.2, we're gonna be discussing the paper, Scaling Active Inference by Chance et al. in 2019, thanks to Blue for the suggestion of the paper. We're gonna go through the goals, the abstract, and the roadmap, and then we'll have a little notation and math overview. We'll also take a look at the figures with a special focus on what each figure is showing or representing or invoking. In today's discussion, 8.1, and next week's discussion, 8.2, we're gonna be discussing this same paper, so save and submit your questions and also get in touch with us if you wanna participate. For the rest of 2020, we're gonna be discussing this paper eight as well as nine, 10, and 11, which you can learn more about by checking our Twitter. So check it out there. All right, here we are in the intros. Welcome again, Shannon. I'll kick out other Shannon. So for the introductions, we'll go around and just introduce yourself and your location, and we'll roll with it as people gain and lose the chat. So just say hello in a short introduction and then pass it to somebody else. I'm Daniel, I'm in California, and I will pass it to Blue. Hi, I'm Blue, and I am in Southern New Mexico in Las Cruces, and I will pass it to Sasha. Hi, I'm Sasha, I'm also in California, and I will pass it to Ivan. Hello, my name is Ivan, I'm in Moscow, Russia, and I pass it to Alejandra. I think she disappeared for a second, so let's go to Alex. Okay. Hello, I'm Alex, I'm also in Moscow, Russia, so welcome everybody. Cool, and Alejandra has been on before, so hopefully we know her, but when she jumps back in, we can hear from her. So for these warm-up questions, and also it's pretty cool to have a slightly smaller panel, so I'm really looking forward to hearing everyone's perspective on what they got out of the paper, what they brought to the paper. It'll be really cool to hear about. So the first question is, what made you excited to read this paper and or to learn about these topics? So while people are raising their hand and feel free to do so, something that initially made me excited to read this paper was just about learning how active inference could be applied to the kinds of problems that people apply other machine learning algorithms to. So that's a starting point. Blue? So I pitched this paper, I hadn't thoroughly digested it when I picked it. I had read the first author's other paper in class computational biology, but applying agent-based models and modeling parameters to active inference was really what intrigued me. And this paper I especially liked because of the ability to apply it in a continuous framework as opposed to like something in discrete time steps. And I thought that that was significant and important. Cool. Welcome Alejandra. Do you wanna just give your quick intro and say hi? Oh, nevermind. Anyone else wanna add something that made them excited to learn about this topic or read about the paper? Yeah, thanks again for the suggestion, Blue. The second question is, what is your experience or background with computers slash computational modeling and what is something interesting that you've been learning about or studying recently in the area? And I think that my goal with this question was to show to our listeners slash viewers that there's many computational backgrounds that intersect in active inference. So it's a community that includes some people who are really on the cutting edge of the computational modeling and that's what we see reflected in this paper. So big props to everybody who is reading and talking about it because it is at the cutting edge of that type of machine learning research. But also there's this deep tradition of action-oriented research and embodied cognition and so many other areas that might not necessarily entail a computational background or even a computational approach in how they're studied until this type of intersectional research. So does anyone want to just something that they've been learning about in computers or what type of computer modeling have they had experience with? So I can start. A lot of my computer work had been in bioinformatics which does deal with computers but it's very different kinds of questions than the machine learning community. However, increasingly there's application of machine learning to bioinformatics. So this was just kind of like a different perspective on machine learning outside of the biological contexts that I had mainly seen it in. And I'll just throw up the third question in case anyone wants to answer that one which is what kinds of problems might machine learning based upon active inference be applied to? So you can just raise your hand if you... Yep, go ahead Sasha. Yeah, I would say I guess to some of all these questions is I have very limited experience in working with computers but my background is in neuroscience and I'm always curious to learn what machine learning has to offer for understanding the process of learning and how that can link back to neurobiology concepts. And so I'm interested in understanding these equations and having a better grasp on what the relationships are like. Cool, cause learning is really a biologically inspired topic. It's based upon learning in organisms or the kind of control that organisms enact in their environment. And that's not even getting into the whole neural network slash actual brain debacle. So definitely these kinds of learning questions. Any other thoughts on this one? Blue? So like you my background is like bioinformatics but I've like branched out since like getting out of my doctoral research into like doing all kinds of different computational and agent-based modeling type applications even quantum computing a little bit. This system particularly interests me because like Sasha I'm interested in learning like how things learn but also like from the beyond brain perspective like stay in biology but think about biological systems just at a very simple level like slime molds or something like this. How do these systems learn? Because still there's some element of learning that happens and when you can kind of deconstruct it into like a computational like model then you can start to like recognize input output into simpler biological systems beyond the brain. Cool, and let's return to that multi-scale cognition when we're a little bit later on but I wrote that down so we can definitely return there. Any last thoughts on the warmups before we get into the paper? Very cool. All right, so today we are going to be talking about this scaling active inference paper which is an archive paper and the paper presents its goal as or what it does as we present a model of active inference that is applicable in high dimensional control tasks with both continuous states and actions. Our model builds upon previous attempts to scale active inference by including inefficient planning algorithm as well as the quantification and active resolution of model uncertainty. Our model makes two primary contributions. First, we showed that the full active inference construct can be scaled to the kinds of tasks considered in the reinforcement learning literature. This involved extending previous models of deep active inference to include model uncertainty and expected information gain. Second, we highlighted the overlap between active inference and state-of-the-art approaches to model-based reinforcement learning. Welcome back Alejandra. And some of these topics were introduced or at least initially brought up in the Act-Infstream 8.0. So if you're a little bit just unsure what is model-based versus non-model-based reinforcement learning or what is reinforcement learning or what is machine learning, it's all good. There's kind of a sequence of ideas that you can follow to understand why this paper is so exciting. And just go to Act-Infstream 8.0 if you're curious. The way that we can phrase what the big goal or question of the paper is in terms of non-active inference vocabulary would be how can we apply active inference to challenging control tasks? And like Blue mentioned, these are the ones that often involve continuous decisions being made. So like turning up a dial or down or steering wheel rather than like chess or checkers or connect four or the Mario Kart example, which now I won't ever forget. And then how to connect these continuous challenging control tasks which have elements of planning and all these other things happening and connect these tasks formally to reinforcement learning. Any thoughts or questions on the goal of the paper? Cool. So the abstract, which we'll just run through is in reinforcement learning, RL, agents operate in partially observed and uncertain environments. Isn't that the truth? Model-based reinforcement learning suggests that this control of these partially observed and uncertain environments is best achieved by learning and exploiting a probabilistic model of the world. So just to be clear, this is what traditional model-based reinforcement learning is. Model-free or unsupervised reinforcement learning would be learning the very, very direct relationships between outcomes like reward and states. Model-based reinforcement learning abstracts a little bit above that and it considers how learning occurs in a probabilistic model of the world. Active inference is an emerging normative framework in cognitive and computational neuroscience that offers a unifying account of how biological agents achieve this control. In this framework, inference, learning and action emerge from a single imperative that maximizes the Bayesian evidence for a niched model of the world. And so learning, inference and action are all part of the organism's generative model and we're gonna see how that gets specified and how it plays out. The challenge is that implementations of this process, this integration between inference and action, if you will, have thus far been restricted to low-dimensional and idealized situations. So this could be, for example, playing a very simple game or doing a two-by-two matrix like a prisoner's dilemma or there are other cases where active inference has been used for slightly higher dimensional but still quite idealized situations like reading from a known set of options. Here, we present a working implementation of active inference that applies to higher dimensional tasks with proof of principle results demonstrating efficient exploration and an order of magnitude increase in sample efficiency over strong model-free baselines. So just by reading that, we can look forward to some improvements on classical algorithms. And as far as this magnitude increase in sample efficiency, the reason why they're focused on sample efficiency is because these distributions that we want to understand the shape of are not just like a simple curve, there's a lot of dimensions to them and they're very rugged, so small changes can change your height, so to speak, quite a lot, which means that the challenge is to sample effectively, not get trapped in one area or not to oversample one region at the cost of another. Our results demonstrate the feasibility of applying active inference at scale and highlight the operational homologies between active inference and current model-based approaches to reinforcement learning. So what would be the implications of that? Well, it'd be pretty epic if all of the hype and buzz that we hear about machine learning or reinforcement learning or all the places where we know reinforcement learning gets applied, ranging from recommendation algorithms to maps, to just all these types of things, could active inference play a role in that and what would that do for our conception of these kinds of algorithms? And there's probably a lot of directions to take that but that's the equivalence, is if there is a homology and potentially even a superiority of active inference techniques in certain cases over reinforcement learning, then it liberates us from some of the baggage of the straightforward model-based reinforcement learning or the critiques of reinforcement learning but also opens the door to some new ways to implement strategies and understand the systems that we're working with. Cool. So the paper has a pretty clear roadmap. This is like driving on Highway 10 in New Mexico and there's a introduction section. The relationship to further work is actually section five which we'll get to in a second but the paper begins with an introduction and a consideration of active inference as a topic. It then moves directly to the model and lays it out in the six subsections 3.1 through 3.6. From a mathematical perspective we're gonna be focusing primarily on one, two and three because those are the parts that we've heard the most about and they're the most relatable. So the generative model and the recognition distribution are the P and the Q, mind your P's and Q's and this is everything that we've been talking about with the two directions between the agent and the world and whether we think about that in the Bayesian structural computationalist perspective or the inactivist perspective, the whole tale of two densities, integrating internalism and externalism, all these things come together in 3.1. 3.2, learning and inference is where we're gonna draw the bow together with reinforcement learning and about other machine learning algorithms. 3.3 is where we're gonna come closer to control theory because we're gonna be talking about setting policies for action. Then 3.456, hopefully when Alec is here next week we'll have a little bit more detail there. Maybe we could even see a simulation or hear about some of the degrees of freedom that the authors had in the model why they chose to do what they did. The part that will be very cool as an experimental biologist is the experimental section and here it's an interesting approach to do one set of tasks that conveys exploration of a state space and another set of tasks that conveys exploitation. So they're saying like, this model has two hands, it's good at exploring, but it's good at exploiting as well. Then there's the connection to previous work and the three areas that they tie it most directly to with the keywords as well as in this closing section are related to deep active inference, which in contrast to, I guess, shallow active inference would mean that you're not just minimizing your surprise about your instantaneous sensory information but you have a deep or a counterfactual model that goes through time. So that's like the example of taking a jacket even though you're not cold yet. That's deep active inference. You're reducing your surprise about future states of your temperature under a generative model where if you go outside and it's cold you're gonna get cold. But shallow active inference would be like you wouldn't put on the jacket or you wouldn't even take it with you because it wouldn't be a relevant affordance in the one second timeframe but it actually is relevant in the one second timeframe under a deep generative model of temperature. The second two are model-based reinforcement learning which we've mentioned a few times and information gain. Information is reduction of uncertainty and when we're reducing our uncertainty about environmental causes it helps us plan better actions but we also need to reduce our uncertainty about which actions to take and that's called policy and that's what relates us here to control theory. Then there's a discussion and conclusion. Okay, before we jump into the math and the actual sections are there any thoughts or questions on the roadmap? All right, here is a partial notation reference and I just separate them onto two sides and so if you're only gonna look at a few parameters if your parameter overload hits in at six the ones on the left are the important ones. These are the ones that will help you get 80% of what is happening and then the next 10% are the ones on the right and then there's a few other ones which aren't on this page that are more like the esoteric statistical parameters. The big parameter is S with a hat just like Sasha with a hat and that's the true state of the environment. So S with a hat is what is actually trying to be predicted and that's like the actual temperature of the environment or the actual amount of sunlight but that's not what the agent is able to track directly it's able to make an estimate of the environmental state and you can imagine there's times where that's an easy estimate to make there's times where it's a hard estimate to make and then this is the difference between the discrete and the continuous case would be if the state that you're estimating is is it day or night, it's a zero or one that's a discrete case a continuous case would be like what is the amount of light and that's a continuous variable so there's a lot more options a lot more shades of gray infinite of them. That's S is the agent's internal ongoing estimate of this environmental state that it's trying to track and we can think about it with just one S or there can be multiple S's like is it humid or is it dry that's one axis another one is it lighter dark O are the observations so in the case of light being estimated that would be like the observations of light on the retina in the case of temperature it's like a thermometer reading A are the policy actions by the agent and those are arising from pi which is a policy and a policy is like a layer above action so for example, action would be moving my left foot my policy would be if I hear this sound then I move my left foot another policy would be I always move my left foot and so there's different kinds of policies that can be very nuanced they can be conditional they can be related to each other or they can be very straightforward and stand alone T with italics is the time steps of the model so this is just a modeling parameter that's gonna come up again and again and it has to do with how time is measured in terms of epochs or states or continually okay then just to the right side we have little and big theta and a little theta is often used just to mean a parameter of the generative model and the bigger theta is like the space of all the parameters so those are kind of used similarly then lambda is related to the likelihood distribution it relates some of the statistical parameters and then f is the free energy and this represents like if you could be on this super smooth variational optimization landscape just as if you would really wanna be following s hat but you actually have to follow s with free energy if we could be following f then we'd be doing perfectly you just strictly cannot do any better than that however we can't directly track f so we have to track g which is the expectation of free energy which we'll also be able to look at okay any thoughts or questions on notation anyone seeing similar notation to another field or a difference is there something that we wanna that you'd imagine that a control theory model would have that isn't in here alright anyone can just raise their hand at any time I'm not sure where several of our colleagues have gone to let's start by thinking about the generative model so this is a generative model p of a bunch of variables and this is going to be kind of like what the organism is slash does we're gonna leave for a second some of these avenues that can be arising with the niche symmetry which we've talked about in the context of yellow brimbergs and actual constants work and there's a lot of other things that we've brought up let's just try to isolate what's really relevant which is that the agent is doing some type of an implementation or is some kind of an implementation of a generative model of a couple of things at once it's generating a model of observations through time the tilde on top of the variable means through time so observations through time and states through time you can think of as a function of policy choices pi and parameters of the model theta so for example one of my parameters in my theta is like if I get hit by a bike then it's gonna hurt and then my generative model is like under the policy of looking both ways to cross the road I predict that my states will include not being hit by a bike and so I won't have any pain observations under that same policy you could imagine other types of state and observation predictions and then there are a few subsequent mathematical details about the insides of this equation it's a little bit bracketed no pun intended and the top and that's a shorthand and then we can explore and unpack it a few more levels so like on that first row we can see that that POS pi theta those relationships can be unpacked on the other side of the equal sign the right hand side to like just the probability of the parameters existing and then just the probability of the policy existing multiplied by this big other thing so why would that be the case and we won't go through all of them in this detail but I think this one's really helpful to understand for active inference and people just raise your hand if you wanna talk but just to keep it like informational and fun I'll just keep going until somebody wants to ask a question so the reason why P of pi is there the probability of the policy is like if the policy is unrealistic you don't even wanna consider it you wanna basically take unlikely policies impossible or implausible policies and systematically not consider them so if I'm trying to get out of the way of the incoming bike and I'm spending a lot of my time or my clock cycles or a lot of my cognitive infrastructure on action strategies that involve me doing a Spider-Man and swinging from a building it's so implausible that it's probably not gonna be effective so we want to have something in our model that represents likely or possible policies being considered a little bit more richly than implausible policies and similarly for P of theta which is a little more abstract because it's just a parameter of the model but it's like saying if it's an unlikely parameter state it's also not worth thinking too much about I don't need to plan for it being negative 40 in California however our Russian colleagues might need to plan for that so it's gonna be niche dependent and contextual whether they need to plan for it or not okay now let's go into this big pi notation on the right side of the right hand side of the first line of equations this is kind of like a sigma sigma adds up the elements of a series so like sigma of one plus one over two plus one over four or something and the pi is like a big multiplier and it's saying from the first time step t equals one through all of the time steps big t do this calculation what is the calculation? well it's the probability of observations given states and that's observation at a certain time given the state at that time so it's like if my estimate of the state not s hat just s with no hat if my state is getting hit by a bike there's gonna be a very likely observation of it hurting now that doesn't say whether it's likely that I'm getting hit by a bike it's just that it is the case if it is that state then there's gonna be this observation so that's the mapping between observations and states but it's observations probability conditional with a straight line on the state being a certain way okay that's the first part and the second part is the probability of that state arising conditional on the previous state and the previous policy and the parameter and so again with this like bike model it's like what is the probability of my estimated state s of t at that time being hit by a bike given that I wasn't being hit by a bike before and my policy was looking both ways and my theta model parameters tell me that if I look both ways it's really likely that I'll stay safe for example but stay safe isn't what's directly specified by the model theta it's actually specifying the likelihood that a given policy and state mapping are going to relate into the future with a certain state and then separately we can consider what observations arise given a state so it's a little bit like just showing our work or being very clear with what is the actual state of the world s hat what's the estimated state s what are the mappings between policies states and future states that's the right side then what's the mapping between observations and states in that first part okay any thoughts or questions because the rest just kind of continues from here in a similar way okay we'll just keep on you know write down any cool questions that are coming to mind Yvonne yeah go ahead well you're muted first but yes then continue thank you could you please explain one more time what the small theta is it's a parameter of it's a parameter of the generative model but what is it yes great question so we can actually we can go just just to show it in the context of the paper and this by the way is in 3.1 but it introduces some of the other equations earlier the theta are the details of the model that relate the transition distribution so here we're looking at the states and the previous policy the s t minus one and pi of t minus one and how given theta it maps on to s of t so what is theta just from looking at the equation what is theta going to be telling us it's going to be telling us how to map t minus ones state and policy state estimate and policy to the time t states so let's just imagine that it were a linear model so I had a linear y equals mx plus b and that was how we were going to map forward just totally a simple way to do it then theta would be like a vector containing m and b or theta would how many dimensions of the model there are is how tall theta is going to be in a sense but theta are the parameters of the model that say what the transition distribution is so yeah if it was a linear model then theta is kind of like a wrapper around the parameters of the linear model which is the slope m and then the elevation b if this were a neural network with a million parameters then theta would have a million parameters but it's kind of specifying the likelihood that the s of t minus one and pi of t minus one map on to s of t so let's just say that theta were one it just always mapped to one for some reason then it's like whatever these two are it's going to map to one that's like you know f of x equals one yeah that's the parameter that's the simplest possible model is whatever the previous states and policy are we just map to one the next one would be f of x equals you know insert your model there and so it is a statistical point about what's the parameter of a model and like aren't these all parameters aren't they all variables aren't they all random variables aren't they all going to be variables in my python code yes to all of them and that's why there's a little bit of nuance and that's actually why we go into the details here because you can get pretty tangled up and confused in the gray zone with what's a estimator and what is the what is the scientist estimating what is the organism estimating and a lot of times when we just look at it it's a little bit clearer and we can think about a few layers of unpacking with the top most level of understanding is this big p top equation then we can unpack it and define well what does that really mean how do we pull out the likelihood of the parameters like the likelihood of data from the likelihood of policy from this time-dependent series how do we define the distribution of p of o of t given s of t like the second line and so each of these we get continue unpacking and then each of them get specified in functions in programming which hopefully we'll hear from Alec about next week and there sometimes it becomes another level of clarity when you look at the function definition and you can say oh this function is actually taking in variable a, b, and c and then instead of seeing them as just o sub t, s sub t the variables name could be states through time or there might be a more descriptive name so I know that especially if you haven't looked at these equations for a long time or perhaps many equations for a long time sometimes like you get halfway down the page it's like what what what is mu of theta again but it is referring to something specific it's just a little bit of a shorthand to make sure that it fits visibly in a containable space so any other thoughts but good question there with the parameters so just one of the other um yeah fit yeah Sasha go ahead um yeah thank you that was really useful to walk through those steps especially with the bike example and it kind of brings to mind cases of when something is very unlikely to happen but the um but it's certain to have really bad outcomes so like you know it's really going to hurt if you get hit by a bike versus things that are very likely to happen but don't have uh but are not certain of having a bad outcome like I don't know um stubbing your toe on something and so these are things that are weighted differently in a model based on the certainty of uh and the severity of the outcome so I think that's really cool that this is all incorporated into one model yeah and when we specify why we're making certain decisions first we're just laying these equations out to describe our little control theory robot playing Mario Kart or playing control theory games on n64 but what if people had a little bit more nuanced discussion on collective decision making and topics like somebody thinks well I think it's totally unlikely that the minimum miles per gallon of cars could be 40 miles by 2030 and someone else says that's super likely actually it's totally possible it's totally likely I think that it can be done or I think that this other policy that you're suggesting is implausible or that it will have a bad outcome and so people can disagree on whether a policy is actualizable or not for various reasons it might come across as an ethical constrain or legal constrain or logistical whatever it happens to be but for whatever reason that p of pi is low for them and then we can unpack that and say okay all these different elements are there depending on the situation and then somebody else might say I'm with you on that outcome being unlikely I just don't think it's that big of a deal saying oh yep I think that that kind of a tsunami will happen once every 100 years but I think that my transition matrix from tsunami damage to being repaired it just says that that's an easy transition so we don't really need to plan so heavily around it because when it does happen we'll be able to rebuild quickly for example so these just allow us to give a little bit more detail into our infrastructure really our architecture of decision making and so I think blue that's definitely where we can take it back to this multiple scales of decision making question just to look through a couple these other equations before we move to q here is the probability model of observations given states and so again that's like if the state my estimated state not s hat if my estimated state is getting hit by a bike or stubbing my toe the observations will be such and such in my proprio receptors and that is given by a normal distribution which means big n on the variable o of t and semicolon parameterized by mu and sigma squared which are the mean and the variance and then those together mu and sigma squared it turns out that those are the hard parts to estimate just like the problem or the challenge of doing a linear aggression on a billion data points would be estimating the mean and the variance so you could fit the mean by adding up all the numbers and then dividing it by the amount of numbers and for fitting the variance you could use like a least squares model but still that would be the computation that you could do on pen and paper but would just be very hard in this case we're going to be doing what is called a amortized learning or a functional a functional is like a function of another function and what we're going to be doing is estimating these statistical parameters with this function lambda and then we're going to fit the neural network about that function lambda and so even though neural networks are often understood as providing additional degrees of flexibility like being able to learn conditional relationships between data this took me a little bit to think about but the neural network is actually just facilitating us in this model it's not saying that it would have to be this way for all future active learning but it's actually still just fitting a neural network model of a normal distribution so we're finding the parameters mu and sigma use it which are very difficult to estimate as you can imagine for nonlinear control task but we're still fitting it with a function around a normal so it's kind of this interesting juxtaposition of using advanced techniques like a neural network high parametric techniques to actually drill back down to a mean and a variance and then so this is this POST is this first part of this series and then s of t given the previous state in the previous policy is the second part here so this part here gets estimated with f of lambda and then the second half of that series through time has another normal distribution with mu of theta in sigma squared of theta and then those two get looped up into a functional over those two similarly like s of t is the right hand side what's being conditioned on and we're going to be estimating the mean and variance of what is conditioned on and then here it's similar we're making a function that is estimating what is being conditioned on okay then we have this early parts of the equation which is the probability of the policy is basically it's going to be pretty simple in this case but this is where you could have a lot more degrees of flexibility this is sort of like an initial skeleton paper as far as what could be done with this framework and that's kind of cool that's actually what defines it in some ways as being research and as being cutting edges this isn't the fine tuning of the parameters this is laying out the framework that's going to help us in a lot of other contexts and then the probability of the policy this is the part where we actually get to free energy okay so if we remember from the notation slide f is like the free energy where we strictly couldn't be making any better decisions if we knew that but g is the expected free energy so just like s is the hat s with a hat is real state of the environment that's like what we'd really want to track and free energy would be what our policy decisions would strictly want to be converging upon but we don't get s hat we get s and we don't get f we get g what is g where does it come into play well the probability of a policy is now here sigma is actually not this is sigma squared a statistical parameter and it's kind of annoying but it is this way that this sigma is a softmax function which basically is like the make a choice function in math it just is s for softmax here sigma for variance here and what this is going to be is the negative expected free energy and sometimes there's some negatives and some logs getting thrown around and these are all just ways to really just say pick the best given the expected free energy of policy and so this is how we get to certain statements which have definitely been discussed on this uh livestream before where it's like the agent picks the most likely policy and it's like wait but is it likely for me to win or lose what if it's a i know that it's unlikely for me to win or lose um if it's one in a hundred that i'm gonna win how is that me picking the most likely policy but that's actually what um is happening but it's not just simply the likeliest policy um there's a little bit more nuance which we'll look into a g okay so that was sort of the p side let's maybe a little bit more quickly because we will recognize a lot of the variables and also a lot of the structural relationships between equations here's the q so we have p and q between the agent and the environment in a sense these are the two distributions that are linking it not that one is being computed by one or one is computed by the other but thinking about in that bayesian computationalist framework where we have like the data going in one model goes to the hyperparameters and then the hyperparameters generate leading to um the uh hey Shannon welcome back if we if yeah um going back to the observed states of the world so q um is going to be distinguished from p p is this implementation of the generative model of the world where is q is going to very distinctly have different parameters instead of p os pi theta here we have q of s through time policies in theta so this is a slightly simpler mathematical formulation and it is going to be unpacked as two of the states through time and the policy and the model parameters is again we see the likelihood of the model parameters and the likelihood of the policy coming out in front and these are similar to the p but p doesn't have to equal q and then a slightly simpler uh piece here which is just about the states given the observations uh and so this is like literally given the retina shadows what is the state estimate of the world so this is um it's like here in in p if we look at just this uh second line this is the probability of observations given states so this is given that it's daytime there should be a lot of photons if the estimated state of the world is nighttime there shouldn't be any photons p os okay but it's o given s now in q we actually have s given o so the p is different than the q but also the orders have switched this is a called a recognition distribution because we're basically doing an estimation of the states given the observations so q is kind of like the incoming way that the observations are going to be connected to the state estimators of the world that's the recognition but then the generative model which is the p that entails understanding what the hidden states of the world are s and your estimates of them that is and then generating out observations that are linked up to policy and so here there's a slightly different way that the parameters of this model and that the um policy mapping is done with these a few more statistical parameters but similarly which we don't need to go into too much detail in here is these are also going to be distributed using some statistical distribution that's going to help us estimate them and similarly we can take all the statistical parameters so for a normal distribution it's a two-parameter model mean invariance but you could have distributions with one or with more than two variables but whatever number of statistical variables you want to estimate we're just going to wrap them into a function and figure out some way to estimate them and so that's a degree of freedom you can think of in the model is that uh there's different you could flip out this normal for some other type of distribution or you could have another way to estimate the distributions normality okay that's p and q but it is really important and interesting what they are any thoughts or questions or we can just keep cruising okay yeah um alec uh yeah i want to ask maybe in terms of so p is for generative model q is for recognition distribution the recognition model and as we discussed in previous session a generative model exists or acts in dynamics so you have a generative model in some way of it's like was example with dominos so when you act and act in accordance to this generative model and in example of road and a bike this generative model and defines a way you are looking on both ways and when go if you don't see a bike and uh in these terms so what will be a recognition distribution recognition model it's will be understanding how bike is looks like or what is how road is looks like and so on or it's wrong direction for my understanding a good question and it was it was actually i had the email alec to clarify a few things because i looked at this q distribution i thought wait it's a recognition distribution but where are the observations how can q be recognition when there's no observations and then it turns out there are observations that are related to q but it's he just said basically it's the norm in the recognition distribution literature like this part of the bayesian computational fields this is what they call those two models the recognition distribution which is from the observations to the generative model parameters which ironically as a norm doesn't include the uh observation and then the generative model again is the one that's going from the organisms parameter estimates the world to predicted observations and that's how expectation maximization works now we had all these interesting discussions alex which are alluding to of for example does the organism have the internal model that's generative or is the generative model enacted in terms of the relationship is it that's the internalism versus externalism discussion that we had and so i'm going to try to combine what alec informed me about as well as channel a little max well here and i think the answer is going to be there's multiple philosophical lenses that you could apply to real world situations and this is only what is specified in the math and in the programming and so if you include it if you include in your recognition distribution is the luminosity and the edge and the color and those are the observations and you're doing image classification on recognizing a bike just simplifying it but there it is then that is strictly only what is in the recognition distribution it's going to be about the mapping of the observations of the states given observations and your observations could include those variables that we just talked about so if it's being conditioned on that then that's what's happening and then even given a single formalization of like okay the observations are the overall brightness and the amount of movement or something like whatever it happens to be it could still be a pluralistic philosophical scenario where somebody could say okay that's a modeling convenience but really those observations are in the space between the organism and the environment and someone else says yep it's a modeling convenience but i think those parameters are something about the neural firing and someone else says it's both of them and someone else said so i would be open to correction and also would love to hear alex perspective on this but i think the answer would be these papers that we read tale of two densities and integrating externalism internalism are actually about how by framing it within active inference we move beyond those philosophical debates or another way to say that is build a bigger tent that includes both of those perspectives in a way that's compatibilist because those are the debates that people had about these models so like in the reinforcement learning literature it's like okay is the reward the signal you know is it a pleasing you know aesthetic body form is that the reward or is that the signal of the reward and if it is the reward then isn't that a circular definition like organisms seek out what's rewarding but it's not like you just want to see the aesthetics there might be some second intention for example and so how do you actually draw out these questions is it happening in the organism's head with reward or is it enacted and that was sort of the philosophical groundwork and now we're for a minute putting those bags of all different stripes down and looking at just the way that we're going to specify it and how it plays out in simulations knowing that potentially one person might have a Bayesian structural computationalist perspective of these examples we're going to look at in a second and somebody else might take a heavily inactive example it might say oh well the the cart that's on the hill it's like the cart hill combo organism that's being or something like that but at the end whether the person thinks that it's enacted or it's just a Bayesian computationalist approach they're still going to at least be able to point to the same equations and ultimately the same code as well so that's like a hopefully explanation of how the active inference framework and the specifics end up unifying these general debates which potentially without such a unifying framework it'd be intractable because somebody would say okay well i think that q is happening in the brain of the organism somebody else says i think that q happens in between the niche in the organism someone else says i think q is a modeling shorthand someone says i agree with person one and three or one and two and now we're saying okay they're kind of all correct just all ways that we can phrase the specifics of what we're talking about here not that everyone's correct because everyone's correct and there is no wrong answer but under this reconceptualization of action and inference we find that some of these previous questions or phrasings become like yes and instead of either slash or okay to move a little bit more quickly just through these uh a few equations before we can i think just highlight a little bit on the experiments and then probably return to the multi-scale question that we talked about earlier and also about what other areas could control be interesting in and also what questions would we want to ask alex and other attendees next week here is equation one and this i put after because i think it helps to understand that p and q and have the mappings but this is what is ultimately being minimized so again we don't get f that's the real free energy and that's um the one that we would really want to be converging towards which as they put it um converging towards an approximation of the posterior distribution of problem the p model of states policies and parameters given observations so that would be like the best possible state trajectory through time and the best possible policy given the observations of the world you know given that bitcoin is at this value and ethereum's at this value the state trajectory that's the best for me entails this policy so that's clearly what we want to know but that is intractable so we need to use some type of a convergence towards this intractability and this is where a lot of the more specific and technical details related to for example the colbeck libeler informational divergence the informational distance between two distributions is related because um there's those distance metrics are bounded at zero like you if two distributions are identical their distance is zero and as they get more and more distant that number strictly increases and so it turns out that there's some ways to have like this hypothetical perfect distribution and then have a distance between where you're really at and this hypothetical perfect distribution which is always a positive distance and then try to converge towards that hypothetical perfect position by minimizing the distance function and that distance function is informational but it still is a distance metric in a certain way and wrote out a few details um the e is the expectation so it's expectation of states of time policies and model parameters and it's going to be something that combines the q and the p the natural log which makes it a little bit easier to do the machine learning techniques and a few other benefits scaling it and bringing certain values below the number line and then that is going to be always greater than or equal to that's this part right here than just the observation the observations so it turns out that by doing this estimator we can bound um some but there's a few things that get bounded but they're not even all phrased here um yep so the last thing I wrote there was given that the q is um focusing on basically between p and q you get all the parameters so what you can do by optimizing the recognition density and the generative model back and forth whether actually like a two cycle engine like expectation maximization which is how it gets implemented in code or dynamically and continuously as in the case of an active organismal behavior what has to get learned upon but here what tractably can be learned upon and conditioned upon is that mapping of how observations states and policies are linked in a nuanced nexus where they're all taken into account rather than running a classifier that goes from just observations to states and then a second layer model from states to policies that would be a little bit more along the lines of traditional machine learning but this is why it's a different approach okay then um again we don't get g uh we or we don't get f we only get g so here is another phrasing of f that's being played out through time up so here is um building on three and four we're looking at the variational free energy not just for an instant here this is like a kind of timeless understanding of free energy you don't see t in here it's through time with s tilde and the o tilde but it's like not a specific times and then this one is actually a function through time this is observation t so this is like that specific observation and then here's where we see the divergence the d distance divergence the k l distance between the q and the p and those are not usually exchangeable so like it's not like driving between two cities so this um section again you can unpack some of what the equations actually say but this learning and inference section talks about how to update or uh do inference do learning based upon the data of these parameters here is where we get to g so we don't get f we only get to estimate g so the way i kind of broke this down was we want to do more than just update our internal model to bounce surprise on observations so if all you wanted to do was test retest accuracy on an image classifier data set all you'd have to do would be bound surprise on cross validated versions of that data um or on uh subsample or testing and training splits but we want to actually do a few different things we want to reduce our uncertainty in the future so that's tau the or the cursive t don't know which way you'd say it um in terms of the policies that we take now so it's quite different than just minimizing our surprise about a linear aggressor given a static data set we want to reduce our uncertainty about observations in the future as a function of the policies we're taking now and this minimizable this tractable expected free energy function g takes as arguments policy and the future time point so as a function of the policies that are available and you know two years in the future how am i gonna manage certain aspect of life and it's going to be an expectation that's conditional on um several things conditional mainly we can highlight is conditional on policy we're doing an expectation that's conditioned on policy because we want to know something about the states of the world and the observations and the estimated states of the world as well given a policy and then that is going to be strictly bigger um and so natural log that means that it's like basically worse um in some ways but it depends a lot on whether it's like positive or negative that's kind of confusing stuff sometimes um like whether it's a negative log or like the subtracting of the positive log and what it's always bigger than is the negative expectations of the observed states in the future given a policy me here's me observing that i have 50 bitcoin um and that's this part too the p of observations given policy so what what i'd really like to know would be uh some type of model that just tells me what states i can expect in the future what observations i can expect in the future um which relate to the states because of the uh estimating the states given the observations but here we're going for the observations given the policy and it's the uh log and then uh this last part is kind of the important part which is the log of the generative model which is the observations given the policy that's how things will go that's well if we take this policy on miles per gallon then we will go between uh you know 1.5 and 2 degrees celsius now there's so many degrees of freedom that go into that but that's like what we'd want to estimate would be relating how policies literally are going to play into observations which then tell us about states of the world estimated like s without a hat which should be hopefully related to s with a hat the real states of the world but still the focus here is on the observations that are conditioned on policy that kind of takes us to policy selection and they write that policy selection is achieved by updating q in order to right to minimize f via g and the need for this paper was that when you have a short temporal horizon you can just run out every possible policy like every possible set of five chess moves or every possible set of all the tic-tac-toe moves or can't be more than nine moves in a game every single step if the branching is small and the time horizon is short you can evaluate g in full so it becomes rather than sampling over a distribution literally just calculating a table and then sorting however when you have deep temporal horizons you can't really evaluate the full branching tree because there's too many options so you have to sample from that distribution so that's the sampling problem the second problem that's the deep counterfactual sampling problem which is even difficult or under uncertainty the second part is the mario kart paradigm which is that in continuous action spaces there are infinite policies meaning that some other way is going to have to come into play because not just are we sampling uh deep through time we're sampling potentially related to counterfactuals which there's infinite counterfactuals and we're talking about continuous action spaces where there's infinite policies so if my only trading affordances are should I have zero or one or two bitcoin then I could ask which one of those three options discreetly is better but if there's infinite and I can do continuous variables uh even smaller than one you know breakdown it's like then should I do 1.0001 what if that's very different than 1.0002 now you might say well those two should be super similar and you should be able just to sample every little interval and estimate the distribution right if it's a smooth distribution then it's gonna be easy to optimize maybe you could discretize it maybe you could do some other way to do smooth optimization but the problem is actions and policies are not smooth so maybe if you're going into that turn in mario kart you could uh aim at the corner and just try to take it as fast as you can or you could slow down and do a slightly different angle or you could do some other skidding maneuver so there's multiple separate ways that you're going to be able to take this turn potentially with different risks and reward and they're all very nuanced so that is why the deep through time and counterfactual aspects respectively which aren't as much dealt with here counterfactuals but through time absolutely is and then the continuous actions which is really the key introduction of this model that takes it out of the paradigm of classifying images and puts it into the paradigm of continuous control theory and then it turns out that there's a lot of trajectories through time given how things can go and so there's some implementation of a way to sample trajectories that is utilizing what's called a Monte Carlo sampling approach or mc and there's the Markov chain Monte Carlo sampling mc mc mc squared um and Monte Carlo just means that it's uh related to Monte Carlo being a card playing reference in English and it's like the way that you can shuffle a deck is so big how are you going to estimate the number of ways that you can get a royal flush or a pair of fours or something like that instead of just actually doing well it's one and 24 times one and 52 times three over 52 the way you do it is you just sample randomly and then you say okay i took a billion samples and one million of them had this feature and then you just take the ratio as your estimate so it's a sampling and that kind of harkens back to earlier where we wanted to talk about getting the efficient sampling because that's kind of what it's about is in the Markov um or sorry in the Monte Carlo sampler which could be Markov chain based um we want to get good sampling we don't want to sample only the hands with a royal flush like if we had a poor randomizer because then we're going to get an estimate that's wrong all right then they go into a little bit more detail about the g the estimated or the expected free energy which is still as it was before about policy in the future and now it's going to be uh this broken down and this is also where they draw some of the homologies to reinforcement learning and info gain into the uh the informational gain kind of like the reward and the info so this is where the explore exploit also comes into play which is that if one is only optimizing locally they may be exploiting they may be getting reward for a short period of time but then it doesn't lead to a um uh like we'll see it in the hill climber example but if you just reward only and only exploit sometimes you don't do very well but if you just explore you might not ever get to exploit anywhere good and so the whole trade-off and the whole challenge of modeling is this balance between explore and exploit between searching deeply and searching broadly and so the holy grail would be the right way to search deeply down the routes that are interesting and informative but not down the ones that don't tell you that much but we don't know which paths are going to be informative or not until we go down them however if we have a deep generative model we can say in terms of me winning this chess game this branch is not informative for me to search down not because there aren't trillions of options but because i have two castles and they have no pieces other than their king so i'm gonna win so in terms of me winning it doesn't give me any information to keep on searching down that route i already know i'm gonna win and then there might be another avenue where there's fewer options per se like less branching but there's more uncertainty that's resolvable more info gain is possible with respect to the policies and winning all right let's just do a little bit on the experiments before we close out so the two experiments that they go into so hopefully all of you experimentalists by Sasha um hopefully all you experimentalists will like this part and the two examples are the mountain car and the uh inverted pendulum task and a hopper task so the mountain car is an example of exploration so what's being learnt is the policy between position and velocity and then getting a higher up the reward function is like getting higher up and so the car doesn't have a strong enough engine just to drive up to the flag so it's got to drive up and then the perfect policy would be right when you're stopped flip the engine into reverse and keep it in reverse until you stop on this side and then get a little bit higher up this way but this shows you why you can't locally optimize why you need to explore because let you see you're driving uphill you're slowing down you're slowing down you start slipping back slipping back and you're like no stay on the gas continue accelerating forward now you're slowing your descent so when you get to here you're not moving that fast so then when you're on this side you'd say no no I still want to get to the flag and so you'd end up just sort of like being very shallow down and focusing a lot of your time here on the um middle and on the right side so if you just simply move towards the flag you're probably not going to get there if the engine is too weak however if you actually say I want to explore I want a policy that helps me increase my range so I want to go as far as I can here and then wow if I enact a policy where as soon as I hit my maxima I just try to explore further this way and then I enact a policy switch and go back to going this way wow I might actually reach the goal purely as a function of exploring now again this is like a little model situation where it turns out that exploring does help one realize the goal that's not always the case but it's it isolates that feature of the model needing to be able to explore the second case in the exploit paradigms it's usually a little bit more obvious what needs to be exploited and about whether you're doing it well or not so in the inverted pendulum task it's like a railroad car on a rail and there's a heavy ball m that's on top of this pendulum it's inverted and then the car can like move backwards and that kind of moves this joint to the left and that pushes the pendulum in this case would straighten it out whereas if this car starts taking off to the right then this pendulum is just gonna fall and so the reward is the height of the pendulum so it's trying to maximize its reward by enacting a policy and so the policy that's best would be to keep on exploiting that perfectly straight up line where you actually are in the zone of optimal control and then the exploit task on the right is about increasing the distance that's traveled by a little jumping thing does anyone else have any like thoughts on explore exploit or um what what else does that yeah go ahead blue so this I thought was super interesting um as I was unpacking this paper so and maybe I should just kind of like lay a little bit of this out there so the reward is only given in this extrinsic value which is like the exploit portion of the free energy equation like given in number six and so you know I don't know you didn't say it explicitly but like here in the exploration experiments the reward just uses that exploit parameter it doesn't do any exploration based on state or parameter information game and then in the exploit task it only uses that exploitation parameter so like there's no need for exploration I just thought that that was super interesting and like what would happen if they included exploration in this explore in the exploitation um paradigm so like why did they leave that out like it was just kind of weird to me well good question like that makes me think about let's say there's some parts of the railroad track where it's different angles and so yes you want to maintain the pendulum being upright but also you might want to explore around a little bit to find a region where you can control it better so that actually makes me think of like doing a track stand on my bicycle there's sometimes where there's some slopes of pavement that are easy to do a track stand on and some are really hard to do it but the way you get from a hard position into an easier part of pavement you have to explore and that means breaking the track stand so definitely real control tasks are related to the combination of explore and exploit why did they use these two I think that the reason why and let's hear from Alec as to the real answer is that these are part of the open AI benchmarking set of like little micro computable benchmarks for learning algorithms here's the open AI let's see if we can even open it here yeah so like here's their website and there's a bunch of uh documentable cases that allow comparison between different machine learning algorithms and so I think they're just outlining the minimal single dimensional like sort of extreme classic test cases there's a ton of cases where we can already see it wouldn't work like a double pendulum that's chaotic or there's other kinds of things that this model just doesn't even try to do but this allows us to directly compare two other algorithms on the exact same task instead of inventing something that would like either not take advantage of active inferences strength or be like kind of an unfair example by making it too nuanced like if we added in a reward another type of something that a generative model could take advantage of then we couldn't compare it to some of these other agents that don't use that kind of information is that I just think they were using the minimal case to to demonstrate what explore and exploit could look like and then future work hopefully would be about things that involve both right and so I get that and but it just was like interesting to me then I guess because the there's no reward in the exploit parameter that that's why they had to keep the exploit parameter in the explorer task do you know what I mean like what I just wonder what would be the results if you took out the exploit parameter from the explorer task but it then not see because it just would wander around aimlessly or would wandering around aimlessly or like without a reward to get higher would that just produce like a random result do you see what I mean I see what you mean so there's there's one there's one driving strategy where it's truly just making policy that's unrelated to its elevation and that depending on how that plays out it might get somewhere it might not most likely not because the space of the strategies with relationship getting the flag is low but yes I I agree that's an interesting question like when is it enough to just explore like when is novelty search simply enough when might novelty search be strictly the best when might novelty search 80 percent plus 20 percent let's keep a little reminder of our elevation when might that be a relevant and tractable strategy which parts of these explore and exploit are calculatable for example it might make sense to evaluate or exploit a current distributional value but how do we explore the space of economic strategies like I can't explore every single possible economic strategy but I might be able to exploit some ratio by doing arbitrage so there's some control tasks where the exploitation is really straightforward and visible and the exploration is hard and then there's other ones where maybe this car can't see the flag so maybe exploring is the only thing it can do and so the really great question those are kind of interesting about these things being traded off so there's the explore exploit tradeoff and how that gets implemented there's also the model adequacy minus complexity which is we want a good model but not too complex or overfit one so there's no a priori best way to trade off those two and then there there's just all these interesting tradeoffs that we can explore the space of let's just look at what the results were so here the x-axis is like where positions is from the other side farther to um 0.5 which I guess could be the flag uh or at least close to it I try to line it up maybe it's the way or something like that but we can see that the position of the greedy agents who just drives up up up up and says no and then just goes up up up no this one stays tightly bounded in a medium velocity regime and within a sort of sampling a ton from this part of the distribution think about that like an upside down bell curve and it's like sampling a ton from right here but it's not sampling from up here so that's how there's a relationship between sampling and control because to get to certain states it's like sampling those states the epsilon greedy agent which has a slightly different implementation it does sometimes appear to get outside maybe by the end of its 100 epochs it was starting to take some trips out further but and slightly higher speeds and slightly further out but it doesn't spend most of its time doing so the active inference learner of course they wrote the paper so it's going to be the one that succeeds but it does in 100 epochs learn very fast and it does get to some pretty high speeds and some pretty distant positions and then just briefly to like compare it with the other task the two exploit tasks are laid out here and they are comparing it to this ddpg agents which is a model free learner and so people can look into details or one could imagine that maybe there's other algorithms they could test it against these are all the kinds of things we can ask Gallic about but we can see that um a which is the same as c c is just the first part of a it just zoomed in on really close but it's like right away in both of these after either like five epochs like five trials five seconds or um after maybe 30 or 40 or 50 the active inference learner just starts to take off in terms of once it's hit on a solution sometimes it just it strictly improves it with some noise whereas the ddpg the two things that you see are first off it spends a ton of time near the baseline so it takes a while before it locks on to any useful combination of parameters and then when it gets off baseline it ends up um sometimes succeeding but still um it's like perhaps because it's locally optimizing or we can ask why it is but they're just not sampling highly rewarding states even after it gets off the baseline it does stay off baseline but it doesn't ever get as high as the active inference learner does extremely rapidly so this is kind of where they're drawing their evidence for the claims about the potential superiority or preference for active inference now yes you could probably train a ddpg on some massive data set and get better performance than they got they probably just used a straightforward implementation so that's where research happens that's what people study is okay well can we do a ddpg whatever that is plus this other feature that is being used by this other group or what is it that it's actually causing the active inference learner to get off the ground so fast and could we even get it off the ground faster in the case of this hopper task or could we make a more reliable reward function for the inverted pendulum task but this is control theory at its finest really and so this is what it's all about is learning policies that map states on to reward so those are the key pieces that's why we spent so much time talking about the equations that are relating between observations state estimates and policies because that's what it's all about is the observations coming in and then the policies being selected based upon the estimated states of the world and then actions coming out that's the Markov blanket of the organism's perspective at least and then outside world in these examples is simple and unchanging like gravity is gravity for the pendulum and for the hopper and it's not changing but we can start to think about well what if there was a diurnal gravity rhythm how would the control theory change well there could be a generative model in the agent's control mapping of those p and q that involved some sort of is it day or night and then I have a conditional policy whether I think it's day or night so we can kind of see how it starts adding layers there and yeah that's basically it but let's have any last thoughts does anyone like what was something cool or unexpected what's something they'd like to ask Alec about or that they're curious to learn more about now chill oh Shannon go ahead yeah so I was thinking about blue's question and why not use the exploration component in these inverted pendulum and hopper tasks and wouldn't it be because there's kind of only one state that is maximum award so if we're completely vertical or if we hop as far as possible but if there were potential states that would have differing values of reward then we might expect to see something different if we use the the exploration component of the active reference model great point like if it were a bimodal distribution then you can't just try to keep it as close to the maximum height as possible because you wouldn't know if you were locally optimizing or globally optimizing and you might want to be sampling across the whole distribution just so you know or you might want to spend a lot of time at rewarding states but still have some understanding of the total distribution to understand if there might be more rewarding states that you're just not visiting at all that is the whole challenge of local and global optimization and so you're right that in this mountain car example local optimization is global you're going to get to the best flag just by by you know so to speak I know that it was in the explorer strategy but in this one it's clear as you brought up if you're having the pendulum at the maximum height you're simply doing as well as you can period you don't need to explore other ways you could do a loop to loop and that's the challenge though is you got to you got to walk through the valley of death to get from the local optimus of the global and that's where these algorithms uh dare not tread because they have to walk through improbable or non-preferable like moving away from your target states in order to get to the next peak which might be absolutely higher in the exploit paradigm or might reveal new territory in the explorer paradigm so there's reasons why you want to get through the valley of low likelihood whether you're exploring or exploiting or a combination and that is where there's so many options there's so many weeds and tangles in the valley that the way that we're going to cut through it with a machete is info gain and we're going to explore paths through the valley that are the most informative for us given our understanding of our own policies and so that's like if you have a four by four versus a small team versus your drone different policies different affordances mean that you're going to be exploring this valley differently and then there's whether you're exploring for the sake of exploring and learning new turf or whether you're exploiting because you just want to have a high elevation or whether it's a little of both those are all some nuance that you can add in but the insight of active inference is like let's actually think about what would be informative trajectories to take through that valley based upon a generative model of the landscape and taking our own policy affordances into account if you don't have a generative model landscape you're going to be doing something like locally just going downhill or locally going uphill so you need a deeper generative model landscape and if you don't take your affordances into account then you're lost because you might be unknowingly optimizing something that's going to be appropriate for you know the drone to fly over but then you're on the ground so you actually can't make it up that hill but it's a policy that might work for a different organism and it's just not the one that works for you I know we're kind of like mixing metaphors but I think it's a fun way to think about these optimizations because we are implementing a control process something that we modeled with control theory and then also we're seeing it play out in like a very abstract computational framework here cool stuff so then they talk about the relationship with previous work I think I'll just put it on the screen people can pause it if they want to look but with that I'll just say thanks to our cohort here for participating super interesting stuff and next week we hope to have Alec on to discuss some of these questions so you can check the calendar event to provide a follow-up survey where you can add some details how could we improve or what questions would you like to hear conveyed in 8.2 and for those who are not participating live please provide us with feedback or suggestions or questions that we can cover and just stay in touch but that was really great so thanks a lot everyone for the discussion