 Hello and welcome everyone to the Active Inference Lab. This is the Model Stream number 2.1 on April 16th, 2021. Today is going to be an awesome Model Stream. We're just going to briefly go around and introduce ourselves, and then I'll just mention how the session will be run today, and then we'll pass it to Nor for a presentation. I'm Daniel, and I'm a postdoctoral researcher in California. I'll pass to Phillip. Hi, yeah. I'm currently a PhD student at Oxford in my second year, and yeah, I guess it was a work I did kind of before I started my PhD, alongside Nor, and I'm currently more focusing on data efficiency within specifically reinforcement learning. But I guess on that topic, like the knowledge of Active Inference is obviously useful for approaching such research problems. Who, and Nor? Hi, I'm Nor. I'm a third year PhD student in a theoretical neurobiology group at the Wellcome Centre for Human Neuroimaging at UCL. So that's University College London. So my PhD supervised by cars, focused on these ideas pertaining to adaptation, one of which I'll be focusing on today, which is behavioral adaptation in non-station environments using Active Inference. So thank you. Awesome. Thanks both for joining and for this presentation, we're going to be hearing a who knows how long presentation from Nor. And then I'm going to be compiling questions from the chat. So please just type questions as they come to you and then we'll address them at the end. So thanks again, and Nor, please take it away. Perfect. Thank you. So today I'll be presenting some work that I did in collaboration with Philip, who you've just had from Thomas Parr and Carl Friston. So it's titled Active Inference and Demystified and Compared. OK, OK, perfect. OK, so the presentation structure does follow. Can you share the screens? I think it's sharing or was it are you not able to see it? I'm not seeing it. Could you just re-share it? Yep, sure. Technology, I tell you. There we go. And I'll crop it. So go for it. Thanks. Perfect. Thank you. So the presentation is Richard as follows. First, I'll briefly motivate the problem setting and provide details of a particular active inference instantiation under consideration today, which is the discrete state space setting. And the second half of the presentation is going to be focused on some particular examples comparing the active inference formulation with reinforcement learning, specifically Q learning and Bayesian model based algorithm. And then what I'm going to do is provide some face validity of particular aspects of why you would want to even use active inference. OK, so what is active inference? It's the first principle account of how biological or artificial agents may operate in dynamic non-stationary settings. It stipulates that these agents in order to maintain homostasis reside in attracting states that minimize their entropy or their surprise. So if you take this particular example that we're seeing of this past little hungry agent opening the fridge, the way it would work is like you would need to why would you open a fridge, right? So you want to make a particular choice between eating at home or outside. And in order to do that, you have to decide what is the optimal action that would allow you to resolve your own uncertainty about the current stage of affairs. So and that would help you then decide whether you want to cook at home or you want to walk to the restaurant. And in this particular instance, this has led to the agent opening the fridge to check whether it even has food at home. And what's nice about active inference is that it allows you to think about these problem settings in a more formal way by specifying that optimal behavior is determined by evaluating the evidence that is the sensory input under the agent's gender model of observations that it is being exposed to. And in this particular presentation, what we'll do is focus on just the process theory that underwrites active inference and not talk through the biological on the neural plausibility of the active inference message passing scheme. But to properly motivate why we would want to even use active inference in comparison to generic reinforcement learning algorithms, we need to first start off with this understanding that with an active inference, there's this commitment to a pure belief based scheme, which means that reward functions are not always necessarily unnecessary because any policy that you would have has an epistemic value even in the absence of preferences. Additionally, active inference agents can also learn their own reward functions and this helps the agent describe the type of behavioral that it expects to see it himself as opposed to something that it would get from the environment. And these two particular points are really important in contrast to reinforcement learning because under standard RL settings, the reward function will define how the agent interacts or behaves within a particular environment setting. But defining that reward function in the first place is quite difficult because it assumes that there is a specific signal that's being given from the environment that can be unanimously good or bad for the agent, which wouldn't necessarily hold true in a real setting where these environment signals can change depending on the setting, for example, eating ice cream is not always going to be rewarding if you're ill and it might make you worse. And that's why constructing these reward functions in the first place is extremely difficult. And if you're not constructing them in an appropriate way, even with an RL setting, it can result in suboptimal behavior for your agents. And active inferences are really good in this sense because we are actually replacing or bypassing the traditional reward function that you would have in the RL setting with prior beliefs about preferred outcomes. So the sort of desire states of affairs that you want to see yourself at being. And this becomes important in settings where there's no reward or there's really imprecise understanding of what a reward or preference setting should look like. And in this scenario, within the standard act of inference, discrete state formulation, what we can do is learn the empirical prior distribution over these preferred outcomes and intrinsic, sorry, the internal reward function of that agent. And this sort of brings me to, like, the first distinct conceptualization between RL setting and active inference because within active inference, rewards are nothing distinct. They're just a standard observation that the agent is getting from the environment, whereas in RL, they're quite necessary to have appropriate action that the agent is going to learn. The second point we want to make, sorry, we wanted to make was that active inference provides a principal account of epistemic exploration and intrinsic motivation as minimizing uncertainty. And again, within the RL setting, this is quite crucial because the whole premise of lots of new algorithms that we see in RL is to try and find the right trade off of the balance between exploration and exploration. So what are the right set of actions that the agent should make at a given point in time? Should it carry on choosing all the different ice cream flavors that it's been never been exposed to, like mustard, or should it always have the same ice cream flavor that's been exposed to in the past that it really likes? For example, hazelnut on a teller, for instance. So it's an outstanding problem within RL and under a Bayesian framework, active inference deals with it naturally using the expected free energy formulation that I'll come to in a moment. And the last bit that you can see within the active inference framework is that it naturally accounts for uncertainty as part of the belief updating process. So now that I've sort of laid out the three things that are super interesting about the active inference scheme in comparison to RL, I'm going to provide some intuitions as to why some motivations as to why we can even formulate the active inference formulation the way we do. So sorry, I just realized I like skipped ahead. No, no, sorry. Great presentation. Yeah, thank you so much. Let me just scroll down to... Okay, so I previously stated that with an active inference, it stipulates that agents are maintaining their homestays by residing in attracting states that minimise surprise. So you must have been thinking what is surprise? Well, here we define surprise as a negative log probability of outcomes. And for this way, introduce one random variable, which is O that corresponds to a particular outcome that's received by an agent. And this exists within a finite set of all possible outcomes. And that's O here. So first equation that we have just formally states that out. And here P denotes the probability distribution over outcomes. Okay, so in active inference, the way the agent will actually minimise the surprise quantity that we just walked through is by maintaining a gender model of the world. And this is important because at any given point in time, the agent will necessarily have access to the true measurements of the current state of the world. So in this particular graphic that you see here, you've got environment and the agent interacting with environment in a particular way. It's being exposed to the sensory signal, but it doesn't know what the outcome of the O was really generated by. So it can only perceive itself and the world around it through only the O and it needs to make inferences about what type of states or the true causes were responsible for the particular sensory input that is being exposed to. And this is why in active inference, when we formulate the problem, we formulate it as a partially observable Markov decision process because in this way, we are able to formulate a gender model that defines this internal distribution over the internal states that the agent would use in order to infer the outcome. So it doesn't have access to the true state, but it can make hypotheses of beliefs about the states that could have given rise to a particular sense of outcome space that it's been exposed to. And using this, the agent will make inferences about the true state using a process of reverse mapping, specifically Bayesian model inversion. And to make this a little bit more concrete, what you can do is think about the hidden states as locations or colour, for example, and observation space that the agent would be exposed to would be, for example, the the velocity of the movement or a particular reward or like a happy face that they're being exposed to. So if we were to think about this a little bit more formally, so what is the gender model? So as we described before, a gender model is a partially observable MDP within this active inference formulation, which rests on a simplified setting that we're considering here where we only have two random variables. The first one is O that we've discussed and the second one is S where S denotes a random variable representing hidden or latent states and they exist within a finite set of all possible hidden states, which is denoted by capital S here. And this joint probability that we get over O and S can be factorised into the likelihood function, which is P of O given S and then you have the prior over the internal states, which is P of S. So this gives you a very nice formulation that we're going to use in the next couple of slides. I just wanted to ask, are you able to see my mouse when I highlight or when I, or is that not there? I can see it, yes. Yeah, perfect. So we know that for an agent to minimise its surprise, we would need to marginalise out all the possible hidden states that could have led to a given outcome and this can be achieved by using the factorisation that I just mentioned, the likelihood and the prior, but the problem is that this is not a trivial task because the dimensionality of the hidden states can be extremely large and if you're considering additional random variables that we're going to introduce in a bit, this becomes even more problematic and that's why we use another quantity, a variational approximation of this quantity, P of O, which is more tractable and allows us to estimate the quantities of interest, so this will be a natural step to talk about the variational for energy, which is this variational approximation of the quantity of interest, so what is variational for energy? So variational for energy is defined as an upper bound on surprise, so the first definition, sorry, the first definition that we considered is derived using Jensen's inequality and is commonly known as negative evidence lower bound in the variational inference literature. So we get this from the equation four that we just saw by introducing negative log on both sides and then multiplying this term by one, which is essentially Q of S over Q of S, so we're assuming that Q of S cannot be equal to zero and with that we then apply Jensen's inequality and we move the log inside the function and we end up with our expectation with respect to Q of S for log of the joint over the approximate or the variational quantity of interest here and then we take the negative inside and we can flip it around and we get our first nice quantity of interest here where we get the bound that we're interested in in terms of tail between the approximate and the joint that we have. So to make this a little bit more concrete what we can do now is to further manipulate the variational free energy summons into the kailh between the approximate and the true posterior minus the log evidence that we, the model evidence that we had and we can rearrange the last equation to really hone in on the connection between surprise and variational free energy. So if you remember that kailh is a divergence which means that it cannot be less than zero so it's always strictly greater than or equal to zero which means that when our approximate is equal to the true posterior we end up with a variational free energy equal to the model evidence which means that minimizing free energy is essentially equivalent to maximizing the joint and model evidence k. We can rewrite the previous equation that we had equation 10 to express the variational free energy as a function of posterior beliefs in multiple different forms. So I'm just going to focus on equation 12 here which is the complexity minus accuracy. So this is a trade-off that's normally used within the papers that essentially says the complexity time or complexity cost is essentially your kailh between your approximate s given pi with respect to your p of s given pi and here pi is just your policies and these can be regardless of hypothesis of how the agent is going to act but I'll come back to what policies really entail but for now just consider them as a term that allows us to condition the free energy on a sequence of trajectories of interest and the second term that we have is the log probability of o given s so the likelihood with respect to the q of s that gives you the accuracy so a simple way to think about it is that this is just how accurate the model is and this is some a regularization term that's a penalty term to make sure that it's not diverging too far away from our initial priors. So this particular quantity variational free energy are there any questions at this point about the variational free energy? Not yet, thank you. Yeah, perfect. Okay, so the variational free energy is giving us this way of perceiving the environment and addresses one part of the active inference formulation which is making inferences about the given world that the agent is interacting with at a given point in time however we have not actually accounted for the active part whereas like this particular agent that we have under the active inference formulation can take series of actions or interact with the environment in such a way that it affects that environment in the future. So to motivate this a little bit further what we can think about is that not only do we want to minimize a variational free energy we also want to minimize a quantity called expected free energy which depends on anticipated observations about the future in the future or about the future and the minimization of this particular term allows the agent to influence the future by taking particular actions in the present which are selected from a set of policies. So I mentioned policies a few times now so what are they? So policies can be defined as a sequence of actions at time tau that enable an agent to transition between hidden states and tau here is essentially a sequence of trajectories up to a particular horizon cap which is considering the total number of time steps that you are considering in a particular setup and for us to properly define policy we need to introduce two additional random variables. So the first one is an action that's conditioned on tau which is denoted by u tau here and this exists within a finite set of all possible actions that the agents can take and the second random variable that we introduce is policy which is the pi that we've discussed and this exists within a finite set of all possible policies or sequence of actions in a sense of like the sequential policy optimization that we're interested in here. So to make it a little bit more concrete pi here the random variable can be decomposed into a series of actions over a particular time horizon tau. So u1, u2 and going up to u tau would denote the action from action of time point 1, action of time point 2 and so on and the link is explicit when you consider that if you consider a policy at a particular time point so tau then the action that you get would be that action. Okay cool. So I also wanted to highlight that this definition of policy is actually quite different to or distinct to how it's considered in RL which is when they say policy they mean state action policies. So as I just mentioned an active inference a policy is simply a sequence of choices for actions through time that is a sequential policy and this is different to a state action policy in reinforcement learning which is mapping of states to actions. So your RL policy which takes in account the action and the state is probability of your action given the state and under upon MDP formulation the definition of action sorry definition of policies in RL and active inference become exactly the same when we consider the setting where tau is equal to 1 so you're only considering 1 step ahead. Okay so I'm going to move a little bit and consider the continuity of interest the expected for energy which is how do we even derive it. So in order to derive it we first need to extend the variational free energy definition that we had before so a few slides ago and now make it dependent on both time so tau and policy and what we're doing essentially is taking that same equation and just decomposing it for previous sorry previous and current time step under a particular policy so that way this is why we have the conditioning and then we're decomposing it in a specific way in equation 15 and then writing out the matrix formulation in equation 16 so we can come back to this if there are any questions but the key thing to take away from the slide is that now we're including a functional dependency on time for the variational free energy and this is allowing us to now move to the expected for energy formulation but the key thing to note here is that we're only considering time points the previous time point the present not the future at all so using the free energy equation before we can derive the expected for energy and what is the expected for energy then so expected for energy is the free energy functional of future trajectory g and it effectively values the evidence for plausible policies based on outcomes that have not been observed yet so that's the key thing so you're making inferences about the set of future trajectories that you haven't observed and there's two heuristics that are introduced here in order to get to the g formulation that we see in equation 17 the first one is to include beliefs about future outcomes in the expectation that is where supplementing expectation under the approximate posterior with the likelihood here which results in a predictive distribution given by these first two terms here and the second one is that we're implicitly or I guess explicitly conditioning on the joint probabilities of states and observations in the gender model sorry in the gender model dependent on the desired state of affairs as opposed to a particular policy now so this constrains the type of preferences that the agent would have and what's helpful with these two moves that we're making is that we can now evaluate this quantity before actually having the observations and the second one is that the minimization of g would actually encourage policies to be consistent with the desired state of affairs that the agent expects itself to be in I'm just going to briefly mention that this is not the only way to derive expected free energy and there's been some work that's looked at other formulations including work by Karl where the formulation of the expected free energy can be decomposed into different structures so if anyone's interested in that we can go through it later but this free energy expected free energy that I just introduced can be decomposed in certain ways so equation 20 and 21 give two different decompositions of the first one being the epistemic and the extrinsic value trade-off and the second one being the expected and the cost and ambiguity tan so if we just consider the first equation we can say that if we're minimizing this equation then we're capturing this imperative to maximize the information gain that you would have from observing the environment about particular hidden states while maximizing the expected value which is scored by the log preferences or the extrinsic value here so this particular formulation actually gives us a very clear trade-off between the first component which is the epistemic value that promotes curious behavior so that's what you want with exploration encouraged as the agent seeks out these new states that minimize uncertainty about the environment and the latter bit is more pragmatic and it encourages exploitative behavior through this understanding of the type of policies that the agent would prefer to reach in other words like this expected free energy formulation and that we're seeing in equation 20 is essentially treating exploration and exploitation as two different ways of tackling the same problem so minimizing uncertainty that mentioned at the start of the presentation okay we can also think about the second equation here which is just offering us an alternative perspective on the free expected free energy which is an agent wishes to minimize the ambiguity and the degree to which outcomes under a particular policy deviate from prior preferences thus ambiguity here is the expectation of the conditional entropy or the uncertainty about outcomes under the current policy in this particular setting low entropy would suggest that outcomes are quite salient and uniquely informative about the hidden states so for example the visual cues that you might see if the room's actually probably lit up in comparison to if it's quite dark you're not going to need out anything important from that in addition the agent will like to pursue policy dependent outcomes that resemble its preferred outcomes so see don't know if I see here and this is achieved when the care or divergence between the predicted and the preferred outcomes is minimized by a particular policy okay and these priorities about the future outcomes equipped the agent with goal directed behavior which is one of the I guess the instances that is really important in active inference okay so once we have the expected free energy we can derive the policies so and this is realized by deriving the probability of any policy by applying a softmax function over the expected free energy and this sort of illustrates the self-adventing behavior of active inference because any sort of policy or action sequence that results in lower expected free energy are more likely and intuitively this would make sense because the expected free energy is sort of encapsulating all the types of things that you want to include or consider when you're interacting in the wild so you want to explore you want to explore it but you want to have a balance of that and then when you're selecting your policy it's just a matter of determining the set of actions which get you closest to this particular goal and this can be defined by an attractor set that is defined by your C matrix that we described before and if you don't have that then it's just random exploration that you would get sometimes you can also include a temperature parameter beta here and by having a hyperprime on this you introduce an additional complexity cost into the expected free energy formulation which allows you to account for how flat or how confident or precise you want your preferences to be over the policy space the key thing to note is that for sake of simplicity I'm not going to go through a lot of the the details about how these are optimized but you can do that in multiple different ways for example in active inference we can optimize the expectation about the hidden states of interest the policies the precision through inference and then we can also optimize the model parameters through through the learning procedures involved but those sort of differ depending on the setup you're looking at for example if you're using variational phase you would just iterate this these functions or objective functions until convergence or in active inference you do a gradient descent to find the sufficient statistics of interest again this depends on exactly which formulation and setup you're looking at but the key thing to note here is that there's three particular aspects to the active inference algorithm that are useful and can be taken from this particular framework and apply to other settings so I'm just going to reiterate and summarize them briefly so we first have the gentle model that's crucial so in order for an agent to interact and minimize the surprise it needs a gentle model of the world and that's described as simply I'm not including any of the model parameters here but you can have your outcomes your states and your policies and this these are decomposed into yes sorry there's a error with the brackets here but these are decomposed into your prior your likelihood and your transition function and then you once you have the gentle model the objective of the agent is to fit the model to sample observations to reduce surprise and that is through variational free energy optimization so this particular trade-off that we have between the complexity and the accuracy cost and then the second I'm sorry the last part of this algorithm is to plan so select actions that minimize uncertainty that that is the expected free energy and the way you do that is by having a soft max over that this g negative of g the quantity we have here and then sampling from that in order to select and expect best action okay so that's a quick and deep dive into a massive amount of active influence literature but I just wanted to highlight that those are the three core ingredients that if you are interested in implementing these algorithms yourself okay so now I'm just going to switch gears a little bit and walk through comparisons with reinforcement learning so in our work we considered a modified version of the open ai gyms frozen lake environment so frozen lake has a grid like structure with four distinct patches so it has a starting point which is s so we can see it here apologies but it's super tiny so s is here and you've got the frozen surface which is f so again I don't think I can differentiate because I'm just moving my mouths and I'm zooming in I'm zooming in they can see it I get perfect so you've got the frozen f and then you've got the whole and then lastly you have the goal so g here where the first piece located and all patches in this particular setup are safe except for whole where if the agent goes to h it gets a negative reward the agent starts each episode at the first position which is the starting position and from there it needs to reach the frisbee location in the least amount of steps possible and the way it can do that is by performing four different types of actions so either going left right down or up and the agent is allowed to carry on moving through the frozen lake with multiple repers so it can go back to the starting position having link on in other places but each episode will end when it either reaches the hole or the goal location and these locations differ depending on the setup that we'll have in our simulations so in one setup the position of the hole is eight and the goal is six and another setup the position of the hole is six and the goal is eight and the objective is as I said to reach the goal in ideally as few steps as possible while awarding the hole because that would end the episode if it reaches the goal it gets a positive reward of 100 and negative otherwise the key thing to note here is that the scoring metric actually allows us a way to compare the active inference algorithms to reinforcement learning algorithms but it's not really important for the active inference algorithm to have the reward function as a get go because it can still move around using just the information gain time so not having the extrinsic value component and that's quite interesting because we'll see the ramifications of that in our simulations for this particular setup we limited the maximum number of time steps for each episode to 15 okay so what I'm going to do is first talk through the gentle model that we use for the active inference formulation so here what you're seeing on the slide is the graphical representation of the active inference gentle model so this model contains four action states so right down up and left and these control the ability to transition between hidden states location fact so for example if you are in position one and you take the action right then you'll end up in position two or if you are in location five and you take the other action you'll end up in position two as well in this particular setting both position six and eight are absorbing states because if you'll remember once the agent goes to that location they're not able to move out so that's when the episode ends and if an agent makes an improv move in this particular maze for example if it tries to go from position one to left it will just stay in that location it won't move in this particular gentle model so I'm just looking at the hidden states now we have a connect transfer product between the two factors so we have location and context here the context cannot be changed by the agent that we have because this is something that's determined by the environment and this determines where the goal and the whole locations are whereas the location is something that the agent has control over and that's why we have the action states over this so with the context we have two contexts the first one is where the goal is in your location eight and the hole is in location six and the second context is where the goal is in location six and the hole is in location eight at each time point the agent will observe two outcomes one would be its own position in this particular maze and the second would be the score that the agent would get the likelihood for the grid position is entirely determined by the location of the agent and the score is determined by both the location and the context in play so if the agent is in location six and it's in context two then it will receive a positive reward otherwise it will receive a negative on neutral reward depending on where it is based on trying to make that comparison with reinforcement learning what we're doing here is we are introducing private preferences where the agent has plus four for positive reward negative four four sorry minus four for negative reward and otherwise and at the first stage it expects itself to be in the first location so we compared this particular gentle model and the active inference agent to two reinforcement learning algorithms so the first one was the queue learning using epsilon greedy exploration and the second one was the Bayesian model based reinforcement learning algorithm using standard thomsom sampling and thomsom sampling is a proper procedure here because it entails the optimization of dual objectives reward maximization and information gain and this is achieved by having this distribution over a particular function that we parameterized by a prior just by having a prior distribution over it that we sample from okay so for the two queue learning algorithms we have two epsilon greedy parameters so one where it's fixed exploration set to 0.1 and then another one where we have decaying exploration that starts from one and decays down to zero so first we assessed this how the agents interact in a stationary setting where the reward wasn't changing such that the goal location was always at six and the whole location was always at eight and then we evaluate the performance of the agents the key thing to take away from here is that both the Bayesian RL and the active inference agents are able to quickly learn where the reward location is and just maximize it out and this is this is performance is consistent denoted by the really tight confidence bounds that we see in comparison the queue learning agents are so for the one where we have fixed exploration it's fairly good and is able to learn where the reward is located but there is some deviation denoted by the 10% selecting a random action whereas the queue learning where we have epsilon is equal to one decaying to zero the performance isn't the greatest and for the null model the active inference where there is no reward here the agent randomly goes to the whole and randomly goes to the goal 50% of the time but the key thing to note is that apart from the non-model all the models are doing fairly well and seem to be performing okay within this stationary setting so the next thing Nor could I just make a point about these experiments as well just because they also do beer waste starting from like the traditional RL like assessment which is yeah great so basically like usually in RL what happens is you kind of have this ambiguity between like training time and test time performance and you usually especially in something like queue learning like you just hit the max over the queue function or you have a policy that tries to do like some search over the queue function and takes that max like given a state but there's kind of an artificial distinction which is like obviously as you're acquiring data you're making mistakes in the environment you're interacting with a real environment so in order to kind of make that fair comparison between here and active inference where you don't really have this distinction between like training time and test time it's all just interaction the that's the reason why the queue learning agent like especially when epsilon is fixed to say naught point one never achieves the optimal policy simply because we're also with point one probability taking a random action so we're not making this distinction like you sometimes would with normal RL between train and test time we're all saying train and test time are basically the same thing so however you're choosing to interact with the world is how you should be assessed so that's that's why initially looking at these codes you might be like hang on like why isn't queue learning solving this but um yeah that's just to clear something up if you're more familiar with say some of the like deep RL experimental procedure perfect thank you okay and then just following on from that so we change the environment a little bit to make it a little bit more difficult to see whether the Bayesian and the active inference agents might struggle when we start having the raw location change after every few episodes so specifically we swapped the goal and the whole location at time point 21 at time point 121 141 251 and 451 so you can see that in these figures where the line the gray lines are shown so at these points the raw location flipped so like the stationary setting for the first 20 trials all the agents seem to be doing as you would expect so both the Bayesian RL and the active inference agents are doing fairly okay the queue learning with the fixed exploration set to 0.1 is doing fairly okay as we saw before and the queue learning with the decaying exploration is doing as it was doing before but when you flip it around such that the goal locations change what you notice is that with the Bayesian RL agent the amount of reward or the score it gets is quite low and then we see a phase where it then transitions and becomes to the optimal policy again in comparison to the active inference agent where it instantly after the first trial of doing the incorrect is able to switch over to the active inference as either appropriate policy and the reason for that is for these RL settings where we're considering as the learning problem you need to first do a reversal learning of where the reward location is and then we learn the new reward location so you're seeing that for queue learning for both the different Epsilon greedy parameterizations and also for the Bayesian RL and we're seeing that consistently whereas for the active inference agent because we're treating it as a planning as an inference problem where the postures from the previous data are moved over to the prize moved over as priors the agent is able to instantly realize that the current policy that I was following this time step is an appropriatence which is its policy to the next to the other one again the null model as expected doesn't really do much because it's just exploring doesn't really care where the reward or the whole location are okay Phil do you want to add something to this or no I think the same point applies to this one yep definitely okay so just to wrap up with these two comparisons with the non well the stationary setting all types of agents including the Bayesian RL and the queue learning would be reasonable frameworks for use whereas with the non stationary stochastic setting having an active inference agent might be a appropriate way of handling change in dynamics but the key caveat with that is that you can introduce a lot more additional complexity into the Bayesian RL or the queue learning or the RL framework in general to allow for a way to handle uncertainty but it wouldn't be a natural way of adding it you would have to sort of augment the function or the algorithm in particular ways to justify it okay so so once we have done this comparison we're interested in why would you even want to use active inference when you don't have an understanding about the world or you don't have sort of a preference over the type of things that could be done because as we saw the active inference model the null model is just exploring it's not really doing much and that brings us to one of the initial points that I introduced at the start of the presentation which is that within an active inference framework we don't really care about having a reward function we can learn we can learn that based on some interaction with the environment so for that what we did was we carried out a few different simulations to see how the active inference agent can select different types of policies in the absence of prior preferences and for this we we did three different experiments where we allowed either or the likelihood and all the outcome preferences to be learned over time and saw how the agent interacted when this learning over the different preferences took place and the way we did it was using the conjugacy models where because these are discrete state models we have categorical distributions over our model parameters and we're introducing Dirichlet distributions on top as our hyper priors and learning those hyper priors we can go into the details of this if anyone has any questions but just for simplicity just assume that all the Dirichlet distributions were set to flat completely flat and the agents were all given an opportunity to learn based on different interactions they had so the first set of experiments like the simulations that we ran were to understand how the agent would reduce this uncertainty about the environment that is interacting with so if it doesn't know the frozen lake how would it interact or explore that lake and what we see is that in this particular instance the agent was just interested in exploring he didn't particularly care about the where the goal or the whole location was and this is highlighted in a series of different exploration trajectories that we see so the first one was where the agent falls in the hole and then another one where it goes around and ends up at the second in the second episode at the goal location and in somewhere it just ends the episode by just going back and forth so this is just flora exploration nothing more to hear the next set of analysis or simulations that we ran was to see what would happen if the agent knew about the world but had no preferences of the type of outcomes they expected itself to be in and for this we ran multiple different simulations and saw that in the absence of any sort of preferences the hole could actually become really attractive if it's encountered first so we see that in the first figure where the agent learns to prefer to hide in holes and in the second type of trial that we saw was where the agent exhibited the preference to actually go to the goal location so this is entirely dependent on the instantiation or the type of stimulus that the agent is exposed to initially that determines the type of preferences it would learn to have okay and then the last set of simulations that we ran was just to check what would happen when we interacted with the epistemic imperatives to actually resolve uncertainty about the environment that the agent was interacting with specifically the likelihood mapping between the outcomes given the states and the uncertainty about the desired state of affairs that the agent expected itself to be in we saw was that if we allowed a sufficient number of trials to pass the agent in this particular setting learned to prefer to ride as we hide in holes after number of episodes but they had a very distinct preference over the type of outcomes it expected itself to be in depending on the particular trajectory as a particular time point so that brings me to the last simulation so I'm just going to wrap up which is in active inference I start over presentation with saying it's a particular algorithm that gives us very nice things to consider when we're operating in a Bayesian or belief-based setting which is firstly the principal account of epistemic exploration and intrinsic motivation that we get from the particular expected free energy doc econ decision that we went through the second thing that I wanted to highlight was that under an active inference setting we don't have to explicitly specify a reward function which we saw in the last the second set of simulations where the agent can also learn its own reward and prefer to become something that is quite counter-intuitive from an RL setting where the signal from the environment is saying something is bad but the agent's internal motivations of preference allow it to then do something that's quite at odds with what the environment is expecting it to do and lastly because of this belief Bayesian setting uncertainty is a natural part of the belief updating so within the stationary settings active inference agents perform as well as reinforcement learning agents however in non-stationary settings they outperform due to their ability to carry out this planning as inference and that's I think that holds this is not a conclusive statement but it is a nice way to start thinking about how if you're scaling up active inference agents to interact on the same type of environments as reinforcement learning agents that might be a little bit hard to resolve because there is some non-stationarity or some weird fluctuations happening in the environment and active inference agents could potentially perform quite nicely if we have this planning as inference setup and that brings me to the end of the presentation so I just want to thank everyone who was involved with this work so Carl, Tom and Phil and everyone who's helped me think through these interesting ideas that I presented so and everyone also for listening thank you thank you awesome talk so you can maybe unshare and we can ask a few questions and if anyone who's watching live would like to ask questions they're more than welcome so maybe I'll start with just a general point that it was awesome to see how clearly you differentiated between the reinforcement learning and the active inference paradigm and just one question I had was you mentioned that in the case of a time step of one a time horizon of one there was like an equivalence between the reinforcement learning approach and the active inference approach now people are using reinforcement learning to do planning for the future amidst uncertainty so how are they accomplishing those kinds of planning and where are some situations where active inference might be able to step into those settings and potentially do better no I think that's already a good question and Phil free for you to jump in if I missed something but so within the RL setting from what I know if they are considering temporal horizons greater than one then they have hierarchical RL or they have options where they're considering trajectories that are greater than one that allow them to consider a whole sequence of play before they like that's rolled out if they select a particular policy instead of like that one single action to state mapping but I'm not I haven't really worked with options that much maybe Phil do you know yeah so options are like one way to achieve this kind of multi-step sort of like I guess oversized almost like this inductive bias right that you will like your choices will be kind of contiguous blocks of like more than one step whereas a kind of policy kind of is just like I'm going to take one step and then I'll have this kind of next state I exist in but I think what's important to disambiguate is actually you can do sort of planning in the model style things because you can view active inferences sort of model based like in the kind of you know RL paradigm and indeed like within model based like you can have like planning methods similar to what is done in active inference however I guess like the key distinction here is when you do planning in model based RL you're simply just trying to take you you tend to take some sort of suit like kind of mini evolutionary method usually something like the cross-century method and what you're doing is this kind of search over all actions that you take at a certain point in time that maximizes your kind of reward over say a horizon of say 20 steps and you can do some clever stuff you can terminate with a Q value function but fundamentally like it doesn't take you away from this sort of definitional difference which is in active inference you're doing this sort of planning over actions even though they might seem superficially similar like kind of basically like drawing lots of actions and seeing which ones maximize some sort of utility in the case of active inference all the useful kind of information and the desiderata about like exploration and exploitation are wrapped up in that maximization whereas in RL it's kind of it's a bit harder to understand like in the classical sense we're just trying to maximize reward but you can have heuristics where you say oh but maybe I want to also maximize some notion of model uncertainty and you know it kind of gets a bit more difficult to naturally integrate all these approaches into the same thing because like in the example of active inference because everything's a distribution you just put a hyper prior over if you want to just integrate it out if you don't want to deal with kind of tuning it so yeah that's kind of where I feel like there is this kind of difference cool and what what kinds of settings do you think that those types of uh action as inference and planning as inference could be utilized like what kind of data sets or questions or contexts are people currently using another type of method but then you're excited to see active inference play a role I think mostly open-ended problems where you don't really have a reward function I think because the way the the RL agent is learning how to interact with the environment is through a reward function so anything where you don't really have that or you have an environment that is changing but I know that there's like a whole host of people within the RL community working on like intrinsic motivation or internal motivation so those sort of things do overlap with the active inference formulation but particular paradigms I think for me the more interesting aspect of active inference comes from the fact when you start thinking about biological agents and if you're modeling a patient or someone with schizophrenia you can with this Bayesian framework you can change the prize to try and see how the person is interacting where I guess it could be that their policy like the way they're rallying their policies are broken or it just could be other parts that are different but I think within the RL setting if you are stepping outside the the standard the game mode right so like Maduk or gym environments and start going into ones which are more open ended and you don't have any rewards then active inference could be potentially useful but I'm a little bit hesitant to say whether it will be better because if you start augmenting Bayesian RL with all sorts of interesting components then to a certain extent it will be active inference scaled up which which is a potentially a contentious point but I think it depends on exactly what you are incorporating and that's why for this particular presentation and our work we were quite careful in defining what reinforcement learning meant which is that you have to have this reward function in play and you want to maximize this reward function and anything you include into the RL framework has to have that that like the objective has to be there but if you if you sort of bypass then say okay I'm just going to add in all sorts of interesting components to make the algorithm or the way the agent interacts with the environment I guess as similar to a active inference framework or perhaps even better then you don't that find distinction between where RL is better or where active inference is better sort of it's not really there for me because I think both communities from my perspective of working for similar things which is sequential decision making and with our work it's mostly with sequential decision making in the face of uncertainty whereas some of RL might RL work might not be focused on that so I think it's when you start drawing the boundaries that it comes in a little bit hazy where like things are separate or not and I think I went up on a little tangent but Phil do you want to add anything to that in terms of the environments and the paradigms that might be useful? Yeah I think if one of your range is say you're given an environment that you just have no prior knowledge about how you want to behave you could argue that you could deploy an RL agent that uses some sort of curiosity or epistemic uncertainty reduction mechanism but I mean I know I know there is like a tiny bit of work about learning priors over reward functions and learning those but I'm not hugely like aware but I think what's important to understand is in the limit of like exploring the entire environment your epistemic uncertainty is going to go to zero right like you will have observed everything and then it's unclear what your RL agent is going to do at that point especially you say you have a deep neural network that parameterizes how you take actions whereas in the case and I think what this paper was really interesting for me to see like when we ran these experiments was actually you just put a prior over your prior preferences and eventually you learn a mode of behavior it may not be optimal but your agent eventually learns to adopt a behavior that is self-fulfilling because it reduces epistemic uncertainty and then all that's left is says well I think these are kind of useful behaviors or at least these are behaviors for me to do in the world and eventually you get quite repetitive behavior and it might be a more kind of accurate simulation of in the absence of any information like how something intelligent might actually behave in a world whereas in the RL paradigm it's less clear you know once you reduce all that epistemic uncertainty about where the value is and you haven't found any like what is your agent really doing at that point but I think my problem at the moment with like the action from formulation is that this will work on like a discrete state formulation right we saw really nice results in that setting but I think if you scale it up active inference agents are going to have the same issues if you're using like advertised inference or something to actually approximate your likelihood or transition functions which means that you might not get these nice properties that we're seeing at the this small scale so scaling it up is like it's an interesting an open-ended problem at the moment scaling up in the right way that you can include like these conjugacy models that we have in the discrete state formulation and that's how we're like do the last set of simulations where we introduce the conjugate priors to do the learning over the the prior preferences but also the likelihood so it becomes a little bit hazy how to learn or like I have a hyper prior over an entire neural network if you're scaling it up that way so they must like a little work needs to be done that area and in order to actually show that it's appropriate that you can have these interesting components then you need to start thinking about how you would include these hyper priors because it is not reasonable to say that you're going to have a hyper prior over the the parameter space such like the beta hyper so that hyper prior over the gamma that that won't be sufficient in those settings or a hyper prior over the way the agent is selecting his actions it has to be over the model parameters and if scaling up means that you're losing that nice way to disentangle the particular model parameters then it becomes very uncertain I don't know it's an open-ended problem for me yeah I think high-dimensional problems still represent something that is relatively difficult especially just due to the kind of you know you know the further we stray from Bayes like the less principle it becomes and you know it's like it's it's quite a fine line very interesting about how whether within the RL or active inference paradigm there's sort of the sparse skeleton the bare bones at the core and then sometimes these other layers or tweaks are needed and um pretty interesting to learn about and also what you had said earlier nor about how the challenge is planning for sequential action when you're in feedback whether just by moving around an environment so your local environment changes or you're playing a game where the board game is going to change or you're trading on a market sequential action you can't just plan steps one through a hundred without at least thinking about some what ifs and then only having access to limited observational data and planning amidst fundamental uncertainty so I think a lot of those points of contact with the motivations of reinforcement learning and machine learning will maybe bring some more light interactive inference and push some of those frontiers you just mentioned so I have a question and anyone else in the chat can ask a question too let's say somebody is wanting to learn about this and they're actually in a lucky beginner's mind perspective because they might not have been enticed by learning reinforcement learning but they've gotten curious in active inference and excited by your presentation and so what what kinds of computer languages or skills might they want to learn or what kinds of approaches or mindsets would be helpful if somebody let's say weren't coming from a classical machine learning perspective and learning active inference but rather kind of upskilling into active inference what would you recommend either of you to them Bill do you want to go first with that yeah I mean I think it's an interesting way to view a kind of this kind of potential person because I kind of felt like I was somewhat like this person like way back when like Nora and I were starting to have this conversations about active inference and I like this paper started off basically as a tutorial I was writing having spent two three months kind of in the evenings reading about trying to like sift through the active inference literature and I think you know I think it's fair to say that at times it's unapologetically dense and quite difficult to read so you know without trying to you know self endorse but I do think reading this manuscript in particular like the whole aim for this obviously when we like when we started writing this was to really understand like what is what is happening here like what is this kind of expected free energy quantity like why do we care about it so I know from a kind of theoretical perspective I think this is a very lucid presentation of the concept so at least you can get some sort of intuition as to what's happening um as for the kind of coding side I can't really speak to that I'm sure I'm sure nor kind of has has worked it works with it quite a bit and I was going to say it sort of depends on what the primary objective of the person is is it for a sort of get an understanding of the high level conceptual ideas or treated as an algorithm because if you're going coming from free energy principle to active inferences a different story or if you if you're taking active inferences of siloed algorithm for specific kind of sequential decision making scheme rate so depending on that is sort of differs but I still will say I think this paper definitely is really nice in a sense like it does try and define all the different concepts and goes through the different formulations maybe not in as much detail with the assumptions in play it gives you the the layout of how you might be able to derive it but certain things like what what the approximate density even really entails those are very difficult questions where you would have to like dive into variational inference literature to understand so I guess from that perspective someone who's coming into the field should spend some time thinking about variational inference and how that reties back to the active inference formulation because the perception part of active inference is in most instances exactly the same as variational inference literature shown optimizing the model evidence or the maximizing the evidence law bound so the second thing that I was going to add is that the paper with Lancelot de Costa is really good for someone who wants to drill down into deriving everything themselves sometimes it's super technical so this are the paper that I walk through today the one with Phil and Tom and Carl is it's a nice introduction for someone who's not familiar with the mathematics and just wants to get a layman summary whereas the paper the paper with Lancelot's first author gives like the detailed derivations and a lot of the assumptions in place so that's from like understanding the theory part from the coding perspective it entirely depends what the end objective is so if someone wants to work with discrete state formulations then the MATLAB code Carl has written and it's years of work with lots of nice simulations and examples that you can use and also our code is online and you can access it so there's a link in the paper in the software section that gives where exactly the code is and you can look through that and see how the simulations were done if someone is interested in more I guess high dimensional formulations of active inference there's some recent work with Zephyrus so Zeph Fontes as first author where we've got a nice again we have a git reaper for that work as well which gives a breakdown of how you would implement a simple active inference agent using a very short time code or is in a simple transition network so there's lots of like different areas but for someone starting out they sort of have to understand whether they want to focus on the theoretical side or whether they want to focus on the implementation side the theoretical side would be drilling down into the variational influence and the maths behind it and if they want to focus on the the coding aspect then they want to figure out whether it's continuous or discrete state formulations that are interested interested in and then sort of break down into if it's continuous then most of it would either be writing the coordinates of motions themselves or they would have to use like some sort of like a neural network to approximate that continuous distribution of interest or the discrete state which Carl's written out and if if they have questions about the discrete state I'm happy to take emails as well so if anyone has or the continuous state space as well Thanks for that distinction and it's such a large difference between the MATLAB code which we got to walk through with Ryan Smith and Christopher White which is doing matrix multiplication and such and then here comes the neural networks and it's offering sounds like new opportunities with high dimensionality and continuous variables but also a lot of new challenges so what is the essence that's shared by the matrix form and by this more machine learning style because for some people it might be splitting a hair quite literally the difference between two computer languages when they're thinking about active inference from a ecological psychology or in an active philosophical or an embodied performance perspective all backgrounds that converge on active inference and so to somebody outside the strand of hair what is it that we can really distill that is core active inference and I wrote down a few things so that you had said what are those core pieces that allow us to dive into the matrix mode with MATLAB or into the neural network mode with like Python so I think it comes down to my summary slide the way I think about it so the active inference the core ingredients are formulating the gender model and here the formulating the gender model is a parameterization of the gender model so either you can use the discrete state categorical distributions or you can use a more continuous state formulation and again the the neural network formulation is a specific instantiation of that so that would be one way of sort of differentiating that the second one is the optimization of the objective functions in play so in the the MATLAB code where you're doing gradients to send using mean field message passing algorithm so a specific formulation that's been introduced in a couple of papers and I specifically didn't walk through that or you're doing back propagation to actually calculate the or learn the distributions and then you're just solving those distributions that you have so it sort of depends on which formulation your like it depends on how you optimize those objectives either you're like taking the implicit forward model or you're taking an explicit gentle model and one thing I forgot to mention is that Alex Shams and Conor Hines they've been working on a discrete state formulation of active inference in Python which might be of interest for people who want to like focus on one specific language that can do like the more high end or high dimensional stuff and the discrete state formulation that Carl has I know they're also looking for people to work on the code base if anyone's interested so it's called in fact activity I think but that's on GitHub as well if anyone's interested cool um any sort of last thoughts or comments from either of you no I think I'm okay so what do you reckon do um I think we're kind of noticing a bit of a um an increase in interest in active inference like just to give you an example like maybe areas where even active inference wouldn't have been considered before like robotics you're now beginning to see more and more of it so I think if you do want to get involved in it and now is a particularly good time great call Philip and I'll just re-recommend the excellent paper that we're discussing it's in the description of this video really appreciate both of you for joining you're always welcome to come on to be speaking about a paper you've authored or not but again thanks a lot to Nora and Philip and I hope to see you again on a future active inference stream thank you thank you for inviting us bye bye see you peace bye bye awesome stop the stream great conversation thanks a lot to Philip and Nora really