 Okay. Hello and welcome everyone. It is January 28th, 2022. We're here in Acton & Flab, Model Stream number 5.1 with Pietro Mazaglia and Tim Verbelen. So this is going to be a Model Stream presentation and discussion on their recent work, Contrastive Active Inference. We're going to have a presentation section and then a discussion. So please feel free to ask any questions during the presentation that we can address in the discussion. And Tim and Pietro, thanks a ton. We really appreciate you joining to share your work. So please take it away and thanks again. Yeah, thanks Daniel for inviting us as well. So I'm Tim Verbelen together with my colleague Pietro. We'll talk on our work on Contrastive Active Inference. So we'll be first to set the scene. Why are we looking at Active Inference? Well, basically our lab wants to build intelligent patients. And so from that perspective, we noticed earlier on that if you want to build something intelligent, it needs to be embodied. It needs to be interacting with its environment. And then it's a small step, of course, to delve into Active Inference, where basically your agent needs to understand the environment it's interacting with and need to build the model, basically. So we'll first give an overview on Active Inference and the way that we approach this. A lot of this material has also been covered in a previous model stream, I think, number three. So if you want more details, you can dig up that one again. And so then afterwards, Pietro will take over and he will point to the pretty details of the Contrastive Approach to Active Inference. So let's get started. So basically, Active Inference, it's a process theory of the brain. And basically it says that your brain or the agent, he builds or he builds a generative model of the environment, which is basically a joint probability distribution over observations. So think that you can see your experience actions, which we do know as A, and then states or yeah, states of the environment. So basically, you have your agent that is separate from the environment. And it can do actions, it can interact with the environment, and this gives rise to new observations. So the idea of the Contrastive Model is basically the agent figures out which are kinds of the hidden states that the change by my actions and that give rise to my observations. And if you can build such a model, then basically this enables the agents to plan some actions to bring the agent to some preferred observations or outcomes and so forth. But so the crucial bit is basically how do you get this model of what happens if I do my actions, how does this influence the state and how does this influence the outcomes that I see. And so the crucial part from Active Inference is twofold. So first of all, that's okay. This is what the agent does and it does so by optimizing so-called free energy, which isn't a propellant and surprise or prediction error. So basically, the generative model allows the agent to predict the outcomes that it will see, that it will witness. And the better these match your actual observations, the more happier you are as an agent. And crucially, you will also select the actions that will minimize the free energy you expect in the future. So we'll dig a bit into the models just to set a scene on the one hand notation-wise so that we all know what powers and assets and ACE are, but also to then see the move that Piet will make from the, let's say, vanilla Active Inference presentation towards a more contrastive formulation of the Active Inference of free energy object. So we start off with setting the scene with the generative model. So it's a bit laid out. The diagram of the agents and the environment that was on previous slide. So basically, this unfolds over time. So you are in a certain state that gives rise to a certain observation. And then given an action on your previous state, you basically move ahead to the next state. And this process unfolds over time. And you can see some of the circles are colored gray. And these are basically the things that you can observe. So you know the actions that you did up until now. And you know the observations that you saw up until now. And all the rest is basically for you to infer. So you can only infer the hidden states until now. But you can also try to infer the hidden states of the future, the actions that you want to take, or the observations that you will experience. And so basically, the so-called generative model or joint distribution over the sequence of observation states and action is then basically characterized as follows. So you have a prior over actions. You have a light truth model. So you have a prior over actions. So this basically determines what is the probability you take certain actions, certain time. You have some transition probabilities. So what is the probability that I will transition to this next state given my previous state and the action I did. And you have the light truth model, which basically says, yeah, given the state I am, which observation I will see. And so this basically covers the so-called Markov assumption that your observation that you see at this time step, it only depends on your hidden state. And it does not really directly depend on anything else. Because if you know your current in state, then you know the observation you will see. So that's basically what's reflected here. Of course, as an agent, having this generative model, this allows you to assess how likely is a sequence of observations, for example. And it allows you to predict, given these actions, what will happen. But one crucial bit, of course, is still the inverse of this model. Like, given that I saw these observations that I did these actions, which is my current state. And this is basically non-trivial. So even if you have the exact strength model inverting, this is typically intractable. So that's why in active inference you resort to variational inference and you just say, okay, I just assume that I can build a model, the so-called, for example, steering model. And this is the thing that will tell me, given certain observations, what is my probability to be in a certain state. So that's what's depicted here. So yeah, we introduce this Q. And Q is basically a variational approximate steer. So it can be any distribution, you can choose it. And you just say, okay, given some observation, I want to have the best estimate for the state I am in. And the free energy principle just states, okay, if that's what you want, then this is easy. You just optimize the free energy, which is node F here. And it's basically expectation over states generated by your approximate steer. The expectation over the difference between the local attributes or your approximate steer and local attributes of the GIF model. And if you can minimize that, that basically means that you will have the best explanation for the observations you see. But at the same time, you also have the best approximate steer for the true one. We're not going to go to the whole derivation of this flow. But basically, you can convert this to the second equation line. And this is the one that we use most often in our models, which is basically KLR versions between this approximate steer. So basically what is the state I am given the observations I saw. And this prior that basically says, this is my guess that I am in the state, given my previous state and action. So I don't know the observation yet. But I want to have the best guess without my observation. And if I see the observation, I don't want my beliefs to completely switch because then probably there's something wrong with my model. And then you have the second term, which is the local accurate term. This is basically the accuracy of your model, or how good are you at reconstructing the outcomes. So this is all for the past, basically, or up until your current time step. So you know the observations you saw, and you can evaluate the free energy. And then you can basically update your model in order to minimize this thing. But of course, you also know to know your actions, you want to look into the future. And so if you look into the future, then we talk about the expected free energy. So here, we use pi as a shorthand for a sequence of actions into the future. I also switched from t to tau just to denote that we are talking about future time steps, basically. But the important bit is now that in the expectation, now you don't have expectations only over states, but also over outcomes because you couldn't sense your observations yet. So you don't know them. So you can just do an expectation over anything that could happen. And then the move that is made in active inference is basically that you on the one hand form a term, which they call or what we call the instrumental value, or realizing preferences. And it's basically stating that, okay, as an agent for future outcomes, I have some prior that I think that I will, regardless what happens, that I think I will realize it's kind of your, your preferred outcomes, let's say, can also cause this more like home of your stasis. So my body temperature to be 37 degrees Celsius. So my expectation before knowing anything is that it will be 37 degrees. And hence, I will act in order to make it so. So that's a bit reflected here, like you disregard the dependency on your future actions. You just say, okay, my prior is that this is what I expect. So this becomes an instrumental value. And then the second assumption is basically that your approximate posterior model is basically very, very close to the true posterior. So that you have a good approximation, then you can rewrite it's in the second term, which basically means that you have on the one hand, and that says, this is my belief over the state, given the actions I will do. And the other thing is also a belief about future states, given the actions and some certain outcomes that I expect to see. So basically, it says, what would be the information that I get from looking for certain outcomes? And it's kind of an epistemic value or an information game, which can drive you to explore. Basically, if you don't know how to get to your preferred state, at least, you want to get to states that give you more information on where you are. First of all, often says it's like the owl that needs foods. So what do you do? Do you eat first? Or do you search for prey first? And so the epistemic value is basically searching for prey. It's like, where should I go to get more information on where the prey is? And then once you know where it is, then you can realize your preferences and go towards the prey. So how does the action selection work then? Well, basically, you want to select the actions that minimize your expected energy. So at each time step, first thing you do is you use your approximate model to estimate your current states, like knowing in which state am I now given my latest observation. Then you can evaluate the expected free energy for each policy or plan. So for any future sequence of actions, you can evaluate the expected free energy. And then this basically results in a belief over policy. So you basically take the negative expected free energy, you multiply with this precision parameter, which just states how much confidence you have that your expected free energy is correct, basically. And then you use this, the Sophomax prediction. So basically, it just says the policies that have low expected free energy are the ones that are most likely. That's the only thing this formula says. And then you refer the next action according to this. You just select the next action for the sequence that you think will give you the minimum expected free energy. And that's how it goes. And then you take this action, you get a new observation on the process. And so one crucial point in our work is that it all starts with this genetic model and is approximate model. And typically, you have a certain problem and you know how the problem looks like, what your observations are, what the hidden state might be. And then you can really pinpoint and write down the exact model and start optimizing. But in our case, this is often not true if you look at the robots that drives around and gets camera inputs, for example. Yeah, what is the state space that you need to track? How do you convert these pixels to the state space? So all these things are, yeah, they're just not there yet. And the goal in our work is that, yeah, can we completely start from scratch and learn this. And for this, we use deep neural networks to act as function approximators to actually provide us with these models. And we optimize the parameters of these neural nets, also by minimizing the free energy. That's the core idea, let's say. So how does it look like? We call this an artificial work model. So we start off with observations and actions. So these can be pixels basically. So an n by m matrix of numbers, let's say, and actions, which could be any action factor, it could be your velocity or whatever your agent can do. And these numbers are put into a neural net, which we call the importer. And this basically then reflects this approximate posterior. This is just saying given my previous state action and micro-observation, it outputs a probabilistic state representation, which is basically the means and the variances of the various functions. And then we have a second neural net, which we call transition model. And this is then saying, yeah, what will happen if I do a certain action? How will my state evolve if I do a certain action? Now finally, we also have the decoder or the likelihood model that then outputs given states some observations. So in the case of an image, for example, this will generate you a new image. And the goal is, of course, to have the best predictions possible. So if you look at the free energy formula again, in this case, it's again, you have this likelihood term, which basically just says, given the output of the decoder, so the generated image, I just want to have this close to the actual image that you then see basically. So it's just a reconstruction loss in terms of a neural net, let's say. And the second term is a gl divergence between the distribution that you generate from the encoder and the distribution that you generate from the transition model. So we apply this on a number of cases, which also were seen in the previous model stream. So just to give you some intuition. So first thing was the multi-core problem, which is a basic control problem. So here, the sensory input is the position you are at the part. You basically have to infer not only the position you're in, but also the momentum you have, the velocity you have. And so you can see that on the right, you can see the model predicting all likely trajectories for going left or right. And you can see how in the beginning, it's not sure on the velocity. So it's very spread out in what it will predict. But the more information it gets, the more it kind of collapses to, yeah, I'm pretty sure that this is the behavior that will happen. And then you can use this to drive the agents towards a preferred state. In this case, the flag. The second one was using the car user environment. So here you get these observations or now just pixels from this game. And the preferred state of the car was to be in the center of the track. And so you can see how it actually infers the actions that will bring it to the center of the track. And it might even cut corners in order to reach a preferred state a bit faster. And finally, we also did this on the robots navigating our lab, where we equipped it with a number of sensory modalities. So you can see the camera, but also front-facing LiDAR and also a radar range Doppler. So the radar range Doppler basically gives you in the y-axis the range. And in the x-axis, the velocity of the reflections basically Doppler. And here you can see how in the beginning, we feed it with a number of observations. And then we basically that the model imagined what could happen. So these are real observations. And now it basically imagines what it will see if it turns around, for example. Because it actually learns like basic dynamics, basic behavior of all these sensor modalities. So this is pretty cool. Okay, so what are the mutations of this thing? Well, there are two core limitations that we address in the work of Pietro. So the first one is we use this pixel-byte reconstruction both to learn a model, but also to define your preferred state. Like this is the image that you want to see and try to make it happen. But the problem is that means great error in pixels. In terms of pixels is not really the best metric. So for example, if you have the left image and you want to assess how good an image is similar to that one, we have two examples, even the right. And you can see that the same image with some salt and pepper noise is actually scoring worse in terms of means great error than an image where the two joint arm is actually incorrect. So although in terms of behavior, the left one is better. In terms of means great error, the right one is better. So that's, of course, problematic if you want to control the arm towards the goal. And then the second limitation is that if you need to evaluate the expected free energy for a huge number of potential trajectories, potential actions you can do, then, of course, this becomes interactable as the number increases. And so the ways that we cope with this in the contrastive work is on the one hand, instead of using a pixel-wise reconstruction error, is to use contrastive learning instead. How exactly this works will come apparent in the next few slides. And then the second thing is, instead of evaluating the expected free energy for all the policies, we basically amortize the policy section scheme. So we also train neural net to output actions given during the state of law. And so with that, we can now shift to Pietro, who will talk about the contrastive formulation of the executive influence problem. Thank you, Tim. All right. So thank you, Daniel, for having us. And thank you, Tim. I hope you can hear me well. Okay, so I'll try to share my screen now. Okay. All right. So now I will talk about our recent work, contrastive active inference. So this work was recently published that now rips in 2021. So it's very recent. It came out by last month. And let's start delving into it. So the setting that we discuss in active inference is very similar to the reinforcement learning one. With the difference that in reinforcement learning, all behavior learning is driven by rewards. So the agent receives a reward function. And the positive rewards should reinforce positive behaviors, while negative rewards should penalize the agent to avoid those states and actions. However, one of the problems comes with reinforcement learning is that in order to actually learn from rewards, you need a reward function. And that's not always easy to have. For instance, as Tim mentioned, especially when the state is not known in advance, so the agent doesn't exactly know its state. It's difficult in that case to design a reward function because you're not sure of what the agent knows and how it can assess its performance compared to the environment. So we instead focus on active inference. In active inference, the agent separates the principle of minimizing free energy as we have just seen. So the principle of minimizing free energy actually enables two things. The one thing is to learn a model of the world. We call this an artificial world model in our work. The other objective is to minimize the free energy in the future by trying to achieve some preferred outcomes of the agent. So we assume that the agent has some preferred outcome distribution that he wants to achieve, and that his goal will be in the future to actually achieve these preferred outcomes. So the environment setting we discuss is that one of our PUNDPs, so a partially observable microdiscine process. So just to recap, we have observation that the agent receives, he asked infer the internal state of the environment, which is non-observed, and then there's action, which are actually known for the past, but the agent should infer or somehow choose among a set of possible action in the future. So this is just a summary of what the an artificial world model looks like. So as we've seen in the previous slides, we have an encoder that encodes the information from an observation. We focus on visual environments. So here we have, again, an image, which is basically an NBN matrix. So the encoder could be, for instance, a convolutional neural network. In our case, then we have the hidden state model, which takes the previous state and the previous action. And in particular, in our case here, we use some form of recurrent neural network model in order to keep preserving the history of the environment. And then we have the decoder that computes a reconstruction of the observation of the current state. So it tries to encode inside the hidden states as much information as possible from what it comes from the observation. So the problem with reconstruction is that computing them, especially in visual environments, is quite complicated because you need big models that have a very good representational capacity and also the models cannot be 100% accurate in low dimensional settings because, for instance, predicting any much pixel by pixel is practically very invisible. So it rarely happens. So let's go on example here. So a few weeks ago, I was training a VAE model. So a model similar to this one on the left. So where we have this encoder decoder architecture on an Atari game, the breakout game, and try to learn an hidden state to learn action on top of the in the state. The problem is that the reconstruction of the VAE, so we're actually pretty bad in that they were losing very important information about the game. So for instance, it was kind of able, so with some uncertainty to model where the paddle of the game is, but it wasn't able to model where the ball is, which is actually one of the two most important details in order to actually be able to play. So even having the reward function in this case, so having the game score available, the agent wasn't able to learn the task because of the state, which was lacking the most important information in order to keep improving. So this is one issue that we try to overcome in our work. And the second part of active inference involves learning to pursue the preferred outcome. So in order to pursue preferred outcomes, active inference agent will do two things. One, try to minimize the distance with respect to these prepared outcomes, but on the other way, also minimize the ambiguity with respect to the environment. So normally, this is done by trying to match the two distributions or trying to match, as we saw with a KIL diversion to try to match the distribution of the imagined outcome with the preferred outcomes distribution. However, again, in a high-dimensional setting, this can be quite complex because how do you define a distribution on an high-dimensional image? Could it be, for instance, just a center Gaussian around the pixel? So with the mean being the pixel value and then some fixed standard deviation, but in that case, we get into troubles because we have the same issue discussed before. We, for instance, have this kind of goal here and a noisy observation like this, which actually adds an higher mean square error compared to an image that is very distant from the goal. And this kind of situation, especially when using reconstruction or in more realistic settings are very, very likely because, for instance, you can think that the center image is actually just a reconstruction of the model, which is not under percent accurate. So that could be the case. And indeed, the agent would be confused and it will think that it's not achieving the goal compared to maybe, for instance, a past observation, but it seemed that it was actually closer to the goal. Or again, when there is some noise in the environment, in real-world setup, like also robotic, we always have this noise into the observation. So it's hard to match a preferred outcome in an unidimensional setting. So we also try to overcome this issue here. So what we do propose is to use contrastive learning. Contrastive learning is a mechanism popular in unsupervised learning scene that we will discuss more in depth in a few moments. So the objective that we want to have with our method are to avoid reconstruction in learning the word model so we don't have any more of the decoder here, as we see on the right. Then we want to be able to match preferred outcomes in a lower dimensionality space because we have seen that in a dimensionality that's problematic. And also we would like this low-dimensional state to be somehow representative of the task, so that when we match our goal in this low-dimensional state, we are actually doing something that actually brings us closer to the actual preferred outcome that we want to achieve in the high-dimensional setting. So let's try to compare to see what are the differences between using the likelihood active inference model and the contrastive model. So the idea in the likelihood active inference model is that we want to maximize the accuracy of reconstruction. So basically this means that we have this decoder that maximizes this maximum likelihood of the observation given the state. So we want the state to maximize the information that it contains about the observation basically. In contrastive learning, in contrastive active inference, we do something different. So instead of trying to reconstruct the current observation, we try to compress with the encoder again this observation and compare it to all the other, not all the other as we'll see in a while, because that's invisible, but many, many other samples that represent something different. So that in the latent space, in this compressed space, we want our state and the compressed image to be very close, while our state should be very distant from all the other images. So we are indeed maximizing the similarity with the corresponding sample that is here called the positive sample, where we want to minimize the similarity. So maximize the distance against all the other samples that are called negative samples in contrastive learning. So as we'll see also in a moment, this mechanism here maximizes the lower bound of the mutual information. So we are basically trying to maximize the information in between corresponding observation and state, while minimizing the information with respect to all the other negative pairs. So as we've seen before, the free energy of the past can be summarized with this equation here. So here, I'm just talking about one moment in this time steps. So instead, team presented for all sequences by using the tilt notation, while here we're just considering one time step at a time. So as we've seen, the free energy is basically an upper bound on the suprasal information that we want to minimize. So we minimize free energy in order to minimize the suprasal of the agent. And here is actually evident that we have this evidence bound. So the scale divergence is over greater or equal to zero. We wanted to basically to reach zero hypothetically in order to minimize this this evidence bound. And this can also be rewritten in a way that is more practical to implement it. So by having the likelihoods of the observation given the state, which actually means the accuracy of our model, and having, again, the complexity of the model, which is the KL divergence between our variational distribution Q, which we we use Q of F given O, which is basically using the auto encoding variation of variational posterior. So this is as typical as it's done in variational auto encoders. So when you try to infer the parameters of your posterior distribution by using the corresponding observation. And then we want to minimize the KL divergence between this auto encoded posterior and our prior about the current or future state, we can say, given the past state indexions. And in our case in particular, we generally learn this prior. So we don't just use a uniform prior of a state, but we will learn our prior to be predictive of what the state is given the past state. So it can basically seem like for machine learning prediction as a conditional variational encoder. With contrastive learning, what we try to do is, as I said, to maximize state similarity with the correct and corresponding observations of our positive sample, while minimizing with the other. This means again that we want to maximize the mutual information between the positive sample and the corresponding state and minimizing the information with the negative samples. This can be written like this. So the noise contrastive estimation, so NCE, that's the abbreviation, basically provides again a lower bound on the mutual information. So where we see that we basically have like a softmax, so over all the observation state, what does it mean? So for each pair of state and observation, we want this value, so the value of this critical function F to be as high as possible, while we want it to be very low with respect to the other. So actually the exponential of this compared to the sum of all the other exponential is higher with the corresponding exhibition and very low with the rest. So this is basically saying that it won't match in pairs to be very close and distant pairs to a very low value. This lower bound is an approximation, normally, when we take a number of samples key from a joint distribution that we define between X and Y. In particular, in our case, this X and Y represent our observation and our hidden states. So we define a priority, this joint distribution to actually represent the fact that this state corresponds to this observation and we want to maximize the information in between them. And this function, X, Y, again is a so-called critic function. So what does it mean? It's a function that should approximate this log density ratio that we see around the right. I won't go into the mathematical details of this, but basically it's a mapping of the two inputs. So basically it's a mapping of our observation and our state. And we want this output to be high for corresponding pairs and low for non-corresponding pairs. So how do we transition from the free energy of the past that we've seen to our contrastive formulation? So what the first step that we do is adding to the free energy functional a term that we assume to be constant that is the entropy of the observation. So how can we assume that the entropy of the observation is constant? So in machine learning, we generally have a data set from which we samples our observation about the past. So we assume when we train that our data set, so that the distribution over the observation is fixed. So the entropy of this distribution will always be a constant because we cannot modify that as opposed for instance to the states which we instead learn. So the distribution of our outcomes cannot be modified and so its entropy is a constant. If we add this term to the free energy functional, we can rewrite it as the KL divergence minus this information gain or mutual information term here between the states and the observation. So given this we can now apply the fact that the contrastive learning functional is a lower bound on the mutual information to actually derive the free energy of the past where explicitly all the terms out according to the previous slide met. We basically have again this KL divergence term and then we have this the value of f between O and S to maximize and then to minimize all the again the value of the functional with respect to all the other negative pairs. So this brings us to again an upper bound on the surprising term. We see that this upper bound is actually even higher than the normal free energy upper bound but as we'll see later this awesome nice property that explicitly help us to get rid of the reconstruction and to learn a different representational space that does some advantages compared to the likelihood based representation. So let's now talk about how can we learn to behave using contrastive active inference. So in the likelihood based active inference model what we were trying to maximize was again the likelihood of the future observation but under the preferred distribution. So we want the imagined outcomes to be as close as possible to the outcomes that we prefer. So this for visual environment implies that we reconstruct what we imagine will happen in the i-dimensional space so the image basically and then we compare it to our preferred image for instance and as we see before we can for instance use a max mean square error distance or we can just use like we can use a gaussian and compute the likelihood under the preferred distribution. For contrastive active inference we again instead use a contrasting mechanism when we want now the future state to be corresponding with our samples from the preferred distribution so we want the outcomes that we prefer to actually be close to the state that we imagine. So that now we don't need anymore to reconstruct what the outcome of our action will look like but we can just say is the state that we imagine matching with the preferred outcome that they want to achieve and that's what we maximize similarity with and again we also have some form of ambiguity minimization or epistemic value in trying to minimize the similarity respect to other outcomes. So in this case minimizing the similarity respect to outcomes that are not in the preferred distribution basically means that you either want to go far from something that you have already seen before in order to maybe get closer to the to the preferred outcomes or you either just want to minimize your ambiguity so you want to be as far as possible from other outcomes and as close as possible to the to the actual preferred outcome that you that you're hoping for. So as we've seen before the expected free energy can be summarized like this. I'll first highlight some difference with respect to the equation that Tim presented. So first of all here we take action to be part of the active inference process so the inference process. Well before we've seen that you can have a distribution of our policies and then you can sample the action from the policy and compute the free energy of the future given a posterior on a given policy. Instead here we make the actions part of the generative model for the future and we actually want the agent to infer the action from the future and not just compute them as a posterior over some distribution of the policies. So we have now in the posterior this 80 so that we infer both the future state and the future action and also the prior over the preferred outcomes that I indicate with tilde. I hope that's not confusing because before the tilde was used to to indicate sequences but in the paper I actually used it to to indicate the preferred outcomes so yeah no technical issues but I hope that's not confusing so the tilde here is basically to say this is the preferred distribution over observation state induction so this is basically our target distribution what we what we hope to achieve in the future and we can rewrite this as the sum of free terms so we have this so first of all I'm assuming that the agent has no prior preference on action so that for him any action that will bring it to the preferred outcomes it's fine so he has a uniform prior of action so the action doesn't really matter with this as long as it brings to the to the goal let's say and so this in this way we we obtain an action entropy term and then the rest the intrinsic value is the same epistemic value that we've seen before and so the one that should lead the agent to to explore the environment more or either to reduce its ambiguity about the environment and then we have the extrinsic value which is basically rewards or just just a way to get closer to the to the actual preferred outcomes so the the value to pursue in order to to minimize distance from the from the preferred outcomes in our contrastive expected free energy we we again to a similar move as we did from the past so here we assume that we are taking expectation over our preferred outcomes since we don't imagine outcomes in the future we just assume that the outcomes will will be according to the to the preferred outcome distribution so that we can again sum the entropy over the our fixed preferred outcome distribution and then the steps are the same so we we have again this mutual information term between the preferred outcomes and the state that we imagine in the future and the the action entropy term and this kind of divergence between the posterior states and the prior states so is a complex term I'll say because it basically should represent the difference between what the what the agent believes it will happen and and what is supposed to happen in the environment so normally in active inference we assume for the future that the model of the world is correct so that the agent does not control of the over his world model so he cannot change all the the environment dynamics will transition from one state to another so I I'm assuming here that this is this kind of diversion term is actually zero though I've seen that some work this could also be we left are are being greater than zero but then then it's basically adding the agent imagines that they can it can violate the the environment dynamics hoping for a better dynamics that it will allow it to to be optimistic and think yeah the thing that I imagine it will happen is actually gonna happen so here we we don't allow the agent to to modify all the environment mode from one state to another and we just assume yeah this the dynamics environment so our posterior over the the transition dynamics of the environment is correct and so the scale diverging in zero and then our objective again doing the applying the the contrastive learning lower bound translates into this when we we have this contrastive mutual information between the preferred outcomes and this this action entropy term so if you write it out explicitly we again have this this two term which kind of reminds the two extrinsic value and intrinsic value for for the free energy so we have this the term that actually should minimize the similarity with the with the negative sample this doing something similar to what the the intrinsic value in inactive inference should do so basically trying to to be distant from from previously seen outcomes is is kind of similar to explore the environment to minimize your ambiguity so try to find something that gives you more information not something that you have already seen and so the the the world model can be summarized in in these three main components that we learn we have our prior network that as i said before is learned and should should learn the the transition dynamics of the environment so trying to predict future states given past states and actions and then we have this gru self that is shared between the the prior and the posterior network and this is what allows us to to brings our history with us so just not stop to the previous state but also to include some information about previous states so that we have more more information available in order to to infer what the current state is actually is then we have our posterior network which also has access to the to that observation and this posterior cnn as as i mentioned is a convolutional network and yeah here we have the actual layered restriction for our environment which are 64 by 64 but yeah that's that's less important important things that we have a convolutional model that compresses the information from the observation for us and this same uh convolutional network is also linked to the to the representation model that is uh that is uh the the critique of the contrastive learning mechanism so the function that is indeed a matching states and the observation in order to learn the a good contrastive learning representation so the the function that we minimize with respect to the past is uh is our contrastive free energy of the past uh summed over an arbitrary number of discrete time steps in the over past sequences it is important to say that for the past the the negative samples that we take our observation of the of the same sequence of the of the corresponding observation so let's say that we have an observation in the states the negative samples will be a all the other observations within the same sequence that are not the same in time but also observation that come from from other sequences so so that we're basically contrasting the the current state with different time steps so what happened in different moment of the same sequence of actions and what happened in different situations so different sequences and that's how we we try to to foster our contrastive learning mechanism then we have the action models so for our action model we have two uh networks one uh is the so called action network which basically interested the the action to to take uh given state and then we have this expected utility network and this helps us pursuing what uh team anticipates so that the fact that we we are uh amortizing the action selection process for very uh long term sequences by using a network that should uh estimate what the the value of a certain state is in the future so I'll try to be uh more more clear here so basically have the action network to uh minimize this uh this g lambda uh team a functional that is basically an estimate uh that of uh of how much value is in a certain states and how do we get this this estimate so this estimate is is provided by this uh formula here so basically at every step uh we provide the the actual uh expected uh contrastive free energy for that state and then for the future uh we we some uh we we compromise in between uh an estimate of what the what the network would predict uh it's gonna it's gonna be the value in the future and the the the value itself that we are that we are uh computing with the with the functional so that at every step we basically uh some the value that we we expect in that step uh we bootstrap uh that's that's the way we normally say reinforcement learning so we we apply some form of dynamic uh programming approach to to sum this value with uh what we expect could happen in the future and uh we use this uh as our uh target for learning the estimate so uh we basically have the estimate and the estimate uh of the future plus the current value and we compare that when we want the the the actual estimates to be close to what is actually happening plus the future uh estimate and this is this is actually what is normally done in reinforcement learning when you apply the the so-called bellman equation in order to estimate what's gonna happen in the future by using uh what you actually know so generally like the reward and in our case the the the explicit uh free energy value and what you you already can estimate for the future so in our experiments uh we compare to we compare four levels that uh make for reinforcement learning uh using likelihood model this is dreamer the dreamer baseline so that does a likelihood based uh learned word model and uh it uses rewards for learning action so the the reward function is uh is already given to the agent and then we compare with contrasted dreamer it is a modification of dreamer using a contrasted learning for his word model instead of reconstructions and then we compare the two flavors of active inference uh the standard let's say one with the likelihood reconstruction model and our contrasted formulation so we use similar architecture and training routine for all the the four baselines in the uh the training routine can be summarized as we see here in pseudo code so for a certain uh for a certain amount of number of training steps that we fix in advance we are gonna train our reward model on the previous experience so now on a replay buffer uh that basically represents our data sets of fast experiences uh then we are gonna use the trained word model to imagine some trajectories in the future uh by using uh our action model and the replay buffer as well which is used because we we need to take this the negative samples for the contrasted free energy functional and then on the imagined trajectories we are gonna train our our action model in order to actually try to pursue the the preferred outcomes uh better and then we are gonna go back to the environment collect a new trajectory using our word model to infer where the what the hidden state of the of the environment is at every time step and using the action model to select the action according to the state that we inferred and add the just collected trajectory to the to the data set and we do this continuously so train the word model imagine some trajectories and train the action model and again keep collecting so that we we continuously improve both the data collection processor because the the the word model and the action model gets better and also our model indeed so that we get closer to the goal so one important insight that uh that first interviews before diving into an empirical evaluation of the method is the fact that using contrasted learning strongly reduces the the componentional requirements of the model so here I I'm comparing the number of multiply million multiply accommodation operation for our models and the number of parameters that we as we see these are much lower when we use a contrastive mechanism compared to to using a likelihood model and this is also reflected in term of worklock times so our model is is quite faster compared to the dreamer that that trains the likelihood based model for the world that uses just three words for learning action and is much much faster than the likelihood active inference model because the likelihood active inference model other than adding to to do the reconstruction during the world model training also has to imagine the the i-dimensional outcomes in the future so in that case you have even more computation because for every imagined trajectory you have to imagine all the possible images in our context that you're did you in you will get pursuing a certain a certain policy so it's our model is quite faster than that so the first starks that I will discuss is a simple mini-crit task so the agent represents the red arrow will navigate a black grid in order to reach a green square that is placed in one of the corner of the grids so the environment is partially observed because the agent doesn't know what's in every pixel of the grid so in order to find the goal first they should explore the old grid and find the the green square or at least be in a position that allows him to to see the green square in front of him so for the for the reward model of course we have we have the highest reward and in correspondence of the of the goal states so when the when the agent is actually on the on the goal square it will receive a reward of plus one and this is a sparse reward task so for all the other states the agent will just receive a zero reward so it will be just encouraged to to reach the goal while for active inference method the way that I chose to define the preferred outcome is to have an image of the agent that is that sees himself on the goal so basically the agent sees himself on the goal and says this is the position that they want to reach in the world and and let's see what happens from a from a qualitative level so as I said before the rewards for for this task it's just a plus one in the right in the right square in the goal square what happens for the the active inference model so what will the active inference model provides as a as a volume of a certain states of the agents in order to to pursue the preferred outcome so we see that the likely active inference when that is imagining the outcome and comparing it to the preferred images is actually giving a very high value for the for the right square so this in a scale from zero one we can say that's that's a one oh sorry we can say that's a one but other than that the function it is providing is a bit confusing because it is giving some higher rewards in the centers compared to the let's say the the the last row and column that are the one that leads to the final goal the the the other corner are not even close to the goal one so it's just it's just providing a perfect match and we have no no control over what the distance to go is because for the perfect match the goal that's that is indeed the the correct value that it is providing but other than that we it's difficult to understand what the what the likely odd active inference model is providing with the contrastive active inference we see that there is a different pattern so the agent is providing a very low value for the center so it's understanding that the center is of course not what they want to see but then it's providing high values to all the corners and in particular the highest values provided to the to the right corner so the the one with the goal because it's of course the more the one that corresponds the most with the we want what we want to achieve but then all the other corner also have a very high value and the fact is that from I will say that the contrastive active inference is is probably capturing more I would say semantic information about the environment so in order to distinguish a corner from from a central tile is actually modeling the fact that there is a corner in in a certain state of the environment and when it when it looks at the preferred outcome image it actually says first of all this is a corner and that's the way it distinguishes and then there is the green the green tile in the in the in where the where the agent is so from a from a value perspective we can say that a corner is closer in semantics with respect to a central tile to our goal and then of course when you also have the the green tile which represents the goal then you are the closest so this is of course a bit risky because it can also lead to suboptimal behavior in some cases but with with good expression of the environment it will for sure lead to the optimal behavior because still the the maximum value is still provided correctly so it's still the value for the right corner but if you didn't see the other corner the agent will will just go to another corner and say okay this looks similar to the goal so I'm trying to do something similar to to what I I would like to do I if I didn't see anything else it is closer this is my the best I can do so this this highlights how exploration is important in order to to achieve preferred outcomes and I think that's that also applies to likelihood active inference as well because in if you didn't see the goal you just have some some noise some noise signal in the center so you wouldn't be able to to reach the goal as well and then here we quantify the performance we see that with the with the likelihood active inference the agent struggles to to reach the goal consistently while our our method leads to consistent performance that are in line with the reward-based baselines of course the reward-based baseline have an advantage because during training they always have a filtered objective so even if their model is not correct they always have this reward function filtered function that tells them yes this this is where you need to go well our model can take a little bit more time in order to first have a good model and then being able to match but with the contrastive mechanism this process actually happens fast and leads to consistent performance where if we with likely the active inference we see that it will probably take more time to converge or it just leads to suboptimal behavior so it's just inconsistent according to our evaluation and yeah these are two different agreed environment where one is smaller one is bigger but yeah the results are very similar in terms of performance obtained. Then the other task that we discuss is a continuous control task with a with a 2D planner environment where a robotic arm will penetrate a sphere goal so the red sphere and this sphere is bigger in the so-called richer easy environment which is the one for which we see the preferred outcome on the left and it is smaller for the for the so-called richer art environment that is the one that we we see on the right so for a reward-based agent we have the reward function that provides when the agent penetrates the the sphere fully a reward of one otherwise when it's off penetrating or just partially penetrating the sphere it provides a dense reward that tells you yeah you're getting closer to the goal so in this case the reward function is actually helping the agent a bit more because it's telling him that he's getting closer to the goal and he should just try in the neighbor area. While for active inference we would just provide the preferred state in environment which is the agent penetrating the gold sphere and let's see let's see what happens so again we have a similar pattern actually similar but different from the previous one so in this case where the the means where there are different distance from the goal is actually more confusing as we we've seen in previous example so the the likelihood active inference agent totally fails to to reach the goal because it's probably all the all the states look alike in the environment because the background stays the same so the the background is not moving as it was happening for instance for the mini grid environment the goal is always in the same position so the difference between two two images is just provided given by the the few yellow pixel that moves around and if the model is not imagined perfectly where this pixel go it's it's very difficult that it will provide some some informative objective for the stars instead our contrastive active inference agent is able to provide an informative goal and apparently the fact that he's providing some semantic information about the task he is actually helping it to converge even faster than the than the reward based baseline because the reward based agents have access to rewards just when they're close to the to the goal where the contrastive active inference provide a reward function everywhere in the environment so when the when we see that the arm is very far we have the mutual information term that should basically take over and tells you yeah you don't want to stay there go elsewhere and until we we eventually find the the goal sphere and we converge to the correct behavior so the agent is actually converging a bit faster than the other baselines and then yeah the the contrastive dreamer baseline is converging a bit faster than dreamer one because its model is faster to learn because it's contrastive so this is what actually happens these are a gift on the right at some point they should they should be set so basically we just see here that the the task is uh is correctly executed so that the agent is uh is able to match the the correct behavior so yeah this is taking a bit longer than expected but yeah you can see that the for instance in the in the art task the agent oscillates around the correct behavior but it keeps staying okay here we see it so basically the in the other environment is oscillating and in a position that is uh uh very close to the goal so it tries to stay as much as possible to that uh that point and uh not uh not be uh driven far from the goal but a near shower of the of the of the arm then we we analyze qualitatively what's what's happening in term of uh the value is provided the agent so what's what is the the objective that is given to the to the agent in order to learn here and as as I try to explain the reward is is somewhere in the middle between zero and one when the agent is partially penetrating the goal and is totally one when the agent is fully penetrating the goal in all the other situation this is zero for the active inference likelihood based model we see that the the signal is very close for all the states so the agent basically thinks that there is very little difference between being very close to goal and being actually far very far off from the goal and that's uh likely the reason why it's not converging to the to the optimal behavior for contrastive active inference instead we see that the the agent is provided a an objective value is some somewhat in between zero and one when is uh when is not close to the goal and something that is very close to one when it's in the goal and in particular when when it's closer to the goal the the value is actually a bit higher than when it when it's far off so we see that there is some again some semantic information provided which is the the actual distance of the arm from the from the right goal or we can just see it as a okay the contrastive learning mechanism saying okay this this pose of the arm is actually very different from the one that I that I hope to obtain so let's try to move to a pose that is actually closer and we indeed obtain higher values when the when the pose is similar to the to the one that we we want to achieve so we exploit the fact that some semantic information is is provided to the agent by using contrastive learning to to work on a on a more difficult setup is the richer distracting environment so in the richer distracting task we have the same objective as before so we want the agent to to reach the goal uh by penetrating the red sphere but now we have varying backgrounds and we have this structure in the environment which could be just altered colors on the tilt of the camera and uh we still want to to achieve the agent to to penetrate the sphere despite that so for the for the reward based agents the reward is same as before so uh being provided for the agent penetrating the gold sphere while for active inference the goal is is actually a bit troublesome to to define because given the fact that the the background is constantly varying across different episodes we cannot upper yori define uh what's the what the the preferred outcome looks like so instead of doing that we we attempted providing a more neutral uh preferred outcome with the agent seeing itself achieving the goal but with the standard uh task uh background so we have this uh the blue uh just like uh uh background uh with the with the arm penetrating the goal and we we aim for this uh preferred outcome to transfer uh to the to the distracting setup this is of course pretty much impossible for for the likelihood active inference model because of course it's trying to match this in uh with a mean square error like functions so of course the the the single provided will be very confusing uh very interesting that we see here is that also the dreamer method fails because it's based on a on a likelihood based models and uh as we'll see uh reconstruction all the the variations in the environment it's very difficult so the the world the reconstruction based world model struggles uh to provide informative states of the environment while the contrastive learning based models succeed and in particular it's it's very interesting that our model was able to to actually achieve the goal here we see less consistently than before so the there is an higher variance but still the agents is often able to reach the correct position despite all the difference in the background and all the destruction present in the environment so we see that it's actually uh the the representation uh is actually learning what the the pose of the robot uh should be and trying to match it uh in the future and then uh here we see some some videos what's happening so we see here indeed that the the arm is oscillating a bit more so it's actually a bit more difficult for him to assess that he's doing uh the right thing but still the behavior obtained it's still uh quite good I will say so it's pretty much achieving the goal and uh this shows why uh a likelihood based model will fail in this environment so here we compare the ground proof in the different varying backgrounds and what the dreamers or the the likelihood active inference model sees through the reconstruction so we see that either reconstructing from the from the posterior state or from the prior state the agent cannot perfectly model uh important information of the environment which in this case is the the arm pose so it sees where the the first link of the the robot arm is but he is not able to see normally where the second part of the arm is because it's very uncertain about that and then and that leads the agent to to not being able to actually assess where it is in the environment and to to provide the the right value for for what's going on so uh using reconstruction in this environment leads to uh this to this kind of problem where the agent is not certain about the the internal states and so it's uncertain of what it should do next because the signal uh of the states is uncertain it's confused and so it is also the the value provided by the to the agent by the model and that's it so I'll just briefly summarize what we have seen and so basically we use a contrastive model uh to reduce the computation uh of uh of active inference these are also personal advantages in reinforcement learning but yeah we focus on the on the contract on the active inference area where this model brought to to a two-fold advantage both in learning the the word model faster but also in uh imagining future trajectory is faster because you don't have the reconstruction then we saw that the contrast representation learned features that better kept irrelevant information for the environment and this was key in solving uh both the the richer task and especially in the richer distracting task where without this uh this feature we wouldn't be able to solve the task and then uh we we show that uh uh we can we can use this method uh to provide uh performance that are similar to engineering rewards but in a much easier way so you can just say okay this is what I want to achieve in the environment provide the the observation to the agent and the agent will find itself a way to reach that uh that state without uh actually having to provide a reward function for every possible states of the environment which especially in realistic cases is is usually unpeasible and finally we we have also seen that the exploration is very key uh for our method to work because we don't want the agent to to compares to to a suboptimal behavior that looks like the the right outcome the preferred outcome so it's it's very important to wisely explore the environment before actually uh delving into learning the our preferred policy and uh we aim to look more into this in the future so thank you very much uh that that was it I don't know if there's any question thank you both very interesting presentation so if anyone watching live wants to ask a question otherwise I um have a few so you mentioned a critic model when you were describing the architecture and that reminded me of language learning like if someone says repeat after me and then they give a sound you might be accurate or you might not be but if someone said no it was not you have a negative and a positive example so what does that speak to perhaps the biological basis of contrast of learning or how these contrast of learning settings active inference or not relate to the ways that organisms learn okay I I will say that uh the the contrasted learning mechanism though it's not uh completely equal I would say that it somehow resembles the the ambient learning mechanism where you when when when you have corresponding pairs of uh so the things that should correspond uh you actually want to to strengthen the link and when you have stuff that it shouldn't be corresponding you actually want to to weaken the link so you know I think that biologically we could we could actually see in this way so when you have something that you you want to be uh you want to link further in in our case the thing that we want to to link further so to reinforce is the fact that this a certain observation correspond to a certain state uh then you use strength and disconnection well when you when you want you want to to be far and that's where where the contrast I think a bit differs maybe from this from this biologically perspective we actually push it uh we push it farther which is not always uh the case for ambient learning because normally you don't have this uh pushing further mechanism so I would say this could be one one possible uh links and as you said yeah the the critic function is actually doing something very very similar to what you mentioned so you have a positive uh samples and you reinforce so the critic tells you yeah this is this is correct and it should tells you that this is correct so it's trained to do that uh we we do it with machine learning but uh well if you have a good critic you could use that uh yeah it should tells you yeah this is the right samples while uh for for non-corresponding states and observation uh it tells you yeah this is not what we want in our representation we want this farther yeah so maybe to add on Daniel um I think what you're hinting at at um providing like it should be like this is more like a way to uh to define preferred states so to speak so and if you translate it to do what we're doing it's basically saying uh these observations are what you should like basically so so they come into place for um portraying the action model like how how do I get to these observations the contrastive learning port is more like um being able to distinguish different things basically um and and and it's it's more growth than um than the difference um but what what you what you what you like to what you like to have so the contrastive learning just learns to distinguish all kinds of sounds even all the bad ones and uh and and you just now say okay but now I really want to have this sound so try to get there I think that's the difference here that kind of sounds like paying attention to the right details which we saw with multiple times like the breakout games like how could you miss the ball humans are watching that gift and we're watching the ball but we also have a sense of how to pay attention to the right details and then in terms of action to have curiosity about the right things so it definitely starts to bridge into some very interesting behavior another question was about the action entropy term in the free energy calculations so maybe could you restate what the action entropy term is since it's one of the major contributions and also what does that say about adding terms to the free energy calculation like the action entropy is always greater than zero kind of like a k l divergence and so that you mentioned gives some perhaps nice properties about the boundedness of f within a lower in an upper bound so maybe just what is the action entropy doing here can we just add other terms that are bounded at zero to free energy and use that in other ways okay so I'll I'll start with the with a question about the the action entropy term and then I I'll also delve into the using different bounds for for the free energy term so here in this in the way we cast the active inference process for learning the actions the key part is that we the actions are now part of the the future inference process so I I could I could also go back to the previous life that's necessary but the normally the way that we see these objectives without this a term here and here but instead we have like a conditional on a on a policy on a certain policy so normally that means that you you have some set of policies already and you're just trying to decide which of them is better so this could be done like using a knock on browser first and then just assessing the the one that you think are best or just assessing all of them but that's that's impossible for instance in a continuous action setup where you cannot assess all possible policies because the actions are continuous a certain number is infinite for every dimension so this make infinite by infinite and so on so it's it's a huge uh it's a huge dimensionality acting space so instead here we we make the action part of the of the inference process so we won't we want to have a separate model that tells you to tell us what's what's the action to to take at every step and I I what I said that we obtain an action entropy term and that's because we in in choosing the the best action so in in trying to to match our actions to the one that we we should actually prefer we we think it like we we don't have a preference this other action so we so for instance I if I want to reach a a certain stated environment so if I want to to go from this room to to the kitchen maybe I don't care what's the what's the shortest word the shorter policies I just care about getting there at some point or I don't care about uh going left or going right right now uh when I when I get off of my chair I just want to to go where I need to go so we don't we don't place a prior over the action we just say whatever action is fine as long as it brings you the fastest as possible to the goal because the the the fastest thing is not given by the action itself but by minimizing the free energy so we don't want a preference over the actions we want it we want the free energy to this as the fastest spot and so the actions are we assume a uniform distribution over them and what remains is just an an entropy so it will be a KL divergence between the queue over action given the states and the this PAT which basically becomes a an entropy value if you if you assume this is a constant because it's just subtracting a constant to that and uh yeah uh switching to the to the other um part of the question so what what does it mean to us uh this constant term here so is having constant term useful so I don't know if there's any other useful constant term that we could add so uh mathematically speaking having a constant so having an upper an upper bound because of a constant couldn't be arming because you're minimizing the same objective but then on top of that we apply the contrastive a the contrastive uh approximation and that leads to another upper bound and as I said there are some implications of this we get uh maybe a better representation but uh are we getting farther from the from the actual objective so from my point of view as long as uh we achieve something that is actually better it doesn't really matter how how far are we from the actual surprises see now so in any case we will always uh get some kind of a amortization or approximation and we will probably never get a 100 close to the to the surprise value so because we don't have a perfect model so a perfect model of the word doesn't exist it's it's impossible to imagine that we can model every details of the environment even with a with a billion uh machine learning parameter and it's impossible to think that we will always act perfectly and always get in the in the perfect route to go uh choosing the the always uh optimal action especially because there's always some uncertainty in the environment and there's the a lot of things that we we normally want to to ignore in our everyday life so a lot of things here is not actually important to to covering the word model or action wise not always important to be 100 percent accurate in our uh movements in the action we take the the important thing is that we we get close to the goal so I would say yeah if there are any other term that we any other constant term or just any other modification that we could do to the to the free energy functional that actually leads to better results uh without compromising the original goal of minimizing free energy I think that's there would be a a good way to uh to address some of the issues that we that we currently have with uh with active inference and that that could also lead to to improving the performance of the the artificial implementation of active inference uh significantly so that's also why I think that taking advantage of some lessons that that we learned from reinforcement learning is actually a useful in active inference as well because uh there has been a ton of research about uh way to amortize this thing or approximate this thing better or uh train a better deep learning model for some something some very specific aspect and I think which the the active inference uh research should uh to benefit from this should uh should take inspiration from this yeah maybe if I might add Peter can you go back to the slide with the action yeah so so maybe to um to make clear here for for maybe people that that are less familiar between reinforcement learning but are coming from more active inference background so in in terms of um active inference um as co-hosting would would look at it this is basically something that you should not do uh so so basically um what happens here is that we we we see action inference more like a habitual thing like I I know I'm in the state or I think I'm in the state so therefore I can just infer my action without even planning it's kind of uh you become habituated to I've I've planned this hundreds of times and it's always this outcome so I just stop planning and I just amortize this action into an amortized policy so that's basically the kind of the mechanism that that we apply here in order to avoid planning um all the time uh because that's that's the the the the tricky part the we have too much options to plan so we don't want to do this so we just say let's amortize it from start which basically means that this a this this our queue or approximate steer is not only overstays but also over actions and then so the action entropy just falls out by introducing the the action in in the queue there so it's not that you that we magically add an action action entry return to the formulation it just comes out because of having the actions as parts of our approximate posterior but so keep in mind that this also means that we have an approximate posterior over our action selection and this works um in these three are force flowing problems because there um yeah your goal is always the same here it does not shift uh it's it's not a it's also not a um if you if you think back on on biological agents it's not like a complex distribution to maintain homeostasis basically just yeah this is the reward this is where you get it it's always the same thing so this basically means that your environment where in which we test these agents and which also reinforcement learning solutions test their agents is exactly an environment is tuned for um um i can amortize what i what i have to do because if i know my state i know what i have to do basically um things will change i guess if you have another environment in which this is not a case where you could be in a certain state and you could still have multiple options to do and you and you can only um you can only really know what to do by planning ahead or by first for information on on what's happening and in these kind of environments i i think that amortization trick once uh will help you a lot or you cannot do it just by amortizing so it's it's a trick we did to to allow it to work in these kind of environments because yeah we have to benchmark against some things and and you have to be a bit on on on par there but so keep in mind it's not a silver bullet that will always work we do deviate from vanilla active inference here we cost action inference as like we just want to learn habits we don't want to plan but this also means that there might be situations where it will not work and then it's not due to active inference or the energy principle that's not working it's more like yeah we did we did a crude approximation here by which things might not work anymore that's interesting um about which training environments favor what kind of algorithms and then how that shapes the perception of different algorithms like the navigation task what if there was a fuel tank or there was a um larger space that was going to require like multiple foraging information foraging trips for example and um so then the sort of single-minded seeker is going to just die fast but then something that's able to actually engage in planning wouldn't so that was a little bit like to those who are familiar with active inference and then here's a variant on what we've seen before how about for those who are more familiar with the dreamer architecture or reinforcement learning what makes active inference active inference and how is it different well i think um they are they're largely uh similar let's say i think that would be the starting point because often often people think about what's the difference but i think the the the main point that we should rather stress more as an active inference community is that there are a lot more similarities between reinforcement learning and active inference i would say that active inference is a bit more general than reinforcement learning in the sense that on the one hand um um we don't use a reward function per se but we we relax that a bit as in we just have a distribution over preferred outcomes which is a bit more general i would say and then the second thing is that instead of by by starting off from the free energy principle as in this is the objective that we want to minimize you also get the um the extrinsic value term here which is exactly the same thing as what an active reinforcement learning agent would optimize so if you only look at extrinsic value your free energy agents will also do this but the the the added value i would say comes in the information gain terms and these will only give you an additional benefit in environments where there is information to gain and this is not your typical reinforcement learning environment but if you look at for example the the the team ace mouse from core priston these are typical environments where you can actually show that if you only go for extrinsic value yeah you you won't you will be acting suitable so you can actually prove almost that in some environments only looking at extrinsic value given the correct model of the environment active inference will win i think the crucial thing that we need to research on is how do you get the the the correct or the optimal model of your world in which by optimizing your expected free energy actually you do the sensible planning and this is still largely unresolved and with our models we are taking steps in that direction but as you can see there are lots of issues right to just find the correct model because in in if you just look at the mouse the likelihood based model should be perfectly fine but by by the way you optimize and practice then you see all kinds of problems like okay this this this little pixel is actually the most important pixel of the of the thing and that does not appear so in my loss function so that's why everything collapses so in theory every it should work but there are a lot of practical problems to to find the correct model that pays attention to the correct details or the correct aspects of your observations and this is still that this is something that is shared with model based reinforcement learning as well as active inference and and i think there's a huge opportunity to find new techniques that can can put forward both both fields and we also show this like a contrastive dreamer in the distracting environment also improved performance on the normal dream so by by by having a technique that lets you build a better model any model based algorithm will work and active inference has this special notion of also taking your account information gain in environments rare you might be in sur insure on on what your status so that's that's where it can prevail but i think in in most of the benchmark environments that you see nowadays especially in machine learning you probably don't need these terms you probably get away with just maximizing your words which is in in in fact also an active inference agents and to some sense if of course if it's i'm talking to about the model based techniques so like dreamer agents in this case of course the model three ones are these are a bit different as they they don't need a model at all but at least for model based reinforcement learning agents i think it's pretty similar to what an active inference agent can do in these environments thank you tim pietro anything you'd add to that yeah i would like to to discuss one aspect of dreamer that we were a bit overlooked it is the fact that makes this amortization similar and it's it's also similar to what we have done so yeah this this is we learn a policy basically we're an action network that provides the correct state a different the correct action for every state but it is the key step that actually brings us closer to the active inference formulation is that we we imagine several time steps in the future so it is true that we don't evaluate a long policies and over over times that we have this this prior about action that is given by our action network but it is it is also true that given the fact that we evaluate the state that we we expect to see and then from there we restart doing the the action optimization process we actually get closer to to the optimization scheme of active inference in particular there is a paper called sophisticated inference that discusses this when you when you actually take an action and then you reimagine from that step what's gonna up what's gonna happen there are some implications of this but we are not completely drifting away from the from the regional active inference theory because of this it's just a different way of doing the action selection process and in that indeed the dreamer is also very close to to active inference itself cool thank you i i wrote down if you don't know where you prefer to go you are lost drive fast if you know how to get there figure it out if you don't and then reassess continually and i hope that conveys some of the similarities and differences do you have any final comments this is a very interesting line of research and we really appreciate this model stream hope to see you in the future or should i say we expect and prefer it but thanks again pietro and tim this is really awesome you're welcome thanks for having us thank you for having us have a good day everyone