 Hello everyone, welcome to Act In Flab. This is Act In Flab Model Stream number 3.1. It's May 27th, 2021. And we're here with Tim Verbelen and Ozan Katal. Today, we're gonna have a Model Stream on some of their recent work on learning generative state space models for active inference. We're gonna have a presentation followed by some time for question and answer. So please feel free to write any questions in the chat and we will get to them in the conversation. So thanks again to Tim and Ozan for joining us today. We're really looking forward to what you have to share. So please take it away and thanks again. So thank you, Daniel, for having us. So I'm Tim Verbelen and together with Ozan Katal, we will talk a bit on our paper on learning generative state space models for active inference. So we're both researchers at IMEC Gantt University in Belgium. If you want to know more about what we do, you can read about it on our blog or follow us on Twitter on their ad to smart robots. So before we dive in, maybe a short introduction on what we do, what our goal is. So basically we want to build intelligent robots. So in our lab here in Gantt, we have a pretty big, pretty large space where we have some robot manipulators on the one hand, but also driving and flying robots on the other hand. And we typically attach all kinds of sensors on these things. And then we want to process the sensor information and infer some useful actions for these things. And of course, active inference is a cool methodology to try out and to further investigate. And it's in this context that basically this work is being done. So why would you bother learning state space? So if you look at active inference papers from the recent years, then typically you will find a figure like the ones that are on the slide. So you have this figure of what is the kind of environments that you're modeling, that you're investigating. And then a whole description on how the model should look like. So you find the state space and then possibly in the case of discrete state spaces you define the so-called A, B, C and D matrix that defines the likelihood model. So what is the likelihood to see a certain observation when you're in certain states or whether you transition models, how do you transition from one state to another and so forth. And recently why we were doing this research, we saw some other people also thinking about, yeah, can we do more learning using these kind of novel deep learning techniques in these active inference methods. So yet, for example, Kai Ulzhofer, who said he proposed to basically learn kind of a state space for the mountain car, but then still he needed to explicitly encode some of the environment information in the state factor. There was also some work from Baron Milic who basically did some learning, but more on learning the policy rather than the state space. So basically used Markov decision process with small state spaces where you could basically just deal with the raw state space, the raw observations that the environment gave you were basically suited as a state space and he would then investigate how to learn the actions given these states. But so our work is basically on, what if you don't know the state space? What if your observations are high dimensional and you cannot really use these directly as your state space? How do you define the model? How you can come up with it? And maybe as a concrete example, suppose you have one of our driving robots in the lab, so this is the first person view of the robots. This is the kind of observation that you get is just a square of pixels. And then, yeah, what is your state space in this case? It might be an XY position on the map. That might be something relevant for a robot as a state. It might also be, watch out you're approaching a cable gutter. So there might be a button that might also be something useful. Maybe the items that are stored in the racks are relevant for whatever task you have. So that might also be part of the state space. Maybe there are human workers that walk around that you also need to model these. So depending on your robot and the use case you have a task at hand and what can happen in the environment, it becomes non-trivial to define kind of a concise set of states that you want to pick, let alone that you need some model that models this, like how do you get from pixels to your XY position? So there are certain methods in robotics that do part of these things, but none of them really just figure out what is the complete state space that you need for whatever you want to do. And especially not in kind of an active inference. So before we go into our methods, I'll briefly talk about active inference. Probably most of you are already familiar with the topic, but I'll just rehearse some of that stuff, even if it's just to get accustomed to our way of notation, let's say, because everybody has his own notation and at least it will put us all on the same page for the next part. So it all starts with having an agent that needs to interact with the environment and the agent is assumed to be kind of separated from the environment from the so-called Markov blanket. So on the one hand, the agents can form actions. These actions, these will impact the environment, which is kind of a generative process that has some hidden states and that provides you, given your action, it provides you with some sensory observation. So in this case, an action could be for the robot to drive around, to put some currents on the motors and observation could be some pixels from the camera. And then the goal of the agent is to build a genetic model where he builds his own state space, basically. He derives how actions have effect on the states and how these states then will generate these observations. And so the question is, how can we build a generative model that actually comes up automatically with kind of a proper state space to learn this model? So basically the generative model then looks like from the partially observable Markov decision process. So it just means that we assume that that's a state at a given time step. It only depends on the previous states and the action that you were doing. And each state gives rise to a certain observation. And so then we end up with the so-called free energy principle that basically what we want to do is want to build a model that maximizes log evidence. So here you see the formula for the free energy and you can unpack it in several ways. So the first way is basically by stating that minimizing free energy is actually maximizing the evidence lower bound, let's say. And if you're at this free energy minimum, then basically you have your approximate posterior that this is actually close to your group posterior. But then the third line is basically what we typically use for optimizing our model, which is basically on the one hand, a complexity term. You want to have your states explain the way as simple as possible, whereas you want to have the highest accuracy to predict whatever you're observing. So in this case, our variational approximate posterior is something that maps observations and actions into inferring your current state. But what about the future? So in the past, you basically knew which actions you were doing and you saw the observations that came in. So you could kind of infer which states would have given rise to the observations. So that's basically the free energy of the past. But then in the future, things change a bit because now you don't know what you will observe. And also you have to select the actions you will do. So the actions, we will basically denote them as being generated by so-called policy. But in this case, the policy is basically just, yeah, what is a certain sequence of actions basically. And you now have to create some expectations, not only on future states, but also on future observations. So which states do I think I will visit by doing some actions, but also what kind of observations do I think that I will see if I visit these states? So then the free energy becomes so-called expected free energy. And again, you can unpack this thing as follows. So basically what happens is, again, you have on the one hand, the low probability of your steer model, minus low probability of your generative model. And now the difference is that you don't know yet the observations in the future. So you conditioned it on the policy, so everything will depend on the actions you will do. And you also have to take the expectation over all kinds of outcomes you expect. And then going to the second line, you basically make the move that after, if your model is sufficiently trained, then basically you're true, your approximate posterior will be very close to the true posterior. And this then basically allows you to rewrite the final term. Well, I see now that I wrote it in a different way, but you can rewrite this either as an information gain on your states and moving towards preferred outcomes. You can also rewrite the way we did here, where you basically state, okay, at some point in the future, I will visit some states. So this is this log p of s, s star given the policy. And you basically replace this by some prior belief that regardless which policy I'm actually choosing, I believe that some point in the future I will realize my preferences. So that's why you lose the conditioning on the policy in the last equation. So this then basically becomes a KL divergence that says, okay, I want to find the policies that actually bring me close to the states that I prefer that I like to be into. And then on the other hand, you have this ambiguity term that says, okay, what I don't like is to visit states that can give me like any observation that I cannot predict. So I don't like these ambiguous states where I don't really understand basically what's happening. So that brings us to the whole active inference scheme. So at each time step, you basically evaluate your expected free energy for each of your policies. And you then basically assume that you will choose the policy that will actually minimize your expected free energy. So you take the softmax over minus g and you also weigh this with this gamma parameter which is just like the precision. So if you have a high precision, then basically you will have a very peaked output of the softmax function. So the softmax will basically become a max. So you then definitely choose the policy with lowest expected free energy. If you relax the precision a bit, then basically you allow for some more randomness in the system. And then basically you end up inferring the next action according to what is the next action that I should take according to this policy. But so when the word follows, the actual, the action selection given a policy is basically just a deterministic mapping. So the policies are just like certain sequence of actions. So given once you choose the policy, the action is basically fixed. But in theory, you could also make this a probabilistic mapping. So let's now go to the real interesting parts of the work is how do we learn state spaces with deep neural networks in such an active inference key? Well, basically we have two components. One is the generative model that I rehearsed from the first slide. And we have our approximate posterior model. And basically you can see that there are three parts in these equations. And the first part is basically the transition model. So it outputs the probability of birth state given your previous states and the action you did in that state. Second part, oh yeah, and also the initial state is basically also part of the transition model but then you just provide with a zero action, for example, just to bootstrap it. The second part is then the likelihood model. So given state, you want to have a model that predicts what kind of observation you will see. And then finally, you have the posterior model that given the previous states action and your current observation also has to predict which state you're, or also has to infer which state you're in. So basically the posterior model has to come up with the same thing as a transition model, but in addition, it has access to your observation. And again, similar to a transition model to bootstrap this for the initial observation you don't have any action, but it can be modeled by the same thing. So all these three components, we basically instance these as deep neural nets where they have to basically be trained to come up with these three models. So if we put it in a schematic, we basically have something like this. So given your state and action from the previous time step and the observation from the current time step, you on the one hand, provide the previous state and action to your transition model, which will then output the distribution over the current state. And in this case, this distribution is basically modeled as a multivariate caution distribution. So basically the output of the neural net will be the means and the standard deviations of a Gaussian. And now we basically use this distribution to then generate samples for the next time step. So the posterior model gets also as input to the state and the action from the previous time step, but also in addition, the current observation. It also does a prediction on the current time step. And after sampling, you basically have the likelihood model that can then generate a prediction of the outcome. And then this state is put to the next time step and the story repeats. As I already mentioned, the output of each of our neural nets is basically some mean and some deviation of multivariate caution with a diagonal covariance matrix. And we use the reprimandization trick to generate samples. So basically we generate sample by having the means and then add the standard deviation times some standard normal noise. And this basically allows you to backpropagate gradients all the way through this neural net also if you draw samples from the models. And then of course, for the next time step, you just propagate a sample and then the story repeats basically. So trading such a model is then basically first collecting the data sets of action observation sequences. So you take your agent, you either let it randomly generate sequences in the environment or in the case of a real robot, for example, you drive it around yourself in the environment while you record the actions and the observations. And while the models aren't converged, you sample some sub sequences from your data sets. You estimate the states that are visited and you reconstruct all the observations. And then you basically backpropagate the free energy loss. So this is the same formula as the free energy in the start. So on the one hand, you have the KL divergence between the output of your posterior neural net and your prior neural net. And on the other hand, you have this reconstruction error that basically scores how good you're reconstructing the actual observation. And using this loss function, you just update the parameters of your neural nets. And you basically build a model that is able, on the one hand, to infer your current state given your observation, that is able to generate new observations. If you know which state you're in, you can reconstruct these observations. But also you have a transition model that you can then just use to, in the state space, plan ahead what would happen if I would do this action and it will give you a distribution over the next state, basically. So then we come to the planning part. So once you have this model, how can you use it to let your agent do useful things? So there we will use Monte Carlo sampling. So one of the things, one of the limitations of these kind of objectures is that basically only approximate the distribution for the next state. So given a sample from the previous state and given the action, we approximate the distribution for the next state, but then we sample during training as well. So we will never have a good distribution for, let's say, 10 steps ahead. So to predict them in distributions for further in the future, we approximate these by doing Monte Carlo sampling. So for each of the policies, we want to evaluate, we sample and trajectories that all do the same actions, but due to the sampling might give us different results in terms of future states and observations. So for each time step, we then have a bunch of states and a bunch of predicted observations. We then fit a Gaussian distribution using the sample means and variances. And then we estimate the expected free energy as follows. So on the one hand, we have this K-elder virgins. If you remember that scores, how good your distribution of states is according to some prior preferences of states. So you use this Gaussian distribution to calculate the K-elder virgins with respect to these priors. And on the other hand, you have this entropy term that scores the ambiguity of these states. So here we basically use then the entropy of the observations generated in our trajectories. We also added a scale factor row to just kind of weigh two terms from one another which allows you to make an agent or risky, let's say, if it's more weight on realizing preferences early, whether you could also make a more cautious agent, let's say that's really does not want to end up in some ambiguous states. And crucially here, we also have kind of this recursive notion in here. So we do this rollout for K time steps ahead. And then for the future, you could kind of recursively look further into the future, let's say, and aggregate also expectatory energies from that point and make it a bit more clear. Let's try to visualize it a bit better. So suppose at some current time step T, you're in the state, then basically what you do is you use your transition model to say, okay, given that I follow policy one, what will be the next state? And you draw a sample from the distribution. So this is then ST plus one by following policy one. And this thing you can repeat for K times. So we put in this K to just have some kind of some course grading of actions, say, for example, in the case of a driving robot, it might not make sense to switch action every 10 milliseconds, let's say. So if you decide to drive forward, you want to keep on driving forward for at least some time. So that's basically where this K comes from. So you follow the same policy for some time steps. And you basically sample this end time. So you repeat this process end time. So you have now at each time step end samples for the kind of states you think you will visit after following the first policy. And then if you have a second policy, you can do the same thing for that policy. So in this case, we only consider two potential policies for example, for instance, go left or go right, let's say. And then at this point, so at time step P plus K, we can basically repeat this procedure and say, okay, what if after K time steps, I switch policy? And maybe I first have to go left and before turning right, or maybe I should turn left twice. So then basically you repeat this process and this you can basically keep on adding to the search tree for as long as your computation power allows you, let's say. So then how do we calculate the expected free energy from this? So we take our formula. So basically what we do is at each time step, we first use the likelihood model to also predict what are the observations that I expect in the states. And you then look at all your state samples from a certain time step. And this then gives you this approximate caution distribution for both the states and the observations. So you plug these into the formula and this then gives you kind of a number that says this is the expected free energy. If I think I will, I am in states T plus K and I will follow along with policy one from thereafter. And then basically you can do this for each of these sub branches. And this is then where the recursion comes in. So then we basically state, well, I assume that my active inference agent also at that time step will most likely choose the policy with the lowest free energy. So basically we combine the free energy for policy one and policy two according to the weights that are provided using the softmax function. So if policy one has a very low expected free energy compared to policy two, then basically only the expected free energy of policy one will be added thereafter. If they are kind of the same, then they will both have kind of a 0.5 weight or if the other one is the clear winner then it will mainly be the contribution of this thing. But so we basically combine these assuming that your agent will at that point also select the policy with the least free energy. So now we basically have an estimate of what will be the expected free energy for the future given that we are at time step P plus three. And then we can further go up to three. And so then for each branch, again, we will use the first part of the formula to estimate these garchants, to calculate the KL by virgins and the entropy term and then add this together with what we already had for the remaining part of the tree basically. And this is very similar, I think, to what Friston proposed in his sophisticated inference paper. So basically at each step in the future, you kind of only consider the free energy of the branches that you think will basically be the best ones. So in the end, you'll end up with an expected free energy for both of your policies and you then select the best one and start acting accordingly. So I'll now turn to Boson who will give some more details on the various experiments that we did that hopefully give some more insights on how this can work in practice. So take it away, Boson. Yeah. Hi. I have some echo on my ends. Maybe it's resolved now. Yeah, sounds fine. Thank you. Okay. So I'll go briefly through our initial experiments so that these are the experiments we did in like the past one or two years. And so one of the first experiments we did and reported on was the mountain car which is a fairly well-known benchmark, I think. So the goal is to have this undirected car so it does not have enough power to reach the top of the right hand mountain but you want to reach it anyway. So ideally your agent should learn here that it should first go back bit to the left to gain enough momentum so that it can climb the steeper mountain on the right. And even though this is such a low-dimensional problem it only has positions and velocities it is actually a very interesting benchmark to experiment upon because there's no solution. So any agent that just wants to realize its preferences immediately will fail at this because it will always fail to drive up the mountain. Now for our experiments we made it a bit more difficult. We made the fully observable mountain car partially observable. So what we did was we omitted the velocity information and only provided some noisy estimates of the actual position. So the agent actually had to learn two things. It had to learn first of all what its precise position was and then also how this position relates to the momentum by its velocity. And then if we take, if we start from this setup and we train the model that Tim explained a bit earlier in its most basic form you can actually learn a state space that closely mimics these physical constraints of the system by minimizing energy. So if you look at the right utmost figure you'll see that you see this sinusoidal state space dimensions and these closely mimic the actual observations of the real world in the lower image. So as you can see in the lower image we have the ground truth in green which is an actual trajectory through its position of the cart and then also our state transition model estimate. So our prior estimates on the position without observing and then also in the blue line which is a bit harder to see you can see the same estimate but now corrected with the observation. So it's actually just a posterior model outputs and you can see these shapes returning in the state space which in itself does not say that much but it gives you at least a kind of vague idea that at least the state space is capturing something relevant for the problem. I think there is one more animation on this. So yeah, it's the same conclusion as I said, right now you can predict the future as you can see in the orange line and it appears to learn the velocity in its state space. Now if you can go to the next slide, yeah. Now we want to actually figure out how we can use this model that we learned for active infants and typically in RL you would give some sparse reward for driving up the mountain but in our initial experiments we were more interested in how an agent could learn from human observations, demonstrations. So we recorded seven I think, yeah, oh five, sorry. Five human roll out in the environment just like driving around with the cart so we go first left and right and then we push these trajectories through our models to get through our posterior model to get a preferred state distribution. So here on this figure you can see the eight different dimensions, how these evolve through time if you follow these trajectories. So in the spell of course gives you the standard deviation at each time point for that state value. And then if you use these at this preferred state distribution and you calculate the G or even only the KL divergence between your posterior and your preferred distribution for different trajectories through time. So here every different color, every different trajectory is a different roll out in the environment or imaginary roll out through the environments. You can see that you can actually use this preferred distribution to rank trajectories according to lowest free energy. So you see that the blue curve is the only one that reaches the top and is also the one that the model seems to prefer. And then in the next slide, yes. We experiment a bit what this means in terms of the actual free energy. So also compensated with the entropy. And what we did here was we had the agent observe the world for one time step. So we give it an initial observation to get a sort of bootstrap latent sample. And then we will only use the prior model to imagine what will happen if you follow certain policies. So here we considered three policies or like two possible policies that you can switch at three time steps. So you either go left, right, right, left, et cetera or you can start with right and then these are the blue curves. You see that if there is no initial velocity, then you see that the agent imagines that the optimal policy which is left, right, right. So first go as far up left and then just blast your way to the top gives actually the least amount of spread. And also he believes that this will reach the top earlier than any other policy. Inversely, if you look at the policies that first go right, then you see that there's still a little, there's close to no spread due to the lack of initial velocity but the agent already knows that it's impossible to reach the top. Now if you go to the next slides, the slide. So here we do the same experiment but we add some random initial velocity. And again, we let the agent observe the world for one time step and then let plan according to the same policies as before. At first of all, you see that the agent has a much larger spread on its believed outcomes. This is due to this external extra uncertainty on what the initial velocity is. And also this initial velocity might render previously infeasible policies feasible as you can see at the bottom because now if the agent thinks, yes, my initial velocity is high enough, even with the one or the more suboptimal policies might still be possible for me to reach the top. And if you go to the next slide, please. Here we can then see this in action. So this animation, we collapse the imagined trajectories to the actual trajectory. And you see that in the beginning if we can really play it again. In the beginning, it believes that more red policies will reach the top and then as it gains more momentum, all, it will believe that all blue policies will reach the top and even it will know that if I now go left, so I decelerate, then I will less likely reach the top. So I think, yeah. So as you can see now, he knows that red will not reach the top and yes. Then another experiment we did building upon this was, what if we take this same approach, but now we move to the task, the kind of problems we want to solve. So the high dimensional observations that you cannot model by hand. So we took another open-edged environment, the car racer. And here the goal is to drive the red card, as you can see on the high resolution image on the left. On the road for as long as you can. And then we trained our model on a handful of human demonstrations of this. And then you can see that in the reconstructed image on the right, it already, it's even on a handful of observations that can learn to predict this. So we did exactly the same thing as before. And I think this is an animation now. Yeah. So if you then use the exact, the planning tree that Tim explained on this little card, it will have learned to that the road, so the gray area is important and it should stay on this. Now, this is a bit more greedy and it tries to shortcut corners to get quicker on the gray part, as you can see here. Okay. If we can go to the next slide, maybe. So now, also an important point here is that our active inference approach, which is model-based, seems to be a lot more data efficient. That it seems to be a lot more data efficient than for example, a baseline RLH. And so we took DQN, since this is also an off policy algorithm to be able to compare it kind of similar to our approach and as you can see in the first graph, so for the mountain car and our model quickly and, so sorry, our models are in, is the orange. Our model quickly learns at least some method to reach the top and then afterwards it just improves upon the swaths. Due to the sparseness of the rewards, DQN is not able to learn as quickly and even after having 1,000 times more observations, it still isn't capable of planning the mountain. And even for the car racer, it's even worse. So here we trained on seven or 10 rollouts and immediately we're able to get a reward of 600 whilst DQN just fails to get that same level of performance even after 1,000 rollouts in the environment. So if we can go to the next slide. Yeah. And then a final experiment I want to discuss today is the robotic navigation or like maybe more accurately robotic control. So here we took our KUKA platform and we mounted some sensors on top of it and then also put it on a laptop for good measure to have some compute. And I don't think the sensor are really relevant for what we're going to discuss now. So then we drove around with a robot in our lab. As you can see on this movie just like with the joystick and captured a lot of data. And we just drove up and down the aisles with a robot. Now this environment is also a bit challenging for robots since all these aisles are super similar. Like for a robot knowing if it's an aisle one or two is nearly the same thing. It's very difficult to comprehend for the machine. So then in this slide to see what a recording might look like. So you have a rider and radar feet and some images. Yeah. Sorry, Tim. This is now the correct slide. And then the goal is to again train a model that will be able to generate future observations for the robots. I don't know if there's an animation on this slide. So you saw the first high resolution image was the real observation. And this is then what the model things will happen if the robot first turns right and then continues to drive. The little ghost artifact you saw a couple of frames ago. So yeah, we'll first look at it again. So we drive and then it will turn and suddenly you will see a ghost them appear. This is because the model doesn't actually know what will happen. It can only try to guess based on its previously learned experiences in the model. And there's a lot of people walking around in the data set. So there was a some chance that somebody would be walking there. So it just might imagine that there's somebody there. Then maybe if we go to the next slides. This is also an animation. So here you can see for example, how these imaginary samples deviate. And this is also the reason why we need this sampling. So the different sampling in the pine tree for different outcomes. So you see that given the same starting position and observation, the model learns that turning left might have different outcomes depending where it is in there, in which island this and where it is in there. For example, the top right, then the robot imagines that it is at the end of the aisle. Whilst in the bottom two, it imagines that it is in the aisle. So it just imagines some stuff on the racks. Then here's then how we can evaluate policies. So we typically we provided three possible policies turn left, forth and turn right. And then you can imagine what all these things will do in the environments. And then similarly as before, you can calculate the G and select the one that will most likely bring you to your preferred sequences. Yeah, right, sensitivity. Also the nice thing of our model is that you can, because it's a neural network, you can put multiple types of observation in your posterior model. You can fuse them in various ways. So what you can see here is that similar to the camera feed, the robot will also learn the effect, for example, in a lighter scan. And also it will learn the effect on the velocity bends in a radar scan. And this gives you, of course, extra robustness in your planning because you can now reason on multiple modalities. Maybe in the next slides. There are of course still some limitations to using robotic, to doing robotic control this way. First of all, our robot is extremely short-sighted in time. It can only learn to predict as far as the length of the sequences we provided during training. And also the longer you roll out, the larger your search tree becomes, the more computational limits you will reach. So this is also then an area we are now actively working on. Then currently our models still require that we pre-record the data set. So our models require that we drive around ourselves and then fit the model. So there's also a point we're currently at this moment working on. And then tying into this, our models currently do not really know how to explore whilst there's probably a sensible way to do exploration based on the free energy principle since your model uncertainty can be baked in. I think, I don't know if we have any more slides actually now. No, doesn't, maybe it's time for questions. Awesome, you can unshare and we can ask some questions. So I wrote down a bunch of stuff and also anybody who's watching live, please ask some questions. So nice presentation though and awesome, very instructive videos. Hopefully made us think, made us laugh a little bit when it cut corners. So maybe just a starting question while people are writing their question. What brought you to be studying this topic in this way? Were you coming from active inference and saw robotics as an interesting application or were you in the area of robotics that then found active inference to be a useful model? Yeah, so basically we were in the area of robotics and we were working on building better low dimensional state representations to feed into a reinforcement learning algorithm, let's say, that was our initial idea. So just building representations or better reinforcement learning. And then we stumbled upon the active inference framework which basically not only gave us a way of how to build degenerative models to build degenerative models because basically we found ourselves the, let's say the free energy of the past that we were basically already doing that. But we saw that it also gives us in the same mathematical framework a way of how to project these things to the future and use these for planning and for resolving ambiguity and also for scoring novelty and all these nice properties that are basically lacking in RL. So that's why we basically started digging into this mathematical framework to go through all the papers of call and see how all the ends are tied together and see whether this would still work if you don't have your state space define the prompt but you just learn it from data. So that's how we started this endeavor and that's really much where we are at this point. And still investigating it further. Awesome, that was gonna be my second question was like, what differences or advantages would you describe for active inference over reinforcement learning or other machine learning frameworks? If you answered it in previously, that's great or do you wanna add any other thoughts? Yeah, I think the nice part is that you automatically get the nice properties of resolving ambiguity and of potentially exploration if you also estimate posterior distributions over your parameters, for example. So these are very interesting properties mathematically. However, it's still, there's still a gap to actually get these out of these real-world cases if you don't have the genetic model predefined, let's say. So there are still some challenges but the theory at least is very much appealing. And I think if you look at what's going on at the RL side that we've worked on inside, then there's not that big of a gap between the two fields, I think because you look at all the curiosity bonuses that they tried to come up with in reinforcement learning. And if you look at the model-based stuff from like the Niger-Haffner with this dreamer approach, then everything is kind of similarly converging to, it's a good idea to build a model of your world and get more sample efficiency. And it seems like a good idea to have some planning in there and maybe there is some gaining curiosity. And so you see that in a lot of different independent research tracks, we all converge along the same lines. And I think what is so appealing to active inference is that basically it brings us all together from a single principle, which makes it a very nice framework to work with, I think. Anything to add on that, Ozan? No, yeah, it was before Tim mentioned it, I was already thinking along the lines of the work of the Niger-Haffner. So I mean, if you look at his models, and I think that is currently almost state-of-the-art in model-based RL, then you'll find that the models they are building are very similar to the models we are building or other active inference researchers are building. So yeah, I mean, it makes sense that all these approaches converge on a single idea if the idea works. Yep, very interesting that like plan to dream and dream or imagine so that you can sample appropriately. Why put that as a second layer on the model or have to incentivize it in a sort of ad hoc way? Why not have that be the basis of the model? So that's a very nice point. So Dean in the chat asks, have the authors heard of the missionary and cannibals game slash problem, which is moving two kinds of mutually incompatible agents back and forth from two sides of a river. And if they have heard of this or thought of any kind of analogous cases, do you see any applications? Yeah, I haven't heard of the game before. I think it's probably similar to the crossing the river with a chicken, a fox and a goat or what was it? Yeah, there's like animal versions. There's all kinds of versions of this one, but going back and forth with different kinds of incompatible agents and you need to sort of go to the left before you can make it up the hill, but it's a little bit of a different setting. So what does that make you think of? I haven't actually really considered it for our models, but it might be if you find a way to model it in an environment and maybe collect some data on it. Yeah, I think this is a problem that is very nicely suited for doing it. Let's say the vanilla way where you basically write out all the different states that can happen, kind of observe the outcomes that you can have. So I think you could actually formalize it in such a way and then run an active instrument simulation on that and see what happens. So that's less of our interest because in our cases, we are mainly interested in what if your observations are so high dimensional that you kind of even start thinking about writing out the derivative model, let's say. And the only thing you can do is interact with your environment and try to learn it from the data, which is a slightly different take on the active inference problem. Great. Another question is you are all deploying these models sort of real time with physical agents, embodied agents. So what surprised you or what was interesting to note as far as going from the simulation only where you can sort of put everything in a box and know exactly what's going to influence what to the world of embodiment where, I don't know, some dust could get into the robot or I saw a person walk by. So what comes into play when you actually deploy physically and how does the model deal with that? For me, it's actually the thing I like. Well, first week I played around a lot with the mountain car and the car racer. But the first thing actually was a real hurdle for me personally was deploying it on a real robot. You suddenly have all these hardware constraints. Like you don't have infinite memory, you don't have server-grade compute anymore. And you all have to fit it. For example, I think Tim and I spent a lot of time to make a demo working where we could roll this out real time and just the hardware constraints of doing something as complex as active inference real time on a real power constrained robot is another challenge in and of itself. Yeah, so I think if the question was, what do you get from doing it on a real system? Well, a lot of frustration and pain, I think is the answer. But likewise, if it then works, then the gratitude, the satisfaction is so much higher. So I still remember Ozan and me cheering in the lab because we gave the robot a preferred state of being nicely in the center of the ale and it was actually moving along the ale. And then at the very end, it decided, hmm, this is not the center of the ale anymore. And it just made a 360 degree turn and started driving back. And we were like, oh, this is awesome. So I think there's lots of pain and frustration to get it to work. But then once something comes out, then the satisfaction is so much higher than when you see a mountain car reach the software. Interesting, you could set up a physical valley, maybe make a physical mountain car because there's so much comparison of the software realization, but maybe that would be even taking it to the next level. So another thing that you talked about repeatedly, maybe even in every example, was actually training the model from just a handful of human cases. Like you drove the car, you had people play the mountain car game. So what exactly is happening there? And how does the model not overfit to the few trajectories you show or not just say, hey, you only gave me three, what's the deal with these three totally different trajectories? What exactly is being learned or updated in the model when you provide just a handful of human examples? I thought then we're going to answer this one. And so basically you're over. Yeah, I think overfitting is clearly an issue. You're, no matter what you do, you're constrained kind of to the data you provide to the model. So it cannot really learn anything else than what it discovers. So that's also what we pointed out at the last slide. That's a real limitation of our current work clearly, but there is no other way when there's a real robot involved to get started. It's the easier way to get started. So that's mainly the driver. But at the moment we're actually working both in simulation but also on real robots to see can we actually get the systems to decide how to gather their own experience? What is interesting to learn from? Because that's actually one of the, to shortcut to one of your previous questions. What does active inference give you while that it actually can deal with these kind of things like my robot needs to collect its own experience, what does it do that it will do all the same, or do it explore or these kind of things. So these are really some active areas of research. But then so to come back to the overfitting problem. So one thing that mitigates overfitting a bit is that the fact that you have all this stuff in the system that becomes in sample states, which makes it a bit more, but it's not like a classifier that all fits to the test. So let's say there's always some kind of noise in there, but you're right. It is in some sense overfitted to the data in the sense that it cannot predict scenarios that it clearly hasn't seen before. But so one of the nice things of having this energy formulation is that you actually finish your planning with this entropy term, which basically means that if you plan a hat in some space that the model wasn't trained on, then typically you will have some more variation and some higher entropy for that area, because typically your observation becomes very bad or very blurry or out of distribution. And so in some sense, the model is kind of robust against that. And it will also, if you then deploy a policy or like do the planning, it will kind of try to stay close to the model, to the regime where the model was trained, okay, because there you basically get the better predictions. So in some sense, that also mitigates the problem. Also, I want to add that by, we could use as little of the rollouts as we did, because actually if you let a human do the rollout, you solve the exploration problem for the agent. You already get a very good coverage of the relevant feasible state space. You get like for the car racer, if you let the random agent drive around as exploration, 99% of your days still just be grass and the robot will, the agent will learn nothing about roads. So by having a human drive it, you know, okay, this is the roads and apparently this is important because it's in every observation I've had. That almost makes me think of two ways that we're seeing these large models be trained with a sort of mentor, like a human who says, here's the first places you want to be sampling. Here's how you drive the first time. That's the driving instructor side by side. And then there's the generative adversarial approach, which is just the almost opposite. Like we're gonna be passing the model the most confusing possible data. And so it's sort of like with this carrot and the stick or the push and the pull, these models from both of them, maybe one or maybe both that they figure out how to be on that razor's edge. And then in the race car example, it was cutting corners. So that just made me wonder about autonomous vehicles. And you'd say, okay, well the goal is to stay on the road and to get there fast. But then sometimes get there fast is gonna take priority and then all of a sudden you're way off the road and maybe now your car's ruined or something like that. So if these were to be deployed, like how will we even know what kind of preferences to instantiate the model with? Yeah, and that's a very good point in fact, because how nice active influence might look in theory, I think it's not a silver bullet to autonomous agents because a lot of the subtlety is still in how do you provide it with like the preferred prior distribution? And that will be crucial in a real world system of, it's similar to a reward basically. It's a bit more informative than just a scalar reward scene, let's say, but it suffers from the same issues as in if it gets away to shortcuts to get this preferred state that you didn't envision before as a designer of the experiment, let's say, it basically has the same issues as reinforcement learning. So I don't see it as a silver bullet of solving autonomy, but at least it has some burning knobs that you can use to at least avoid some cases, like avoid these ambiguous states or at least make it first learn the model properly. So it has some nice properties, but it's not a silver bullet to solving autonomous systems, I think. Anything to add on that, Ozan? No, I think Tim said it very well. Yes, the model is not a silver bullet, definitely recognized. What areas might be interesting shooting ranges, first applications, where we can at least explore it? Are those the same use cases that people have been talking about just more broadly in terms of autonomous vehicles? Or might there be a sort of divisional labor where active inference is gonna specialize like in those ambiguous scenarios? Or I just was curious about that. Okay, so yeah, I was also still thinking, but for us, I think the real next application we're targeting is just still the constrained navigation. And then, for example, a warehouse where the impact of misplanning is still fairly limited. So actually more realistic versions of the situation we have in our lab maybe. There's also the trajectory we're currently still on. So I think we still believe that there is some value in developing autonomous active inference agents for this more industry-like settings where you can at least separate it from general public and you have also some level of control of the environment. Yeah, I think it's similar to mitigating the same problem in your course of learning is basically that before you just let the system randomly pick actions or pick any action at once, you kind of, or an environment where at least as a human, you can shortcut the system and say, okay, you want to drive forward into this rack, that's not a good idea. So you have at least some ways of defining some rules to keep it within some safety range, let's say. And within this boundary, it can, for example, move autonomously. But still then, you might have the nicer properties of the active inference agent that if you're driving around in some ails and somebody just dropped off a box in the middle of the aile, it will not freak out because the SLAM map is no longer consistent with how the robot was programmed, for example, but we'll just say, okay, this is another case. Either it experienced this before and it has in its world model an idea how to cooperate this or it will just be intrigued and start learning about this new situation. So I think these are then the nice properties you have in this agent and you kind of bypass the all, it's getting too greedy to realize its preferences because you kind of shortcut these situations by having this more rule-based system in place that limits the choice of the convection in this paper. Thanks for the answer. Here's a question from the chat. As the code is not disclosed, would you say sticking to what you write in your paper is sufficient to reproduce your results or are there any further tricks you use in designing and training the models? I think the art, I'm sorry, go ahead, Tim, first. Yeah, I think that together with the appendages it should be sufficient. What do you think goes on? Well, I think, well, we had some experience where I tried to help somebody out who was trying to replicate our results. I think we have some tricks for the planning. Like, that isn't as straightforward to replicate, but the model prediction part, we explained the architectures and I have pretty detailed in the appendages. So I think, and we don't do any extra tricks, for example, on data processing or our loss terms. So these should be easy to mimic, but I think the planning is a bit more involved as we sometimes ourselves have difficulties replicating it. Planning is hard. Yeah. Okay, if anyone else has questions in the live chat, they can type that. Another piece that I thought was really fascinating in the mountain car example, at least, was how you talked about the noise allowing for the previously implausible policies to become possible. Like we saw a broader spread of the trajectories when there was noise. How is that being integrated in real time or how are the noise, which are often very small, how does that change the model's understanding of where it can go and what it should do? Yeah, so in the mountain car example, there are basically two sources of noise. Let's say one is on the noise on your observation that you get. So you get the noise estimate of your position. And so this is typically not so large because you need to have a sufficient signal to noise ratio just to learn anything, let's say. But the second part was whether your agent starts with zero velocity or with a random velocity. And these are basically two separately trained models. So either you have an agent always starts with a zero initial velocity. And then basically the model learns that the initial observation has zero velocity and it basically knows how to properly predict from the first observation on, let's say. If you train the model where the agent closely has a random velocity, then it basically learns, yeah, from the first observation, still a lot of options can happen, depending on my velocity and the more observations get in. You see how the model picks up, okay, this is now my velocity and from then on, you see how the wide range of options collapses to the most likely ones. Anything on that, Ozan? Well, I was just also thinking, for example, in the robotic planning example we gave, there you also see the spread, but then it's more as in that the model learned that at similar state values, different outcomes are possible. So it will try to maybe make the Gaussian in the latent space a bit wider so that if you sample from it, you might get a slightly different sample value and that will then generate the different outcome you want to visualize. So that's also, so even though the standard normal you're sampling from initially for your reprimaritalization trick is very as just standard normal, the model can learn to inflate or deflate this distribution so that you get a wider coverage. Yeah, for example, you also saw in the navigation example is that basically the model, since we trained it on pretty short temporal subsequences of like, I think one or two seconds in real time, so it's not able to cause, you have a very consistent prediction for longer times, frames on that. And also given the fact that every ale in the lab looks very similar, it learns the general structure like there are wrecks left and right and there might be boxes or you saw that kind of the boxes or the stuff in the, actually in the wrecks is kind of very blurry and it's kind of brownish blackish but there's no real, you don't really kind of identify what is in that wreck, for example. And it's, so it has basically no spatial awareness of where in the wreck, where in the ale, am I at the very end or in the beginning or in the middle? It has no idea. And that's why you saw if you then say, okay, they're right for a certain amount of time, then it will either predict I'm in the middle. So the next, if I turn 360, it will still be ale after me or I'm at the very end. So if I turn around, I will see a wall. So it basically has no consistent knowledge of where it is. And so the, this is basically modeled as kind of noise in the institutions and if you draw different trajectory, it will either think it's at the wall or it's facing the other ale or there's a human passing by. So I see some shady people like structures. So all these kinds of things are then modeled as kind of, yeah, noise in your distribution that just appears to be happening there. Make sense. What might be helpful or required for long-term planning? Because the tree that you had with the multiple, I guess bifurcations or whatever they represented, that was very interesting, how you had a very fully fleshed out tree and then you showed kind of how you recursed to prune back down to make a policy selection. So as you suggested, that exponentially explodes the computation. What might be helpful or how can long-term planning be achieved with reasonable hardware? Yeah, so the key thing here is hierarchical models, I think, and that's what we're pushing very hard on now, is basically that you, given your, given a model like we trade now, you basically put a new model on top of that, that now as observations does not get pixels, and it gets state samples from this model, basically. And the time step now is not to predict the next time step ahead, but like to predict 10 times that the head or something or whatever, however you want to course, grain, basically. And once you're at that point, then you have basically a system that can plan, if you plan 10 time steps, you're actually planning 100 time steps for the lower level model. And so this way you can keep on course, graining that you only have to explore a few policies at each level. And then down below, again, you only have to predict for like one second the heads because the other plan was made by the models on top. And so this way you can easily bypass and scale down the complexity of the planning to seek you. It reminds me a lot of driving where it will be like, okay, in five streets, take a right turn. So you're not pre-saging the right turn. Okay, one, two, three, four, all right, now I should get ready. What about symbolic information? Like what if the aisle had a color gradient or if it had one dot, two dot, three dot, could that be learned in an unsupervised way? A symbol in that pixel level model or is that where a hierarchical model would come into play? Well, I think the problem currently is for us architecture wise, our model isn't capable of capturing very like low level details about environments just because we are based on a VA approach and then the mean squared error objective we use for reconstruction and the way we sample is already inhibiting, for example, recognizing dots in your inputs. Maybe I don't know what Tim thinks about it, but I think color gradients is something like the model might learn if given enough data and incentive. Yeah, so I think there's a number of problems with the approach in some sense that we're now doing prediction in pixel space, let's say, and the way you evaluate the likelihood and you calculate the reconstruction error, it basically means that you want to have each pixel independently predicted on average, quite right, which means that if you have very fine grained details, it easily ignores these. You also have the complexity term that basically says, okay, I want to have the least complex representation for reconstruction, but this basically also means that depending on how much you put pressure on restricting the complexity, the less information you will actually encode on the more glorious, basically, your reconstruction will become, so it's very similar to the beta VIE, to those familiar, where you have this beta parameter that tunes how much weight you put on the gal divergence term versus reconstruction term. And this also has an impact on what the model will actually put in the state representation and which details will be ignored. So yeah, I think a lot of these things are very difficult for the model that's straight now, just because of the way we built and parameterize the likelihood model. Very interesting. How might somebody go about learning or exploring this? Like, is there a textbook or the citations that are in your paper or hands-on? What would you encourage somebody who is curious about this and wanted to over the next maybe few years be following? Yeah, I mean, we had a lot of, like in the active inference parts, I think when we started out, there wasn't that much information on how to do active inference. So we had to figure it out the hard way by trying a lot and failing a lot. But now currently, I think even in this model stream, there was some cool information and accessible information on active inference. So, and I mean, specific implementation-wise, if you want to build a model like this, I think the current state of the art in RL and active inference, I learned active inference is all pretty similar. Yeah, I agree. So a lot of things have changed regarding how accessible the information on active inference has become. If you look now at the tutorial from Ryan Smith, for example, which was also extensively covered in one of your videos. So I think that really, it really knows the entry bar to just get to know the theory and how I think works or should work and to play around with some small toy examples and simulations that you can get some insights on what does this thing do and how does it work. For the deep learning part, let's say to build these models, then probably the resource to go to are just like tutorials on variational auto-encoders. I think these are the things in deep learning that are most relevant for the active inference work we discussed here. And if you can build a variational auto-encoder and you know how active inference work and you put it through together together with some details from our paper, for example, I think it should be pretty straightforward to get to the first working example. Cool. Any closing thoughts or even questions for our lab or just to leave for people to be thinking about as they dream in preparation for action? Yeah. Maybe I just want to continue a bit on what Tim said earlier that, for example, if you want to build hierarchical models, you don't have to build an active inference model on top of an active inference model. We did some preliminary experiments on, for example, just using our active inference model as dynamic model for a SLAM algorithm. And then you also already get hierarchical modeling and some of the long-term benefits. So maybe that's also interesting to think about is how this active inference model fits in and already existing techniques. Yeah, so I think what Ozen wants to say is that in this case, we use these deep neural nets to characterize this transition model, like with model and posterior model, but you shouldn't always revert automatically to these deep learning techniques. So these are kind of very popular right now and very cool. But in some cases, it might be sufficient if you know your environment, then yeah, you shouldn't bother learning the state space model. If you know the state space model, just use it. I know we used the mountain car example here. This would be a good example of why we shouldn't just use the deep learning approach. This is basically just like a proof of pin principle for us to get some simple example working, but it only pays off, I think, if you have like these real high dimensional observations and you don't, you have no clue how to prioritize your state space model, but it's not that this is like the default way to go, let's say, and we're also like kind of evolving, especially if you move through these hierarchical models, then we're kind of looking at how can we make a more kind of discrete style, parameterized state space model on top, where we basically try to put the whole state space into discrete parts and then have like a simple transition model. Like you give the example yourself, Daniel, if you think about navigating yourself, you're not thinking about predicting all the pixels of all the houses, but you just think about, I need to go forward now, and then the second street to the left. So you basically chunk up the whole continuously state space into some relevant parts, and then your transition model also becomes like a very simple matrix with transition probabilities. So I think the future is in hierarchical models and the future is in a mixture of some learned parts, but also some very discretized, intuitively comprehensible parts. Just two points on that. One is something that's always drawn me to active inference is that it's inference conditioned and about action. So you're not going for that 4K Google street view, what does every house look like? And again, when that's the input data, even if you have a generative model, that's the output data. So active inference is a really principled way to just sort of reduce what you're predicting to like, which way should my elbow move? Not what will the pixels look like when my elbow moves, which may be just taking gigabytes of data, but if it's just reduced to the state space, then it's easier to learn and that's why it was such an interesting contribution with your paper to actually learn that state space in the context of high resolution and real-time and heterogeneous sensors. And then also it's something that we've seen many perspectives on in the lab and in these discussions is there's the philosophical discussion, map and territory, who's really an active inference agent? Is it everything, a dust particle, a bacteria? Is the world built this way? And then there's this sort of engineering approach where you have your preferences for how you wanna see the robot work and then whatever you can tinker and cobble together that satisfied you and it reduced your uncertainty about performing the task you're trying to perform. So it's sort of sidesteps, but then it sidesteps those questions in a way that actually brings us to a higher level of understanding because I know that many listeners who are not as advanced in the machine learning will be inspired and have qualitative thoughts based upon what you brought here today. So thanks again for this awesome presentation and conversation and we'll always appreciate hearing any follow-up whenever the time is right. Yeah, thank you Daniel for having us. It was really nice discussion. Okay, peace. See you later. See ya, bye. Bye.