 Hello and welcome everyone. It is September 1st, 2022 and we are here in Model Stream number 6.1. We're going to be discussing branching time, active inference, the theory and its generality. We're going to have a presentation followed by a discussion. So thank you Ali and Jacob for joining and anyone else to be adding their questions in the live chat. Without further ado, over to Tiafile Champignon and thanks so much for joining. Really appreciate it. Hello. Thank you very much for the kind introduction. And thank you for inviting me to present today. I'm very glad I had this opportunity. So today I will be speaking about branching time, active inference. Basically three different versions of the approach. The first is in branching time, the second is in batching filtering. And the third one is in belief propagation and allows the model to contain several observations and several identity states. This works has been realized in collaboration with Lancelot Acosta, Mike Jash and Howard Bromance. So first of all, I want to speak a bit about the action perception cycle, which is a core idea in active inference. I give out two entities here. The first is the environment and the second is the agent, which is over there. The environments provide observation to the agent, for example, an image of the environment. And then the agent needs to take this input and perform inference on it. And the goal of the inference process is to extract high-level item states, such as the position of Blackman in X and Y, or the position of the ghost, or whatever information may be relevant. Then based on those states, we can perform planning and action selection. And action selection outputs the action to perform, maybe the action going up, which is fed into the environment, which produce another observation. And this cycle continues until the trial ends. Now that we have the core idea of active inference, which is action perception cycle, I will be speaking about active inference in a bit more depth. Basically, active inference is about an agent, which is equipped of a model. This agent makes, as I said, observation, which are represented here at the bottom of the screen. And those observations depend on the item states through the A matrix. So basically the A matrix provides a distribution of the observation for each possible Latin state. We also have the D vector, which contains the parameter of the prior over the initial item states. As well as the B matrix, which explains how the transition of the environment works. So basically it explains how, given a state and an action, we get the new state at time t plus 1. We, as I said, have an action, or here it's a policy variable. And this action variable or policy depends on the precision parameter, which is called gamma. And as we will see, influence how stochastic or deterministic the policy of the agent will be. So here we see how the prior overaction is being defined. So it depends, as I said, on the gain parameter. And it is defined as a softmax function of minus the gammas, the precision parameter times the expected free energy. And the expected free energy for a particular policy is basically a sum over all future time steps. So from t plus 1, which is the first time step into the future, to upper casting, which is a time horizon. And for each time step, the expected free energy is defined as the expected cost plus the ambiguity. The expected cost is the k-divergence between the predictive posterior over future observation and the prior preferences. The prior preferences defines which observation the agent want to observe. And this, the predictive posterior, defines how likely each observation is. And so what we want to do is to minimize the divergence between the two distributions so that we actually observe what we like. Okay, and the second term is about the entropy of the likelihood mapping expected under the variational posterior over state. So I'm speaking about variational posterior. I will explain what this is in a minute. That's the definition of the expected free energy, risk plus ambiguity, or also called expected cost plus ambiguity. So I presented the model, here's a more formal definition. Here we have a joint over all the variables in the model. We saw that the policy depends upon the precision parameter. We have a gamma distribution for the precision, like the gamma parameter. We have Dirichlet priors for each of the tonsor of parameters, so A, B, and D. And we see that S and D depends on D and that the observation depends on the state through the A matrix. And then for the transition mapping, we have a state which depends on the previous states and the B matrix as well as the action being performed in the environment. Okay, so now we have the model that what we want to do is given some observation. We want to be able to compute posterior beliefs over the latter variables. In a probability, in probability theory, we call that computing the posterior distribution. And we do that through a process which is called inference. We can, for example, use exact inference, which is based on bias theorem, so that the posterior is equal to the likelihood times the prior, and then we normalize using the evidence. And basically the evidence is just obtained from the numerators by summing out all the latter variable X. The problem is that when X is a continuous random variable, this summation turns into an integral, and we may not have an analytical solution for this integral. So this method, the exact inference, can become intractable because of this. What we do instead when using a variational inference is that maybe this true posterior is very complex, but we are going to approximate it using this Q, this variational distribution, and try to make the divergence between those two distributions as close as possible. So here we have the true posterior in red, and here we have an example of how the variational posterior may be fit to the true posterior in red. In the context of active inference, the variational distribution is defined as follows. So it is a joint distribution over all the latter variable of the model, and we are doing what we call the mean field approximation, which means that all the variables within this distribution are assumed to be independent. Thus, there is no more dependency for the pi, A, B, D, and gamma parameters. And we do a slight exception where the state still depends on the policy pi. Okay, so that's the definition of the variational distribution. And now that we have the variational distribution and the generative model, we can define the variational free energy. So the goal of the variational free energy is to make sure that the approximate posterior, so our variational distribution, remain as close as possible to the true posterior. And it is defined as the keld divergence between the approximate posterior and the generative model. This variational free energy is also called like the negative evidence lower bound or elbow in machine learning. And it decomposes into two intuitive terms. So this is the variational free energy and decomposes into the keld divergence between the approximate and the true posterior. This is the term that will make the approximate posterior be as true as possible to the true posterior. And here we have as evidence, but which is a constant with respect to the variational distribution, which we are optimizing for. So really, the variational free energy is approximate for the first term. Okay, so what is variational message passing? Variational message passing is an inference algorithm. Basically, it is based on what is called the Markov-Blanket. So let's suppose we want to compute the variational posterior for one specific node in the graphical model. What variational message passing is about is saying that this node A only depends on its Markov-Blanket. More specifically, it depends on the child of A, so here D and C. It depends on the parent of A here, F and G, and also in the co-parent of A, for example, E and B in this picture. And what this Markov-Blanket says is that really, we only need to know the values of the variable inside the Markov-Blanket to be able to perform inference over A. Here is a bit more formal view on this question. So here we have the optimal variational distribution for one random variable. So this could be, for example, A. And the equivalent of the right gives us that physiological solution for this posterior. And we see that it's only depending on the node itself, its parents, its children, and the co-parents. And that's all we need to know. Okay, so why variational message passing? The message bits comes from the decomposition of this analytical solution into messages which are added together to form the variational posterior. Here we can see the first message, which basically comes from the parents. Here we can see one message for each chapter. And what we do is that all those messages will be added to form the parameter of the approximate posterior. Here's a practical example. Basically, we are trying to perform inference over the random variable Y. So we want to compute the parameter of the distribution like the posterior distribution over Y. And the way we go about doing this is that we send messages from the parents all the way through the random variable Y and sending for the child and the co-parents. And each time we reach the factor node, we combine the input messages and forward the results toward Y. When we have received all the messages, we just add them together and this provide us with the parameter of the variational distribution over Y. So this was variational message passing, which is basically the algorithm that we use to perform inference. Now I will be speaking about Monte Carlo research, which is a planning algorithm that we use to look forward into the future and estimate the quality of each policy. So before to do that, I want just to introduce the notion of multi-index. So here we can see the roots of the tree, which is the state at the present time step. And then from there, we see that we have node in the future, which are indexed by sequences of indices. For example, here the sequence is a sequence of size 1, which contains just the action or the index 1. This is the state reached after taking action 1. And then here we have the state reached when performing action 1, followed by action 2. And those indices are called multi-index because they are composed of several indices, which here represent actions. Okay, so now that we have this in mind, I can discuss Monte Carlo research. So the way Monte Carlo research is structured is in four steps. First, we have the selection step where we start at the root node and compute the UCT criteria. So this is just a real value, like a number for each of the child. And then we select the node, which actually has the highest UCT value. For example, maybe it is S1, so the state we reach after taking action 1. Then from this, we can compute the UCT of each child run, and maybe this node will be the highest value. Okay, once we have reached a lift node, we can move towards the second phase, which is about expanding the child run of this node. And for each child, what we are going to do is just simulating some rollouts from this node and computing the average expected furniture for this node. Once we have this, there is a fourth phase, which is about updating the expected furniture of the ancestor of the node we just expanded, as well as a number of visits, because we want to explore part of the tree, which have not been explored a lot in the past. So we need to keep track of how many times each branch has been explored. Okay, so that was the algorithm for planning. And I now have introduced all the background required to discuss the fourth approach, which is branching time active inference with variational message passing. So first, we need to define the model. It is basically split into two parts. The first one represents the past and the present. And it is basically a partially observable marker decision process, which is just a fancy word to say that we have observations, which depends on states. This is through the A matrix. We have action variables over there. And we have states, as I said, either side. So the left-legged marketing, as I said, is adding, like, it's parameterized by the A matrix. The transition, as usual, is parameterized by the B matrix. And we also have here the D vector, which defines a prior over the initial island states. So this is coming from a standard active inference. And then the novelty is that we are expanding the future for each time. Okay, so we have this model. Then we perform active multi-carot research. Each time we want to expand the node, what we do is that we are going to add some parts, some bits to the generative model. So for example, if we want to add this part of the generative model, we are going to add one transition mapping, B. This will add a random variable. Here it's like S1, 2. And then we will add the associated observation using the likelihood map. So this is how we expand the generative model as we do. Mathematically, this is defined like this. So we see here a generative model over all the variables in the model. We have Dirichlet prior over the parameters of the model, so the A, B, and D matrices. The state at times of zero depend on D. Each observation depends on the states associated to it and the A matrix. And then we have prior distribution over actions, which depends on some data parameters for which we have Dirichlet prior. Okay, then the state as usual depends on the previous state and the previous actions as well as on the B matrix, which defines the transition probabilities. So this is a POMDP version, like the POMDP part of the model. And this is a tree-like structure that expands as planning is going on. So IT is a set of all multi-indexes which has been expanded in the model. For example, if I come back here, we see that there are three multi-indexes that have been expanded. So I is equal to the multi-indexes 1, the multi-indexes 2, and the multi-index 1, 1. Okay, and now for each of those multi-indexes, we add to the generative model a transition and likelihood mapping. So that's exactly what we are doing. For each index in the set of multi-index which has been expanded, we have the likelihood mapping and the transition for the specific node in the future. Okay, so now we need to test this approach in an environment. And for now, we'll be testing it inside a maze environment where we have here the exit of the maze. Here, the starting position. And then the prior preferences of the agents will be that the closer we are from the exits, the happier the agents will be. So here we have a distance of 0, 1, 1, 2, 3, 4. One feature which is really important is that the prior preferences lie across both. So here we have a 3 and here we have 4 in terms of the distance. We don't have like 5, 6, 7, 8, we have 4. And what this produce is a blue cell. And what this blue cell actually is is a local minimum in which the agent can be stuck if it is not careful. And the way to avoid this local minimum will be to be able to plan final into the future to see that the rewarding path is actually this one. And not the one leading to the local minimum. So this is basically a challenge that we are adding to the task. And here what we can see, so here is basically an illustration of the tree which is being expanded. And here we have the table of results. We see that as we increase the number of planning deterioration, we go from the agent being stuck inside the local minimum. So never reaching the unit to an agent which is basically behaving properly and which is 100% of the top. Next, we need to compare how partial branching time active inference to the previous state of the art which is active inference. And basically we designed an environment in which for the agent to be able to solve it, it has to plan 3, 5, and then 8 times step into the future. For active inference, what happened is that it was able to properly solve the two first tasks. But then what happened is that because the number of policy tasks we evaluate into the future grows exponentially with the number of time steps it has to plan for, the last one, which is the biggest, crushed. With how environments, like with how our approach, we did the exact same thing. And when we increase once again the number of planning deterioration, we see that the agent becomes able to solve tasks all the time. And does not crush because it is able to explore in a clever fashion the space of all possible policies using Monte Carlo research. This was a bit of a non-purecal comparison between BTEI and active inference. Now what I want to do in this slide is to compare them in terms of complexity classes, which is a lot more theoretical. So basically here, each circle corresponds to a category called distribution of a state, which means that to store one of those circles, we need to store the number of states. So a number of parameters, which is equal to the number of state values. So if we have a state which takes three values, we need to store three parameters. Then for each time step into the future, we need to store one more parameters, one more categorical when it comes to active inference. And we also need to store one more categorical for each possible policy. So the total complexity class is equal to the number of policies times the number of time step until the time horizon, times the number of parameters we need to store for each categorical distribution. Now in the worst case scenario, the first thing to say is that BTEI does not store every single possible combination. It's using the three structure of the GRT model to store only one distribution for the past and present. So for each time step in the past and present. And then if we expand all the tree, then it's going to store the number of action to the power of the time horizon minus T. So this is still exponential in terms of course, because of this change over there. But in practice, we never expand the entirety of the tree. So what we do is maybe we will expand this, this, and this, but not the two other. So in practice, the real complexity class of this algorithm is a linear in the number of expansion that we are making. Okay, so that was to show that Bronson time active inference does not require as much storage space as standard active inference. Now I want to speak about the second approach, which is based on Bayesian filtering. So first, what is Bayesian filtering? Well, Bayesian inference algorithm, which starts with just a simple generality model with a state and an observation. We actually know which observation we are making. And we also know what the prior of a state and the likelihood is. And then we can, for example, use bias theorem to compute the posterior of a state, given some observations. And we just compute it like this. Once we have a posterior of a state, we can use it as an empirical prior. So here is our empirical prior. And we also know the transition probability that it is the next time stop. So we can use this information as well as the action that we're actually performing in the environment to compute the predictive posterior over the state at time step one, given the action that we just made, which is U0 and the observation that we made before. And the way this is done is just by performing Bayesian prediction through the transition mapping. So averaging out the dimension of S0. And then comes another observation. And we can just use our predictive posterior that we got from the previous prediction step to now compute a posterior over S1, according to this new observation. And those two steps of integrating evidence and then prediction through the transition mapping are going to be iterated as many times as we need to. So this leads us to the second approach that I want to present today, which is branching time active inference with Bayesian filtering. So first thing we are not storing anymore the past, the past observation on state, because all the information we need is stored within the beliefs over the initial state, like the current time step, the state at the current time step. Okay, so we have an observation at time T, we can perform the integration of evidence state to get a belief about this state at time step T. And then we can perform forward prediction for each of the children that we want to expand. For example, maybe we'll compute this one, and then this one, and so on and so forth. And if Monte Carlo research tell us to expand this child, then we will just perform forward prediction as well for this one and expand its associated observations. So that's the main idea. So really, the only difference here is that we don't have any more partially observable macro decision process for the past. We only have the current time step, and we are also changing the inference algorithm from Bayesian to Bayesian pattern. And here's a more formal definition of the generative model. So we can see here, the likelihood and prior of a state for the initial time step. Here is the current time step T. And then for each future state of observation, for each routine decision that we already have expanded, we have the likelihood mapping and the transition mapping associated to it. In terms of performance, we compare here, branching time activity inference using Bayesian filtering to the same algorithms of BTEI, but we vary the time in such person. And we see that for the same task, one of them is performing with an order of magnitude, which is about minutes scale. So around four minutes between four and seven minutes, where the other performance faster between two and 11 seconds. So this speed-up in performance is basically made possible by the change of inference algorithm. But now what we want to do is to be able to define more than one observation on state at each time point. So the way we are going to do this is by changing once again the inference algorithm from Bayesian filtering to belief propagation. So what is belief propagation? Basically, belief propagation is an algorithm that takes as input a function over some state variables. And this function, we know that it factorizes into a set of n factors, which we call fi. And the question is how can we compute the marginal distribution, like the marginal of this function, when we marginalize out all the other random variables except for one sm? Okay, so we want to marginalize the distribution over all but sm. And the way belief propagation is solving this task is basically by passing messages through the computational graph. So in the graph, we have two kinds of nodes. We have factor nodes, which represent factors of the distribution, for example, f1. And then we have random variable, maybe x1. And we may have several of them, so f2 and then x2. Okay, maybe we have a transition mapping between the two. So this is the setting. And now what we want to do is to pass some messages through the graph. So when it comes to a message from a node x to a factor, so let's suppose we have one more here. What we are going to do to compute the output message is just multiply the input messages that comes from the other arrows that goes toward this node. And we just multiply all of the input message and outputs the results. When it comes to a message for a factor, we basically are going to take the factors associated to this factor node and multiply it by all the incoming messages. So we take all the incoming messages and multiply them by the factor associated to this factor node. And then we marginalize out all the input dimensions, so that the message has the exact same shape as the output, as the targets, random variables. So this is the marginalization I'm speaking about. We do that for every single message we can inside of the factor graph. And then we use those messages to compute the marginal that was the goal of this algorithm. And the way we do that is that we just take all the input messages and multiply them together. And this gives us the marginal distribution over the specific state we want it to. So this leads us to multimodal or multifactor branching time activities, which is the last approach that we have been developing. Before to be able to speak a bit more about this approach, I need to introduce the notion of temporal slice. So a temporal slice is just a set of states and observations. We have S states and O observation. So this is a plate rotation which just duplicates a variable O time or S time. And then we have those dashed lines. What those dashed lines are doing is just connecting the observation to a subset of the states over there. So for example, maybe we have an observation one, which depends on state one and state two. And then maybe we have observation two. But this observation two only depends on state two. So the reason for which it's a dashed line is because we can have a sparse mapping in between the state and observation. We don't have to have all the possible connection. Okay. And two slides, two temporal sites can be connected through the transition mapping, which is this arrows. And this arrows mean exactly the same, but between two time step. So, for example, this, the state over there can depend on the state as a previous time step in arbitrary fashion exactly like the observation depend on the arbitrary subset of the states. But this representation is a bit unpractical when it comes to presenting the entire generative model. So what we do is that we just represent this temporal size as a square, which is called TST. So the temporal size at time T. And the background here is gray because the observation within the temporal size are provided. They are actually observed. While in the future, the background will be white because the observation are just not observed. Okay. So now that we have this more compact presentation, we can present the generative model. So here we see the initial time step. And then from there, we can expand some new temporal size exactly when multicart research asked us to. So we start there, maybe we compute the UCT criterion. This is the highest, the node with the highest UCT criterion. So we are barely asked to expand those children. And we can do so by just using the forward prediction for computing the state here. And then from the state, we use forward prediction once again to predict the future observation associated to this temporal size. Okay. More formally, this generative model is a joint over all the random variable once again. And it is the product of the probability of the temporal size at time T multiplied by all the future time temporal size. Okay. So all the temporal size we have already been expanding during multicart research. Each observation depends. So in the initial temporal size, this is the initial temporal size. This observation depends on a subset of the state within this temporal size. And because we are at the top of the tree, the state at time step T does not depend on anything. After that, for each temporal size in the future, we still have the dependency on that. We still have the fact that the observation depends on the state within this temporal size. But we also have the fact that the state in this temporal size depends on the state in the previous like the parent temporal size. So for example, the state within this temporal size have parents in inside this temporal size. Okay. So this is the way the generative model is defined. Now the way we perform inference is using what we call the IP IP algorithm. So IP stands for inference and IP will stand for prediction. So this slide is about the inference step. And the goal is to compute the posterior over the state within the initial, like the current temporal size, given the observations. I'm not going to go through this derivation. You can pause the video if you are watching it on YouTube. So basically we do some derivation and then we obtain this solution. We just tell us to take the product over all the likelihood mapping with all the priors and then use the belief propagation to actually marginalize out this function. And that's exactly what we are going to do. We use belief propagation within the current time step. So if I go back here, we have some observation here. And we use belief propagation to compute the state within the initial temporal size. So this is the E step, the I step, which is the inference step. Then we need to perform the P step. Each time the Monte Carlo research tells us to expand a part of the generative model. So once again, I'm not going to go through this derivation, but basically the idea is to compute the posterior over the state in the next temporal size, given the observation that we made in the current temporal size. So here we see that basically once again we have some kind of summation over all the parents of the states in the temporal size that we want to compute the posterior for. And this is the transition mapping that we know. And this is a posterior distribution from the previous temporal size. So basically what we are doing is just doing forward prediction, in this case, taking the expectation of the transition mapping. We can do this for the state and for the observations. So to come back to the main picture, we first use the I step to compute the posterior over the state as the initial temporal size. So this is the I step. And then we can use this posterior distribution to compute the posterior over the state in the future. So maybe the state in this temporal size. And then we can use once again the P step to compute the distribution over future observations that correspond to the temporal size. And we are going to do that for all time steps, like all temporal slides we want to expand in the future. And the next thing that we need to define and last for this approach is the expected frequency. So basically the way we are going to define this is by first grouping all the observation into distant subsets. So basically all I is a set of all observations in the temporal size index by the multi-index I. And now we are going to split. Once we have groups, those observations into subsets, the way we define the expected free energy is just as a sum over all possible groups of random variables. So in each of those we will iterate over those. And then into the Kelder versions between the prior proof and the predictive posterior for those observations. So this is the risk term, the Kelder version between what will happen and what we want to happen. And then we will compute the ambiguity for each of the observations, which is as usual defined as the expected entropy of the likelihood mapping. So this thought equation is probably new for people that have not read the paper. But we can look at one specific case, which makes it more intuitive. And basically this specific case is just when each subset corresponds to one observation in the temporal size. And in this case, the expected free energy for a specific policy, so for one specific multi-index, is just the risk for the specific observation. So we still have ambiguity plus risk. And now we need to present, like to test those approaches together. So compare, branching time, accept inference with viral time and such passing, bias and filtering, and the last approach, which is based on big propagation. The way we are going to do that is by using a variant of the described datasets. Basically, we represent the environment as a guide. And so we, okay, in this environment, we have three different shapes. We have LEDs, hertz, that need to be pulled towards the bottom right corner of the image. And we have squares. And the squares needs to be pulled on the bottom left corner of the image. And because there are way too many position in X and Y, what we do is that we do some form of state aggregation. And for the eighth first position in the right left corner, upper left corner, if the shape is in one of those eighth position, we are going to aggregate them into one state with index zero. And for this, it will be in index one and then and so on and so forth, like 234 to 19 for the squares. And if it hurts, we will just allocate those states, the indices between 20 and 39. And something for ellipses. So what we are doing is basically reducing the state space so that some of those approaches can still do something, because they are not powerful enough to solve the entire state space. So here are the results where we compare variational message passing, the Bayesian filtering and the last one which is based on belief propagation. Bayesian, variational message passing, we had to use a granularity of four, which means that the length, like the size of the square is like a four by four set of the square. The width of the cell are the size of four by four. And with this setting, we were able to solve 96% of the times. And the average time for one trial is around five seconds. With the Bayesian filtering approach, we were able to go down to a granularity of two. So this time the cells are the size of two by two. And with this granularity, the agent becomes able to solve the task 98% of the time. But the thing is that because we reduce the size of the granularity, we also increase the size of the state space. And this produces like an increase in computational times, which means that each trial now requires around 17 seconds to be executed. For the last approach, likely we use the fact that we now know the factorization of the likelihood and transition mapping. And this allows us to go down all the way to like only one, a granularity of one. So we are now able to differentiate between every single X and Y position inside the image. And with this granularity, we can solve the task perfectly. And because we can take advantage of the factorization of the distribution, we go a lot faster than all of the previous approaches. And we can solve the task in around 2.5 seconds. So I just want to make one thing very clear here because this approach, basically was able to model every Y position, every X position, every shapes inside, like every shapes of the despite environment, all the orientation, like possible orientation of the shape and all the scale. So basically each shape can have different sizes and this scale dimension represents that. So basically this approach was able to deal with around 700,000 configuration of the state space. Now we have presented the results and show that this approach approaches very performance, like can be a very good performance. But now how do we trade that? Well, here I've got a very small code example where basically I'm retrieving from the environment the A, B, C, and D matrices. The C matrix corresponds to the prior preferences of the agent. And as I already said several times, A corresponds to the likelihood, B corresponds to the transition, and D is a prior of the initial states. And then the way we do about creating the BCI-3MF agents is just by creating the top one slice builder, telling him that we have one action which is called A-0, and then giving it the number of values that this action can take. Then we just add one state for every single state of our system. So the X position, Y position, shape, scale, and orientation. And we provide as a second parameter basically the parameter of the trial over the state. So now within our generative model we have the state of the system. Then we need to add the observation. Basically for each of the states we add one observation which depends on this state through A matrix. So we provide A matrix on the list of parameters over there. And this, basically, we add an observation for each state in the generative model. The before last step is just to add transition. So basically for each state in the system we say what is the B matrix that need to be used, and what are the patterns for this state. So for example the position in X of the agent depends on the position in X as a previous time step and the action which is being performed. And this will basically add transition probabilities within the model. And the last step before to build the temporal size is just to define our prior preferences. And because in the display dataset we need to have prior preferences about the X, Y, and the shape of a blob. What we do is that we say here is one factor. This is one of the X subsets of observations. And we provide the same matrix associated to that. Then we call the function build which just returned the temporal size. And we create a BTEI 3 map agent which uses this temporal size. And we provide a number of planning iterations we want the algorithm to use as well as the exploration constants that trade off exploration versus exploration. Exploration versus exploitation. And this is a graphical user interface that we have been developing. And basically here we can see the initial temporal size. If we were in the software we can click on the temporal size and see the different posterior over all the state variable. We can also see different information about those as the temporal size. So the number of times have been visited on all this time. And then we can use the button on the left to basically perform a step-by-step multicolor research. So what will happen in the interface is just we are going to add the children. And then we will be able to click on those children to explore those parts of the trees. So and we will have information about the temporal size in the future. So that's a tool that you can use to analyze both planning and the beliefs of the agents. Okay, so now I'm done with presenting the different approach. It's now time for me to conclude this presentation. So we have seen three different approaches. The first is based on bias analysis passing and obviously using active entrance and multicolor research. The second is based on belief, based on filtering. And the last one is based on the IP algorithm, which is a mixture of belief propagation and forward prediction. And in terms of performance, the first BTI VMP is basically slower, like less performing than the second one, which is once less performing than the third one. Okay, but if I do with this increase in terms of performance ability on one approach to the next, there are still some tasks that we cannot solve. For example, how do we solve image-based problem? It is not clear how we can learn the structure of the generative model. For now, the modeler has to provide the description of the model, but it would be very nice if we could learn it from the data. And also, for now, we have been barely providing the agents for useful sequence of actions. For example, if we go back to the disparate environments, we could imagine a task where each time we go right, it's only pushing the shape one position on the right, and then one more, and then one more. But what we have been doing to make planning tractable is just by chunking all those actions together and maybe executing like eight actions altogether or maybe four actions altogether. But this has its limitation and it will be very nice if we could learn sequence of actions like this automatically. Well, that's all I wanted to say. Here are some references if you want to move them up after the presentation. Thank you. Thank you for the presentation. Indeed, use our reactions. Well, very interesting, a lot of material and things for us to discuss. I'll start with just one general question and then Yaka Ben Ali looking forward to your questions. Just for some context, how did this team and you come to be working on this problem? Were you working on active inference and interested in extensions? Or were you working in a different adjacent area and came to this algorithm? Okay, so maybe one thing that I should have said is that I've been starting, like I started a PhD at the University of Kent and basically Howard and Marek, which are two of my collaborators on this project are my supervisors. So this is how we came up to work together just through my PhD. And last to the costa is a collaborator that I've been working with because I've been doing some presentation at the field, which is the Institute of No Science, where Calvin Stone is. But it is through a presentation, like during the presentation, he told me that he was interested in working with me. So this is how I started working with last to the costa. And so in terms of my background, I'm coming from very computer science like a school, like full up into coding. And then I arrived at the University of Kent, where I started to study about machine learning. And this is where I started to gain some experience into reinforcement learning or even active inference. And basically, Howard Bowman was initially, and Mike Jess as well, was where two of my teachers and Howard Bowman was already interested in active inference. And one day, basically after the classes, I came on scene and said, well, I would be interested to a site project. And that's how everything started. And I ended up doing a PhD after that. So that's it. So basically a bit of both. I come from machine learning, Mike Jess as well. Lancelot comes from pure mathematics. And Howard comes from like neuroscience and easy guy, which bring in the active inference subject to the table. Excellent. Thank you. So I have many more questions. But how about Ali first with a question? Yeah. Well, first of all, thanks a lot for your amazing presentation. I really learned a lot. Well, I'd like to make a couple of comments if I may and maybe a bunch of questions. Well, as you're well aware, active inference slash FEP research has been going on recently in basically two distinct paths. The theoretical work geared mostly toward developing the underlying foundational principles, for example, the work of Dalton, sector by developer Maxwell Ramsted and colleagues comes to mind. And the more application oriented research, perhaps not unlike the distinction between theoretical and experimental physics line of researchers. But I'm not sure if you agree with me on this, but it seems to me that sadly the application oriented research and especially the work on the various algorithmic implementations of active inference has not gained as much recognition as it should, at least as compared to the research and theoretical side. I mean, the amount of related published literature is pretty scarce comparatively. So in light of this context, your line of research seems to me a lot more daring and gains much more significance. So I wanted to congratulate on that. But you see, I'm a big fan of unification in science and technology. So my first question is about your opinion on the possibility of kind of unifying the different algorithmic implementations of active inference. As an example, you just mentioned the automatic learning of the structure of the generative model as a possible subject for future research. I'm not sure if you've seen what's here at all. A very recent paper from a couple of weeks ago, namely learning generative models for active inference using tensor networks, if I'm correct, which outlines an interesting physics inspired approach for that task. But it doesn't include any citations to any of the branching time active inference papers, probably because they weren't aware of your work at the time of their writing. But this work looks to me as a pretty good candidate for a potential integration with BTAI in order to overcome some of the limitations you just mentioned. Or another recent example would probably be Sinesh Aral's paper deriving time-averaged active inference from control principles, which is an attempt to derive an infinite horizon of average surprise formulation of active inference. So I really liked your comparative overview of the different variants of a branching time active inference, especially the benchmark analyses. And I know you described a sophisticated active inference in your work as a subset of branching time active inference. But regardless of these specific examples I just mentioned, I wanted to ask how do you see the future of BTAI in terms of its possible unification with the other variants of active inference implementations, each with its own pros and cons in order to overcome some of their limitations without compromising the advantages of each? I mean, do you see it as a possibility that a branching time active inference will one day subsume all the other approaches somehow in a truly integrated kind of framework? So to be honest, I don't exactly know. The only part of the literature that I've been exploring is the connection with branching time active inference as being active inference and sophisticated inference. I won't really speak about this in the presentation, but I can quickly give the idea of what my work has been, what has been the conclusion of it. And basically it is really about how we backpropagate what I call the local expected framework, which is basically the expected framework associated with one node in the future. And so if you backpropagate those upwards in the tree following like multi-couch research, which basically comes from Baylman's equations and all this kind of literature and reinforcement learning, you will fall into, you will underprivilege an approach which is very close to sophisticated inference. Basically because sophisticated inference is also taking some inspiration into Baylman equations, just applying it to the expected free energy instead of just having reward. If you go downward, so if you backpropagate those local costs towards the future, then what you are effectively doing is just computing like the path integral of the expected free energy. And so this will be active inference, just taking the sum of the whole future timestamp of the expected free energy basically. So this was a sophisticated inference and this will be active inference. Now concerning the other approaches that you mentioned, I haven't been reading those typers, so I can't really state on that. But yeah, I believe that portion of the inference is a fairly general framework. So it may be okay that some of them will be related, but more research is needed. Thank you so much. Great. Great comments. Jakob, do you have any question? Yeah. Once again, thanks a lot for the awesome presentation. It definitely explained a lot of things that I didn't understand on my reading through the original paper. I'm wondering, again, going back to the question of learning the structure of the different components of the generative model. In your paper, you mentioned using deep neural networks as general function approximators for learning the states-based representation. And I'm wondering whether you have given some thought into how neural networks might fit into this factor graph representation of the generative model. And I guess there are perhaps also two ways to look at learning the structure of these different components. One is just the initial step in your slide when you showed the initialization of the model, getting the AMB tensors, the prior preferences, and the prior beliefs, replacing that step with deep neural networks to learn the representations. But perhaps there's also another side to where you could dynamically change the dimensions of these different components as in perhaps the agent receives an observation that wasn't captured in the likelihood mapping of the A tensor, or perhaps it's a multi-agent setting where one agent has affordances, the other agent doesn't, and a new transition mapping needs to be learned through observations. So I'm wondering what your thoughts are on this and how you think this might be compatible with branching time active inference. Okay, so the deep learning area is I think really interesting and should be enabling active inference to scale to a more complicated task. And more recently, so this paper is not yet out, but I'm working on a deep learning version of active inference. There is no branching time inside the picture, so there is no multi-carriage research. And the reason is because it's already surprisingly difficult to make it work just for active inference. And basically I've been reviewing some of the paper in the literature, and then I provide my own implementation of deep active inference. But for example, I spoke about this prior dataset and I was not able to make deep active inference work on these approaches, on these environments. So yeah, and I've been doing a presentation as a field about this, but basically some of the implementation of the internets contain mistakes. And yeah, some of the paper also contains some, I mean, I'm not sure if for the paper it's like a mistake, but that's stuff I don't understand. And for now the authors have not been able to answer my questions. So basically what I'm trying to say is that I've been trying to implement deep active inference. Surprisingly, it's quite difficult to make it work at least on the despite environment. So there will be a first paper about analyzing what the deep neural networks are actually learning and why it is failing on these environments. And then what I wanted to do is to try to apply this implementation that I hope is correct. Two different tasks, especially like the target games and stuff like this, and try to find out whether there are some tasks for which the expected free energy and this implementation of deep active inference can actually solve the task. And the preliminary result that I have on this is that there seems to be some tasks for which deep active inference is actually performing better than, for example, the DQN, which is a benchmark from the reinforcement learning literature. And yeah, but it's not as straightforward basically as it seems for now. It's quite challenging to implement that. So that was more for the deep learning aspect. I think we still need to work quite a lot before to make something very robust that can beat more standard reinforcement learning like benchmarks. And yeah, the other approaches to structure learning I have not researched it for now, so we need more time to think about a more robust answer to your question. Yeah, that's basically what I had to say. Thank you, Flint. One really striking aspect of the presentation was the analysis of the computational complexity. So maybe we could return to this because it's something that we've wondered about and discussed on a few occasions. You presented the theoretical complexity classes with a big O notation and then also discussed some of the practical aspects of the actual like clock time on a given hardware. Wasn't exactly sure what language or hardware you ran it on, but provided the theoretical complexity class as well as some runtime provisioning. So I was curious to hear some thoughts on how does this big O computational complexity analysis shine light on different variants of active inference as well as branchy time active inference and what real computational resources were taxed in the analysis? Was this a RAM overflow that caused the crash that you referenced earlier? Is it a CPU throttling? Is it paralyzable? Does it require temp files? Like what in theory is happening with a computational complexity and the exponential blow up and then what in practice is going to facilitate this kind of analysis to scale? Okay, so first thing like this complexity analysis was done in terms of the space which is required to store all the parameter of the distribution of a state. Okay, so here we are really interested in how much space do I need in order to store all the distribution of a state like also posterior distribution of a state. But if what happens in standard active inference is that the number of policy that will be available to the agent so let's suppose we have two actions. We have one here, one here. Okay, action zero, action one. Here at the first time step there are two actions. At the second time step there will be four policies basically that the agent can actually perform. It can go for zero, zero, zero, one, one, zero, and one, one. And this number of policies will basically be multiplied by two each time step, like first on the road, right? Because each time we can now pick each of the action again for each of the previous policies. And this exponential growth is quite problematic for, for example, the prior of a policy. So if you remember this definition for the prior of the policies we see that in order to define the prior over the policy we need to compute the expected free energy for each of those policies. But we need to do that for an exponential number of them as the time horizon of planning increases. So this is the first problem. Also, this exponential explosion is not limited to the number of policies because we still remember, okay, so maybe for this one I need to go and look at the variational distribution which is used in variational inference but we see that the number, like the variational posterior over states depends on the policy. So for each of the policies we need to store a distribution over states. And this is once again a problem because there is an exponential number of policies which means that there is an exponential number of variational posterior that we need to store. So this is the kind of problem that happens within standard active inference. The number of policies is growing exponentially. And we also need to store the distribution of a state for each time step. Now where branching time active inference becomes useful is that it uses a structure like the graph structure to avoid to have to store every single possible combination of time step versus policy. And so this, those two, we see that is growing linearly is because we just keep in memory one distribution for each state in the past and present. And only when it comes to the current time step do we start imagining what's going to happen in the future. And this growth is still exponential. Like if we had to explore every single possible policy in the future we will still have an exponential growth. But because we are using multi-carriage research basically we are going to only explore a small amount of the tree. And this is where the complexity moves from exponential to linear to the, with respect to the number of expansion of the model. So each time we expand a new branch in this model we need to store one more categorical for this future time step policy. And so this is how like we can move from an exponential complexity class into linear one with respect to the number of expansion of the generative model. So that was for the complexity class. In terms of hardware it was basically just on my own computer. So no practical GPU used nothing like this just CPU basically. Very interesting and on the hardware or on the implementation side where do you see packages in Python or Julia like Fornilab and the reactive message passing paradigm being developed or do you see GPUs as being relevant like this is the storage consideration. What kinds of scaling relationships or in theory and practice how are the operational aspects of the computing rather than the space requirements computed. So first thing to say is that the space complexity is also linked to the time and you know computational complexity because for example as I said when I was speaking about the the prior of a policy if we have an exponential number of policy we need to compute an exponential number of expected free energy for each of those policies. So on something for when it comes to the posterior when we have to store the variational posterior and that there is an exponential number of them then we also need to compute them. So this will also become intractable in the room too. And in terms of implementation I know that some people have been developing like Fornilab in Julia I've been providing my own implementation in Python so yeah those are possibilities in terms of GPUs I guess their usage will be really useful only if the graphical model allows for parallelization for example one case where the GPUs are very useful is for images because each position in the image can generally be processed in parallel. So if we had like a generative model where we had a likelihood mapping for just four pixels next to each other like a patch in the image and we had to compute all the posterior for all the image then there is a very large potential for parallelization but if for example a message has a dependency on a previous message then there will be just a part of the GPU which is just waiting for the input message to arrive so there is also a limitation on that because some of the analytical solution require some other messages so there are some dependency throughout the dependency for the graphical model so using GPU yes but probably in specific generative model for which it's useful such as image-based generative model or so yeah this kind of generative model if we have something which is very simple then I think it's not going to benefit a lot from GPUs computation Excellent Thank you Ali again if you'd like another question or I can ask one I also wanted to ask about the different possible future implementations of branching time active inference because Daniel and Jakob know that I'm a big Julia fan so I wanted to know if there is a plan to have a Julia implementation of branching time active inference because I think we already have C++ and Python the first branching time active inference was implemented in C++ and the multimodal one in Python so what are the future plans for the other forms of implementation at BTAI So for now I was just planning to just use Python but I guess it should not be too hard to port it in Julia and just for now I don't have the usage for it but yeah and also for like future possibility when it comes to branching time active inference I've been starting to work on trying to implement a SLAM algorithm so this is a simultaneously location mapping so basically we create a map of the environment and we navigate through it so this is also like a possibility but within this context basically I was going to grow exponentially that we can have observation that depends on a very large number of states and therefore what we need to have is more like a conditional probability table which is stored as a tree so basically you can encourage rows within the conditional probability table for example you have let's put that you have probability of C given A and B then maybe if A is equal to 1 you want to have a branching on B and then we have 0, 1 and maybe this is 0.9 so what this means is that if A is equal to 1 and B is equal to 0 then the probability distribution of a C is going to be 0.1 for the first value and 0.9 for the second value and so basically the idea is to try to not represent the entire table but choose a tree structure to encode rules about the dynamic of the world and the likelihood map function as well and then the challenge is to be able to perform forward prediction from this tree and inference also from this tree so this is another feature which may be integrated inside BTI in the future definitely a theme that runs through a lot of these discussions is representing objects as trees and then taking the tree turn or the forest turn seriously because the tree structure allows us to avoid redundancies and enable some new types of analyses. Jacob do you have a question or I can ask one? Yeah maybe continuing on the SLAM thread I'm wondering whether you're considering the application of SLAM to the image classification problem and perhaps how the image classification problem needs to be reframed to even fit this well first of all branching time active inference scheme but overall active inference scheme because it seems that active inference overall is much better suited to these kind of continuously evolving problems where the generative process changes whenever the agent takes actions whereas an image classification problem seems to be way more static which in the machine learning scheme it's just input and output and then perhaps some error that gets back propagated through the network so I'm wondering how you are thinking about image classification with active inference and overall just how images can act as input in dynamical environments so yeah thank you for the question once again so basically the thing with image classification is that we don't have this temporal structure like you just mentioned which makes it quite difficult for an active inference agent to be applied to this in some sense it's a bit like the transition mapping is like an identity function so it's difficult to think about how you can bring it but because each time that could be one image for example is a classification but I think active inference is just not really well suited to something like this for classification I think there are just classification models like whatever ResNet or whatever which are much better suited basically if you had to apply active inference you would have to change the structure of the model to remove that it is a temporal transition and just to have observation or otherwise you will need some kind of dynamic environments like the Atari games, backmanus kind of thing or the despite environments and in this case you can model the temporal dynamic of the environment and so here active inference really helps because you can think about action and how they impact the transition and you can have basically a non-coder that will compress the image so you will have technically an image here you will have a non-coder network a bit like a in a variation autoencoder that will produce a prometure a mean and a variance of a distribution of states and then we will have the decoder over there which produce another image from the latent variable and then we will have here like a transition networks which is also a deep neural networks which outputs as a mean and a variance of a distribution over the state as a next time step and then here you will have another encoder for the future image and another decoder for the future image as well and this transition will have to take into account the action as well as the states to predict the next state so that's the kind of architecture you will need to create a deep active inference agent and I think it's better suited to dynamic environments like at other games dynities or static environments this is indeed a lot more complicated to apply it to if I could give a few thoughts on image slam very fascinating point about static analyses and dynamic analyses and so what are some ways that we could pseudo-dynamicize the image classification task so a few options one of them is navigation amongst large databases of images so potentially choosing informative examples for training in large empirical image databases or frames from video or in a dynamic feedback with prompt engineering for AI generated images so then it makes it into a dynamic question-response task that would be using dynamics at the level across images but still taking in the whole image and one other approach could be building on some of the ocular motor active inference models of attention only taking in a small amount of the image potentially reducing the state space or the computational complexity vastly and then making the dynamics of some lower level entity related to policy selections on eye movements or attention and then treat that as like the lower level of the slam and the classification what kind of image am I looking at as a higher level of a slam but the policy is being enacted at the level of which parts of the image are being scanned Yeah, that's indeed a very nice set of ideas which require modification of the task but is indeed much more much better suited to active inference so very interesting I wanted to also ask about two modules or functions that various other active proposals have had which are hierarchical nesting of models and learning so how do nesting and learning influence the theoretical and the realized computational complexity so I think like having hierarchy can really help reduce the computational complexity of an active inference for example one idea would be to have a generative model over images imagine you had like an image and then so an image can be like it's millions of possible combination why it's like not even more than that but probably more than atoms in the universe but what you could do is like a bit by imitating the structure of convolutional neural networks maybe you can create over a patch of pixels for example different line patterns so maybe diagonal lines or horizontal lines you will have a first level of hierarchy which will extract those informations and then you will have pattern of patterns and so on and so forth I think you can create this as a category called distribution but in a hierarchical model basically at the very beginning you have pixels and then you have small edges and stuff like this in combination of edges and all the way up basically to to having objects but this is very complicated like in term of implementation we will require to probably use GPU for the inference process because there will be very large number of patches so we will need to speed up the training but still I think this is a very good way to reduce the state space of the agent because if you try to basically put an image as input of a standard active inference agent you will have to have like more possible images in the world of atoms in the universe but probably actually with small images so hierarchy can really help on that sorry you had another question the second aspect was about learning for example what if we update our priors as the tree continues or we want to consider policies on priors or other types of updating of our different parameters that might be fixed in other settings yeah so one thing to say about having learnable priors is that in some cases this could go really wrong so if you don't like okay imagine you have observations and you have states so those are the observations or sorry this is the states and this will be the observations so three possible states two possible observations and if you start with a Dirichlet prio which is just fully uniform so maybe the parameters are all one everywhere well what's going to happen is that if you make one of the observation it's just going to give you as much like the inference process will infer a uniform distribution of states because the weights within the matrix are basically all one so there is for each observation there is no real like state which will be more likely to basically generate it which means that the inference process will have a uniform distribution of a state and basically this is a problem because what you will end up having is like an A matrix where maybe some states are more likely but each state is not more likely to generate different observations and so basically there is a failure of learning because it's just not able to like if you have them again so if you remember the way we update the parameters of the Dirichlet it's just by counting the number of state observation pair that we are observing if the state which is observed is always like one point like one third for example like if it's a uniform distribution then it's going to count one third of the observation for all the state at the same time and we are not able to identify which states has been able to generate this observation because they are all as likely to generate these observations and so what is going to happen that you Dirichlet just cannot learn which states generate which observations just counting the number of states has appeared but not with respect to observations so this is a degenerate case which shows that adding matrices for example with Dirichlet priors and stuff like this can fail to ruin the dynamic of the environment and the likelihood of the environment as well so maybe having deep neural networks can avoid this problem but here it seems that's a real challenge like learning the parameters within active inference model seems to require making a human to first give it a first draft where the likelihood like the prior is not uniform when it comes to the Dirichlet if the model is to be able to learn so this is one of the challenge in active inference it's something we've come across in specifying the state space and what policies are possible and it's an interesting conversation because it brings us as modelers into engagement with the model and helps clarify where are we setting scaffolds and constraints what are we what manifolds are we placing that agent into that set it up oh it's rolling downhill within some super local context even if that local context is still enormous in its state space it may still be just a tip of the iceberg in terms of the total model structures and that's not even to say we need to explore the total model structure in practice but in theory it's quite important or we might just be looking where the light is and putting the rabbit in the hat making these models that play out a certain way maybe even deterministically because they've been kind of told the secret in the beginning I have a more question and then I'll lay your yuck up so you contrasted three different approaches which were variational message passing Bayesian filtering and belief propagation and whether for didactic or pragmatic use where do you see these different approaches as being useful or specialized where are they better where is one generalization or a special case of another so for example the way we structured the model in variational message passing is that we keep track of the past and each time we get new observations into the future so let me maybe just go here oh no so in variational message passing we keep the past and this is quite interesting because when you have variational message passing you can also have backward messages which means that as you get a new observation you will have an an account like the state associated to it and what is going to happen is that you will have a message like this and you will also have a message that goes backward in times and those messages will enable you to refine your understanding of what happened in the past so this ability to revisit update your understanding about the past is something which is quite specific to the variational message passing algorithm and does not appear for example in Bayesian filtering because we only keep a belief state of the current random viable and when we expand when we get a new observation and a new state associated to it we are just going to perform prediction to get the posterior and then we are going to get rid of that so we cannot have those backward messages to update the posterior beliefs over past states so we can't really have this kind of counterfactual abilities now with the belief propagation algorithm basically it's very similar in AD to what is done with the belief propagation in the belief propagation settings it is just a more scalable approach which enables one to have different random states and different observation because this BTIBF approach was only restricted to one observation and one state and if you have for example the exposition the position the text of the disparate environments then you will need to create one random viable that corresponds to all the combination of those is and why position so maybe it will be a random viable describing the position and if we had two values for each of the exposition and as a right position then all the combination will be like four ish of the value for one times all the value for the other and this goes exponentially with the number of viable so let's suppose we have now a scale variable and that this scale variable can take two additional values then the total number of combination of those three random viable will be like eight eight possible combination basically all the is and why position for each of the two scale a possibility maybe scale one and scale two and this exponential growth become problematic if you don't have this ability to have several observation and several states have an exponential growth in the number of states and observation you try to model and this is where really like the other approach multi-factor and multi-modality is really useful to scale to a more complicated approach with more states and observations Excellent, Ali or Jakob any closing questions or thoughts? Well you see I came across your work a few months ago and it got me truly excited so much that I read all five of your papers is it is the number five right? I mean you published five papers up to now because you see more often than not people see active inference and the free energy principle is basically speculative thinking and endeavor without so much pragmatic value in real applications so in my opinion your work as I mentioned is a very welcome addition to this nascent yet exponentially growing and I hope to see more exciting developments in the future for branching time active inference or possibly the other variants you might come up with in the future and I'll definitely keep following your work from now on so thanks so much for joining us today No problem and thank you for inviting me I'm really glad I could present here so thank you for the invitation Jakob any final thoughts? Yeah well this has been a really great presentation and discussion Ali also linked your work a couple of months ago and it also got me very excited for the future of active inference modeling and it's a topic that we're discussing quite a lot in the institute and I think that this approach to reducing the computational cost of performing active inference in more and more complex state spaces is probably the best way to go to really reach adoption of these models in different domains so yeah thank you very much for joining today and I look forward to reading that paper on the deep active inference deep branching time active inference Well you're welcome back anytime and we will certainly be observing thank you bye perfect yeah great really good