 Hello everyone. Thanks for joining and welcome to ACTIV-INFLAB, live stream number 24.1. Today it's June 22nd, 2021 and we're going to be discussing the paper and empirical evaluation of active inference in multi-armed bandits with several of the authors. So thanks everyone for joining. Welcome to the ACTIV-INFRAB. We are a participatory online lab that is communicating, learning and practicing applied active inference. You can find us at the links here on this screen. This is a recorded and an archived live stream. So please provide us with feedback so that we can improve on our work. All backgrounds and perspectives are welcome here and we'll be following good video etiquette for live streams. This short link has a schedule of all the live streams that we've been doing and will do for 2021 and we're here today on June 22nd in number 24.1 which is the middle of the three-part series on this paper about multi-armed bandits. The .0 video 24.0 had some context and some background on this paper that we're going to be discussing in empirical evaluation of active inference in multi-armed bandits and today we're here with three of the authors. So thanks so much for all of you who are joining today because we'll have a lot to discuss and learn about. And in today's discussion for 24.1 we're going to first just go for some introductions and then we're going to have a presentation by the author and then we're going to just open it up for discussion. So if you're here on the video call or if you're watching live in the live chat just feel free to ask any question and we can go wherever people are interested in going. So that being said here we are in the introductions. We'll just go around and introduce ourselves and say hello and then especially for the authors who are joining for the first time it'd be awesome to hear anything you want to say or we'll ask you questions as well. So I'm Daniel. I'm a postdoctoral researcher in California and I'll pass to Dave. I'm in the tropical mountainous rainforest 120 miles north of Manila. My background is in cybernetic learning theory, general psychology and machine translation, but not much math. So I'm floundered with much of the active inference. We'll go to Sarah and then continue on. I'm Sarah. I'm a postdoc in Dresden together with Dimitri. My original background is physics and biophysics, but I wrote my dissertation essentially about active inference and especially applied to habit learning. And yeah, that's it for me. I'm going on to Dimitri. Should we let Trevor introduce himself? I'm not sure if he wants to do that. Okay. I mean, I can also introduce him shortly. So, Trevor is our colleague from UCL previously, so he worked in Max Planck, UCL Center for Computational Psychiatry and Aging Research and currently he's switched to industry, so he's working in second mind on applied reinforcement learning there. So basically he's an expert on this other part of stuff which doesn't cover active inference. So, and yeah, I'm also a postdoc at Technical University of Dresden. Both me and Sarah are at the chair for neuroimaging where like the head of the chair is Stefan Kiebel. And yeah, we have been involved with active inference for a while now since probably 2015-16 and have been applying it to various questions like learning human behavior, cognitive control, decision-making in dynamic environments and similar. And yeah, so should I kind of switch to the slides now? Sure, we can go to the slides or if I can just ask one general question to any of the authors who wanted to respond. Was it that you were interested in active inference and then looking for domains to apply it in or were you interested in a domain and then sort of found active inference as a way to integrate what you were working on? So my background is also in physics, complex systems and computational neuroscience and as physicists we like to think about this unifying theories. So in this sense active inference has this appeal that it can connect a very kind of distinct way of basically thinking about decision-making. Thermodynamics, right? Stochastic dynamics in very kind of different areas of understanding dynamical systems so to say and kind of control. So that would be the appeal and right, I mean, definitely as a tool to apply like to decision-making human understanding human behavior. This is kind of where this all started, right? So what new can we learn basically using this approach? Who? And Sarah, any thoughts about, especially I'm curious about the biology side because we hear a lot about the math and the physics verging towards active inference but it's also cool to hear about biology and that was my background as well. My background is rather in biophysics so I'm only tangentially in biology but what I like is also that it connects to some general information processing scheme in the brain and actually my master's thesis I was working way, way down in the abstraction hierarchy on spiking neural networks and like receptor dynamics and actually in this area I find it easy to not see the forest for the trees and I think actually connecting upwards and ask the question what is the general information processing that's going on and then connect downwards again is what also attracted me to active inference. Awesome. So as usual people will be, I guess, joining or leaving. I'll unshare this screen and then Dmitri, if you want to jump into the presentation that would be awesome. Yes. Okay, so you see, I hope this slides. Yep, looks great and I'll crop everything so go for it. Excellent. Okay, so let me start with a bit kind of motivation for this work. Well, you realize kind of our involvement with active inference came from cognitive neuroscience direction and originally not so interested in the technical or machine learning side of it. However, if one thinks about multi-arm bandits that's a very general problem which kind of generalizes resource allocation problems, then one realizes that this is most of behavioral experiments can be cast into this kind of framework. And one sees this in a range of experimental cognitive neuroscience domains like decision making in dynamic environments, value-based decision making, structure learning and similar. One can also think about attention as a resource allocation problem. So it's kind of... There are many other domains where maybe they're not explicitly talking about multi-arm bandits as such, but they can also be cast into this general framework. For example, from decision making in dynamic environments one of the most well-known kind of tasks is probably like reversal learning tasks. That's kind of used in hundreds of papers and this is also kind of what initially motivated me to make this paper. It's a kind of just understanding so if I apply active inference to reversal learning tasks how does this compare to other alternative decision-making approaches we can apply to that. However, besides this kind of cognitive neuroscience direction multi-arm bandits have a range of industrial applications. They really use starting from marketing to finance like recommendation systems in finance for trading applications even in many optimization problems in deep learning. There are lots of papers sexually showing how you can speed up learning by finding better sets of examples for deep learning systems. So basically the way this also kind of allows then active inference to bridge this into machine learning and find new applications there if it kind of shows to be useful for this kind of multi-arm bandits as kind of adding something new to the existing work. And when we talk kind of multi-arm bandits have a range of different formulations. Today we will talk about stationary bandits and dynamic bandits where it just means that reward probabilities either are fixed over time or changed in different ways. Other people also discuss adversarial bandits risk-aware bandits, contextual bandits one can also talk about non-mercovian bandits where kind of rewards depend on a sequence of selections or you have kind of some memory dependence in the system and so on. There is really like a range of different kind of structural definitions of multi-arm bandits which kind of then require a different range of thinking about the problem. And this is potentially worth something also interesting for the future like expanding what we did here into this other domains and seeing if again the results are generalizable to other definitions of multi-arm bandits problems. Okay, so just how I structured the slide so far we have like two parts one is about stationary bandits and then we will switch, we will go to switching bandits and given that we have two hours I hope that will be enough time but if not we can also continue for the next session for example with switching bandits or so. Let's see how this goes. Okay, so what is stationary bandits? So the definition of the problem is this follow, right? On any trial an agent makes a choice between k-arms. Choice outcomes are binary variables so we've kind of focused here on Bernoulli bandits. Outcomes can also be drawn from any other distribution so then for example you've talked about Gaussian bandits depending on the how the rewards are generated. We will just work with Bernoulli here bandits and we have fixed reward probabilities in a very specific way so there is only one arm which has reward probability associated with Pmax which is one half plus some term epsilon which is larger than zero and all other reward probabilities associated with other arms are just fixed to one half. This kind of allows us to like control the difficulty of the task. So what we call the smaller the epsilon is the closer to zero is the more difficult is to distinguish the best arm from others and the more samples one needs to draw to kind of realize the difference. And right beside this like the best arm advantage term one also realizes that right number of arms k is also increasing the difficulty the more arms you have the more time you need to figure out what is the correct arm to play. So just to give you an illustration of the example we have four arm banded here. So whenever kind of agent pulls one of the arms it gets either zero or one so we can think of this as a reward or absence of a reward. Beliefs about the reward probabilities we will assume that these are beta distributed so we will use beta distribution to model kind of representation of reward probabilities associated with each arm. And we will use Bayesian belief updating for all action selection algorithms. So this also kind of limits the range of algorithms which we want to compare here instead. I will explain a bit later why I mean we focus on that. So just to give you a kind of visual example of what this does is right whenever an arm kind of an agent starts with kind of flat beliefs other reward probabilities associated with each arm so this is like a uniform distribution this is a special case of beta distribution and whenever it pulls like selects one arm it gets either one or zero as an outcome and it updates the beliefs about the reward probabilities in one of the directions. In here it sampling one more often so it has increased belief that there is a higher reward probability associated with this arm. So formally how this is kind of implemented what you see here is like to a very simple you know a generative model. So with each K arm we assume there is some associated unknown reward probability that the K which is just a number between zero and one and choice outcomes of choices are either zero or one so they are just binary choices. So these kind of constraints are like observation likelihood to a Bernoulli distribution so this is just kind of product of different Bernoulli distribution depending on which arm one is selecting and this kind of term below like this is our prior belief about reward probabilities and this we'll assume is just kind of product of many beta distributions where here alpha zero and beta zero are just fixed to one so this corresponds to the uniform. One can also start with other values but I think this makes some sense. Reflects no knowledge a priori about reward probability. So now given some choice at trial T80 we just apply bias update rule and we get from some prior beliefs posterior beliefs and I think about the stationary case is if our priors are beta distributed our posterior will be also beta distributed and the update one only needs to take care about how we update the parameters of beta distribution alpha and beta and this just works in a form of accounting basically alpha counts how many times you observe one as an outcome and beta counts how many times you observe zero as an outcome. And so because of all this in this stationary case inferences exact because both prime and posterior belong to the same distribution family you have kind of conjugate prior here set up so you can just track effectively the updating. There is another kind of representation of the update rules one can use here so for example if we express the updating in terms of expected kind of outcome and the scale parameter new here we see that actually the expectations are updated in the form of delta like learning rule so right where this kind of learning rate is something which decreases over time so basically new is just increasing over time whenever you select specific arm new will just increase by one and this means that the learning rate decreases over time the more you sample from one arm the less you will update your beliefs about it right so this is practically what's happening here very simple okay so and basically this is everything we need to know about kind of this let's call it perceptual part of the generative model so how we update beliefs given some outcomes now we have to kind of introduce how we select actions based on these beliefs and for this right we kind of can talk about action selection algorithms and so they are in kind of in the literature there are lots of examples one of the oldest one is probably epsilon greedy you also mentioned it during the first presentation 0.0 five people also talk about the UCB upper confidence bound this is also one of the oldest one KL UCB it's kind of extension which uses KL as exploration KL divergence as an estimate of exploration bound there is Thompson sampling et cetera here we will focus mostly on this tree we will use optimistic Thompson sampling as one kind of comparative example and another is Bayesian upper confidence bound so first both of these algorithms have been extensively analysed in the literature and people just found that they work in many examples a lot much better than kind of let's say non-Bayesian approaches and secondly we can use basically for all three algorithms the same update rules because they are just Bayesian they are just algorithms corresponding to Bayesian and banded so we can kind of our motivation here is to compare kind of action selection principles based on the action selection algorithm and not based on potentially different ways how you update beliefs about about the history of observations and so this would be kind of motivation and also a bit to simplify the comparison I mean you can like add to the list at least ten other ways doing the action selection and even like decision making in multi-arm bandits okay so just a bit to kind of historic example important this is just like upper confidence bound although I will not directly compare it it has some relevance to understand also what is happening kind of in active inference later and this form here just corresponds to a version which is adapted to specifically Bernoulli banded so some people might be more familiar with other version which was derived from Gaussian banded so this is kind of in this paper here one can see these different derivations and examples how this works what is important right to notice here is basically that so this first term is just the expected probability of reward on arm cave and the second two terms correspond to this exploration bonus or the bound so this is right typical for UCB is that the bound kind of increases over time logarithmically with the number of kind of trials you are doing so basically the more you are not selecting one arm the more you are kind of pushed again to select it at some point in time in the future alright so basically this necessarily increases over time if you are not interacting with the specific arm right but still although like where the algorithm is very simple it has nice theoretical results so because it's kind of it's efficient algorithm it converges in infinite number of samples however right here we will talk on this kind of Bayesian variant of the upper confidence bound which actually selects arms based on kind of the percentile upper percentiles of cumulative distribution function right so so as time progresses one is trying to estimate the extreme value of a belief distribution which in this case is just like a beta distribution and one gets this extreme by solving this equation which just corresponds to inverse regularized incomplete beta function so basically the more extreme point your beliefs contain the more likely that you will select that arm so the more relevant is to select this arm because you are expecting that this high value is still possible as a reward probability in this setup and so this kind of algorithm has a couple of parameters but what authors kind of show in this paper is if we have very good results for just fix by fixing c to zero so basically this term just becomes one and so we just use their advice here in the paper so we will not and we didn't kind of try to analyze other possibilities or other kind of values of these parameters in the in this case so for Thompson sampling this is again one of the classical algorithms first probably attempt with Bayesian bandit so with Bayesian decision making and it's also extremely simple given some beliefs about probabilities at each arm you sample one point so right if there are like 10 arms you would sample from each arm one point and you will just select an arm which gives you the largest values a variant of that which was shown in the last five to ten years that it works slightly better is this optimistic Thompson sampling where one is constraining the samples only to values which are larger than the expectation right so one is first making a sample from each arm and then if the sample is larger than the expected probability of reward then one keeps the sample otherwise you just use the expected value as a as a value a kind of reward probability associated with that arm right and you're just again over different arms maximizing taking the selecting the arm which gives you the maximum reward probability right this is a kind of a very stochastic approach where this exploration actually kind of bonus to this algorithm comes from basically this random sampling from a probability distribution right the broader your beliefs are about something the more likely is that you will get the kind of large value hence the more likely is that you will explore more select that arm and in different rounds and basically exploration is kind of here completely driven by the noisiness of the sampling process itself okay so now we come to basically active inference version of this so kind of simplifying here things from what people maybe know how active inference is used normally the first we will use like rolling behavioral policies which means that agent is not like tracking history of actions it perform it just kind of repeating step of policies on every trial and in this case a behavioral policy just corresponds to a single choice so in this type of bandit problems we are kind of analyzing here like agent cannot change anything in the environment hence planning is completely relevant in a way you cannot kind of position yourself in us in space better over time so that you kind of need to plan something which means that we're just kind of single choice policy evaluation a single time step in the future is sufficient to make good decisions generally expected free we'll base our action selection on expected free energy here where this will be a form which decouples into a risk and ambiguity term but we can also think about this problem as estimating expected value of different arms plus the expected information game so how much information we can extract from different arms if we select them so now let's assume that we know how to compute expected free energy I will go through the details on next slides normally like right post here of a policy is is estimated as a kind of mixture between expected free energy so kind of a future expectation about what behavior we'll do plus the kind of the second part which is the free energy about the past kind of outcomes however because we have like rolling policies this term is just constant for each policy is the same it has the same value but basically so the posterior policy or action just corresponds then to the softmax or the expected free energy and normally one things like about choice election this is a sample samples from the posterior however we are here only interested in kind of optimal choices so for us basically gamma is just infinite value and we are just selecting the making choices about actions which minimize expected free energy so this is useful parameter to have if you need to kind of fit a model to the behavior but for this kind of practical application there is not much gain in adding another source of noise here okay so how do we compute expected free energy right well that's in a way quite simple so we just have a couple of terms here one is posterior beliefs over posterior beliefs over reward probabilities right given is this like cute term this is just a product of beta distributions we have marginal likelihood so probability of observing outcome given some action so this is just marginalizing likelihood over current posterior beliefs and another term is here is just the prior preference over different outcomes right and this is really easy to parameterize because we work with binary outcomes and we can just define here a single parameter lambda so the higher the lambda is the higher the preference is of the agent to observe once relative to zero and this parameter also kind of has a role of regularizing the amount of exploration an agent does because the larger the lambda is the more the more focus is set on the exploitation of making selections based on the expected value instead of expected information so this final term right computing ambiguity is basically just computing expectation over the entropy of different outcomes right so which also has relatively simple form here because of Bernoulli likelihood so without going into all the details this is basically how the expected free energy looks like so this first term here is correspond just to the expected negative of the expected reward so because we are minimizing expected free energy this effectively means that we are maximizing expected reward and the second term is just a very complex set of equations which gives you an estimate of the expected information gain and in a way kind of motivation is just because one cannot really understand what is going on here we have kind of logarithms of expected reward probabilities but then we have a D gamma function of parameters so instead we can kind of try to simplify the term by approximating right effectively the information gain part just to get better understanding of how basically expected free energy scales like with repeated choices and so what one ends up with very simple form right basically the exploration term just corresponds to one over two divided by the number of time one selected different terms right so this is in a way quite similar to what we had in UCB algorithm right like first exploration term was also divided by something which is proportional to how many times you selected but without this kind of expansion of exploration bonus with time so there is no logarithm of T and I mean this will have a a consequence of efficiency of active inference so I will show you this in a moment so how we want to achieve basically this simple form is just by assuming that for large the D gamma function has this type of approximation when x is sufficiently large right so this just means sample sufficient number of time each arm this will be more or less very accurate approximation and we see that this approximate so algorithm and the exact they behave very similar so it is in a way reasonable to good approximation so you actually Daniel I think you asked like a question about scaling properties of this right one kind of motivation for introducing approximation is that in a way the algorithm is much simpler so it scales better so you can kind of run it on much more arms easier however like the advantage is not huge so maybe it scales slightly better but you are maybe gaining like 10% or 20% performance in computation time so it's not something like which destroys completely the exact form at least not in this example because the problem itself is simple however it helps us a bit understand but just get intuition what is happening so at least like how the exploration bonus changes with time okay so before I kind of just show some result like comparison of different algorithms I want to introduce just concepts which come from machine learning analysis of multi arm bandits which people use just to kind of rate and compare different algorithms how good they are in solving this task and this is done by using something defining a regret so basically a regret is simply a difference between what you did a trial key minus what was the best choice and that trial and so kind of assumption here is that there is some oracle which is solving the task which knows exactly what was the best choice in every trial right and so normally people kind of consider two quantities either a cumulative regret which is just some of regret or up from first trial till the current one or a regret rate which is just the average over time of cumulative regret and we will here use both just for visualizing different aspects of the algorithm and in stationary case at least this is a very kind of important result because for all good algorithms one would expect that this regret goes to zero over time so as we go to infinite number of trials you should be able to always do a good choice and if this is the case then one can show that algorithms which have this property they are called asymptotically efficient and they scale for large t's terms times a logarithm over t and so this is kind of one important aspect of stationary case of different decision-making algorithms or multi-arm bandit algorithms so they should at least scale as logarithm of t when you expand when you go to large t limit otherwise this first thing will not probably hold or they can also scale slower than logarithm of t but for example if they increase linearly with t this will not hold anymore now let's go to the comparison right we will look into how optimistic tops on sampling, Bayesian upper confidence bound, exact reference algorithm and the approximate one compared to each other I will start first with just trying to see what would be kind of a good value for this lambda parameter in different settings and also to give some kind of initial comparison of exact algorithm and the approximate one so what we are I'm showing here is a regret rate for different kind of snapshots so these dotted lines are after 100 trials the dotted dashed line is after 1000 trials and the solid line will be after 10,000 trials and so what one kind of notices here is that there is some obviously kind of minima for different lambda so this kind of reference parameter and that the longer kind of the more trials you do the smaller the lambda should be so this is not quite nice and this kind of has a consequence however we can just pick some value so for example this purple dotted line shows like around 0.1, lambda 0.1 which seems to be close to minimum for most of these cases so we don't want to have different value for different examples because I mean this is just not practically feasible you want to have a kind of general algorithm which can be applied to many different situations at the same time so when we compare so Bayesian UCB, no domestic Thompson sampling and just the approximate active inference so I'm just excluding this here the exact one because they will behave the same pretty much for this specific parameter value if we compare them in terms of cumulative regret we see that the approximate active inference is not asymptotically efficient so this curve just goes divergence over time if you look at the green and the yellow curve they kind of flatten out after some time and they get the slope proportional to this dotted which actually shows the slope of this kind of asymptotically limit what you should see for large t's now so why is this happening so the thing is that because this exploration bound is kind of just decreasing over time active inference algorithm kind of gets stuck into the wrong solution with some probability which depends on the difficulty of the task the smaller the epsilon is the more likely and the more arms you have sorry for more epsilon and for small number of arms there is a higher probability that you get kind of stuck and one can see this if this is a snapshot of different runs so this is kind of a distribution of cumulative regret I'm just plotting the logarithm here so like making a histogram over different runs this is like 10,000 runs of different algorithms and just showing at what value for the logarithm of the cumulative regret they end up as you can see here there is for kind of algorithm you see this kind of well spike here and in the tail of the distribution which is proportional to just doing random choices so it means that basically algorithm was just selecting wrong arm constantly it never converged to a correct solution in a way it gets stuck to rock solution because the exploration bound was reduced too soon so this is kind of say limitation for application of active inference to stationary problems because this is not a feature of an algorithm you would like to have in a way normally if you are kind of so example of this would be for example optimization problems that you want to find the best solution for a set of parameters for example Bayesian optimization finding the minimum of some unknown function uses Thompson sampling this seems to be a very efficient way of finding the minimum however if you would apply to such situation active inference based on arm selection or sample selection there is a chance that the algorithm fails just kind of get stuck in the wrong minima wrong optima it doesn't do sufficient exploration so this kind of requires potentially some adjustments to how actions are selected at least in the stationary case so just to okay strange I got some strange slide just to kind of go a bit what the short term behavior looks like so from the perspective of kind of cognitive neuroscience or like human decision making you don't really care about this asymptotic limit because you don't usually expect people to be in either a stationary environment things always change or rather they kind of have to repeat actions so many times so what I'm showing here now just in a very reduced example so if we have only three arms and we just use kind of different epsilon values so task difficulties what is the probability that for different algorithms that you select actually the optimal arm and as one can see that right initially so Bayesian UCB for first maybe 25 trials have the highest probability to select an arm however there is a range of trials like from 50 to maybe 1000 where active inference based algorithm takes over so right in a way because of this information game term active inference is more efficient in targeting like the arms which will give you the most information has can recover the best arm in kind of some intermediate interval with the highest probability however as you kind of expand this after like 1000 trials so you see that this probability gets stuck so it never converges to one unlike the other this is especially evident for this difficult right difficult problem small epsilon and and so this is in a way explanation so what happens so basically the algorithm although it reaches good solutions after with higher probability than other algorithms there are still lots of lots of examples so in this kind of simulations simulations which get stuck to a wrong solution and they cannot get out of this so in a way where this kind of ask question ok what can one do to make active inference so how can maybe either generally model be changed to support increasing the exploration bound over time or maybe introduce kind of instead of computing expected expectations of expected free energy one can also just draw samples from a posterior and kind of also compute these two terms like information gain and similar to like Thompson something so right there are kind of different ways one can think of how to add exploration bonus and a third option would be to actually introduce learning of the lambda parameter so that lambda itself kind of goes down over time so it kind of goes to zero with specific rules however currently we don't have very good solution for this so we just leave it to this that is a obvious limitation of just applying active inference to this type of approach are there kind of maybe any questions I think this would be like the half where we switch now to the perfect the other if anybody has any thoughts definitely I could ask some things or also we'll ask if in the live chat people want to post any questions but how much longer of a presentation did you estimate that you had so that we could kind of also address some general points here during this point one well I don't think there is more than maybe 20 30 minus max I didn't really gauge it but there is less slides in the second part would it make more sense to go through it quickly here or would it make sense to do it in the next weeks as you wish I think both are fine for me I mean maybe additional 15 minutes okay so we'll complete the presentation and in live chat and on the video chat here we'll compile our questions and then in the remainder after your presentation today and then next week we can have more open discussions so please continue okay okay so now we go to this like dynamic non-stationary problem and on any trial in this case it's a very similar setup on any trial and agent makes a choice between K arms again we are focusing on the Bernoulli bandit so outcomes are just binary variables however what happens here is that the reward probabilities associated with charm change over time and we kind of differentiate between switching bandits where we will assume that the changes happen at the same time on all arms and furthermore in switching bandits they are kind of also called piecewise stationary so there is like a period where nothing changes and then there is just one moment in time when rewards, reward probabilities on all arms change or we can think about another variant of dynamic bandits would be restless bandits where changes happen independently on each arm and they are continuously changing over time so for example following a random walk I will only talk about switching bandits but from some testing I did all the results generalize also to the restless case and in this kind of beside this number of arms and difference between the best arm and other arms of kind of reward probabilities epsilon and K we have another past difficulty this is the rate of change or like just change probability so the more often things change the more the task is first especially if you have many arms and we will consider switching bandits with fixed difficulty with just extrapolation of the stationary case by introducing changes about to which arm is associated with the maximum reward we will always have the same rewards on all arms it is just from time to time optimal arm changes with probability row and in this case we have just three parameters which define our task as difficulty we can also think about switching bandits with bearing difficulty which then just means that with probability row the reward probabilities associated with each arm either remain fixed on to next trial so they are just kind of translated or they are sampled sorry with some probability one minus row should be here they are staying fixed or with probability row they are just sampled from a uniform distribution in this case beta distribution with parameters one in this variant of the task we just have K as row as a fixed as difficulty parameters so epsilon kind of disappears so you are averaging over epsilon in the task and just to give you an example so we will not discuss this but how restless bandits set up looks like is basically you can you assume for example that the logit transform of reward probability just follows a random walk so it is just a brown exploration in the logit space of the reward probability and this would also require potentially changing the generative model which I will introduce but it is not necessary so one gets pretty similar solutions and behavior so now to come back to the example from before if we have like multi arm banded with four arms the setup is exactly the same and in addition we assume that the agent has access to the underlying probability of change so this is not something which is unknown this simplifies the learning rules and believe update equations however one can extend this what I will introduce today to the setups where the probability of change has to be learned also or that the probability of change is also something which changes over time so that you can have to track how often reward probabilities change on different arms so this would be kind of an example of decision making in volatile environments so to visualize the algorithm basically the only difference is practically that now you have an effective forgetting of what agent learned before and one can see this for example if you look on this square to the left as time evolves and agent is selecting other arms this value which reward probability associated with the leftmost arm just decays back to uniform probability so in a way agent is forgetting information or expectations it had about this arm and it assumes that with time the reward probability believes about reward probability will revert back to uniform distribution and the algorithm is really a straightforward extension of what I already described before so this is a starting model now slightly more complex so besides the likelihood term so observation likelihood which remains the same is just a Bernoulli distribution now we have a kind of state transition term which tells us how reward probabilities change over time and what this means is that if one believes that there is a change reward probability will be independent from the previous values and they just belong to a uniform distribution so this kind of prior belief and if there is no change the transition corresponds to a delta function which just means that the reward probability stay unchanged from trial T to T-1 to T and finally prior on each trial we have the same prior about probability of changes and this is just again a Bernoulli distribution with probability row which here we will just assume this is a known parameter to DHS right so the problem here in like dynamic cases you can still apply the bias rule and you can compute the posterior both for the change probability terms or JT and for the marginal posterior about reward probabilities however you see here the exact kind of form of the posterior is not anymore doesn't belong to like conjugate so the prior is not anymore conjugate probability distribution to the likelihood and this is not anymore a simple beta distribution but it becomes a mixture of beta distributions and as you are kind of evolving into the future this becomes larger and larger mixture of beta distributions which is well practically intractable right if you kind of expand this to open any number of trials so because of that we want to have something which is a bit more efficient algorithm we can basically introduce a mean field approximation and we now say that okay our probability distribution can be described as a product of a bunch of beta distribution and a categorical distribution which just tells us the probability of change right and try out the and how one what this corresponds to here so basically what is actually we are using here as bit of it's not a standard variation inference so where you would have to kind of compute the gradient over the variation free energy to find the optima this simplifies the things because you just need one step to update parameters both about the change and change probability and about right reward probabilities this makes it not super optimal so there are better solutions how I can do this but it's very efficient so and in the end for the Bernoulli bandits there is will not be much difference you can use better algorithms by this gives you just marginal advantage on the long run just because the problem is very noisy and it's very difficult to actually figure out the correct choice so what this variational smile does so it was introduced by Vasily in well quite recently in 2021 so that they provide you a bit more detailed justification for what I'm saying here I'm just paraphrasing paraphrasing a bit how the algorithm is defined so basically we can associate the marginal about change probability with the exact posterior marginal because you can compute this analytically and then we use this as a basically known known belief about change probability to estimate to estimate reward probabilities by averaging in the log space over different prior beliefs about reward probability so basically instead of averaging in the probability space you're kind of averaging in the in the log space here what this kind of results is very in a very simple set of update rules so on the left side we just are just showing how omega t is updated and this is just correspond basically to forming beliefs about change probability based on bias factor shown here right which is just the likelihood between observing given that the change didn't happen and going to the change happened right at the current trial and then using that estimate to update your beliefs about different probabilities and basically depending whether you selected the arm or not so right basically the omega term here plays as a forgetting rate the larger the omega the closer to one so the larger the probability that change occur the more you will revert back to the beta zero and alpha zero parameter so the initial prior beliefs unless you will depend on your current beliefs from the kind of previous trial and this also has an important limit like if the agent believes that there is no change in the environment you will revert back to the exact inference and update rules which we had for stationary cases and that's kind of also nice thing about this algorithm it can be just generalized to any knowledge about probability okay so the action selection algorithms did not change so the learning rules will change but we are practically doing still the same way of making action selection so for Thompson sampling we are just sampling from the posterior beliefs for the Bayesian upper confidence bound it's slightly different so we are using kind of the mixture between again the mixture of possible parameter values to estimate the inverse because from the this predictive posterior it's difficult to inverse it it's a mixture to beta distribution so this is just kind of approximation one can use for Bayesian UCB and for action selection approximate expected free energy one gets with very same set of equations because basically a row can just so probability of change drops out it can be in can be ignored easily there okay so now if we do kind of the same comparison first of exact and approximate after inference algorithms we see slightly different picture to what we had before first it seems that as you increase the number of trials and you kind of compute this regret rate for the specific number of trials the regret rate does not change so in a way algorithm converges very fast to specific regret rate and for different and there is a clear kind of stable minima independent of number of trials you're exposing algorithm to similarly again we see like very similar behavior between exact after inference algorithm and the approximate right and for this example I will just I just kind of fixed lambda to 0.5 so in what it just seems to be reasonably valued parameter for many many situations which we see here so here I'm showing for 10 arms if we go to 80 arms the picture slightly changes so it seems that optimal lambda parameter although it doesn't change with t it changes with the number of arms and obviously with all the other row and epsilon all the other right parameter so this makes things a bit difficult and basically suggest that it's important to kind of find potentially a learning rule like self self optimizing way of of estimating lambda and there there is some work actually in active inference literature which potentially has solution for that so we just have to test it out at some point so however we can still use the same value this does not influence that much the performance later on so what we see when we kind of compare it to approximate active inference with Bayesian UCB and Thomson sampling is that for a range of right settings either with like fixed advantage of the best like reward probability advantage of the best time we see that active inference is just right performance better already after like thousand trials or so compared to the other two algorithms this is just for different change probability so for example row 0.05 corresponds to a change every 200 trials this would be every 100 trials on average and the last step here is every 25 trials so this would be the most difficult right scenario so one sees is that kind of the more difficult scenario here's the differences disappear so you're kind of losing the advantage however okay similarly if we look at the if we fix number of arms for example to 40 and we just change the epsilon parameter a similar trend is visible right that the more easier the task is the bigger the difference in like non-stationary scenario you get relative to other algorithms of course one would potentially kind of destroy disadvantage if changes become very slow like every 10,000 trials or something where you are kind of approximately in the stationary stationary world okay so this would be like this switching bandits with varying difficulty we can do the same kind of analysis for switching bandits sorry previously we looked at the switching bandits with fixed difficulty similar analysis can be done for varying difficulty and interestingly one sees here because epsilon drops out one sees here actually for the exact active inference there is like a clear minima which seems very dependent on the kind of any of these parameter terms so row or K so basically we can fix lambda to 0.25 for the exact active inference algorithm and we can still keep for the approximate lambda for the approximate active inference algorithm lambda to 0.5 and just to right show what happens here is that there seems to like in varying difficulty there seems to be also advantage of using exact active inference decision-making algorithm it outperforms the approximate quite clearly in these settings and it doesn't require a fine tuning in this sense right for the range of problems you can have much better performance with the single choice of lambda value and again so the in this case interestingly the algorithm becomes in this kind of quadrant the less the difference between algorithms is and actually Bayesian UCB algorithm becomes quite efficient in these settings also here to see that Bayesian UCB also quite good performance interestingly much better than optimistic Thompson sampling which for me at least I didn't found any paper who previously showed something like that so yeah this is also a bit probably new result for machine learning people okay so just to conclude active inference does not result in a synthetically efficient decision-making additional work is required to establish theoretical bounds on the regret and derived improved algorithms in non-stationary bendings however we see much better performance in comparison to other algorithms and especially noticeable in more difficult settings right that you are kind of getting the more difficult task is you're kind of forgetting the better results attentive to the list for like a next next steps here is like introduce learning for the lambda parameter establish a really like kind of theoretical bounds on cumulative regrets so what can one expect to see given different choices of the algorithm for action selection based on expected free energy and so right this would hopefully improve behavior in stationary case and potentially also apply to some like real-world example so this kind of right in machine learning field I would say this kind of recommendation systems optimization problems incident right just to see how it performs in this kind of scenarios yeah okay that I would just like to thank all the collaborators on the project and the people who helped me with different advices and you can also find the slides here and the code is available my github page I would just not recommend to use it in the next two weeks because the paper is under revision and I'm breaking stuff constantly awesome thank you you can unshare and we'll return to just discussion for the last 45 minutes or so here but thanks for that awesome presentation with always good to get okay multiple multiple times just to sort of see some of those figures in the paper than also there are some different figures and some different views so again anyone in the live chat is welcome to ask a question or anyone who's here in this jitsie I'll ask a question from the live chat first and then if you're here in the jitsie please feel free to raise your hand so it's written in a live chat since the bandit problem here is not Markovian does this mean that we only need to consider the current time to calculate the expected free energy why why is it not to Markovian well let's actually clarify what makes a problem or a situation Markovian or non Markovian okay so for example the problem is Markovian because the changes are only dependent on the last trial or the current trial right so the reward probability in the non stationary case will be a function only on the reward probability in the previous time step and this is what makes the problem actually Markovian so I didn't there are non Markovians ended but this is not what we are discussing here right it is really just Markovian but this is not the case you don't need well maybe yes actually I mean if you would go to non Markovian bandit then you would potentially need to plan ahead for longer because that would require that there are kind of dependencies between your actions and outcomes which you observe in a sequence and different sequences might result in very different outcomes so right because here we are in Markovian case and agent cannot change the dynamics of the environment in any way so it's in a way just kind of passive sampler from the environment then you just there is no gain in planning ahead so you can just reevaluate your beliefs on each trial so it's kind of like a memory list process which you brought up and then what would have to change to maybe account for situations where sequence of actions does matter so for example one could introduce structural dependencies between different arms so that for example rewards reward probabilities depends on the location and allowing agent only to select to make their choices from the nearby kind of introduce some spatial dependencies that would be one example where then depending on in which part of the space I am in kind of where I selected one arm this limits me to what's the next arm which I can select and this would require you to then plan depending on where you should be in different trials depending on how you expect things to change this would kind of introduce then the requirement for planning very interesting Blue or Dave or Sarah want to ask anything otherwise I have some questions you mentioned a few industrial applications and a few ways in which people do use the multi-armed bandit this is just kind of like a logistical question like what is the rate limiting step in those use cases deploying active inference agents is there a way to kind of wrap the inputs and outputs of multi-armed bandits in a way that's sort of interoperable like you talked about how the learning rules were similar but then you juxtapose different action selection approaches so in the context of pipelines that people are already using is there a way to kind of hot swap active inference and maybe have it be deployed in industrial settings very rapidly but yeah I mean that would be one idea and for example if you have this non-stationary problems and you have already been used using Thompson sampling for example or even like optimistic Thompson as a choice of the algorithm in a way you already have a way to form beliefs about relevant aspects of the problem then you can like really easily swap the two you just can then apply the difficulty just to figure out how to compute basically expected free energy and test it out or the different generative model that people might probably have there again for stationary problems that may be a trickier it might work in some situations but yeah you don't have this kind of nice and cutting guarantees that you will always find good solutions in a way it's an advantage of active inference in a changing world like people's preferences for a given advertisement or the situation for trading is always changing and so it's a false allure to have something that has extremely well behaved behavior in the asymptotic or infinite case because we're not in the infinite case we're in the finite case and so it's almost like the sort of strong pillar that purportedly is underpinning these other approaches isn't so much of a gain pragmatically so it's pretty cool to hear about that blue thanks for the raised hand what would you like to ask so I have a question that was left over from the dot zero video and something you kind of alluded today in your talk can you kind of detail the difference between the switching bandits and the restless bandits I was like unclear on like the timing of the switching in the switching bandits and then also just like a part B of that question in the case of the restless bandits what are are the similar like algorithms that are optimal for switching bandits also optimal for restless bandits you had mentioned the active inference was good but what about the others that are commonly used it's the same right I mean so let's say in restless bandits the only kind of so in switching bandits we have a like piecewise linear problem or piecewise stationary problem which means that between trial one and trial then everything behaves stationary and then when the change comes you're just getting a new reward for both probabilities associated with each other for example that would be a kind of the idea of a switching bandits so that between changes everything is like fixed and the restless bandits assumes that things continuously change and here the example I gave was one can assume for example that reward probabilities can be described as a random walk in this in this logit space of the probability because probability is like between 0 and 1 you transform it to one constrained space between minus infinity and infinity you just have a random variable and then you map it back into the with sigmoid function for example into the probability space and so but for example the algorithm I show like for belief habit and one could also just apply it there so it's just one one doesn't know what would be the role so what is the change probability in the restless case so in that sense you would need an algorithm which can also infer the change probability and the restless case doesn't necessarily translate to kind of potentially fixed change probability so it's a bit more different problem so that because the assume arm the changes between right between arms which are optimal do not follow specifically like the same structure as in the switching case so one can either take a different generative model which actually assumes explicitly the random walk and like for example a hierarchical Gaussian filter is something which one could apply there but there also like other belief updating algorithms for that so the local or the global maxima and minima are changing in the restless as well yes yes yes so this would be kind of this case of varying difficulties over the relative probability between the best arm and the second best arm varies all the time I don't I could have just drawn the lines to show but I don't know I didn't do that I just don't have it right now something to illustrate this so one other thought on the advantage possibly of active inference is that with a deep generative model using the same skeleton of maybe even the same code it could be possible to do model testing between two different types of bandits like what kind of scenario am I in or even have deep temporal model so it could be extended in a way where sort of instantaneous sampler might be led astray yes yes yes I mean in principle you can any hierarchical Bayesian generative model which consists of multiple models potentially you could also generalize to Thompson sampling or Bayesian UCB I just wouldn't assume that this would be very efficient way to figure out which is the correct model which you should be currently applying to the specific task and this is potentially also very active inference would would provide an advantage in such a scenario where you can also kind of learn about the generative model itself better over time and that made me wonder what would it look like at the sort of human level as we're making decisions that are sequential in our day our decision making what would it look like or what would we keep in mind if we were going to be making decisions more like an active inference agent than like a Thompson sampling agents like what would be when you're in the grocery store looking at the serials that you've had before or not how would an active inference agent behave differently just kind of wondering yeah I mean that's a good question I mean we could actually we have some data sets which we could use exactly to ask this kind of questions I mean I don't have on top of my head any clear answer what would be your expectations Sarah well I guess it would tell you when or better tell you when is a good time to try a different cereal because you may like it even more especially if cereal recipes change over time and then you have some likelihood to be stuck maybe with the super bad cereal but also I like you to have the one you like most yeah I mean I guess this exploration wouldn't would be more structured in a way more directed right right like when I mean when the ingredients change you check the ingredients and then now you've updated your likelihood of trying something to yeah I mean Thompson sampling doesn't have directed part of the exploration just a random exploration whereas was there here the focus would be more directed exploration and another potential thinking how to add random one other piece is like we're often comparing and contrasting active inference to reward learning and reinforcement learning so it's almost like instead of making that decision based upon the highest expected value like choosing something proportional to its relative value or always going with the one with the best likelihood of having the best tasting cereal there can be some other heuristic and so it opens the door to just purely curiosity driven sampling like I've just never had this and it's not even as much a reward maximizing maneuver as it is just a purely epistemic gaining maneuver and then as we've seen when there's a pragmatic and an epistemic component to the function that's being optimized then those decisions can kind of coexist and be put on a common grounding unlike in a purely value driven framework where even the exploration has to be kind of coerced back into reward yes so I think Said Noor and she had like as a first author she had interesting paper the mystifying inference and they discuss a bit about learning the preferences themselves and what kind of consequence this has and this kind of puts a different perspective on understanding what the reward is because in real world you don't necessarily know why should something be more rewarding than something else but you learn this over time and you just learn to prefer different outcomes they don't even necessarily have to be somehow rewarding more in the absolute sense you just build experience with some outcomes over others and you start to prefer them and this then appears as if you would doing reward based decision making but I mean it's actually in a way preference based decision making right cool I had another question about the approximation of active inference is that the only way to approximate active inference are there other moves that you could have made to approximate sections that you did approximate are there other pieces that can be approximated what can be swapped or approximated but still retain this essential structure of an active inference model well I don't have many ideas what else one could do I mean the problem is in this scenario it's relatively simple right but for example one can kind of think of okay given that we have to compute the expected information gain and we are kind of learning having a way to estimate it efficiently like just with this approximation there is other way how you can compute this just by sample and you can draw a couple of sample from your beliefs and you get some estimate on the expected information gain this would still be in the kind of active inference framework but just like a different way of how you actually compute expectations and what they will kind of represent this would be a way to add a bit of randomness to the decision making process I mean active inference as I thought general enough could work with many kinds of approximations you can do approximations and many different points but as long as you there get some sort of variation of free energy I guess it's still within this framework and you may be separated the inference part from the action selection part but I think actually you can use it as a model of both and then you can do approximations and different points in this joint model and as long as you can still write down the variation of free energy I guess it's a concept active inference well I don't know even if that's necessary I mean I would say any kind of Bayesian belief updating even if it's not specifically motivated by variational inference would the consequence have minimizing variational free energy once you compute it from some posterior which you obtain and I mean right it will still be in a way active inference so in a way one can replace like smile variational updating with any belief updating rule and still keep the same concept there because you are just getting a better bound on the margin likelihood if you have a better posterior so what would you say anyone is necessary or sufficient for a model to be considered active inference versus not active inference you know just including sense and action or perception inference action or agents in a niche or blanket states these things are sort of we're in an overlap of Venn diagram with certainly many classes of models different approaches and like that blurry intersection that we're looking at and that's where action and inference are being just applied together or is there something unique or something that we can use as a diagnostic well I guess the difference would be more than this kind of this action selection I mean planning is inference part not necessarily as perception is inference because as I said one can think of many ways how one can solve that part but kind of once you go into this planning as inference part and also concept that actions themselves will change beliefs and you will choose actions which are the best in changing your beliefs or what you want to achieve then this is like I mean idea of active inference in a way it's a circular inference problem in a way it's not in a way it's I mean more so easy to disconnect the effect of choices and the effect of perception they all depend on each other and also being aware of your own uncertainties and I think if you do standard reinforcement learning where you just calculate expected rewards I think you're always aware of how certain or uncertain you actually are about your own generative model for example and then potentially with active inference easier to know which action to take in order to better learn your model very interesting it's almost like by harrying and propagating our uncertainty and having a self model of action and learning then we get that almost like second order cybernetics where we're acting in a way where in the future we can expect to learn better or expect to act better as opposed to just this hungry search for the best action and then learning is only a one step projection into like what action is going to be informative right now what's the next Wikipedia article that's most informative rather than what's like the trajectory that I expect is going to be resulting in more effective learning or action again on a common grounding so that's pretty interesting looks like you have a thought though Dimitri that's definitely important part but I mean I would then call Thompson sampling would also satisfy that Bayesian you should be right I mean they all kind of are Bayesian decision-making algorithm would necessarily have to take into account uncertainty if it's the right from Bayesian decision-making theory but yeah it is not that clear that right this aspect that okay the actions also have a consequence of reducing the uncertainty in the future and this is what you can use also for that's a gauge on which action is better so yeah there are subtle differences and not that obvious always it's one reason why we're so interested in in ontology and slowly scaffolding the research so that we can actually juxtapose different models and understand where they differ like okay they're kind of it's like two road trips and then this person just took a little extra loop or this bridge they crossed this way versus this person you know took a different route right here we can understand like you talked about the smile variational updating but then just recently you mentioned that there's that's sort of like a module you can switch out so what is smile or because I also noticed a very recent citation so what does it do or what is it different in regards to other ways you could have fit that module in I mean I have implicitly used smile for years now it's just a very simple way of updating your beliefs so so the problem a bit with kind of variational inference if you want to find a minimum of approximate posterior you kind of necessarily need to iterate to several loops like minimizing the gradient so following the gradient of the variational free energy right and I mean this also doesn't make it very efficient in for this kind of applications so for me at least this variation smile approach five steps this iterations you can kind of transform variational inference into just a single update step just because there is a way part which you can compute explicitly and you just assume okay this is now my fixed belief about that and the other part which you need still to update through variational approximation and so this is I guess a small advantage here but as I said there are other approaches how you can update beliefs in these scenarios and in the Bayesian sense there can also be more optimal but they just bring you closer to the exact posterior so I mean we tested these things with some examples and the story remains the same so there is no gain making things more on that level but one can also imagine that in different environment better generative model and approximation rules would take you for that but for Bernoulli then this is just not the case what kind of empirical data sets are almost whether they're open source or just obtainable are amenable to this kind of analysis like if somebody hears about the algorithm and they kind of want to see it in action or play with it themselves once you're two weeks from the recording of this when your code is available like is your code set up for more of a simulation or is it something where we can plug in a type of data set that might already be structured appropriately it's currently just for simulations but I mean one could potentially play out with different algorithms add other algorithms both on the like learning part and the action selection that would be super useful for everybody yeah if there are people interested in that repository is open and welcoming any contribution it was part of the structuring of your paper that like made us excited to juxtapose it with these other approaches is you kind of showed that you can directly compare active inference alongside other models maybe it's something we're coming back to because that's the crux of the results of the paper is really the different dynamics as time increased or as the relative challenge increased as the number of arms was changing between these different styles and the two styles of bandit so it's kind of a cool thing that people could both build on directly what you're working on but also maybe more broadly instead of just a single model being presented people could just include multiple types of models in their sort of baseline paper so that we wouldn't just have to read the paper that says this algorithm works well we could see them directly compared and that's something that more and more papers are doing with active inference I mean the kind of also the libraries we use is also this is something which was developed relatively recently like by Google Jax it's kind of acceleration for linear algebra libraries which allows you to run very fast this code and it also integrates well with probabilistic programming language so it's NumPyro which was also developed really recently so one could in practice link this to any behavioral data to equate some test data very easily with few lines of code there is also this perspective what language is the code in one or multiple then what areas of conceptual math do you think somebody would want to know before diving in so the language is I mean programming language is Python and just depends on as I said a couple of libraries like Jax NumPy very standard stuff except the Jax that's new so yeah somebody interested in using it would have to learn a bit about Jax but that's a long term also useful so go for it but from the math side well I mean maybe getting couple of introductions to multi-arm bandits that would be kind of good place to start right there is there is a recent quite recent introductory paper multi-arm bandits which exposes many different algorithms in the stationary concept so provides a bit insights about this historically how they analyze these problems so yeah I would suggest that that's a potentially place to start cool blue so I'm curious about this Google Jax I haven't heard about it but like I thought Google had their own language like don't they have like Golang right isn't that Google or TensorFlow right well yeah TensorFlow yes they have TensorFlow and I think the basis is similar XLA it's called accelerated linear algebra for both TensorFlow and Jax so what Jax is is basically accelerated NumPy so you can just run standard NumPy code and in pure Python more or less and get very so you get for free like computational gradients so it's kind of autograd library so you can kind of the compute gradients and very complex graphs this is also something which TensorFlow allows you but it seems to provide more benefits for this dynamical scenarios so I have problems when I try learning or using TensorFlow it's very difficult to think it to write code which is dynamic there and that's why I never actually started using it I started with PyTorch at some point and now that Jax showed up that has some speed up advantages it's also quite lucrative so one code question just to stay on this theme and then ask a question from the chat what about the Python implementation like I think in for actively or what Alex Chance at all have worked on what are the similarities or differences with their Python approach there I think currently none because I'm also involved in that so they will start using Jax okay so that those threads will join together okay but I mean this also called just like reduces some of the I mean one could also write just the SPM code but this is super complex right it would be super slow so for example just to examine numerically this stationary scenario this would take months probably okay nice that somebody is ringing so yeah hmm that's kind of the problem there and this I wrote the code to be kind of very efficient in a way so just kind of removes lots of complexity that you don't need cool all the right and some scenarios you want to have general description of the problem so yeah so here's a question from Steven in the chat do you think that parallel modeling processes might be used more in the future with different model approaches highlighting different patterns of behavior happening in different niche contexts I would say in some ways it's what your paper did but what are you thinking about yeah I mean from practical side yes I mean that's what you should be doing because just figure out what works best and just don't think too much about well the philosophical side of things but yeah but I mean I guess long term one could I mean imagine scenario where more and more things become generalizable to this description so it just in a way difficult process well even for those who are just learning about the technical details like the visual tell for me was that the figures were a grid of graphs so it was kind of like three different settings of difficulty or three or five different settings of arms it wasn't just showing we ran it with one parameter combination each of those graphs was a grid of combinatorics and then you presented what I guess could be considered two different niches with the static and the dynamic and then today in the presentation we heard about all these other variations and so those are like kind of toggles in the code you can say I'll take a one alpha or I'll take a two beta as far as the combinations of how to run it and so as the code becomes more interoperable and pruned down to really the necessary pieces then it becomes easier and easier to expand it back out so that we can choose amongst different options for a given piece and so that's like this skeleton that so many variations flow off of yeah cool well in our last few minutes here where would you like to take it next week or beyond I mean what are your current interests or curiosities well I mean for me it's just important to have this paper out so next time when the reviewer asked me what about that and that algorithm I can say okay look we analyze this it's like that it's similar different you can get whatever you want basically there right I mean because you have this adaptive parameters in a way you can generate very different behavior one can also think okay if you have like an agent which behaves as a Thompson sampling what would be the corresponding parameter in active inference framework which would emulate that you actually even differentiate between them so I mean this is a potentially interesting questions which one can try to answer and which are just from my work kind of relevant because of this constant questions I'm getting in during the review process and that was we've got KGD yeah but then there is also this like quite interesting side in like machine learning where this can find potentially quite interesting applications I got started a bit recently working with kind of Monte Carlo research and interestingly this is something which was applied in active inference as a way to compute expected free energy in complex problem but turns out that if you have kind of very complex problems you can also use Thompson sampling in Monte Carlo research right so as a way to just figure out what is the best to follow in a sample and this potentially provides if this stationary scenario can be improved somehow and this is also kind of you can apply active inference to planning inside active inference itself you know in a kind of hierarchical circular way right so what does that look like or how does it get implemented or how is it different than just straight forward active inference so the problem is like that when you have quite a complex decision in a planning problem where you have multiple branches in the future which you have to go practical to compute everything so what people do is a very popular way to do is like Monte Carlo research where you just sample different paths you estimate on a sub-sample of possible paths what is like the best path to go and this would correspond in active inference like your estimating expected free energy of a path in the future just through a sample but then you have a problem okay how do I select the paths in the sample and then you can apply active inference to the path selection itself so right you can kind of choose which paths I should sample randomly when I'm trying to approximate expected free energy for this problem so that's kind of what makes it kind of interesting as a possibility to explore and the way that you just framed it as well as what we've seen which is the relative strength of active inference in dynamic settings it's consistent with a lot of the qualitative and philosophical ways that people are talking about active inference as like a sense making or a way finding or a navigation approach rather than a sort of cut and dry calculus of decision making just resulting in the total you know crystal path just being laid out before you it's really about the instantaneous actions that we take now in light of uncertainty about the present and really the past and the future as well so it's always cool to see how the technical developments while they're like kind of weaving and recombining they proliferate and then we see oh actually these three are kind of interchangeable or these three are really and then we get more technical detail and speed ups while we also get more and more clarity on what the structure of this sense making problem is. I agree. Any final thoughts here from anyone otherwise this was a super interesting presentation and questions so we really appreciate it. Thank you. Cool, yeah. We're looking forward to for the next week. Great, so thanks everyone for joining and everyone's welcome to join live for next week where we'll continue the discussion and the dot two is kind of like our jumping off into the unknown unknown instead of just the known unknown so thanks again for joining and we'll see you next week. Thanks. Bye bye.