 All right. Hello and welcome everyone to the Active Inference Livestream. This is Active Inference Livestream number 8.0. It is November 4th, 2020, and I'm Daniel Friedman. I'll be doing a solo contextualizing discussion today. Welcome to TeamCom, everyone. We are an experiment in online team communication, learning, and practice related to Active Inference. You can find us on our Twitter account at inferenceactive, at activeinference at gmail.com, at our public keybase team, or at our YouTube channel. This is a recorded and an archived livestream, so please provide us with feedback so that we can improve on our work. All backgrounds and perspectives are welcome here, and as far as video etiquette for livestreams, mute if there's noise in your background, raise your hand so we can hear from everyone, use respectful speech behavior, et cetera. So first, the announcement is that we have chosen the papers and the dates and the time for the rest of the Active Streams for 2020. All meetings for the rest of 2020 will be from 7.30 to 9 a.m. PST. And the papers we'll be reading, number 8 is Scaling Active Inference. That's what this 8.0 is going to be about, and that's going to be on November 10th and 17th. Paper 9 will be the Projective Consciousness Model in Phenomenal Selfhood, a 2018 paper. Paper 10 is going to be a Variational Approach to Scripts. Paper 11 is Sophisticated Active Inference, Effective Inference, sorry, Simulating Anticipatory Effective Dynamics of Imagining Future Events. And you can see the dates that all these events are, so set aside the times if it's possible for you to participate, and if you have a time zone or kind of event that you want to do that's not reflected here, just let us know, and there's our Twitter address. All right, so here's what's going to happen in Active Stream 8.0, this one. The goal of this talk is going to be to set the context for 8.1 and 8.2, which are going to be on this same paper, Scaling Active Inference. This is a paper by Alec Chance, Baltieri, Seth, and Buckley from 2019, with the archive 1911.10601. The video is an introduction to the context of some of these ideas. It's not a review or a final word. Definitely, I learned a lot just reading through the paper and researching for this presentation. So the idea is that this video will contextualize some of the ideas, math, and notation, and vocabulary of the Chance paper. And the video should be accessible, though this is also hopefully cool and cutting-edge research. And the punchline, and don't worry if it doesn't make sense yet or the implications aren't clear yet, is that Active Inference is scalable and it's homologous too, so it's similar to and potentially preferable to other common algorithms in a similar space, like control theory or machine learning. So in 8.0, the first section is going to be some background on all the key words that they used in this paper that they provided, and then talk about the goals, the abstract, and the roadmap. Then the second part, 8.0, will be the key equations and notation and quotations walk-through. And we're going to do like the 80-20. So most of the notation, most of the meaning, but not all the sections, not all the symbols. And then talk about figure one and figure two and what they represent and how that supports the conclusions of the paper. And then in 8.1 and 8.2, we will all come together to discuss the same paper. So save and submit your questions or put them as a comment and then get in touch if you want to participate in this one. Okay, so let's start with the key words. So these were just the key words that the paper provided. So I would think of them, these are the topics that this research is going to be on the cutting edge with respect to that field, hopefully. So they are artificial intelligence, machine learning, and they mention also reinforcement learning and model-based reinforcement learning. And then systems and control and information theory. And then a keyword that wasn't on the paper, but we can just add here is, of course, active inference and the free energy principle. So each of these key words, definitely, you could have a course or a PhD on. So each one of them is going to get about one slide. So of course, go deeper into the resources mentioned or find other courses that teach these techniques because there's a lot to them. I'm just going to kind of go through them in a way that builds towards where we're going with the math and with this research paper and just know that there's other people's perspectives on these topics, too. So first, artificial intelligence. Wikipedia, when we last checked it, was saying that artificial intelligence, or AI, is intelligence that is demonstrated by machines unlike the natural intelligence displayed by humans and animals. So even using this definition or perhaps an alternate one, which I'll get to in a second, we can say that what AI results in today, 2020, is it results in maps, policies, individual and collective policies, recommendation engines, it results in encryption techniques and security, it relates to classification algorithms and things that influence us all. Now, where I would come at this is by saying, well, AI is made and used by humans as part of their natural world. And everything is natural in that it exists in the world. And so by positing that there's an argument that there's an artificial type of cognition that is somehow distinct rather than extended or embedded and acted by humans, does a few things that aren't very helpful. And there's been some excellent other discussions of this topic. Just to move it in a positive direction, I would just say that often other phrasing can be more helpful than calling it AI. So here's one example. What we're going to talk about a lot today is computer statistics. So using computers to do statistics but basically things that pen and paper could do slowly. So things that are about matrix calculations or about statistics with computers on big data. Another way that AI can be meant today is like human in the loop AI. So human in the loop AI could describe a map or a recommendation engine to emphasize that it's human agency all the way through in the use affordances, as well as hopefully in what the designers considered. Another phrasing of AI that can be helpful and also center the role of the human in deciding how to use it and how to design it is intelligence augmentation, which is clarifying that as us who are augmented or even a more general phrasing than AI would just be the human technological niche which would also include physical aspects of our niche. For example, what we talked about in 7.2 but that would include effects that aren't just simply in the sort of silicon chips and their interactions that AI captures. All right, machine learning. So machine learning, there's a lot of things to say about it's a big area but the main machine learning topics that come into play in this paper are reinforcement learning and model based reinforcement learning. So the simple reinforcement learner is an agent in an environment and the agent takes actions A that act on the environment and then the environment sends back states and rewards. And so sometimes it sends back just states directly and then the reward is understood through a layer of perception by the agent. Other times the environment directly sends reward back. Where model based reinforcement learning builds on that is by the introduction of this model which is outlined in red here. And in these types of model based reinforcement learning rather than learning the simple raw connection between successful behavior and reward and learning that as a behavioral correlation the model based reinforcement learner is able to have a model of oh for example running is painful now but then I'll feel better later. And so then it allows one to pursue goals that are transiently unpleasant in order to get at larger term goals. So think about in a little maze trying to get not just greedily towards the exit but trying to take a step back to take two steps forward. So the similarities are the agents are acting on or within the environment and that their actions arise from policies which are models of action. So what isn't likely and those can be concurrent with what the organism actually can or can't do quote but at the very least it's what ends up getting implemented by this organism. The agents modify their action policies based upon their reward from the environment either in a direct way or in a symbolic way or in a model based way. They learn or to use the Bayesian phrasing they update their model. So in any case it's going to be computational because we're going to be using computers and math to describe it. But then Bayesians would say like the priors updated with evidence to generate the posterior. And there's various ways that this learning or update can be done. Another similarity is that the environment sends rewards or signals to the agent and one area of debate or question to think about is is the observation the reward or how is it the reward? Is the learning itself the reward? There's machine learning algorithms that have that kind of approach as well. And then does the observation symbolize or signal the reward or how do we think about that? Those are kind of the ways that reinforcement learning and behavioral algorithms can be applied to both abstract data classification approaches clustering and things like that as well as direct plans for action. And it turns out that active inference is going to relate to reinforcement learning in an interesting way. That's why the authors made it a keyword. All right now in a modern context this reinforcement learning is going to be implemented with a for example neural network that in this case we have the environment sending the state to state S on the agent and that's going to go through some sort of deep neural network DNN and result in some policy selection. So neural networks are inside the agent that's a way that we can implement it with basically software that's available and that allows for some types of scalable in the sense that we know how there's going to be a relationship between how many models our parameters have, how big our model is and how effective it's going to be. And then it also allows for non-linear inference not just a simple linear regrasser let's say. And sometimes the little and sometimes the big theta is used for parameters but we'll get to that later. And the neural network can be model free so it can just be sort of let loose to find the patterns in the data or it can be using a underlying model. So if a certain type of event is a priori taken be much more likely those kinds of observations can be used to in a more weighted way. One example of another modern reinforcement learning is this Q learning which was used for some games and some types of things like that recently and the insight is just another way of doing it is by mapping the Q table which is the mapping of state action and reward. So that would be mapping your current state and your behavior in that state like oh I'm tired but I'm gonna keep running because it's rewarding let's say but in some mathematical way. And then the deep Q learning also represented here is when again it's not just a table but it's a neural network so kind of the modern insight just replace any other statistical module with a deep learning something. And that allows for apparently some more nuanced policies to arise. And then just one slide just to throw up there you can pause if you want to look at it but it's from a paper called deep reinforcement learning and it shows just the depth of how many different areas this is kind of related to. So these are the kind of structure of the problems and the areas that this paper is getting at with these keywords. Okay so let's start from this model based reinforcement learning framework and then talk about systems and control. So in this framing we can look at the model based reinforcement learning and the green and the blue model free reinforcement learning. So model free again is the direct mapping from experience onto policy. And the increasingly abstract model based reinforcement learning is based upon the supervised learning of experience into updates for the transition model generative model the world and then using that to implement some sort of planning process resulting in some policy selection. So control systems theory is about planning for action amidst uncertainty because you have to plan for action and there's going to be uncertainty and there's various kinds of uncertainty sometimes they're metaphors for each other other times the methods are transferable other times one system will suffer from one type of these challenges versus another but it includes systems that are chaotic like a double pendulum stochastic multi-scale partially observed noisey etc so if you can't act you can't control you can only watch and that's really what differentiates control theory from just statistics because statistics is descriptive but systems and control approaches are more oriented towards understanding the dynamics of the systems in a way that can be intervened in so modeling this planning process and the policy explicitly and then also of course the caveat that not present in many of these models is the strategic axiom that non-action can be a form of action so the summary here is that we want to have a mathematical model a way to talk about how a system has to supervise or model their self their action affordances and their system and their world well that's the good regulator and also have this action policy plan so supervised learning is about supervising your input and appropriately updating your generative model the world and then what is your generative model the world for it's going to be about this policy selection and here's a more classic control theory diagram with the way that the system measuring element controller and effector are related all right information theory another keyword so there's probably another 500 hours of youtube to watch on info theory and a lot you could say about a lot of different areas um a simple way to phrase it is that information is the production of uncertainty whether it's actionable or not so it doesn't have to be connected to control theory it can just be purely descriptive like taking the Shannon entropy of some dna strand information is always contextual it's relational it's model dependent there's a lot of critiques of this sort of naive Shannon interpretation of information theory sometimes and the measurement of or the comparison of information is challenging like if you calculate the entropy of the biodiversity of insects and plants on the same island they're not necessarily on the same scale etc etc and uh there's a lot of areas areas that uh info theory touches on like semiotics and semantics measure theory um dynamical systems and signal processing reduction of uncertainty is also not truth because of course wrong and precise estimates do exist and precise estimates of external states don't really facilitate effective action in many cases a lot of other related areas are the quantification storage and communication of information the measurement comparison mentioned before how do you store and transmit information through time that's like memory how do you communicate that's about transmitting information through time and space protocols for communication and when information gleaned by a system reduces its uncertainty about world states or causes or relationships it can facilitate the updating of effective policy models and therefore action in some cases but not all cases and when the action is effective it can be rewarding or sustainable or high fitness there's various ways that you can think about that kind of a success depending on the mechanism scale the system and then um the parameters are updated by this rewarding so whether it's uh which genotypes of the plant are successful in a certain niche or whether it's which computer virus is mutating the best it's this sort of learning slash development slash evolution process that ultimately we want to kind of move towards an integration of all right just to talk a little bit more about information theory and also introduce this conditional notation just for those who may or may not have seen it before why are we talking about information theory here because again when we reduce our uncertainty about causal relationships and the statistical regularities in the world we can enact better policies so let's get a grip on some basic information theory concepts but also introduce this notion of a conditional so the notation a vertical line b means a given b so the part after the vertical line is the part you're conditioning on h of a is the information content how surprising is the random variable a in units or in bits and just to kind of combine these two bullet points h of a given b is the information content of a conditioned on b sorry there's a typo on there information of a semicolon b is the mutual information between a and b so that's shown in these sort of single overlap spaces on the venn diagram below and then also there's a triple overlap and you can also have overlap and conditionals so the way that we're going to think about mathematics is just sort of like nested operations that hold different kinds of transformations and potential and there's a lot of cool areas of information theory james gleeks a great writer and this book by stone in the middle is a good introduction and then this book by zaniel et al algorithmic information dynamics is also a cool application and theory so let's return to a little variant of a slide that we've looked at kind of before which is this inactive versus bayesian dichotomy slash divide that ultimately active is going to try to integrate so in the inactive world view we have the agents in the world and they're enacting in this exit nexus of action and perception and the outcome of this interaction is the embodied and ecological action sequence or policy from an embedded agent who makes slash is their behavior and just to go one more level now from a control theory perspective into the agent we can see that the agent has the sensory states on the input side then they have their causal model of the world and then their policy model in some sense these could be like one integrated unit or not we'll get there later and then we can line that up with the bayesian structural representation list view so in this view you have the data the observation which are sort of the base level observations or parameters in the model and then it through a recognition model hyper parameters are updated which result in a generative model of the data and this kind of a expectation maximization scheme results in this statistical convergence of a multi-level model that's representing hence the representation list structures of the world hence the structural part maybe and then just to line it up we can think about that model that's a bayesian control agent let's say it's going to have sensory states what is my robot sensor detecting what is the causal model for what that means for the pose of the robot and then the policy selection how are we gonna correct so these are both can be these two sides of the dashed line can kind of be internally consistent and we want to return to this question which is about how the free energy principle and active inference is going to say something about the relationship between the world and the agents so we have action and perception and we want them to be symmetrical in this kind of a niche concept way we've talked about we want to have the free energy principle be an axiomatic set we want to build on top with a multi-scale active inference perspective this is sort of where we want to be moving towards how has active inference stepped into this gap and built up towards this goal so active inference 1.0 and open to people building on slash correcting me on any of these parts here this is just very rough epoch level active inference 1.0 is a implementation of machine learning type algorithm that builds on free energy principle as well as the inactivist approach and it relates the attunement of local sensory states and policies so in that way it's kind of like the model free reinforcement learner and it focused on very discrete cases so just like on a grid for example and use pretty simple expectation maximization algorithms and had relatively short range spatial and or temporal reach or depth it was more of a let me see if my video not sure what happened here sorry about that so i'm going to just read my camera say la vie anyways active 2.0 added temporal depth and niche interactions into the equations with a longer term planning phase and also added in this aspect of agent embeddedness and introduce the idea of valence and learning into the formulation and this is kind of where active inference 3.0 and beyond go so one direction they've been going is towards learning and affect and this sophisticated or counterfactual aspects of action here's a citation and then another direction that it was going towards is reflected by this scaling active inference paper which is going to introduce the idea of the high dimensional data the continuous variables explore exploit behavior and also homology to reinforcement learning and working at scale so for me it was like the hero's journey reading this paper because this is just a framework for a completion story so this is the scaling active inference um journey so it starts off with the call to adventure which is the question how can we model complex control tasks with active inference here's the beginning of the transformation the challenges in the temptation should one go to sleep or should they read first in bibliography the helper comes into the picture here's Alec we have a transformative moment when we read and when we understand the paper scaling active inference then we have our movement from dark to light from reinforcement learning to active inference from failure to control along the way we receive the gift of meeting new friends discussing cool ideas having impact on systems and making first in memes of course so let's get to the paper so scaling active inference is the paper the goal of the paper is presented as we present a model of active inference that is applicable in high dimensional control tasks with both continuous states and actions our model builds upon previous attempts to scale active inference by including an efficient planning algorithm as well as the quantification and active resolution of model uncertainty our model makes two primary contributions first we showed that the full active inference construct can be scaled to the kinds of tasks considered in the rl literature this involved extending previous models of deep active inference to include model uncertainty and expected info gain second we highlight the overlap between active inference state-of-the-art approach to model-based reinforcement learning so the non active inference phrasing here is how can we apply active inference to challenging control tasks and connect it formally to reinforcement learning so the abstract was in reinforcement learning agents often operate in partially observed and uncertain environments model-based reinforcement learning suggests that this is best achieved by learning and exploiting a probabilistic model of the world active inference is an emerging normative framework in cognitive and computational neuroscience that offers a unifying account of how biological agents achieve this on this framework inference learning and action emerge from a single imperative to maximize the Bayesian evidence for a shared niche of the world however implementations of this process have thus far been restricted to low-dimensional and idealized situations here we present a working implementation of active inference that applies to high-dimensional tasks with proof of principle results demonstrating efficient exploration and an order of magnitude decrease in sample efficiency over strong model-free baselines our results demonstrate the feasibility of applying active inference at scale and highlight the operational homologies between active inference and current model-based approaches to reinforcement learning okay so for some of you that might be a lot of new ideas others may recognize a lot in there so let's just lay out the roadmap for how we're going to get from A to Z and then hopefully it will be approachable from all these different angles section one is an introduction and two is about active inference and what work has been done in the area three is the model which is what we're going to focus on in this discussion and 3.1 is the generative model and the recognition distribution those are the p and q models then there's a section on learning and inference policy selection trajectory sampling and expected free energy as well as the section on the fully observed model those are the sections we won't go that much into today then there are the experiments related to exploration and exploitation and the two figures related to the comparison of exploration strategies and the comparison of performance on two continuous control tasks then the relationship with previous work specifically deep active inference model based reinforcement learning and information gain theory then there's a discussion and conclusion all right this quote which you can pause to read in full I'm just going to pick up on the end of the top paragraph while this approach the previous work provides an elegant framework for evaluating expected free energy it can only be applied in discrete state and action spaces meaning it is not directly applicable to the high dimension states and continuous actions considered reinforcement learning benchmarks so this is one reason why active inference discrete state-based learners have not been compared in the author's phrasing is that the reinforcement learning benchmarks that are used to test the adequateness or the efficacy of different reinforcement learning control theory algorithms are our continuous tasks like the kind that we'll talk about in the figures and so what this paper is going to do is employ an approach called amortized inference which we'll talk about more in a little bit which utilizes functional approximators i.e. neural networks to parameterized distributions free energy is then minimized with respect to the parameters of the function approximators not the variational parameters themselves and then they restate what each of the sections in three are here is a notation reference sheet it's not every single parameter especially the statistical parameters that are distributional and a lot of the later sections too but this is the sort of parameters that we're going to see a bunch in the coming slides and we can really just already look at what a lot of the topics that the model is going to cover are so for example the policy of the agent the states that it's trying to estimate and the true states with the hats it's going to be through time it's going to involve observations we're kind of seeing a lot of the ingredients of the reinforcement learning algorithms that we were talking about earlier so let's talk about this amortized inference here's what they write the author's chance at all amortizing the inference procedure offers several benefits for instance the number of parameters remains constant with respect to the size of the data an inference can be achieved through a single forward pass of the network rather than let's say back propagation training moreover while the amount of information encoded about variables is fixed the conditional relationship between variables can be arbitrarily complex in equation three which we'll get to the parameters of the transition distribution data are themselves random variables in the current context these parameters are the weights of the neural network f this approach allows the uncertainty about these parameters to be quantified and cast learning as a process of variational inference the prior probability of data is given by standard Gaussian which acts like a regularizer during learning um and this just talks about how free energy is going to be used to select the action policy so what is this amortized inference why is it being used these papers on the bottom have a lot more information but in amortized inference the number of parameters used is flexible you can have a small or a large model but what's good about it is that you can also set the size of the model to not change even as more data is input so this allows you to train on devices that you know are going to have limited computational power or use a very large but defined model size to implement on larger scale networks let's say again the model can be trained with a single forward pass rather than more complex schemes and then also there's all this degree of freedom when you can design the neural network because there's a lot of approaches that can be implemented with neural networks the general framing that they're going to be thinking about for this whole mathematical journey that we're about to describe is a partially observed Markov decision process so these are the key variables that are going to be defining this organism centric approach let's say at each time step t so that one's easy to remember the true state of the environment which is s with a hat hat means that it's the real state and the states are what are the important things those are like is the bus going to hit me or not going to hit me let's say but they could be various other things so we're keeping it general and then that s hat sub t so the real states through time exists in a state space with a certain dimensionality and there's also a real transition dynamics of the real states that's related to this p s with a hat t given the previous times and the action so that's saying that the time evolution of the system the real one with a hat s of t is related let me see if I can actually get the placer yep the s of t with a hat is related to the current time and the previous time and the actions and that has also a dimensionality and agents do not have access to the true state of the environment but might instead receive observations so observations it's a good easy one and those have a certain dimensionality which are generated to an actual observation distribution t which is related to how observations are conditioned on real states of the world so if it's really sunny outside then you're really going to get photons if you look at the sun as such again this is because agents only get observations and not the absolute truth from the world agents must operate on their beliefs about states so that's s without a hat through time those have a certain dimensionality about the true environmental states as with a hat and then there's a difference between the true dynamics without italics and the model of the dynamics which are always with italics and here's how those equations are linked here's where active inference comes into play active inference proposes that agents implement and update a generative model of their world so italics p that means the model of the world of observations states policies and parameters of the world where the tilde is the sequence of variables through time so those are the observations and states through time and pi is the policy like if i have the policy of running every day then my observations through time are going to look like this you can imagine that one and theta with italics is belonging to a larger version of theta capital theta and that denotes parameters of the generative model which are themselves random variables and also agents maintain a recognition distribution this is like p and q so mind the p's and q's of the states through time and the policies and the theta parameters of the world representing the agents beliefs over their states policies and model parameters all right so in this part we're going to look at the p model remember it's the p's and q's so the p is an agent implementation that's why it's in italics of a model of the observations through time o with a tilde states through time s with a tilde policy and model parameters so that's what this model is about and it's implemented by the agent that we care about here's what each of these lines says so the first line on the left side of the equation says p which is that function that we have in big on the top is equivalent to the probability model of the parameters theta take those out then the probability of the policy is so if it's not possible for me to jump a thousand feet then it's not going to be part of my probability distribution if it is part of my power to jump a thousand feet then I'll consider it in my policy estimations and then the big pi it's kind of like how sigma adds stuff up pi multiplies stuff up and so this is saying multiply it all out through time starting from the beginning t equals one and look at the p the agent model of o through time observations through time given states through time so the first part is just what's the probability of given that I'm healthy that I'm seeing this state and then that's multiplied at each time point by the ways in which states s of t are conditioned on the prior states and the prior policies and the parameters of the model so how all those things are linked multiplied out through time is what at a first pass describes this big equation on the top the total model of os pi theta then to dive into this p of observations given states through time model that the agent implements and big n notation means normal and so this is saying the model of observations given the state estimates is a normal distribution big n with observations through time with a certain mean and variance mu and sigma with a certain parameter this lambda and together these two this kind of like dople tuple of variables constitute this function lambda so the way that we're going to estimate these mean and variance may or may not be a least squares might be not not an l2 norm we could use a functional approach here that's the insight of amortization the probability of states that probability transition model of how the agent believes states through time to be linked to conditional to states at prior time and policy of prior time and then the parameters of the world is related to another normal distribution of states through time with another mu and sigma with a different subscript and for those mu and sigma we have f um with the sub theta of how the states and the policies are related so this is another just sort of way to estimate the variance of different kinds of events happening therefore weight sensory inputs together the state estimations distribution are again yeah estimated by a function p of theta is given as a normal distribution with a mean of zero and this identity matrix of covariance I believe and uh I think this is just reflecting a neutral initiating approach or regularization but I'm this is not 100% sure about that and this is reflecting that the the probability model of policies is related to this softmax function so it's a sigma but it's not the same as sigma squared in a statistical model and it's the softmax function of the negative expected free energy of the policy which we'll talk about in a later section so that was our p let's look at q q is this agent recognition distribution to q of the italic of states through time policy and model parameters here's how that one is defined so q is the probability of the parameters data multiplied by the probability of the policies pi and then that same multiplication through time for all time of q which is the model of states given observations so previously we had observant p os the second line here probability of observations given states now we have q s l so this is kind of going the other way two of data is a normal distribution kind of as we saw before with its mean and variance parameters same as the policy and same as the the q model of the states given the observations and it turns out that those mu and sigma are gonna also have a functional that's gonna approximate them so to make active inference applicable to the kinds of tasks that are considered in reinforcement learning the authors treated reward signals observations as observations in a separate modality so that extends the generative model to include a additional scalar Gaussian so that's another reward value of scalars like a single number over reward observations with a unit variance in mean and that allows them to wrap this neural network which is f sub alpha of the states which is like this reward fully connected neural network with the parameters that they're then gonna train so that's really where the formal homology arises from the model base reinforcement learning is that now we have pretty much the same architecture that was specified earlier in this discussion where the agent state is then of sensory input is run through this dnn and that results in this output of the policy or the reward but it's a little bit different but that's how we see a lot of the same homologies all right so what do the learners do as new observations are sampled agents update the parameters of the recognition distribution to minimize variational free energy or f so the f of the observation is the expectation okay well let me finish reading first this makes the what this f is going to do is make the recognition distribution q converge towards an approximation of the intractable posterior distribution p thereby implementing a tractable form of approximate Bayesian inference all right so f of o through time with the tilde is the variational free energy of the observations through time so what is this magical free energy not magical it's just this equation in this case it is an expectation that is conditioned on a specific recognition model which is this expectation of q that's going to be related to this s pi policy and theta parameters variables what is this expectation going to be of that's everything that's about to be in square brackets this expectation is going to be the natural log ln which is a way to turn a question about maximizing the goodness of a model's fit into this minimizing the negative sometimes there's a few ways that the natural log can help the expectation is going to be the natural log of the recognition model minus the natural log of the generative model so again that's the part that we q that we can learn on and then the part p that is characterized as intractable it turns out that this expectation is always bigger than or equal to the natural log of p of the observations so that's the just the p model of observations likelihood by minimizing the free energy of the observations through time the agent converges heuristically towards this intractable p distribution given that q is only over the state's policies and parameters and that p is really including and focusing on the observations what this allows the agent to do is to focus on modeling the states and the policies and the parameters that are relevant via q and then potentially use math slash inactive behavior in this model to tractably include condition upon or learn from its observations so we could bring previous knowledge the table just like Bayesian models allow for but also have a way to deal with sparse or dense information as it is okay crucially active inference also proposes that an agent's goals and desires are encoded in the generative model as prior preferences for favorable observations i.e blood temperature at 37 Celsius free energy then provides a proxy for how surprising i.e unlikely some observations are under the agent's model while minimizing equation one provides an estimate for how surprising some observations are it cannot reduce the quantity directly so here equation one we defined what f was but it's just a definition and a bounding of what f is it cannot reduce this quantity directly so it's not really related to policy as directly approaches to achieve this agents must change their observations through action acting to minimize variational free energy ensures the minimization of survival surprising negative natural log of the probability model of observations or the maximization of the Bayesian model evidence p of observations through time since free energy provides an upper bound on surprising that was equation one active inference therefore proposes that agents select policy in order to minimize expected free energy fancy g where the expected free energy for a given policy pie at some future time is going to be defined this way so we want to do more than just update our internal model to bound our surprise on our observations we don't just want to fit the statistical descriptor model of the world we want to reduce the uncertainty in the future which is a future time point t tau through the policy pie that we take now this minimizable expected free energy function fancy g is going to take in the arguments of the policy selection and the future time step so fancy g fancy pie fancy t that's the free energy function in this case and what that value which optimization on it is going to allow for effective action is going to be defined as is the expectation that's going to be related to q model which is about this observation states and parameters given policy models so given that my policy is this then how will the states and transitions of the world appear and it's going to be the expectation of this quantity that's in brackets it's going to be the log of states and parameters so that's the q of the s sub t and parameters given the policy so that's what states will I be in and what will I think about along the way potentially these are just first pass verbal explanations just look at the math but what states and parameters am I in now at a future time given my policy so if I go up on a walk every day how will it influence my states what does my model have to be at that time and then that quantity minus natural log of p the observations states in the model given policy so that's the part that we want is actually the observations to be the correct values but what we can do is just condition the states on policy and that g value is always greater than the negative that's this last part of equation two then the negative expectation of this q which is the observations conditioned on policy so that's like how things will actually turn out of this log part of the generative model that is the observations given the policy that's how things will go so maybe there's other interpretations or other parts that are really critical but I think this is one of the key formulations which is that there's a policy related optimization problem that can be phrased in a variational way that is going to be bounded through a free energy like strategy to estimate tractably some other function q something related to that of basically what we would want to know which is like okay the observation that I look into my bitcoin wallet and there's 15 bitcoin what policy do I have to take to see that observation or for that counterfactual to exist if only one new so that's why there's this tractable boundable equation potentially I think there's a lot to say here and Alec and anyone else I'd appreciate any other input but that was just one read on it okay so that was most of the part that I wanted to go through at that level of depth but then just to convey the other sections and what they do for the paper but not go into every equation as much as then talk about the figures so 3.2 is the learning and inference section and in order to actually implement this optimization scheme as it's laid out it turns out there has to be an update process so is it going to be a sort of stick with what has worked or is it going to be novelty search those are kinds of things that optimization algorithms have to navigate the trade-offs that are implicit with different strategies depending on different problems and so this section defines how this F variational free energy is defined through time and I'm not going to cover the implementation details but Alec would be cool if you showed us a simulation or something like that 3.3 is about policy selection here's what they write under active inference policy selection is achieved by updating Q of policy in order to minimize free energy fancy F given the prior belief that policies minimize expected free energy IVP of policy is related to this choice softmax related to the negative g negative expected free energy of policy spaces with short as specified in question 3 free energy is minimized when the Q distribution of policy is the same as the optimal free energy minimizing expectation of policies across policies like just like the free energy optimal policy selection for discrete action spaces with short temporal horizons basically g of policy can be evaluated in full by considering each possible policy so that's like connect for you could just run out or tick tick tell you can run out every single option for the games at that branch point compute the absolute value of every single possible outcome and then make your choice however in continuous action spaces there are infinite policies meaning an alternative approach is required that's a really cool sentence because it really conveys well that actually the continuous action space even for a single variable is a truly different domain than a let's just say prisoner's dilemma game there's sometimes a almost no overlap in the algorithms that's why this paper actually reflects quite in advance and there's more details in that section 3.4 trajectory sampling so now how do you go from just having this distributional relationship between policies and other variables like related to generative models of the world and observation models how are they going to get there so to evaluate from the distribution to the specific policy so to evaluate the expected free energy for any specific policy first they have to evaluate the expected future beliefs conditioned on that policy so one can't really undertake in this sort of normative framing it's a secondary debate which we can definitely have about to what extent these are experienced cognitively by humans but we're just going to kind of go with it with the intentional stance and the sort of machine learning control theory take but and so beliefs it's not necessarily cognitive we're just thinking about what the agent wants to do to succeed but basically if the agent doesn't have an expected future belief sequence for that policy it will be extremely difficult for them to undertake it and the fact that the transition model is probabilistic and the parameters of the transition model or random variables induces a distribution over future trajectories so rather than just playing every possible chess move we're thinking about how we are fitting a more advanced kind of model chess is still discrete so let's choose we'll see a specific example but this is just a continuous optimization later in this figures but this is like having a distribution over future branch points in a chess game so whether it's discrete or continuous you can still have this element of uncertainty of a distributional range over future trajectories in state spaces so several approaches exist to approximate the propagation of uncertain trajectories for example one can ignore uncertainty entirely and propagate the mean of the distribution or one can explicitly propagate the full statistics of the distribution so those are kind of two extreme approaches one of them is just say yep ensemble modeling is tracking the mean and so I'm going to keep that moving forward and then the alternative is basically to keep a summarization of the entire distribution and use that as what is being learned by the population in the current work it's utilized a particle approach whereby a Monte Carlo so sampling based samples are propagated in particular we consider this big B samples from the parameter distribution data which is drawn from cube data these sections convey how information on policy selection is accomplished in the model and how the policy fitness landscape is explored in order to implement predictive control on trajectories so the policy through time implement it must involve some element of predictive control all right then there's section 3.5 3.6 so 3.5 has details on computing this expected free energy value as the title might suggest in this section they describe how to evaluate negative G for a policy where they have used this notational convenience to talk about policies moving forward so kind of just made it like a naked pie rather than this upper superscript that has to do with the time states and uncertainty and the negative expected free energy which is what we are trying to get out here with the negative G is equal to the sum of this negative expected free energies through time and then that is where there's a decomposition that not my area but just I would think what it means is that there's a decomposition into this state information game gain and an extrinsic value component so this kind of breaks down into an explore and exploit components or some other trade-off we see this a lot with this negative G I would love it if somebody could talk a little bit why this is or if it always is the case but it always seems to be presented as model accuracy penalized by model complexity or this plus this is it always two parts in the equation is it ever just one term can it be three terms here's three terms but what or four so I'm not exactly sure why sometimes it has more or less terms but that's what I'm curious about then in 3.6 they write the model presented in the preceding section serves as the most general formulation applicable in both partially observed and fully observed environments in what follows we describe an implementation for the implementation for the fully observed case leaving an analysis of the partially observed case for future work so it's just another wrapper around the fully observed case to have the partial observations and so they decided to use some benchmarks that facilitated them to basically have total observation and this isn't like cheating the examples are going to describe are kind of like balancing a dinner plate on a stick so in that case you do have total observation of the model so just because you can't control well it may or may not be possible given a certain stick and motor reflex time but at the very least you can observe everything so there's also unobservable control theory problems and then there's ones where the system is actually under standable and controllable like the dinner plate but you only get distorted or malicious information there's other ones where you get perfect information but only at time points and there's a game going on so a lot of different ways that that can play out but that's sort of the parts of the model that can get switched out and that's what also I think it will be interesting to hear from any of the authors on so in section four they go through some experiments and this is kind of the results phase of their paper and they describe how they investigate whether one whether the proposed active inference model can successfully promote exploration in the absence of reward observations i.e exploration so that's what's really difficult for the model free reinforcement learners to do is when it's not getting the reward it doesn't do very well can't explore very well because it's not winning and two whether the model can achieve good performance and high sample efficiency on challenging continuous control tasks i.e exploitation so that's what requires tracking on something we evaluate these two aspects of the model separately leaving analysis of their joint performance i.e the exploration exploitation dilemma to future work so again show it separately in the kind of pure use case show that it can do either one and then it will be something like a trade-off between the two of them or a wrapper layer that will end up mediating this trade-off all right the explore task that they're going to do is called mountain car and you're a little car on the bottom of this valley and you can push forward with the forward pedal or you can push reverse with a reverse pedal and you don't have a strong enough engine to get up to the yellow flag with just one go it's like you're at the bottom of the hill and you're in gear three seven on your bike but you can go up a little bit and back a little bit and so here you have to kind of explore and figure out a strategy that helps you explore more and more in order to get enough speed by going up this left hill in order to get the yellow flag and then the exploit tasks that they're going to be talking about are two-fold there's the inverted pendulum task so here we're still a little cart or something like that and we can move forward or backwards on a track and then there's the pendulum that we're trying to keep upright and so that's like the dinner plate on a stick then there's the hopper task which is a jumping motor coordination task so let's look at what they actually did for each of these two experiments and then see how the active inference learner stacked up against other algorithms so the mountain car example is a one-dimensional track positioned between two mountains the goal is to drive up the mountain on the right however the car's engine is not strong enough to scale the mountain in a single pass similar to what was said the only way to succeed is to drive back and forth to build a momentum so what was cool about this I thought was that the code was on the open AI site and on their github.com link is here and so it could be really seen and played with and used as a standard so it seems pretty cool didn't know about this resource for standardizing different learning algorithms so how did the active inference agent do well it did well in this test otherwise they wouldn't have they wouldn't have made the paper and the way to read these graphs on A, B, and C you're gonna have the same X and Y axis so I put the map with the engine on the top left so kind of maps on to the bottom left panel A and so the position is related to the X axis and so that's how far to the left and how far to the right that little train car goes and then the velocity is whether it's at rest zero whether it's moving positively to the right or whether it's moving negatively to the left so the reward agent who just sort of pursues momentum whichever way I guess only ends up getting a max speed of around 0.02 in either way and then only moves like 0.7 or something in either direction the epsilon greedy agent does manage to learn a short range policy that we can see brings it a little bit further on one side and it does take some rare excursions into slightly higher velocity regimes but in the end it doesn't really get too far like it learns how to go uphill when it's going uphill but that only helps it eliminate extent whereas the active inference model as implemented has not just a more broad sampling within the state space conversions after 100 epochs but their position extends to much higher range in the x and also a higher velocity distribution range in the y so it was more able to get out of this little ditch by being an active inference learner than by being epsilon greedy or just sort of model free reward based all right the second optimization example presented in the paper is this hopper v2 again didn't know about this resource so this one is interesting it is described as making a two-dimensional one-legged robot hop forward as fast as possible and the code is also there I'm not going to put the code up but it kind of reminded me of this idea of evolutionary computation as control and so sometimes there's so many angles to this first off just this image with a checkerboard floorboard it reminded me of some simulators of evolutionary computation like for Linux and stuff like that where you could have these blocky aliens that would replicate and take up a ton of computational resources and just clash into each other and sometimes it was just the outline other times it was 3d blocks with this kind of a background and a floor plane but here and what we're evolving is control strategies not chromosomes representing genotypes or let's just say but actually a similarity is this idea of the genetic algorithm and the genetic algorithm stochastic search because sometimes the way that these control policies are optimized is actually through steps that involve things like genetic search algorithms that switch over large sets of their parameters so there's some just interesting parallels and this I just thought was a cool example and at the intersection of motor behavior control but also complex state space estimation even the beginnings of an inactive and maybe even embodied approach because at some level of movement complexity it's going to be implicit in the system and distributed in it just like other systems that it has some notion of its own affordances if you had 500 muscles and it was a human leg it would have some sort of tensegrity structure that kept it balanced in a way that facilitated some types of actions but not others so very interesting choice and I'd like to hear from any authors or from other people like what are other benchmarks that are cool what other control systems problems are interesting are applicable what about social versions or networked or communication based versions so what were the results okay so here is how to look at these graphs the x-axis is the epoch so that's just the sampling through time that's how many generations of parameters are tracked and here the red lines indicate the same interval of time so C and D are related to epochs 1 through 100 and that's actually compressed up against a very small unit on A and B and A and C respectively are for this inverted pendulum and B and D are for the hopper task so we can talk about the same phenomena looking at A and C in reference to the pendulum B and D in the hopper and in both of these cases zoomed in or zoomed out what is pretty apparent is that the y-axis is reward so that's just like in the hopper task is how far you're getting and then in the inverted pendulum task it's about your time maintained given a certain parameter combination for policy and in both cases the reward sharply increases for the active inference agent and also just immediately starts climbing after some sampling within the first 100 or even the first several epochs probably because of the lower state space of the cart because all the cart can do is move forward and backwards have a policy on that like gas and break and stuff like that that's all cooked into the physics of the pendulum I wonder if there's an inverted double pendulum task by the way that'd be kind of interesting and in the hopper task there's several more parameters at play I didn't copy out the exact number but there's several more so it takes a couple more epochs reflected by the zero through maybe 40 where the the average is like not really getting off the ground and this is being contrasted against the DDPG which you can read more about I'm just not going to go into what this algorithm is I just trusted all right the authors use something that makes sense let's hear from anyone about what other alternatives could be tested against so those would all be things where I'm just not as up on the literature but definitely would welcome the comments of someone from the machine learning field and what would be an impressive or useful demonstration of active inference or what would it mean to do a control theory simulation in this sort of a framework what else was used what else is the current state of the art for these inverted pendulum and hopper tasks just I don't know but it'd be great if anyone does know so just to close out with the relationship to the previous work okay this is from their closing section which maybe it's a disciplinary thing but often it's like a discussion section but if it was relationship with previous work I would have expected it to be a little bit more contextualizing and at the front of the paper so but it read more like a discussion related to deep active inference which is like this through timeness of active inference they write our work built upon these previous models by incorporating model uncertainty in its active resolution we extend the previous point estimate models to include full distributions over parameters and update the expected free energy functional such that the uncertainty in these distributions is actively minimized this brings our implementation in line with the canonical models of active inference from the cognitive and computational neuroscience literature so this is almost like saying yeah one more moreover it enables us to evaluate the feasibility of active exploration under the scaled active inference framework apply the model to more complex control tasks and obtain increased sample efficiency relative to previous models so it's kind of like saying we've made this active inference model closer to what it has been speculated to be in the qualitative neurobehavioral realm which I think is a good reason to be reading it right now because it does fit in nicely with our discussions of behavior and of neurobiology in the variational ecology discussions so this is on the simulation side how it comes to be this way so this is kind of cool because deep active inference is now being implementable through these kinds of models that are presented in this paper and previously it was like oh yeah I guess if it were deep active inference then it would accommodate for narratives and now this is the kind of framework where oh yeah okay we could talk about narrative through time in this framework could we the second area they address again another key word for the paper is model-based reinforcement learning in the current work we opted for Bayesian neural networks to ensure consistency with the variational principles espoused by the active inference framework but note that ensembles can be made explicitly Bayesian with minor modifications so they talked about several homologies to reinforcement learning and specifically amortized inference there's actually one quote that I pulled out from one of the papers that are on the slide and that talked about how you might get the sense that this is like a win-for-free situation with just wrapping a functional around the estimation of these parameters because it's like oh it's a neural network I'm implementing a neural network so I have an additional degree of freedom to learn nonlinear policy but then the paper argued that actually you're constrained because the functional is estimating the mean and the variance on this Gaussian outcome and so it doesn't actually liberate a degree of of output expressivity but rather definable and a scalable way to implement on current computational hardware a way to implement this massive pencil and paper problem of estimating the Gaussian mean and variance of super super complex high-dimensional state spaces but that still is what it's doing so super interesting stuff and also the note here that the ensembles could be explicitly Bayesian so with all the kinds of variants and flavors that that introduces so anyone who's a enthusiast or expert in these areas I would appreciate their perspective on what that means or why this is important or interesting all right then info gain and this was one of the keywords too so identifying scalable and efficient exploration strategies remains one of the key open questions in reinforcement learning model free methods such as the greedy or Boltzmann choice rules paging Dr. Proton Sean utilize noise in the action selection process or uncertainty in the reward statistics a more powerful approach is to construct a model of the world allowing the agent to evaluate which part to the state's base it has and has not visited this allows for measures such as the amount of prediction error or prediction error improvement to be utilized for exploration okay cool if the learn model implicitly or explicitly captures probabilistic features so that's the statistical regularities of the niche then information theoretic measures can be used to guide exploration so there's so much to say here and it's very interesting to bring in the information gain at the end really what it's saying is this learned model of active inference and the way that the exploration is tied into the generative model of the world such that actions facilitated not just towards the straightforwardly rewarding that's sort of the model for your rl nor merely the deep model rewarding that's like I love running marathons because it's healthy I'm not you know saying it's not healthy just it's it's the mindset that helps one get there for sure and then this is saying we can actually go another level beyond that we can say we're the deep generative model of what one is likely to be doing we can maximize precision which may be something you hear amongst the you know ultra ultra committed of an area something like this is just how and who I am so in that sense it would be going beyond this whole oh well this many miles per day is healthy for me which is a great motivation it might be might be a multi-scale thing just one example and definitely I know you know even for me it's just changed like through life so it just shows how it's related to your niche your context your social relationships so many other areas and just to kind of close on one thought on this information gain information gain surprised me at first because I was like trying to connect all this to information theory and I realized that what it's about is reducing uncertainty on target parameters information is about reducing your uncertainty about things and it's just about reducing our uncertainty about something else or about more specifically something else being a multi-scale model that includes some attributes that we're probably pretty familiar with like the observation model some aspects that are homologous to a reward model like sort of a preference model let's just say but then some aspects that are mainly maybe new and that might be the part about the free energy and then that area as I understand it is related to another type of information theory and model selection and the variational framework so if that is an area that anyone has insight into I think it's pretty cool maybe we could try to unpack that I'm not sure what it looks like but definitely something I wondered about and mining the Ps and Qs and thanks Alec for a little bit of the email discourse about clarifying a few aspects of the paper because it was pretty interesting and I think there's a lot of implications and what systems we can possibly apply this to approximately or in the medium term so tons of interesting stuff we'll be talking about this on November 10th 2020 and 17th that 7 30 am PST and I think that is it and let's see yeah one one closing thought all we can have is a deep generative model of the space intertwined with action so that the parameters we explore are the ones that are at the trade-off of successful and optimally informative sometimes we fear we veer towards successful which can be not optimally informative other times we are getting informed greatly in a way that may not be successful but overall we manage that trade-off well it's how we've gotten this far so great work everyone thanks a lot for listening keep it up thank you for participating we do provide follow-up forms to live participants and would be awesome if anyone had feedback or suggestions or questions you can stay in communication with us and one more slide on that parameter reparameterization trick and the slides from that paper but other than that thanks for listening and for bearing with the anomalous camera error but yeah I'm looking forward to the end of 2020 coming in with all these cool discussions we'll be having would love to have some new participants come online or bring in some new perspectives that we haven't considered or talk about it from a beginner's perspective from any number of fields as a starting point just let us know that's the kind of stuff we just always appreciate hearing about so have a good November and end of 2020 season and we look forward to talking with you all right bye