 Welcome everyone it is just adjusting things it is May 4th May the 4th of 22 be with you and we're here in Act in Flab live stream number 43.1 blue I don't know how you do it with always joining during the theme song welcome to the Act in Flab we're a participatory online lab that is communicating learning and practicing applied active inference you can find us at links on this slide this is recorded in an archived live stream so please provide us with feedback so we can improve our work all backgrounds and perspectives are welcome and we'll be following video etiquette for live streams check out active inference.org if you want to learn more about what's happening in the lab and get involved and participate in the live streams or any other activities all right well we're here in live stream number 43.1 and are continuing the discussion of the paper predictive coding a theoretical and experimental review by Baron Millage, Anil Seth and Christopher Buckley in 43.0 with Maria we went over some of the background and context and did like a first pass on various aspects of the paper but by no means exhausted it because it's a long and very intricate paper and also it touches on a ton of other areas of more general interest so there's a lot to discuss and in the 43.1 today we'll see where things go if you're watching live it would be awesome to have any questions in the live chat otherwise we will just start with some introductions and pick up with whatever people are finding salient and exciting so we'll just start by saying hello and going from there I'm Daniel and I'm a researcher in California and thanks a lot blue for joining. I am a researcher in New Mexico and yeah I'm excited to be here today this paper was really intricate and detailed and provided a lot of background that I don't have because you know there are so many active papes that I want to read but I don't have time to get to them all because I'm I like to read things very thoroughly and slowly and carefully and look up the things that I don't know and do the back references travel down this scientific rabbit hole so to speak so yeah this was a good paper that kind of helped me you know shirk all those active responsibilities that I have and came out with more like questions than answers which is always a good sign of a good paper. Awesome well how about going into the paper what questions did you have and maybe we'll get to how our questions evolved but what were you expecting to find in the paper and what questions were motivating you to seek it? So I always have had a hard time like what is the difference between predictive coding, variational bays, and active right and like I was really hoping that the paper would kind of lay that out for me and like no so like I mean I kind of got a good historical overview of how they all are interlocking and overlapping which I didn't know before because you know all these things none of them are simple and there are intricate details in all of them like you know that I don't know all the details ins and outs but I think that this gave a very good like historical overview of how these things kind of layer into a nice little sandwich. Great definitely a good starting place let's also add Bayesian brain there to the mix and start to look at maybe some similarities too because they're not exactly like addressing the same area and they're compatible often more compatible than not so getting at what distinguishes them is very important and like one way that that happened and it came up in the dot zero was so much of the paper was dealing with inference on the relationship between observations and unobserved latent states so think like trying to infer the temperature of the room given the thermometer readings and that was so much of the paper and then action comes into the picture in section four or five it's like predictive coding the author's right can also be extended to include action allowing for predictive coding agents to undertake adaptive actions without any major change to their fundamental algorithms but that's 50 equations deep and so it was very interesting to see like if those 50 equations are kind of like the stem I know that the equations are not sequential and building on each other in in all of these cases but then action gets tucked into a bigger picture very seamlessly and I think that speaks to the the relevance of distinguishing how they're similar and difference and why matters I guess so how should we even begin to look at that well so even action actually came up earlier in the paper before 4.5 so action came up in section 2.3 with dynamical predictive coding and generalized coordinates and this I thought was a really good example like like I was craving and still I'm craving like a really simple example like like it's one thing you know we talk about the equations we go through all the letters we compare this and that but it would be really great to see a super simple example worked through like from start to finish using all of these observable states latent states complexity accuracy like it would be really great to see that at some point and if you're listening to this livestream and want to come on and give like a guest tutorial on like how do we actually work these equations into some very simple problem that we can solve in real life like I would be more than happy to facilitate that presentation yeah because I can't even like I can't jump from like all these equations into like how does that apply but even in in 2.3 this was kind of a good a good a good ish example a real life example about like inferring you know the velocity and the acceleration and the jerk just given the the observed coordinates so that was kind of and it talks here about also free action which I've not heard before so it's instead of minimizing the variational free energy it says we must instead minimize the free action which is f bar and I've not seen or heard it referred to in that way but it's modeling like time series of observations so like you know you know the coordinates at time one time two time three of a of a swinging pendulum for example if you know those coordinates then you can kind of guess you know how fast is the object moving and or you can infer these other these other variables so I haven't seen that free action but it did even come up before four or five yes and it speaks to that expected free energy e fe which is what do you expect the free energy to be and then there's the free energy of the expected future which one of the authors has worked on which is kind of a slightly different variant and then here's another way to look at the free energy of some future time steps and and it speaks to that difference between the variational free energy which is like what you're able to calculate right now with the information at hand and looking backwards and so the more perceptual aspects of inference can be calculated using the variational free energy approach but then anytime the future is coming into play there's uncertainty about the observations as well as even uncertainty about what actions one will take let alone the consequences of the action that one will take in the future so we do need a different way to calculate it and then as you're kind of getting at this is a different way than we've seen it before with a minimization of free action but let's return to this page and think about how we can distinguish with like a Venn diagram or something that helps us understand how these core ideas are linked one couple that we can highlight is predictive coding and predictive processing do you have any thoughts or questions or on that yeah so maybe not a Venn diagram but like maybe maybe we should build perhaps like a pyramid or like a multi-level building because I think you know based on the what I read in the paper I think really predictive coding kind of starts it all off and also it makes sense to make a layer cake like being that these all kind of start off with the layers of cortical processing that happen quite literally in the brain so I think making a layer cake maybe makes the most sense okay so we're in the kitchen we're in the active restaurant kitchen and what are we gonna build in or what is the bottom of this layer cake so I think the bottom at least so so there might be two layers on the bottom one from Helmholtz so we can start with like the the perception as unconscious inference view of Helmholtz and then predictive coding as described by Rao and Ballard in their description of the visual system and the visual cortex okay and again kind of calling back to the dot zero the two qualitative philosophical ideas that the authors invoke and that Maria helped us unpack were Helmholtz's notion of perception as unconscious inference and Kantz's Kantz's notion that a priori or beforehand structure like prior structure is needed to make sense of sensory data so we're gonna have one bottom layer of the cake with Helmholtz and then also with the Kantian so this is going to be our qualitative philosophical layer of the cake these two notions and then you mentioned another critical work which was Rao and Ballard's 1999 work building on a broader history of applying predictive and anticipatory approaches to different neural systems and in the dot zero the two neural systems that we looked at were first the retina system with a Srinivasan 1982 and this is looking at retinal physiology and like the electrophysiology of photons hitting the retina and then another area where predictive approaches especially early on were getting built out around and towards was the cortical hierarchy and the different layers in the mammalian cortex so these two anatomical exemplars of some micro-circuitry some histology that is compatible with a predictive approach okay so we have these two basal layers but we might even add more basal layers but we have biological systems that are doing predictive things and that includes the retinal as well as the cortex also of course if anyone has like questions we'll get to it in the right way so please ask any questions or doing any comments so we have the retina the cortex as examples but there's others okay and then if there's a third layer that we can add to the bottom I think it would just be Bayesian statistics because so I think it comes maybe before that also or maybe like maybe now maybe Bayesian statistics is like a a layer two but like I think before the Bayesian statistics and even before biological systems doing prediction there's the information theory component that I think is important also okay so well we have um this is qualitative and philosophical then we have biological and actual and then we have a formal area and that'll be information and base information yeah the information goes under the base I think just like the Helmholtz comes under the count right I think it's the really the relationship between information and probability which we started to explore in or we explored a lot in the FEP of generic quantum systems live stream number 40 that was 40 I think so many streams ago okay this is fun and I think we're we're gonna start building this into a tasty structure so we have the philosophical coming in on the left and that's one on ramp that just like we had discussed earlier like this is something that has been almost just imminently available to thinkers for thousands of years these are the kinds of qualitative claims that anyone can experience whether by thinking about the blind spot or about how they're sense making in response to some stimuli this is just the idea that the sensory observations are first off not the direct contact with the world like we're not seeing the lightning strike event we're seeing photons hit the retina and sound hit the eardrum and so we're receiving sensory input that's that kind of Plato's cave angle and then Kant and Helmholtz elaborated on that to include this idea of unconscious inference that requires a priority structure or like what we would call like a Bayesian prior then we have this biological area and these are just reflecting natural systems that are exhibiting some kind of behavior like we can think about the 1999 paper of Raun Ballard themselves about the functional interpretation of extra classical receptive field effects and we talked about that in the dot zero like the classical effects are kind of the simple normative ones like the joke is it's classical because of the papers in the order they were published but that doesn't like make it even true let alone the best model the classical is just like it's like classic rock you might like other genres but then there's the classics or whatever so that's the biological systems that are exhibiting certain kinds of outcomes and then we have these two or mole or quantitative areas and those are the Bayesian statistics and probability as well as information theory and they're definitely linked as well but we'll just leave them as kind of two conjoined layers here okay now where where do we go from here well isn't that the question so so maybe um predictive coding and predictive processing um and like really what's the the difference between these two i think um predictive coding came from neuroscience so maybe that layers on top of the biological part the biological piece of the cake and then predictive processing might go on to the quantitative piece awesome yeah and it really is what Maria highlighted and and drew out in the dot zero which was using a a quote from an anti-claric book and Clark wrote that predictive processing is not simply the use of the data compression strategy known as predictive coding so at least in this take predictive coding is like mp4 it's like an encoding strategy which is also why we connected it to frame differencing and video encoding but predictive coding is going to be something that's data oriented and it's related to information theory compression information encapsulation and then Clark is writing predictive processing is is not simply that and so Maria spoke to how another way to think about the difference between predictive coding and predictive processing is to use coding for the formalisms and implementations and processing for the philosophical understanding that prediction is the basis of signal interpretation as opposed to merely recognition or descriptive models so if anyone has like a thought on that then you know put it in the live chat or come get involved but let's work with that delineation going forward so predictive processing is going to be um somehow related to the formal or sorry predictive coding is going to be more close to the formal whether it's predictive coding in a biological context or whether it's going to be in a um philosophical context or something it's going to be more on the formal side whereas predictive processing is going to be where a little bit closer to the philosophical yeah i think so and then maybe like does the Bayesian brain like make up the the third part of our like a predictive coding layers over the Bayesian statistics and predictive processing layers over the philosophical then does the Bayesian brain somehow make the in between sandwich that goes over the biological okay okay so let's see we have predictive processing as being a philosophical approach and in consideration of these philosophical memes and themes in the biological case which we're certainly bound to then we can have predictive coding as kind of the formal way that these predictive anticipatory systems are implementing predictive processing and then what is the Bayesian brain seems pretty fair to put it at like the intersection between Bayesian and brain but what is the Bayesian brain to you so to me the the Bayesian brain is kind of um you know i'm trying to look back through the paper to their uh description but um in my like recollection recollection of the Bayesian brain which was like probably my first introduction to Bayesian statistics i was like oh this is great um and it wasn't even the Bayesian brain it was like the Janesian interpretation of the Bayesian brain paper but like that paper was like probably the my like turn on to Bayesian statistics and and implementing this and um so in my mind the Bayesian brain is is just this idea that um the brain uses Bayesian reasoning or Bayesian updating to process information and and go forward so but that's probably a very simplified um interpretation i think there's going to be one more realist interpretation like the brain is doing Bayesian stuff and then there's the more instrumentalist interpretation we can use Bayesian statistics to model what it is that the brain is doing so is Bayesian the territory or the map of the brain but we'll just leave it as an edge for now so i'll read a question from dean and then i think we're gonna be exploring a lot where action an active inference comes into play so blue uh dean writes what do you think of this as coding you are doing something in coding as processing you determine by comparing processing and or not processing as you're doing so process entails processing types what i'm hearing there is encoding doesn't require two arguments it just takes one you take one file and you zip it you've encoded that in a way so predictive coding can encode and or implement some type of informational algorithm without a reference or a comparator whereas dean's saying as processing you determine not with this single argument um plug and chug but you determine by comparing as you're doing and so predictive processing it it does have that sort of top down and bottom up as it's often visually represented but processing is entailing that full stack and therefore the processing is requiring something like a minimum of two to have consideration if you're processing nuts then you're sorting them into two different kinds or you're removing something from something else about those like you're doing something that has compositionality or some type of multiplicity to it it's not just something you can take that one file and just zip it so that's a great point dean i think i'll add two little arrows um here yes okay yes blue let me share the slides thank you um i'm left in the dark here i definitely um yeah agree with dean's point about the um coding versus processing like i can write a whole lot of things right like i can take a whole lot of notes and that's coding that's like encoding the information from our conversation this live stream but like until i do something with those notes like turn them into elements of a paper that i want to write or something like this like i have not processed what i have encoded until i take them forward to a forward step so i think like there is kind of this temporal implication with processing versus coding yeah also like the timeliness of processing like you want to process the food before it has become you know bad or something so okay we have predictive coding again these are just tentative slide play if anyone thinks differently they can join or write a comment but predictive coding might entail something of a unique directionality just like a dot zip and you say well somebody will unzip it later but we can encode it without that part of the loop being required and then we have this um two directional relationship with predictive processing being like top down and bottom up um okay what do you see now or i can bring up something um yeah so i'm not really seeing anything so why don't you bring up what what is on your mind or what what are your thinking of where is action and then i think that'll start to walk us towards where active inference is so because of the the temporal nature of processing i think action might start in the processing um elements like even in something like mental action like people do like thought correction or thought remediation um so like if you're uh you know you have this prediction and it like surfaces in your mind but then like you correct the prediction or like you um you know you think you see a snake on the floor and then you say oh that's a snake but then like you look at let me see that snake like and then you have to zoom in look at it closer and it ends up being like a coiled up rope right and so it's not really a snake but like you direct your attention because of some prediction and then can continue to process that prediction so so possibly action starts in the processing um in the processing area or maybe it's there in the middle well i'm gonna start it in the middle we can see where it goes because i think it's going to have an important edge to each of these three areas so there's a few ways to represent it of course so how does action relate to biological systems or first let's let's um start with some philosophical frameworks that highlight the importance of action so this includes inactivism embodied the four ease five ease seven ease etc etc etc everything involving like this sort of embodied enacted and cultured extended etc approach to cognition and philosophy and those can be qualitative so i'm putting that edge here these are different qualitative and philosophical areas yes they can be formalized we'll get to that edge but these are qualitative memes that come from this philosophical area and they're going to draw us to action okay now how about biological systems and action what do you think about when you think of biology and action i don't know i think like those things are kind of inextricably linked in my mind um because like you can't have life without having some action i mean cells replicate i think this is going to be just uh something that doesn't demand to be that but is totally true biological systems are active there it's active matter and life is this multi-scale organization of activity so there's many ways biological systems are doing action and what are some areas that are formal whether or not they have to do with biology or any philosophy can you name a few areas of like formal theory and science that you think relate to action so i mean i think um just that's maybe more related to processing um but but as i said earlier i think processing is related to action but like Bayesian inference and and variational inference i think um which also tie in a lot to active inference i think that they come in there uh in the processing slash processing through action or action through processing okay great so um Bayesian planning as inference as well as areas like cybernetics control theory and um different formal ways of modeling active systems and modeling decision making these could be related to a biological system or not and they can be drawing on a philosophical framework implicitly explicitly or not all right so now this is starting to get fun you added another term there and i think this will merit uh a detour but then i return to here which is the variational Bayesian approach so what do you think variational is meaning or doing here so it's interesting and and part of what makes this a complicated um paper and also like a complicated detour um that like the authors cast predictive coding as variational inference like even though we've kind of made this distinction between predictive coding and predictive processing where coding does just is like a one-way street and processing is like a two-way street the authors nevertheless use the term predictive coding as variational inference and and this predictive coding um or this variational inference i think the the key construct here and that's probably i don't know it overlaps with with Bayesian brain maybe to is the um idea of having a model of of the data generating process and so i think um variational inference kind of throws the idea of a of a model into the loop okay so let's zoom in on this blue corner of formal and quantitative areas and so these can be taken in an action independent way they can be explicitly about action like planning as inference or they could just more implicitly rest on inference i'm sorry more implicitly rest upon action for example in the case of we're inferring the hidden state of the temperature of the room and we're observing the thermometer that doesn't have pi it doesn't have a it doesn't have action in that model it might just have those two parameters of like the room's temperature hidden state and the observations of thermometer but then we can kind of take a step back and see all right well there's the person who's engaging in this experimental action the person's ocular motor zooming in on the thermometer maybe we can just abstract away but action is always baked into it because we're talking about active systems but let's just put some of that aside for a second and take a quick detour to talk about variational inference because it's a really important topic let's think about and and we've also explored this in active live stream 26 and 37 and a lot of other times so let's look at three different ways of doing bays bays ways bays areas as they were so and maybe there are more but the three that we can list up here are exact bays montecarlo and variational bays so pick one and then what is like something important to know about that way of doing bayesian statistics well so i'm by no means an expert in bayesian statistics actually like cracked out my bayesian data analysis book while i was reading this paper because it was one of those papers like i said i'm a i'm anal retentive and and detailed and i like what does that mean i have to look up every little thing i don't understand so like okay so let me start with exact bays so exact bays when i think about exact bays it's like the probability of a given b like the probability of rain given clouds or something like that and so it's like a an alternate alternative to frequent statistics where the probability of rain and the probability of clouds or like a coin flip is easier to talk about right like so each flip of the coin is half and so like you know the idea that what are the chances that you're going to get heads um you know five times in a row is half times half times half times half times half so um in bayesian statistics there's or the probability that you're going to get rain here this is it in frequent to statistics the probability that you're gonna get rain and the probability that you're going to get clouds is um the probability of rain times the probability of clouds. That's a great way to put it. And then in Bayesian statistics, there's it evaluates the probability of rain with clouds, like given what you already have observed before, like if it's rained, it's usually cloudy this percent of the time. And it takes into consideration the probability of rain with clouds, probability of rain, probability of clouds, all those things separately. So it gives you a prior estimation, I think. So I don't know, that's probably super confusing. Great. No, you mentioned a lot of really important points, which is it's an alternative to frequentist statistics analysis. So we're not getting a p-value out of this. And we're able to explicitly say our priors and not to go down this rabbit hole, but a lot of frequentism implicitly has a uniform prior. And we'll just leave it at that for now. So the exact Bayes is using the Bayes equation, essentially as written. And just as you said it, there's the probability of something given something else. And then you kind of find the probability of that other thing given the first thing, multiplied by these other terms. And then you fill out all of those variables, and you literally do that division. And so we explored like this in the case of Axel constant's paper with the bacteria. And so there was the exact Bayesian bacteria that was having its prior. And then taking in new information and updating how it thought about what was out there based upon the new incoming information. And it was like in a formula. And it was like dividing exactly the numbers that you see here in this way. Okay, so that's exact base. Now, what are some issues with exact base? It might be straightforward when you're talking about that coin flip. But if you're talking about a high dimensional space, or something that is just bigger than the RAM or whatever of your computer, when it comes to the implementation, there's often challenges with exact base. And so there's several techniques that have been developed to approximate what an exact Bayesian approach would be in a more tractable way. So there's two alternatives to exact base implementation. And the spirit is going to be exactly the same of taking in new information and updating our priors as specified. But it's going to be done in a few different ways. So first, what about Monte Carlo? What is Monte Carlo? So Monte Carlo, you probably know more about the history than I do. But Monte Carlo is like a sampling technique. So instead of trying to calculate the probability of rain and clouds using exact base, you just sample like out of a thousand rainy days, how many of them were cloudy? And out of a thousand cloudy days, how many of them were rainy? And so instead of like getting an exact measurement of your distribution, like where is the overlap between clouds and rain, you just randomly pick out from all of the possible days, you estimate your sample that way. It's based upon sampling. And so it's like, we might not know one of these terms, but we can at least draw a sample. And so this does a few different things. It connects us first to empirical and specific data as implemented, like in a row of a program, not just B sub i or some sort of analytical representation, but like this is where we're getting into this specific sample was pulled by this algorithm at this time. And one method that's a common thing is the Markov chain Monte Carlo. And this is using a local resampling approach, like assessing local maneuvers of a given chain that's engaged in sampling. And you kind of drop these chains into different parts of the probability distribution and then have them evaluate different aspects of that distribution. So it's as easy or hard as dropping those paratroopers into different parts of the physical landscape and then having them accept or reject different proposed moves. So that's sort of a two dimensional landscape with the height being elevation. And maybe that's what you want to map, because you might want to know, is there one central peak? Are there two peaks that are very similar? And take that into the statistical case. And this is especially helpful for when there's no, I don't want to say no idea, but there is not the desire to explicitly formalize what shape the distribution looks like. So we might want to do like Bayesian phylogenetics. And so we need to talk somehow about the likelihood of a given DNA sequence being that way. But like, where do you go from there? And so that's where the Monte Carlo sampling based approach really comes into play. And then because it's based upon sampling, the advantages are you can run that sampling for three iterations or 3 million. So it's very flexible with the amount of computing power that you bring to bear on this challenge. But also there are the risks of over sampling, which is that you have used some unneeded electrical power and computational power, which is really non trivial. But also there's the risk of under sampling. And you might even be like in a regime where you think that you've sampled because you're getting samples that are just confirming what you already know. But you know, there's a whole other island that you didn't do the paradrop to. So one can be lulled into a false precision with bootstrapping and sampling based approaches, because they can give ultra high precisions. But that can be based upon actually more like a biased sampling approach, or all these other features. And just the one anecdote that I have on that is in the Bayesian phylogenetics case, I remember this one professor in undergrad, Professor Moore, and he'd always say fuzzy caterpillar, you want it to look like a fuzzy caterpillar. And there's some technical details. But it's like you want it to be exploring the full range of a parameter, like if something could be zero to one, you want to be exploring the full range, but returning to the best estimate. And that shows that you're sampling like the diversity of what that parameter can be, but also spending more or most of your time in the most likely and in the best regions. And so if the value were at 0.5, and the range of what's possible was like zero to one, that looks like a fuzzy caterpillar with more thickness in the middle and then a lot of fluctuation. But it's not like it's spending a ton of time at one and then it drops suddenly to zero because that would suggest that you're not converged yet. Whereas when you see that fuzzy caterpillar, it's like you're sampling the extremes and getting novelty and testing different combinations, but you're also returning to something that is working. So Monte Carlo sampling based approach brings in all these opportunities and challenges associated with sampling. Well, so I was under the impression like you talked about over sampling, but I think over sampling, like as you as sampling goes to infinity, like your your accuracy increases, like you're closer able to get the true posterior. But I think it's uneven sampling, that's the danger in the Monte Carlo method, like if you're sampling too far in one end or the other. And perhaps like asymptotically, if you exhausted the state space, then you would know that you had the right answer. But the whole reason why we're using any kind of heuristic approach rather than exact base is because like we don't have the full state space. So yeah, there's relatively low cost to over sampling. But we don't always know when that is. And then under sampling can be just totally misleading if we take that as our only representation of that distribution. And so there's all kinds of cool techniques that involve like multiple chains happening in parallel. And some of them are like higher temperature and less temperature. So for one of those paratrooper teams, it's very cold. And so they're only making the best, very local moves. It's like the elevation is sharpened for them. It's very hard to get out of a local rut. And there might be another team that's very high temperature. And for them, the landscape is flat, maybe even completely flat. And then they're like a hot molecule that's able to move over something that might be a barrier for another team. So that team is going to like fill in the details and locally explore quality solutions. Whereas there's another team that's like another chain. And so that's the multiple chain Monte Carlo approach. And there's a lot to that in phylogenetics, which is where I've seen it the most, but also probably many other areas. Okay, cool. I've run it in phylogenetics too. And like, I mean, the programs can take forever to run because, you know, I mean, you're too many parameters and whatever, they can really take a long time. The computational power is no joke. Well, like what goes into the Monte Carlo sampling. So we have exact base, which is calculating the true posterior. And then we have the Monte Carlo, which is estimating the true posterior through sampling. And then we come to the variational, which is trying to minimize the divergence between the true posterior and your approximation of the posterior. Great. Minimizing divergence between the true posterior, like what we would really have wanted to actually know, what is the actual temperature in the room conditioned upon the noisy thermometer estimates that we're getting. So are we going to sample our way out of that one? The variational approach is going to be very different. And in 26, I think we had like the cat. And then one approach was like the Monte Carlo was like pointillism, it was like dots. So we're sampling pixels from the cat, and then it becomes like a pointillist picture of a cat that if you do sample densely enough, it looks like a cat or it's recognizable as a cat. And the variational approach was like a clipart template. And so let's just say that you had a template of different curves or of like, you knew that it wanted to be a cat, you knew you were looking at a cat. So then you have like a cat template that can get stretched or zoomed in or out. And so that might be very straightforward to optimize, because you're just trying to minimize the divergence between these very limited parametric changes that you can implement. And of course, you run into a situation, which will explore various aspects of variational inference. What if you try to stretch that cat emoji onto a different animal? Or what if it's a totally different kind of data set? So you will find some divergence minimizing solution. That doesn't mean it's sufficient or even in the right ballpark, just like doing an L2 least squares minimizing linear regression will always find a best fitting linear regression. But that best fitting linear regression can be like hilariously inaccurate. Like if the data are like a parabola, like a U, the best fitting regression might be like a flat line through the middle. Or if 80% of the points are here and 20% over here, there might be a regression that's very misleading. That's like the Simpsons so-called paradox. So in variational, we're not going to use sampling. We're going to use the minimization of a divergence, which is a KL divergence. But they mentioned in this paper, it can also be a different kind of divergence. And the rainy divergence has been explored in some recent work of Sijid et al. So we're going to minimize the divergence between the true posterior and a distribution Q that we control and is tractable to optimize. So it's like instead of that sampling based approach, we're dropping the teams at different parts of the landscape. And then they were going to report back on information. Here in the variational approach, it's like we have a template. And then we're going to just do stretching and bending. Again, this is kind of a torture metaphor, but we're going to stretch and bend. And there's only a few parameters on the stretching and bending. And we're going to find the best fit of those stretching and bending parameters, like if it were a linear regression, y equals mx plus b, the two dials that you get to change are like the m, the slope of x and b, like how high or low the regression is. And then you're finding m and b that are the best fitting about the data y. And so you're kind of like minimizing the least squares error term. Here, it's not a linear regression that we're fitting by minimizing the least squares. It's a variational Bayesian approximation that we're fitting by minimizing the divergence between the true posterior and the distribution q that we control and is optimal or is able to be optimized. What else would you add on that? I think that's it. So the true posterior and the distribution that is tractable and we can optimize over. Yeah, that's it. So let's close this little computational intensity on variational is like much less, right? Because instead of calculating all of the parameters for every little dot, every little point on the sample, we are calculating for like a big blob. It's like one big blob instead of several little points. But we calculate for that big blob, this is the density of that function. And these are the parameters of that. So now we're going to back out of this formal coldest act that we've been in and just remember to ourselves, even if it's your first time hearing variational inference and or you're kind of like blue or I where it's like we've read it in how many papers but still every time we kind of start on square zero. So why do we use variational Bayesian inference? So the first thing is it allows us to use Bayesian statistics and probability, which we might prefer over, for example, frequentism. Okay, variational Bayesian then allows us to use a heuristic Bayesian implementation. So if there's something that's too challenging or implausible for exact Bayes, now we can approach it like Monte Carlo or variational. These are both heuristics. Unlike Monte Carlo, it's not based upon sampling. Variational is based upon a family of equation fitting. And so it can be implemented very efficiently, which isn't always the case for Monte Carlo. It rests upon inappropriate factorization of the problem. And one other piece is this connects two analytical equations in physics. So it's almost like Monte Carlo is like a computer scientists approach. It's like, how much RAM do we have? How many processors do we have? How big is this data set going to be? And then there's no analytical equation or closed form for Monte Carlo. It's like engineering. It's kind of like you have to have some art and skill and science coming together for the Monte Carlo to be the best it can be. And then it's more about like an adequacy and an efficacy question because you're not converging to the infinite asymptote. Again, otherwise you could have done something else. And then the variational is like the physicist's approach and it draws on the variational calculus of Feynman and others. And this is the part that's very amenable and connected to the equations of least action and all these other things like factorizing equations. So it's funny that you bring up the computer scientist approach in Monte Carlo. And so I am a horrible computer person. I mean, I will brute force wrangle my data, like just give me more RAM. Just give me more processing power. I just want to make things work out of any patience for finesse. So in my mind, I think about the Monte Carlo, like the brute force, just shove your data through the algorithm no matter how much RAM it takes. I think about Monte Carlo like that. And I think about the variational bays more like the people that can do like, you know, very simple things in code to make it run much faster than have that like very good skill with like, oh, well, we can just, you know, run this little factorization or, you know, we can make the problem run so much better on the computer. So I think about it about the variational like that. But you're right, it does have that connection to statistical physics and to Feynman. But I think about it as like a highly optimized way to run Bayesian statistics. But why, Daniel, would someone want to run Bayesian statistics in the first place? So maybe you want to give us an example? Why would that happen? Yeah, we will pull back. Just the last point there is the variational methods, we might even have a heuristic for those or an algorithm or approximation of those, for example, message passing and for any factor graphs, those drawn equivalents between Bayes graphs, where nodes are random variables and edges are statistical relationships. Bayes graphs can undergo variational inference. And there's a tractable way to use toolkits like that developed by the Bayes lab and use message passing and a 40 factor graph representation to have like a level of implementability for even the variational, whereas it's not possible to imagine like a heuristic for Monte Carlo. It is the heuristic sampling is sampling, you can do better sampling, but that's the game that you have chosen. Whereas in variational, even this question of minimizing divergence does require kind of like a proximate or an operational approach, because how can we minimize the divergence between the true posterior and something we control and we don't know what the true posterior is. So there have to be a little bit more added, but this is like very awesome discussion because it's helping us pull out like this is the sort of exact approach. And then there's a more computer science like sampling approach and then a more analytical and physics related approach. Okay, so why do we want to do Bayes? Let's pull back to our triangle here. Why do we want to do Bayes and thinking about the concepts here? Why do we want to do Bayes? I asked you, you're not allowed to turn the question around on me. That was me asking you. You're not allowed to just talk and then ask me. Well, there's a few ways to take it. I guess one would be, what is the alternative policy selection? If we're doing research or application, what is our alternative to picking some Bayesian approach? So if we take the Bayesian road, we know that there's going to be some sort of tri-vergence later. Exact Bayes, Monte Carlo, variational Bayes, something else, or we could go down a different statistical path like we could use frequent statistics. And maybe that's totally adequate and effective for the situation. And both are just maps. So a well-fitting linear model of height and weight doesn't make height and weight a linear model. A really good fitting action perception loop, whether it rests upon a kernel of frequentist statistics or Bayesian statistics, doesn't make those systems that way that you modeled it. So I would say Bayesian statistics is useful when we want to do some kind of formal or quantitative inference where we want to be specific and explicit about our prior beliefs and how we want incoming information to update those beliefs. So I agree. And it's what I always try to tell people when I'm explaining what is Bayesian statistics, which a lot of people have never heard of. In college, at least when I graduated, there's no Bayesian statistics course. Now many schools have them, and I'm outdated already. It's really like when you're doing frequentist statistics, you go out to the desert and count mice every year. Every year you go and count how many mice you see in the desert or something like this. And you expect to see some mice in the desert, otherwise you wouldn't be there counting them. So you expect a non-zero answer. So every year you go out, you find 10 mice, 20 mice, 50 mice, 30 mice, 20 mice, 10 mice, and every year you go count mice. So if you were to go, you have some expectation that you will find mice. You also have some expectation that there is going to be less than a thousand mice, because every year you've been sampled this one place. And so what Bayesian statistics does is it takes actually that prior, not like frequentist statistics. They don't care if you ever found mice there before. Your probability of finding zero mice and 30 mice and 50 mice and 100 mice and a thousand mice is all the same. It doesn't matter how many years you've been there before. You are starting with nothing in frequentist statistics. You don't get to have any guess that there's even going to be one mouse in the desert. With Bayesian statistics, obviously you think that there are going to be some mice and you think it's going to be some number you can count reasonably with whatever area. And so what Bayesian statistics allows for is that prior belief or that expectation that you have as the experimenter. Awesome. And that's kind of what we were getting at with that idea of frequentism having this implicit uniform distribution. You flip the coin 10 times, you get three heads. Your maximum likelihood estimate is it comes up 30% of the time on heads, .3. However, if you want to take like a bigger picture of you, you might have some sort of precision about where it's likely to be. Like is it likely to be .5? And you're going to be surprised if it were .3 because you just pulled it out of a cash register. And then in that case, you still might say, I want to have a uniform prior. And if it is .3, then it's .3. Or you might want to say, I'm very confident it's something else. But Bayesian statistics opens up the space to be specific about how we want to use our priors, how confident we want to be, how much precision we want to have in those priors, and then how we want to incorporate new information. So any other thought on that? Or let's continue on that and connect it to some formalisms. So I have one more thought on the coin flip with the Bayesian idea. Like if you know it's a fair coin and the probability is 50%. And like if you're flipping coins and we're betting each time, like, you know, dollar, dollar, dollar, if we flipped that coin, like you flip the coin 10 times in a row in its heads every single time, like on the 11th time, like my bet is on tails because like it's not, it's a fair coin. Like it's not going to come up every single time heads. And so frequent statistics like does never allow for that previous information. And also like where is that information? Which is something that always baffles me. Like does the coin already know that it's been flipped 10 times and it's come up heads all those times and that it then has to come up tails sooner or later? I mean, the probability is so low that you're going to get like 1000 coin flips of a fair coin in a row that is heads. Like, I don't know if it's ever happened to anyone ever. So like as we increase the number of flips, and if it's heads, heads, heads, heads, like it becomes increasingly probable that the next flip is going to be tails in my view, right? Which is maybe totally wrong. But where is that information stored? And the Bayesian kind of there allows for that to be there. And I wonder if this even connects to where is that information stored in the quantum reference frame of the observer and nowhere else. But let's look at how that gets implemented in these formalisms. So you just describe that setting where somebody has like a belief that the coin is fair. Maybe that's an empirically grounded belief like previously they flipped it 1000 times and they got 5050. Or they have just an a priori belief that is fair. If that belief were generated by a real other data set, we would call that parametric empirical base, because it's the process where you set the priors as well as their confidence based upon some collected data set. That's parametric empirical base. Or one could just take that a priori cons synthetic a priori and just say coins ought to be 5050. And that's what I'm sticking with. Let's just say that then we play this game, we flip it 10 times and likely or not 10 heads happen. That's the trace of behavior in the niche should actually happen. Now that might be seen as a totally likely outcome by somebody who believes that it always comes up heads, hence they're unsurprised. Or depending on what your priors were, that could be variably surprising. And then you talked about, well, when and how should you update your beliefs on that coin? Like maybe something changed while you weren't looking at it. And so you want like your best estimates to very heavily reflect the recency. And maybe that should be like a moving average of the most recent three flips or maybe the most recent 30. So let's look at two places in the paper where they do something like that. The first one has to do with predictive coding and frame differencing algorithms in video compression. And so in this area, we can think about the way that the video looks people who are watching this video, the way that the video looks is like the output. And then we want to encode like how surprised are we what is happening when the frames are changing. And so the simplest method is just count where it's different and convey that information. But it might also be important to use as they write more advanced methods that predict each new frame using a number of past frames weighted by a coefficient in approach known as linear predictive coding. So that's one case where you're determining how many frames back how many coin flips back should we use for that now casting. And how should we wait them. So that's one area. And then a second area that they bring up and connect is the Coleman filter. So here's the Coleman filter. And this is their formalisms 33 and 34. And it has a lot to do with Bayesian statistics. So on this and Bayesian Coleman filter is very common. We can see it in two different ways. So on the left side of this slide, the top image shows a prior prediction. That's the prediction. And then to is the measurements. And then there's the fusion. So we have the prediction, the measurement here with like GPS. And then there's the fusion. So that is very similar to having a prior and then some updated sensory information and then the posterior. So it's just labeled slightly differently. But we can already see how this is like totally related to Bayesian statistics. What this also brings into the picture is like a pseudo code that unrolls through time. So Bayesian statistics approach doesn't have to be about time. It can be about a static dataset. And then you could be modeling like, okay, per extra sample of the population of their height and weight, we're going to update our inference on that relationship. But that's a timeless analysis. It doesn't incorporate some sort of unfolding through time. That can be incorporated into the Coleman filter using this pseudo code down here on the left. There's some prior knowledge of state. And this is like two dimensions. And then there's like a cloud representing a distribution of precision or uncertainty. So a more precise estimate would be like a tighter cloud there or a sharper peak or sharper valley depending on how you want to think about it. And then less precision would be like a broader basin or like a more diffuse ink drop. And then there's a prediction, then a measurement occurs, which can be precise or can also have an associated measurement error term, which could be fit with parametric empirical base based upon testing, or it could be determined on the fly with expectation maximization. And then that prior gets juxtaposed in this update step and that outputs. So this takes sort of this timeless Bayesian update scheme and specifically adapts it to the case of something happening through time. And that's shown on the right side here with this image of like time is happening. And we're getting these noisy x measurements. And the true temperature is the purple. And let's just say it's unchanging in this case. But that could also be changing. One can just imagine that especially if it's very noisy, it's hard to get increasingly complex dynamical patterns. But that's all part of the game. And then the prior starts high. And then the prior is weighted. It's kind of like a spline. It's weighted through all of these x's. And you could imagine one extreme case is like move to the last one that you saw. And so that would be like basically recapitulating the measurements distribution through time. The other extreme case would be wait all the time points evenly. So that like we're kind of converging to a moving average. So if we have 100 hours of video that is like at a two, and then all of a sudden it switches to an eight, and then your moving average would very, very slowly start moving up from a two. And then one could imagine that there's some intermediate solution that doesn't use like the whole data set because that's too slow to adapt and maybe takes up too much memory, but isn't just like a one step instantly switched to the nearest measurements. And that parameter of how fast should you update your Kalman filter is a parameter in the model. And so that is exactly what is being statistically optimized is how much through time should we update our ongoing estimate. And so in the static or the timeless Bayesian case, that's where we talk about precision. How much should we update our inference as new information is added to our data set. But that doesn't mean appended in a temporal way. The Kalman filter is making it explicit that these data points are being added sequentially. And that is providing us with this pseudocode by which the latent state up to me, the latent state is updated, not just as a function of adding data, but actually adding data in a sequential and unrolling way. What do you think, Blue? Yeah, great. I think that that was a super good explanation and very clear. And yeah, my first interaction with Kalman filtering was in imaging and image processing. And so I think that's maybe where people come across that a lot, especially like doing laser scanning microscopy or video. I guess the video update also is like that. So. Yes. So let's just look at some of their writing. But these are like the smaller the formalism, the less we've paid attention to it. And the more that we would love to know about what it actually means. Well, let's kind of pick up above their 35. They write Kalman filtering proceeds in two steps. First, the state is projected forwards using the dynamics model or prior. And that's the P of mu T plus one. That's the mean at the next time step conditioned upon the mean at this time step. So that's like, what is going to happen next conditioned upon how it is now? These estimates are corrected or sort of compromised. Coming to a compromise by new sensory data by inverting the likelihood mapping. P of observations at the next time point conditioned upon our estimate of where the latent state will be. So this is what temperature will the room be conditioned upon how it is now at the next time step? What will it be conditioned upon what it is now? This likelihood mapping, which is like the A matrix in the POMDPs is what is the probability of the outcome in the next time step being a certain way conditioned upon how we think the room's temperature is going to be. And so these are the equations and some of the variables. And then they kind of conclude or in this little section, the derivation of the rules is relatively involved. And that's in appendix A and some other places. But also the Kalman filtering can be interpreted as finding the optimum of a maximum a posteriori estimation problem. So this is almost like we have Kant with the exhortation of the a priori. What information are we bringing to the table qualitatively? Kant or quantitatively Bayes. That's the a priori. And then Bayes kind of bridges the a priori, the prior with a posteriori. Not sure if that's the Latin for that one, but it's the posterior, what happens after the sensory update. And then that space between like the before and the after is the during. And that's the during of the now that we're doing action and inference within. Let's look at how they introduce anything. Otherwise, we'll look at how they introduce action and then connect it back to active inference. So I think the Kalman filtering and the updating. So I think we kind of skipped over updating when we were talking when we were making our layer cake earlier and just how how messages in the nervous system, biologically and in predictive coding are passed forward and backwards. So it might be helpful to back up and talk about error propagation and propagation of back propagation and propagation of signal through time, just as like a very basic way, because I think the Kalman filter is kind of an advanced way to do that. Okay. Can we go back back? We're back at the triangle. And so now we've added variational as like a little badge to the Bayesian. And also we're going to add Kalman as a badge. Where have we visited on this journey? And one can imagine that they could take a Monte Carlo or a variational or an exact approach to Kalman filtering and so on. And then you mentioned a few more biologically inspired or compatible features like error propagation and so on. So what is an area in the paper or an idea that's relevant here? So I think that the signal propagation and bottom up prediction error, like how does error propagate like through visual systems, through like neural networks, I think that there's applications here. And is it the error that's fed forward or is it the signal that's fed forward or where does the error go in the system? So I think these are some confusing things that maybe they touched on in the beginning of the paper. So let me see if I can find exactly where there's some debate. Okay. While you're looking, so I've pulled out this right edge of the triangle. So we're going to just leave some of the baggage at home and take out to the table just this link between Bayesian statistics and probability and biological systems. And then you raise this really important question about like, how do we think about error estimation? And also some related terms there would be like precision, ambiguity, risk, compression. And that so that those are error estimates. And then how are error estimates propagated and communicated? How are they propagated and communicated in language? I'm not sure. If I had to guess in how are estimates communicated in systems? And so we might be interested in mathematical systems. So then the way that you communicate the error is just you multiply the two variables. You have a precision matrix. And then you have like some sort of hidden state matrix. And then you basically just buzz this kind of pristine matrix by some error matrix. And so if your error on something is like zero, this is not exactly how the multiplication would work because multiplying by zero. But if the error were zero on something, one would want their estimates to be passed through without being blurred. If the estimate of the uncertainty were super high in the extreme approaching total noise and uncertainty and variance estimate, then no matter where that pristine estimator were, you'd want it to be like totally fuzzed over. And then one could imagine that in a per variable ongoing way, you'd want to be updating these variance estimators. And that is exactly what happens in the Kalman filter, which is this unrolling and ongoing estimate of observations, latent states, and the variance that links them. And that's done like through matrix multiplication. But how does it happen in biological systems is a different question. So I think some things come into play here, like redundancy and signal compression, and then error propagation, forward and backward. And so there is this in predictive coding, slash predictive processing, because I'm still like not 100% on the difference there. But I do think this is a predictive, it seems like it's a predictive processing thing, but the authors do say predictive coding. There is like a bottom-up construction of a model. So the authors say that, but it's not only that. The authors say that perception is not the result of an unbiased feedforward or bottom-up processing of sensory data, but is instead a process of using sensory data to update predictions generated internally by the brain. So there's the idea of a model comes in here. Yeah, generative model. Where do you see generative model in these discussions? I mean, it's kind of threaded throughout. So I think the idea of variational inference brings in a model because the model constructs the estimate of the posterior, like where we're consistently, there's a consistent effort to minimize the divergence between the actual posterior and the estimate of the posterior. So the estimate of the posterior is like trying to make the model match reality. So trying to make the picture of the cat match stretch bend fold into the model of the cat. Yes, awesome. So generative model with no complications is being put into the Bayesian statistics area. Because if you specify a model that's generating thermometer outputs given the temperature, it's quite literally the generative model. And we know that there's kind of two directions. There's like the recognition model and the generative model. That's the tail of two densities, because a distribution can also be understood as a density. And so in the realm of statistics, generative model is quite literally the approach that is taking parameters of a model and using it to generate, for example, observations in this case. No need to complicate that. But then there's a few ways that we hear different people and different papers talking about the relationship between biological systems and generative models, as well as recognition models. But we're focused on the generative model here. And sometimes these are even coherently or incoherently used differently in the same paper or the same sentence, which is our favorite. And we'll eagerly await the kind of automated detection systems that will enable us to do high throughput analysis. But here's just a few of the kind of ways that people can talk about that. We hear sometimes that the biological organism or cell is a generative model, has a generative model, enacts a generative model, or the more instrumentalist. We can model that system with a generative model. And so this bottom one is kind of like saying, I'm just planning to use Bayesian statistics to model biological system. And perhaps we could say that this is the least bringing assumptions to the table of what that system is doing. Because by saying we can model it using a generative model, we can use or fit, drive a generative model for this behavioral cognitive system, we're just remaining purely with both feet in scientific instrumentalism and empiricism. We're making only claims about the map and not about the territory per se, which might be throwing out the whatever with the whatever. However, these ones are when people are making claims that are about the territory. What is the brain? Is Bayesian brain what the brain is doing? Well, it's not passing around screenshots of Bayesian equation like we are. So what is it that it is doing? Or again, is it on the instrumental side? It's just something that we can use to model the brain. So yeah. So going back to predictive coding and also linking to what is the brain doing. And really, this is why I wanted to stop here. I found the spot in the paper. But when we were talking about common filtering and image processing and then error propagation in a system, they talk early in the paper about predictive coding as a means to remove redundancy and applied it to signal processing to reduce transmission bandwidth for video. And this, the authors say here, Barlow applied this principle to signaling in neural circuits, arguing that the brain faces considerable evolutionary pressure for information theoretic efficiency since neurons are energetically costly and redundant firing would be potentially wasteful and damaging to an organism's evolutionary fitness. Because of this, we should expect the brain to utilize a highly optimized code, which is minimally redundant. And they say predictive coding minimizes this redundancy by only transmitting errors or residuals of sensory sensory input that cannot be explained by top down predictions. And so they remove redundancy at every point in the layer. So it's like when you're watching a video, and or even like the flip book, like, so we see like, you know, the flip book that likes of a little figure kicking a ball. The point that changes is the point that's prioritized, the point that's different in each frame, not the point that stays consistent, because that is where the action is here, where the motion is happening. So that that's what's prioritized here. And that totally connects to information theory, informative updates are the ones that update your prior to move it to a different posterior. That's the Bayesian information concept, as opposed to, for example, the Shannon entropy, which is like based upon symbol entropy. So they're totally compatible. And if this seems like a technical detail, it is. But also this is the Bayesian concept of surprise and information. But that was a great section you pulled out. So here we're really highlighting as following Barlow, applying this principle of redundancy removal, but already we can caveat how much redundancy to remove, because a more redundancy removed system is also potentially more fragile. Like if we only have one copy of each saved file, now our threat of losing it is very high. So redundancy is always in a trade off with like resilience and other system features. So this is one principle, but it's not the only principle at play. So redundancy doesn't mean we're going to pare it down to nothing. It just means that for an organ like the brain, which is using like 20% or however much of an organism's energy budget, changes in its efficiency by several percent can be really important. So that principle of redundancy removal is being applied. And the argument is that the evolutionary pressure is for information theoretic efficiency. So if you don't have anything to update, you don't want to spend the energy to do something extraneous, nor do you want to spend any more energy than you might have to on updating. And that's again, because of the wastefulness of potentially excess signaling. And then with evolution, there's always a twist, like what if it is wasteful? And that's just some second level reason why so it's not the whole story. This is just sort of short sentences. Because of this, we should expect the brain to utilize a highly optimized neural signaling, neural firing code, which is minimally redundant. And then they connect it back from what the brain ought to be doing normatively at a first pass to the formalisms that were being described with predictive coding. So in the frame differencing case, it was like, only tell me about the pixels that are changing. And then if nothing is changing, just send me like a little beep. And I'll just know with just one piece of information just to keep it exactly the same, even if it's 4k video, and it's a gigabyte per second or whatever. But we don't have to change anything for that video. And then as pixels are changing, update me on how to change them. And so that would be like the maximum efficiency in coding. And so you would give a full resolution of the first frame in the movie, that's the prior, that's D. And then all one has to do is think about how to encode how the subsequent frames change. And sometimes they introduce things like a keyframe. So every second or every minute, it's it starts with a fresh prior. And so these are the kind of computational things that one can do when they're not bound by like the constraints of biological systems. But this is an awesome connection. Any other thoughts on that theme or like where does error propagation come into play? Or we'll turn back to the triangle. So I don't know, the error propagation is interesting. It would be great to have like the input of somebody who is more like versed in machine learning. Because I think the idea of like back propagation, like this is an error and like sending that back to the previous like layer in the neural network is super interesting. And part of what like enhances the efficiency of neural networks, like, hey, fix this. And so I think, and I can't really find in the paper where they say that, but, but there's like a local correction for error within each layer of processing. So in the visual system and um, yeah, in signal processing itself, it's like each layer prioritizes just fixing their own error. And then that contributes to the overall efficiency and accuracy of the system at large. Yes, back propagation is mentioned 45 times in the paper with a quick search. So it's definitely an important concept and also something the authors have worked in a lot. So good error propagation and back propagation. But also that reminds me of when Professor Levin was here in 40.2 with the back propaganda and the imperatives for like the different parts of the neural network and the kind of information that they want or expect or prefer and then how they want to escape being trained and engage in learning but not being trained but still updating. So this learning and being trained both involve like updating priors potentially appropriately. But in the case of training, there's a feeling like another entity or agents is imposing their will versus learning can also be associated with behavioral and cognitive changes. But has a different sense potentially than being trained. Anyways, let's go back to the triangle as we're sort of not quite landing the dot one plane, you know, the check has not arrived at our restaurant. But maybe we're working our way up to our first star. Where do we sit on where action is? And then we have a few terms that we want to put, which is message passing and then active inference. Oh yeah. So what is another term that we can add in or something that we're seeing differently in a new light? Or where do you see message passing or active inference? So I think yeah. So we have action. Are we happy with where action is sitting? Do we have enough arrows? I think it's I have used every feature of Google slides sufficiently. I like action as sort of a fulcrum in the center of the merry-go-round is action. And yes, more arrows could be drawn, but we'll keep it a little sparse. But go ahead. So I think maybe message passing sits between predictive processing and action. And then that's exactly where I was going to put active inference was where you just typed in inference. Because I think if we're or yeah, it's hard to say where active inference really goes, maybe like not right where inferences, but like somehow if action and the Bayesian brain and biological systems could form a triad, maybe active inference would sit in the middle of those things. So added message passing to Bayesian statistics and probability because it's something that we can do on a Bayesian graph. It's a implementational approach, but it's also leaning a little bit towards the biological because when people talk about like, how does a given piece of information or signaling or stimulus in the toe end up having an effect on the other side of the body or on the brain? Like there's some kind of now it's being used in a technical Bayesian sense with message passing, whereas in a more just sort of like conversational way with biological message passing, but whether it's a synapse in immune synapse or a neural glial synapse, or it's a mechanotransduction, or it's a hormone. Broadly, we can just consider these different biological mechanisms that convey information or updates or anything to be like past messages. So message passing is going to be a formal technique, but it does lean towards modeling the ways in which different biological nested entities are exchanging information. Yeah. And so why I would put it with predictive processing is because when I think about the things that we were talking about, like how are signals compressed? How is redundancy eliminated? How is error propagated through a system or back propagated? So which is why like I liked it there in the processing because like you can't legitimately process that information without like knowing how the message is going to be passed. Like if you're only going to get every fourth letter of the message I pass, well, I'm going to encode my message in every fourth letter, not in every single letter. And that changes, right? Okay, thanks. Yes. So we have message passing now as part of the philosophical concept of systems that engage in holistically considered predictive processing because it connects what biological systems are doing with some idea about like information sharing without any formalism. But then message passing via the for any factor graphs is going to be in the Bayesian side because that's like the implementation. So maybe message passing here. Yeah, it's like on the sort of predictive coding, implementational side. Whereas we're not saying biological systems are doing 40 factor graphs and message passing in that technical sense. But for sure, we can think about it in this qualitative way. Okay. So we have action in the center. And then kind of bifurcated it to include action and inference, aka active inference for a few reasons. First off, action is not just blind flailing action. When we talk about planning as inference and inference on action and inference about future latent states and observations that aren't explicitly about action, but we know in active inference that they actually do condition on action. So inference and action, how we think and how we act, how the entity's cognitive model is and how the entity's behavior is through time, those two are like an inseparable dual. And that's why active inference says it all in the title. It's about action and inference and all of the ways that they are related to each other. Again, whether it's like planning as inference about future action, or inference on the consequences of action, or it's inference about something that isn't explicitly action itself, like what is the temperature, but it requires or is conditioned upon policy selection in this framework. Not the only way to look at it. But that's how they come together in active inference. And like, let's peep over to 4.5 where they introduce action formally. The basic approach to including action within the predictive coding framework is to simply minimize the variational free energy with respect to action. Free energy is not explicitly a function of action. Up until equation 51, it can be made so implicitly by noticing the dependence of sensory observations on action. So if your future sensory observations don't depend on action or don't depend on that kind of action, this is like an extraneous calculation. Like if you have a coin flip and you're rolling a dice, and they don't influence each other, then they're conditionally independent. And so there's no need to calculate the joint distribution if you're just interested in the coin flip, because it's not having a causal relationship with rolling the die. However, if we're going to have agency or model systems that appear to have agency, then there's some kind of dependence on future sensory observations related to action. In the visual case, the visual input is conditioned upon the ocular motor decisions, and those are actions of a musculoskeletal system. And then the way that it gets tucked into the equations is the change in action with respect to time is going to be related to a gradient, so like a partial differential equation, but then it's amenable to gradient descent. It's going to be like a partial derivative of free energy and how it changes with respect to action. So like, am I minimizing free energy by steering the wheel more to the left or by steering the wheel more to the right? That's like df over da, change in free energy over change in action. And then free energy f here is a function not just of the joint distribution of o and mu, not just a joint distribution of the observations on the thermometer and the mean estimate of temperature in the room, that's kind of the pure inference take. But again, we're suggesting that sensory observations have a dependence, could be partial, could be complete on action. And so now observations themselves are a function of action. And then that can be unpacked a little bit more in 51, and they discuss it further, of course, in the paper and elsewhere. But that's like how this inferential framework gets built in predictive coding. And then action becomes introduced as something that inference can be about. And that's what they say, allows for predictive coding agents to undertake adaptive actions without any major change to their fundamental algorithms. It's not like there was the temperature thermometer module. And then now there's this totally ad hoc decision making module that's bringing in all this thing about what is the reward of different temperatures? What is the reward of different thermometer observations? Rather, within this same parsimonious and first principles, inference framework, we can model active inference, which is inference, including action. Let's kind of continue on this action theme, but then we'll end in a few minutes. So under such a scheme, the prediction error becomes the difference between the current observation and the target or set point. That raises the question of where these set points and targets come from. And that's where they just sort of have that road leading off into the distance, saying, well, in the evolutionary case, it's inherited. In the computational case, it's just simply there. It could be just a priori speculated or it could have been a parametric empirical base. So that's one really interesting point is like, when you do introduce action, you have to discuss not just the target or the set point and how it's generated, but also you have to make explicit in the forward model as they write the dependence of the observations upon action. And that has to be provided or learnt because just that temperature thermometer model is not going to include a forward model of what happens when you turn on the heater or when you put on a jacket. So that's something that has to be learnt. And then a second interesting thing that starts to come out of their framing here is like the costs of action. And that closes the loop really well with a discussion that you raised of like Barlow and the information efficiency argument for why the brains ought to be doing something, something like predictive coding, efficient signal transfer. And hence why it's either what they ought to or are doing or and how we ought to or could model it. So that's why the costs of this are important costs of action are important. And then just one last piece in four or five was active inference in the PID control, which is related to the generalized coordinates of motion. And we unpacked that a lot in live stream number 26 with the Bayesian mechanics paper of Dacosta at all. And this connects control in active inference systems to a very commonly used engineering framework for controlling processes, which is PID control, which is framing action through time as being related to these three terms which are described above. So that's how they brought action into this paper was by spending many of the early sections providing like the single layer predictive coding model and connecting that to variational bays. That was in like sections two. And then in section three, and also in section two, they explored all these different interesting generalizations, the spatial case, the hierarchical case. And then they also connected it to biological systems and reviewed some evidence for predictive coding of different kinds. And then in section four, when they connect to several other inference algorithms, like predictive coding and the back propagation of air, linear predictive coding and the Coleman filter, normalization and the normalizing flows, which we didn't talk about predictive coding as biased competition, also something we didn't talk about. And then after all of that action gets tucked into the picture. And it's just like very clean and a very elegant way to think about it and helps us understand like maybe even how active inference is similar or different to some of these other frameworks. So in the spirit of our layer cake, I think today like we covered it if this paper is a sandwich, we covered both slices of bread, like the very beginning and the very end, but we left out like all of the things that kind of come in between. The meat and potatoes. Have you had potatoes on a sandwich ever? They do it with some french fries sometimes, but I know what you mean. Go to Subway. I'm going to go to Subway and be like, yeah, can I get some potatoes on myself? Did I have a predicting coding special? And then they go, what do you expect is going to be on the sandwich? It would be cool next week to really kind of dive in more to error back propagation, predictions are sent, you know, up and errors sent down or is it the other way around and really kind of discuss these computational graphs that they start to talk about. So I think next week would be really cool to kind of dive into what's inside of the sandwich. Oh, awesome. Okay. So we'll definitely go into some, so what's inside the sandwich? A back propagation of error, a biological evidence and examples. So I think there's a ton of exciting stuff that we'll be able to talk about next week. The different paradigms, like we could cover over figure three, the paradigms of predictive coding and the normalizing flows. So we definitely, we covered the bread today. Yes. Well, anyone who is listening, we really appreciate it. Look forward to your comments on the video or joining us next week if that's going to work. And blue, this was like super interesting and very helpful. So I think it's a great dot one. And it's funny because like dot one, it's, it's the meat and potatoes of the zero one two sandwich. So it's like our middle of the bottom of the bathtub phase, the dot one where we're opening up all these ideas and just trying to like give a second coat of paint on a few things, go into a few technicalities, re-represent some knowledge. It's like that was in a very paradoxical or delightful way about not the potatoes itself, but rather about the beginning and the end. Well, to be fair, for those of you who haven't yet read the paper, there's like 60 equations and like 55 pages plus and appendix. So no surprise that we covered the bread, but it would be cool to tear apart a little bit of what's inside of the paper because I think that these layers are cool and important. And the paradigms of predictive coding were super interesting to me. And I love like the intersection between the brain and machine learning. And I learned a lot reading this paper, especially like, I was taking neuroscience classes, I'm going to say how old I am right now, but I was taking like neuro and college at like what at like in 2003, probably maybe 2004. With no idea learning about, you know, the work of Rao and Ballard and Hubel and Wiesel and all of these, you know, centers around like visual experiments with no idea like how new they were. No idea like how new of an idea it was like the processing through the visual cortex. Really, like I didn't know how cutting edge the work I was learning about was. So it's cool to kind of have that framed for me and put in perspective and yeah, all the building upon that that's been done. It's nice. Excellent. Thank you very much, Blue and everybody else who's participating and see you next week.