 Hello, everyone. Welcome to Actonflab livestream number 43.2. It is May 11th, 2022. Welcome to the Actonflab. We're a participatory online lab that is communicating, learning, and practicing applied active inference. You can find us at the links on this slide. This is a recorded and archived livestream. So please provide us with feedback so we can improve our work. All backgrounds and perspectives are welcome and will follow good video etiquette for livestreams. If you want to learn more about what's happening at Actonflab, greetings. Head over to activeinference.org and you can find out more about participating on livestreams or other modes of participation. And today we're in 43.2 where, as always, it's up to the last minute with what we actually do. And welcome also to Steven. We're going to continue this discussion in our third part of the conversations we've been having on this paper Predictive Coding, A Theoretical and Experimental Review. And we'll begin with the introductions and then we can move around the paper. We brought up some things that we'd like to cover last week for today. But also we can take any questions or areas or sections that any of the team want to talk more about. We can absolutely go there. So we'll just start with introductions and say hi and feel free to add anything you'd like. What brought you to the paper or what's just one thing that you took away from it. And then I'm sure we'll have some fun today. So I'm Daniel and I'm a researcher in California and I'll pass it to Dean. How goes it, Dean? Oh, sorry. I was muted in Jitsie. So it was on the livestream. But welcome to Dean and Steven. And I'll also share the slides with you here in the chat. But we're live and in the game. So feel free to say hello and anything just to get us going with this discussion. I'm getting a bunch of feedback from somewhere, Daniel. It's going good. I'll open up another window so that you can see what slides I'm working on. But yeah, how goes it? Or what was interesting to you about this paper? Do you hear me, Dean? That's me. All right. Okay. I can hear you now. Yeah, I could hear you. I couldn't hear you before. Something was weird. Sound fine, Dean? I can hear you, Daniel. But there's a long delay and there was a, I can hear, I could hear feedback like a second audio loop going off while you were talking. Okay. Do you have the stream open or anything? I just, no. I just have the picture of the three of us. Okay. I'll just reload just to be sure. So it goes with live streams. All right. Okay. So is that better? We'll roll with it then. So, okay. Or we'll see what happens. So yeah, Stephen, how goes it? And what do you like about the paper as we start? Well, I'm just interested in the general scope as per normal. So I'm sort of going to go with the go with the overall vibe of pushing the boundaries of active input. So I'm fairly, I'm fairly relaxed with this one. Okay. Hey, Dean, work better? Yeah, my bad. All good. All good. Thank you for telling me not to have both Jetsy and the YouTube open at the same time. So these little technical things that I should have been more on topic. There's like some reflex where if your own audio is played back to you at a certain delay, it makes it like basically implausible to speak. It's pretty interesting. Yeah. Well, apologies, guys. That's not the start I wanted. All good. So how goes it? Or what like brought to you to this paper? Or what did you find interesting in what they wrote? Well, I'm not really totally up on what predictive coding is. I think I know what predictive processing is. And when you and Maria were talking in the dot zero, you were blue or talking in the dot one. Just the sort of the translation of that into operations that you can you can identify with a certain amount of confidence is the part that I'm really interested in. I'm not really sure what the applications are beyond the psychology realm. I'm sure there are quite a few. And that's kind of why I showed up today was to sort of piggyback on what what people who obviously know a lot more about this than I do are saying about it and where it's going and what the potentials of it are. All right. Well, let's start with a few reminders and recalls. Is there any area or figure or formalism that either of you would like to suggest that we jump to to remind ourselves of anything critical about the paper? Otherwise, I can bring this to some. Is it okay to start with the back propagation of error? Yeah. All right. So I'll bring in this was in section four. So I'll be bringing in some but let's um, we're gonna ramp into it just as a reminder about the roadmap. So the first section of the papers and introduction. The second section of the paper describes with quite a lot of detail the core kernel of the predictive coding architecture. This is a computational architecture and it involves information transmission that is in this predictive coding way. So check out the paper, the dot zero, the dot one, that's where we're going into a lot of detail on predictive coding as an information architecture. And then as Maria was bringing out for us, predictive processing might be used to refer perhaps more holistically to systems engaging in predictive coding. And there's a bidirectionality there to predictive processing, whereas predictive coding might be something that could be done in a bit more of a unidirectional way. That was section two on predictive coding. Introducing the kernel and then going into a few specific generalizations of that kernel and modifications that help make it more computationally powerful, as well as linking it closer to certain biological architectures of tissues like in the retina and in the cortex of the mammal brain. In predictive coding in the brain tree, there's some exploration on different paradigms of predictive coding. And these are referring to kind of like settings where the core and elaborations can be applied. And that's in the supervised and unsupervised cases, as well as thinking about spatial and temporal predictive coding algorithms and connections with modern machine learning, like with deep predictive coding. Then after that, sort of kernel and generalization of the predictive coding kernel, and exploring some of the paradigms for where that kernel and its elaborations have been applied in section four, there's the relationships between and among our algorithms. And so that's where we get into this section on predictive coding and the back propagation of error. So that's where we'll go. And I'll copy in some formalisms or find out which ones would be the best to bring in. But here we are on slide 56, which I'll share in the chat here. What is interesting to you about back propagation of error? And then we'll start there. Well, what I'm curious about is that bottom line of that slide, this recursion is identical to that of the propagation of error algorithm, which I've never actually encountered before this. So I was wondering, I know a bit about recursion, I think we all have sort of a basic understanding of what's involved in recursive loops. But I was wondering, where is this relationship between predictive coding and this other algorithm, which I've heard that, I don't know if it's an expression or label back propagation of error, but I've never really looked at it from an operational standpoint or how it's laid out mathematically. And I'm not sure if either of you have that familiarity, but I was hoping maybe you did. Great. So how does predictive coding relate to the back propagation of error? Let's look at these sections, which we didn't annotate too heavily leading up to, because there's like many equations in the paper, many areas. Even before we introduced the formalism, and to give me 15 seconds to read it, qualitatively, what would back propagation of error mean to you? Or where is an applied setting where you think this could matter? And again, this is me talking out loud, hanging my butt out there, because I don't really know what I'm talking about. But am I correct in assuming that it's anytime you find yourself in feedback and feedforward loops? I know that they're acyclical. And I know that that usually implies a directional, but isn't anytime you're getting sort of a looping condition where you get an opportunity to sort of reset? Isn't that the situation that we're examining? Okay. We're in the arena of statistical Bayesian graphs. So we're talking about models here, we're not necessarily talking about what an actual system is doing, a computer system, nor some sort of biological system. We're talking about a modeling statistical architecture. And it would be a secondary conversation about what systems are actually doing with back propagation. So this isn't just like saltwater flowing from the ocean into the river. Okay, just to sort of forestall any preliminary connections we might want to draw. But we can see when we have a bit more sense of what back propagation of errors on this graphical setting, how it could apply. So they're introducing this concept of a computational graph fancy G, not free energy related just G for graph. And it consists of two kinds of elements, just like all graphs or networks do, which is vertices and edges. The vertices are representing intermediate computation products. For example, the activations at each level, this multi level predictive processing architecture. So we have like nodes within the level. And then there's edges that represent influence here as differentiable functions. So this is related to some of the graphical Bayesian approaches we've looked at previously, but also it shares a lot with some neural network representations where the edges are not just like a correlation or mutual information, but the edges are more like a differentiable function. We're going to restrict to looking at specific kind of graphs here, which are computation graphs, they're saying are directed acyclic graphs. So that means that it's directed, there's an arrow, such that the differentiability, the function is going from one way to the other, like we can think of that as the forward direction. And then the acyclicness is just ensuring that there's like sort of one place where we could start and we could just forward propagate everything out to the edges. And that wouldn't trap us into an infinite recursion by going forward. And we've talked about some of the differences between like cyclic and acyclic graphs with Majid when we were talking about the DCM and all the ambiguities of that app. Okay. Stephen, anything and then we'll continue. Just one quick question. Ontologically, how useful or problematic is the word coding now we've moved a bit further down with active inference. And in a way that propagation of error needed to be distinguished out, even in a descriptive term, just so that the connotations of coding don't sort of confuse all the discourse. What are the connotations of coding? Well, that it's basically kind of almost like a representational kind of script almost that's been written somewhere. It's kind of how you know, like the coding that you get in a computer program or something. And I kind of think that that does sort of speak to the nature in a way because this is the nature of how for the future, but back propagation of error kind of speaks in some ways to the nature of how the code is processed. And I may be off topic here, but I just kind of wonder as we're going through this and trying to link it all together, how much that ontological term is also possibly something that could be thought of in another way make it easier to understand what's going on. Okay. We are dealing here with computational architectures. So I don't think we need to run away from these questions that were brought up more in a philosophical non computational version into the anti computationalist perspective, like we're not fighting for what is happening in the brain here. We're using predictive coding to talk about information coding and transmission approaches. So that's one thing. And yeah, I mean, we don't we can't just go around flipping different words. Predictive coding is what for decades has been referenced by a certain architecture here. And then there's a lot of openness with how that architecture is applied. But I, I'm not 100% sure like this is about predictive coding architectures. Yeah, no, that's cool. I'm, I'm just putting that out there just seeing how that floats a little bit. I think that's a good, good response. Like what else would what else could be in this position of coding? It's about how signals are encoded and what is being transferred between different nodes. Is what is being transferred the raw value? Or in the predictive coding architecture, we have two directions that signals are being passed in this compute graph, not in the brain. And that would be coming from one direction, the expectation, that's the predictive part, and then the error residual, that's the deviation from the expectation. And so that is what the predictive coding architecture is, is just about having graphical systems that are passing expectations and differences or errors rather than passing like just the plain values themselves. Yeah, I suppose there's also that question of how much is in the dynamics of what's going on and the hyper parameters when looked at and how much difference that makes to this story. Okay, so you've got, you've got the predictions, the prediction errors being passed down, but coding tends to be like the numbers of error. But in some ways, the numbers can kind of evolve once the there's the system, the dynamics of the systems or the dynamical sort of patterns of the system starts to sort of fire up, so to speak, you know, it takes a little bit of time for the active inference model to start really purring along sometimes. So that kind of coding is in a way sort of almost more live than traditional coding, which is kind of sort of like too. Okay, I got you. We're not even in active inference yet. So I see some back later, but we're discussing predictive coding, back propagation of error. And then action does get brought into the picture in section four or five. But up until section four or five, there's no action, even in the model. So we're just looking at the architecture that does computation, then understanding how action could be brought in. And that's kind of the fun surprise that we uncovered in the dot zero and dot one. And in the layout of the paper was when the architecture sufficiently general and powerful, incorporating another variable that is about action, like policy selection becomes fluent. And that's why these architectures are composable with different sensory modalities with different scopes of action with different affordances, all these things, because there's a level of generalizability that we can pull back to. Okay, so let's return to predictive coding and back propagation of error. We have this graph, the vertices are representing intermediate computational project products. So like a number or like a variable, because it's kind of like where you're storing the intermediate computation product, and then the edge of differentiable functions could be, you know, like y equals two x is a differentiable function, anything that has smoothness in the graph is differentiable. And as the orange part says, we're restricting this to directed acyclic graphs where we're not going to get into an infinite regress just by going forward. Now, what happens when we're trying to tune this architecture, so that it has a minimized loss function, loss function, whether we're dealing with least squares error, like a regular linear regression, or whether we're dealing with some other loss function, the loss function is like what is trying to be minimized. So when the ball is rolling downhill, we're minimizing like the loss or the energy function, but the energy function approach and wording is coming more from like a physics side, whereas the loss function is just a purely statistical claim like what is the function that we're minimizing. Okay, given the output vertex v out. So this is like the one that we actually want to be reducing the loss around. The loss function is going to be l. l is a function of v out. Back propagation can be performed upon a computation graph. So we're going to be minimizing loss at a certain vertex by back propagating upstream a lot of what led to it being the way it is. And so that could be done on just the last tip of the spear, or it could be done throughout the whole system. Back propagation is an extremely straightforward algorithm, which simply uses the chain rule of multivariable calculus to recursively compute the derivatives of children nodes from the derivatives of their parents. Okay, so Formalism 28, this fancy D is partial differential. So the change of the loss function with respect to the change in an intermediate variable I is what we're looking at. We're looking at how we're going on that bowl. Are we dropping on the loss function? Are we is the model getting better or worse? With respect to changes of intermediate computational products vertices. And that overall loss function is the sum of v, the subset of p, which is the parents. So we're looking at vi, like the node that we're focused on, node of interest or something. And we're going to sum over all of its parents. And then for all of its parents, we're going to compute something that's basically like the change in loss function over the change in that parents value, b sub j, and how it changes with respect to change in the child. So we're taking something that's like this graphical network with nodes and edges and differentiability. And then figuring out which way is the overall downhill based upon all of the back propagation of the gradients. By starting with this output gradient, that's the out. And if all the gradients are known, so like we're dealing with a model that we specified, then the derivatives of every vertex with respect to the loss can be recursively computed. So because all the edges are differentiable, the big function is also differentiable. So like 2x is differentiable, x squared is differentiable, x cubed and x to the 11th power are all differentiable functions. So you could also differentiate their concatenation through addition. That'd be like the derivative of 2x plus the derivative of this. And also the chain rule in calculus allows us to like nest those. So if we can differentiate 2x and if we can differentiate x squared, we could differentiate 2x inside of parentheses squared. That's basically the chain rule in calculus. And then that is what allows this sort of big downhill computation based upon the specification of the graph and then the back propagation of all the relevant nodes and finding out how those need to be tuned so that the final output node that's being optimized can be going downhill defined as going down on the loss function. Less loss is better here. Okay. Yes. Excellent. Thank you because this was one big opaque mark off blank to me when I read it. I just couldn't make any sense of it. But is the ability to differentiate, or what they, is what they're saying here is the ability to differentiate an ability to build relationship. That parent child thing is, is it that, is it the capacity to differentiate that allows for the relationship to be identified or is it the other way around? Because when they're talking here and saying back propagation performed upon a computation graph, back propagation is extremely straightforward and uses the chain rule of multivariate stuff. When it's talking about that it's, it's, it's pointing back to something that you don't necessarily have to see, but you know that there is some relationship just like there is between two x and x squared. Am I getting, is that, is your explanation helping me move close to a better understanding of that? Because I understand derivatives, but I'm not sure how the derivatives fit into this explanation. Great. Yeah, I think we're, we're solid, solenoidly exploring. Now, we haven't even brought in predictive in this, these equations here are for 28 and 29 are about back propagation on computation graphs. So with these papers, it's really important to know like what has not been brought onto this diction. And there isn't error. This is just about the nodes and edges like the intermediate computation values of the abstractest kind and their differentiable relationships. So now they're going to show how predictive coding relates to back propagation of error. So they write predictive coding can also be straightforwardly extended to arbitrary computation graphs. To do this, we simply augment the standard computation graphs with additional error units for each vertex. So let's recall what some of these graphs that we previously looked at here and figure one of the paper. We have a graph where the nodes are intermediate computational products represent differentiable relationships. And importantly, some of these intermediate products have the semantics or the interpretation of being estimates about mean mu at that level, or estimates about variance and error at that level epsilon. So this is kind of like a subtype of graph, like the abstract is kind, it's just intermediate computational products. And then predictive coding are specific kinds of computational graphs where some of the nodes represent mean estimates, and some of them are error estimates. In contrast, and hopefully not to confuse, here's from 26.2, where we talked about like integrator chains. So notice that all of these nodes can be interpreted as intermediate computational products, but there's no error term here. So this is an example of a computational graph that doesn't have error terms. This is not a predictive coding architecture, not that it's inconsistent with or somehow disjoint from but this is not having epsilons. So it's not predictive coding architecture, whereas this is sort of a canonical multi level predictive coding architecture. Great. Hey, Jack's good to see you in the chat. Okay, now we're going to return back to the backprop. Okay, so we're adding in these additional error units epsilon sub i for each level of the graph. Formally, the augmented graph becomes now G tilde. So previously fancy G was just edges and vertices, nodes and edges. And now the vertices is being unpacked so that V can remain for the mean estimators, which are ultimately the ones that like we would like to have the most accuracy on accounting also for variance estimators. So epsilon this big E has been pulled out it's the set of all error neurons. We then adapt the core predictive coding dynamics equations from a hierarchy of layers to arbitrary graphs change in V the nodes with respect to time prediction is happening through a sequence of observations in the predictive coding architecture. That's that now casting Coleman filter frame differencing a lot of these other sort of real time algorithms for mean and variance estimation that we talked about equals the change in F free energy with respect to change in the node. So now the loss function through time is going to be related to the F that had been previously previously defined in earlier equations. And then that is going to be defined as basically the error. This is in Formalism 30 the epsilon sub i. So the error variance at that last level minus a sum of C. So C are the children. And then there's just some more terms involving basically tuning the nodes accounting for the fact that we want to like tune the variance estimators and the compute computation nodes that represent the mean estimates. And so then the dynamics of the parameters on the vertices V and the edge functions theta. So they kind of slightly evolve the edge notation towards theta can then be derived as a gradient descent on F where F is the sum of prediction errors of every node in the graph. So previously they had specified how the composability of the multi level predictive coding architecture is such that the levels are can be factorized from each other which allows us to sum their free energy contributions into like a free energy total. And so that is how a multi level model can be fit because it's finding like the free energy minimum for that whole multi level system which might entail one of the levels not being at its like perfect perfect lowest free energy state. And in this section they've connected the way that back propagation in 28 and 29 without any action without any prediction is able to make a big loss function that's composed because of the tractability of the compute graph. And all of these little loss functions can be summed into a big loss function that you use to like train a neural network, for example. And then moving that into the predictive coding and free energy minimizing space by saying well this is a special kind of compute graph that has the mean estimators mu and the variance estimators the error terms. And then we're going to do something similar where we're going to compute like what amounts to a loss function F the sum of prediction errors of every node in the graph based upon internal prediction errors being minimized. So in the non predictive coding frame we're minimizing a big loss function that's composed of smaller loss functions in the predictive coding frame. We're minimizing our divergence from expectations composed of the diversions as from any smaller expectations. Okay, yeah, go for it. So extrapolating hopefully again from this that the predictive coding then like by working with the errors which are kind of almost I know I'm making a little bit of a jump here but I kind of implicit to the to whatever the system of interpretation is on the data. It is quite it's kind of that's why you're going into this biological world because it's it's so bottom up in a way because it can start just with whatever the errors are in the nature of the the system that's been set up for doing the sensing even for getting action just the nature of how the sensing set up will start to generate the errors at some implicit level. Would that be a fair assumption? So anytime there's going to be a prediction of observations so yeah we don't need action here yet. There's two options either you have 100% predictive power and your error is zero or there was some error so you're always in one of those two cases and which one are we usually in not having the total perfect 100% accuracy down to the you know 100th decimal point if for no reason other then that's beyond the point of of diminishing returns or we're also trying to balance accuracy with model complexity. So because we have these metrics like the AIC the BIC all these metrics that find us a model that's a balance between accuracy and complexity so it's not overfitting the cost of not overfitting of cost of having a simpler model is literally increased error because every new parameter that you add into a model you always explain more variance. The first principal component takes the most variance of the data. The second principal component takes the next most variance of the data. Every principal component on a data set is going to continue to eat more and more variance out of the data leaving less and less in the error term but it's a situational thing whether you need to explain 80% of the data or 90% of the data or 95% of the data but there's always going to be some difference between the prediction and the observation again unless one is in this super edge case where it's perfectly known so yeah go for it. Yeah that's helpful and it's also implicit there that if even if you whatever you set that complexity and accuracy to at this very deflated level is really helpful so the nature of the model will change the nature of the errors and basically the best it gets to by the most it can reduce will vary a little bit as you change that model and not making too much of an extrapolation but that that is a nice way to be able to unify a lot more complex and rarefied ideas about models and you know what is really happening you can sort of drop back to that complexity accuracy principle. Here's some more sections from 4.1 towards the end of 4.1 on page 30 in the paper so yeah hopefully we've been representing this at least mostly accurately and then for those who want to go deeper that's why there's a paper and that's why there's like a continued discussion and development on these ideas because they they drop many little crumbs of well we've held this to be zero for simplicity and but they write we can think of this as a predictive coding network in which all the error is initially focused at the output of the network where the loss is and then through dynamical minimization of prediction errors at multiple levels layers this error is slowly spread out through the network in parallel until the optimum distribution the error at each vertex is precisely the credit that should be assigned to causing the error in the first place. That's pretty interesting it's kind of like let the chips fall where they may with respect to error and variance attribution. Dean? This is this is awesome so are they also are they also is the way the architecture is and the way the graph the graph is set up as kind of a mirror of what the architecture is of this layering thing are they also there to kind of contain little fires of of the stakes and errors is that why it's effective because it sort of contains something you know suddenly causing a chain reaction and having and having everything when we read Mejid's paper about I think it was Mejid or maybe it was Axel's paper about you can you can make predictions going right off of a cliff and and the result is you no longer exist is this kind of an architecture that helps us avoid some of those really consequential prediction consequences? That's an interesting idea I think we should engage in some slide play and try to try to get a little bit of clarity on that and like also hopefully we'll hit on a lot of the ideas that we've been bringing up in a more abstract way because I know it it's helpful for like everyone to see multiple coats of paint on this and like it it's the epitome of abstraction or it is the austerity of those formalisms about compute graphs so let's think about a we won't take it to the the cafe level that was even though that was I remember that being quite fun with we're going to be we're going to be between the cafe and you know something else so we'll be intermediate because we're going to be talking about a specific graph but we're not going to like necessarily for the first pass on it go all the way in okay so the node that we care about being most accurate on is seven so like this is going to be like our our focal node this is like the one that we want to we want to have an image classifier algorithm this is the classifier node we want to have on prediction on what the thermometer is going to say that's what this is we want to have prediction on the temperature in the room that's what this is we want to have inference on action and think about the upstream parents of action then that's what this node is so seven is going to be like the node that we're interesting interested in minimizing our loss function on okay this can be carried out on multiple variables but we're going to just look at minimizing on this okay what is the back propagation chain here well let's think about what the forward and the reverse arrows are so first just to clarify like this is a directed acyclic graph there's no loops in the graph and the arrows are all directed um they reflect differentiable functions causal influences amongst these uh variables these intermediate computational products so like variable one is then there's some function differentiable function that translates it into two two influences four and three three influences five and six six influences seven okay we've seen this in the basing context where the sparsity of the connections allow for the factorization of this graph like it allows us to sort of hold one part um unchanging and change other parts through factorization that's a lot of the variational inference stuff that we've talked about okay so what are the parents of seven which nodes do you think that we should consider if we're going to back propagate so the forward model the actual like causal chain in the model not in the world is like one you can imagine some perturbation happens to one every node is going to be influenced that's the forward perturbation in the forward generative model what is going to be invoked if we start at seven and do a back propagation which nodes are not important five and four right yeah five and four will not be invoked in this optimization scheme so if we were so we'll make these ones gray that we're going to be back propagating on but um and I also hope I'm representing this accurately but um you could imagine that like changing the dials on five or the function between three and five will not reduce the loss on seven so it still might be an important thing in the forward model but it's not in the back propagation from seven it's in the forward propagation from one um okay so then um this is not a predictive coding architecture because in the predictive coding architecture some of these um that's what we saw in like for example in figure two is what we saw in figure one so there's the means and the variance estimators so let's bring in figure one and try to adapt it to our graph so our graph is sort of up to formalism 29 now we want to think about a graph that's not just nodes and edges of the abstract type but it's going to be nodes and edges that include these error neurons as they're calling them here okay so here's a reminder on what that architecture looked like jacks wrote how is the back propagation of error related to um holland's john holland bucket brigade the bucket brigade in signals and boundaries it sounds explain a little bit what is the bucket brigade or will look but sounds like a bunch of people helping and sharing buckets or something like that um okay so let's just say that the one that we actually want to minimize is going to be like a mu it's going to be a mean estimator now let's just modify this graph so that this one's going to be error i'll just use the um well type the variables so this is a mean estimator and then we're going to alternate so that this one is an error and this one is a mean this one's a mean again this is not the exact same architecture that they laid out and this might not be um traditional or um classical standards but like this is getting us towards this idea that within a cortical column so to speak we have um the forward propagation of alternating layers of means and errors um we want this loss function to be minimized by diffusing the uncertainty across these error nodes that is like letting the chips lie in terms of the variance attribution and as they write precisely the credit it should be assigned to cause the error in the first place so if like node two in the forward causal model was causing 90 of the variance and so this was like like one was you know is a very tight distribution and it gets blurred a ton on the way to three and then from three it only gets blurred a little bit on the way to seven so like two is going to be big this is like a big error estimator and six is small tiny variance estimator so it'd be like if we were in classical statistics and we were going to do um you know we were doing brain imaging on a bunch of people from two different categories like with and without this diagnosis you could imagine that in that two level model one world there would be like massive variance amongst people and very little differences between those two higher groups in another world it could be the opposite there could be big difference between the two groups but very small differences but amongst the people within a group and so this is the parameterization and finding the parameters that are going to minimize the loss function means that we want the variance to be partitioned appropriately across error turns just like we would like the average the mean estimators to be accurate the equivalent in the error world is we want the errors to be appropriate in how they're distributed so it's like it's like if you're doing a mean estimator on something you don't just want like the highest mean possible you want like the most accurate mean possible and the most accurate error possible doesn't mean the lowest error possible that would be the fallacy of well we're estimating means well then let's just make it bigger so we're estimating errors not because we want bigger errors or want smaller errors that's the thumb on the scale estimating errors accurately entails appropriately distributing the error terms on this graphical architecture and then they're adding in one extra point which is basically that the dynamics of predictive coding are purely local requiring prediction errors from the current vertex and the child that is on the computational front a win for tractability and for the actual algorithm because you don't need to have the whole model loaded into RAM for example because you might only you know which neighborhood of variables you're going to engage when you're doing a certain parameter update and then on the biological side it starts also leaning towards plausibility because like they even call it an error neuron well neurons are signaling forward let's just call that the direction that the action potential travels down the axon but there's also retrograde signaling at the synaptic junction so it's plausible at the multiple neuron architecture level or even at the two neuron biological system level that there is like even if there's a direction at which signaling is assumed or perceived to be normatively happening there's also a retrograde signaling of multiple kinds we won't go into that like complexity of retrograde synaptic regulation but that at the circuit level and at even the synapse level the idea that signaling is happening both ways is something that is seen as a strength of the predictive coding architecture again the locality allows it to be more computationally implementable and this sort of bidirectionality and signaling is more reminiscent of biological signaling processes synapses conversations brain regions so on okay let's okay yeah any thoughts on this I think that that's a pretty good explanation go for it Stephen just just checking so if you added all the errors are like are the errors kind of independent of each other or would they is there like a total error across the chain so to speak of like like how much error there can be or is it like kind of contained within between like is six relatively contained and two relatively contained okay Dean can I take a guess Stephen yeah sure I'm guessing so just keep the wild a money money part of this front of mind I think what the I think what the graph is telling us or at least when we formalize it in this way is that you could always find the cumulative error but the point of the system is to try and keep the error where it actually exists within each layer so sort of sort of partnering the error at each level with with whatever the prior or the hierarchical l plus one or l minus one layer is but me again I'm just guessing but I'm kind of making sure that I understand completely what what Daniel laid out here and that would be my guess I mean you could you can add it all up but I think the I think the point of it is is that before it it runs away from you the architecture is set up so that there will be some cumulative effect but it probably wouldn't be it shouldn't be there's the word I'm looking for it shouldn't be as abrupt as if you didn't have the layers and you didn't have the ability to differentiate and you didn't have a kind of an understanding of what what is happening i.e. back propagationally now again I'm just throwing that out there but because I don't have a reputation around this I could be completely wrong there's a lot to say and the authors like do provide many citations but I believe that's right like just as appropriate mean determination we're thinking in a Gaussian world like right um there's a lot to say on Gaussian processes but a Gaussian is a mean and an uncertainty so it's there's going to be somewhere where the hump of that distribution is and then there's going to be some width to it now sometimes the process is actually Gaussian other times we can use nested Gaussian processes so that it's like well there's some process that's moving it and then there's wiggle around it so we're going to see that moving as one Gaussian and then the other one is like a ripple that's like a smaller Gaussian on that and then even in cases where there's a non-Gaussian distribution that's where the Laplace approximation comes into play which is the second derivative we have the Jacobian and then the Hessian matrix in 26 and then we talked about how when you take the second derivative you're basically making that bowl and then lying in it so you're going to the bottom of the bowl that you created and you know that it's going to be the sort of hump structure that's amenable to going downhill because once you go beyond quadratic beyond the hump anything that's like a multi hump distribution or even just x cubed you're going to have challenges getting globally optimal from local optimization processes because you won't know if you're in a local energy well or not and so many of these computational tools and approaches that we're discussing like across these live streams and papers can be understood as taking something that isn't like a bowl and making it so that it can go downhill and finding well what is the appropriate level of how many layers have to be added to this on the inner how many oven gloves do I have to put on how many approximations should we carry out in our approximation science so that there can be something like a parabola where we know that we're on parabola territory and that's what allows us to then minimize a loss function because if one believed that they were on a rugged landscape then minimizing something locally would be you'd have no way to know if that was on the right track or not um yes even and could I say then would I be okay to say that you're still most like each of these areas say number six number two in here and you could have those would be like little mini bowls or mini at different and the further up the chain the bigger the potential impact of the positioning in that bowl or whatever the the landscape is out towards whatever's being you know the mean that's being understood okay let's look at figure one so this is a chordically inspired so inspired by the cortex and blue wrote dean I understand the architecture the same way error correction is retained locally so let's look at this semi chordically inspired predictive coding hierarchical system just like in the example with the two clinical populations doing brain imaging there's one world where the patients within the group vary a lot and the groups don't differ there's another world where the groups differ a ton and then the people within each group do not have any variance amongst them so depending on how it's parameterized one could have a model where the um highest level has a major influence or not and that's the whole question it's like in a linear model is it more important like the um mx part the slope or the b how much it shifted but if the regime of attention is on linear modeling as an architecture it doesn't make sense because they're both just parameters in a model now for any given data set it's an empirical question to what extent changes in the slope or the um intercept are associated with changes in the loss function and that's linear aggression here's an architecture that gives us an approach to finding loss functions but how do we know how many layers how do we know how many of these side connections to make but there's a few ways to go about that one could just a priori model the structure of the model of the graph around something else like previous work or some inspiration from a biological system they might also be interested in this structure being a parameter of a model like and that's the whole structure learning and meta Bayesian approaches and all of that but that's pulling back another level and so that's the the challenge of structure learning because we can get that loss function within a given static graph but that doesn't mean that we have the best possible graph architecture it's like if you have a hundred pieces of data biometric data on people and you're making a multiple linear aggression to find out their risk for some condition do you just use a hundred that might be overfitting it might be too costly computationally so how many of those variables do you use there's a process for determining that's what's at the trade-off between accuracy and complexity where it's fitting the outcome well the loss function on the condition that you're trying to actually predict but it's not past the point of diminishing return or using variables that are uninformative or for example if there's two variables that are perfectly correlated with each other that's called multi linearity and like you wouldn't want to use those variables so that's the challenge in the linear regression world and this is like taking it into another space but it has more analogies than not which is why SPM statistical parametric mapping that textbook is for those who want to engage vital prerequisite because it addresses a lot of these questions on model fitting including on dynamical systems and it puts it with six feet on the ground in classical parametric and non-parametric statistics dean this just raises the question of my mind i haven't answered it as to whether or not a mean is constructed or if a mean is actually sculpted or some combination of the two i mean i being a min two guy i'm going to assume that it's probably both like you can have the figure but it's your understanding of of what the difference is between an edge and a and a and a directional link which you have to kind of carve out from the overall image but that's again that's one of these things now where if you get into this predictive coding piece i think the product looks like a construct but in actual but in actuality there's there's got to be a huge element of this of being able to isolate the mistakes so that they don't overwhelm and if you're isolating something if you're removing some factor you're actually sculpting as well so maybe that's a bit of a philosophical overlay on the sort of mathematical operational side but you know me i'm always trying to turn it into what's the what the practical function of this yeah it's like statistics as world building because if you just want the mean then just take the grand average if you want the mean and the variance there's a method do you want a variance on the variance how confident are you that it's 10 plus or minus one is it 10 plus or minus one plus or minus 0.5 or is it 10 plus or minus one plus or minus 0.01 so a tight estimator on the error as a parameter and so we're like playing with which parameters are seen as like the things in and of themselves the means and which ones with respect to means have this interpretation as being error neurons or being variance descriptors now one properly fit model which you could take into the you know 50 stacked levels 10 plus or minus one and then on that one variance estimator we have plus or minus 0.1 and then how how sure are we about the plus or minus 0.1 0.01 okay now how sure are we about that 0.01 0.001 that would be like a well fit model where as we kept on adding uncertainty parameters we would be making appropriate estimators now what would be like a mal fit model and I hope this is accurate 10 plus or minus one and you go how sure are you on that one plus or minus 50 it's like the variance just exploded it's like well you're saying that the variance could be like 0 or the variance could be 50 how much confidence should we then have in the mean estimator and so that's what this appropriate distribution of uncertainty is now let's think about node 7 and uninformative node 7 we're predicting temperature but it's just it's unchanging we don't have a lot of information there to fit a world model because it could just be a single level estimate it's 10 and it's not changing but as we deal with progressively more and more nuanced and informative things that we want to minimize loss functions on like natural language images from natural scenes action including the unknown consequences of real-time unfolding action those because the uncertainty becomes higher and the information content of what you're trying to minimize your loss function on is higher it not authorizes or licenses but kind of moves us into potentially having higher levels of model complexity as simple as possible and not simpler with that principle being carried through again if you're just trying to predict a constant number you don't need a six level predictive coding architecture whereas if there was something that did actually have that many levels of depth that's the kind of architecture that would do well to predict it and we talked about this with par and pazulope on the architectures for homeostasis and allostasis and it was like yeah to just recognize if you're out of bounds and then come back in that's a given model but then there's a different graph that's going to do something like intermodal or anticipatory or have memory so all of those cognitive or computational features license or engender more complex causal models but these are being sculpted ab initio from nothing by the statistician and then we're in sort of brackish or gray waters when we juxtapose the architecture of the graph with the architecture of some natural system okay yes steven yeah and you've got this and you've got this excitary and inhibitory dynamic going on and even though the arrow is going from the arrow down to the u there is a the u is exciting the layer above isn't it it's exciting it's almost like it's like one bottom up is kind of exciting higher up areas and then things that are further up are also sort of saying hey calm down calm down i can reduce some of this you're getting a bit out of hand here you know they've only scored two goals let's relax you know but that that again i also hope this is not inaccurate it's almost like um there's a suppression of error like a dampening being carried forward with the ultimate dampening being you predict it perfectly you have totally quelled all error yet in real settings because of just simply model simplification or uncertainty in the world error is coming in from the observables which is often what we're trying to reduce our loss function on and that's this whole difference between reinforcement learning finding oneself in reinforcing or rewarding states and active inference with reducing surprise about outcomes we're dampening error and carrying that forward in the forward generative model and then also able to run it back in a tractable way because it's like you're you're you're dampening the vibration but then the vibration is still entering in and you'd want that vibration to be appropriately distributed according to where it should be okay dina then semen yeah it seems to me that the way what this is what this is saying to me is that because of the way things are organized two two things can live simultaneously at once one is sort of that local mean correction and the other is the more generalized global or the other direction in the bidirect bidirectionality parallel of of mean correction one is kind of reinforced through external evidence while the other one is reinforced through localized almost almost it is distributed but it's almost like the ability to localize and isolate if necessary so i know we've talked a lot about top down and bottom up but this brings a lot more color in my mind to some of those conversations we've had before in terms of you know arrows going in opposite directions and it's not so much that they're going in opposite directions but that you can be going two different directions quite literally want literally at once because of the way things are set up so yeah i think this this has been really really helpful for me in terms of not only understanding what back propagate propagation is but actually in sort of giving me a little bit more foundational ideas around again how do we turn this this these sort of statistical ways of representing it into so i'm confused should i should i go into the cafe or not i mean we do not not all of us have to go to that level of detail to really understand what the final thing was that pushed us over the door threshold or not but it appears that it's this has been around for a while and being able to take that if you really are kind of stuck because you really don't know what the the parents are of your current action because we haven't got as you say we haven't got to action yet but that's the whole point there is an architecture there that's going to allow us to have choice thanks um this minimum of two um comes up so much so it's awesome to hear about that and then to connect like expectation maximization and the tail of two densities and epistemic and pragmatic it's like wherever we look there's partitions into two into multiplicity and unity through plurality and then other times even within the one road there's the movement in both directions forward and backwards in time or the forward propagation of a generative model which could from just one if all you knew was one you could simulate sevens if all you knew was seven and the structure of the network of course you could back propagate the best possible mean and error estimates using back propagation of error okay Stephen this is helpful for thinking about how we perceive things as well for that was Dean was saying about making choices it can be when we get zero error or whatever seen as negligible error you want to just allow that to carry on and stay with your imagination so I go to the bar and I a lot of things are the same right maybe there's a happy hour sign and if that's unusual maybe I'm happy about that I noted um but if a lot of things are exactly or very similar to how I expected then in a way my perception is probably mostly my predictions just not being contradicted it's all they all actually I just don't see it I just I think I've seen it I just don't need to see it and where there's something which is like ambiguous like you were saying you need that number seven if we're going to use this diagram starts to have to be looked at you know you start peering at things straining their eyes trying to sort of see well what's that is it is it a dog is it a man running into the woods is it a dog is it a man running into the woods um you you know you start to notice things I think this is also quite useful for that perspective of you know when you just you don't actually need to see a lot of stuff it's quite an efficient model right if it's if the if the area goes below a certain level you just move on and carry on of course that's where the challenge with autism is they think because too many things are being looked at that are possibly negligibly relevant at the time okay thank you so just one last one last thing down before because I know you want to clean this up I think then one of the analogies that I draw is with the emphasis on the word minimum to min two doesn't mean it's always a you know a hard bifurcation and or a binary it means that the minimum amount of things like even when you take your shoelace and wind it through all the eyes of your shoe you've got two ends that you eventually will loop back together and and entangle in some way to hold your shoe on so even though it's one lace I think what a lot I think what active inference reinforces because it's got active and inference is that ability to be able to see two things both seemingly contradictory and both actually true now it can be multiple things you can have multiple as you can have multiple surprises some of which are make you quite happy and others where you walked in and said well why hasn't it been happy hour at this hour when I've been at this bar multiple other days like you could be disappointed in that as well as being pleasantly surprised so I think what this again I just reinforces the the minimum of two allows for the integration the minimum of two recognizes the differentiation and when we're talking about statistical issues and when we're talking about errors and when we're talking about energy I think I think the way that they that these folks have been able to discover what so what architecture would allow for that is what's really interesting to me now because I wasn't even thinking in these sort of so what's what is the what is the scaffold look like but it looks like there's been a lot of work done in this area so so let's take it to action as we head into our last little bit on this and of course if anyone is more author or not familiar with the formalisms just contribute to acting flab help us annotate the papers and read these deeply because there's so much to learn and somebody getting involved like a few days or weeks before these streams is a leverage point if you want to help other people understand this area so I've changed the colors just all to gray because they're all just random variables now we have the seven so we are back propagating error on temperature prediction but also maybe this system is predicting pressure and so then three might be like estimating you know is the water boiling again just we're getting one one step closer to the cafe so if if we were only trying to predict temperature we back propagate along the seven six three two one route if we were only predicting pressure we could back propagate along eight five three two one the fact that there's a parent that is upstream it's like the last common ancestor of pressure and temperature like we might be able to reduce um these errors a lot on the temperatures by having this higher order variable is the water boiling and so we can imagine like if the answer is yes then it's like it's expecting that this is like higher on both and if it's low it's lower on both so the generative model can go from the water's boiling what pressure and temperatures do I expect and you can add a third note it allows you to conditioned on a cognitive structure go from observations of just temperature around the horn to pressure or from pressure and temperature up to the water boiling but the influence of these observations is like limited through this choke point it's actually a markup blanket with respect to what's on the either side of it let's bring action into the picture action can also be understood as parametric what angle should my elbow be at what angle should the eyes be oriented those types of things again not whether that's what the actual neural signaling in the brain is doing but this is how we model actions like the the car how many miles per hour should I be going those kinds of questions is parametric modeling of action and that's what was so fun and exciting was the first 50 equations in this paper have scaffolded this tremendous general node computation architecture multi-level means and variances forward and backward up down left right all of these interesting ways and connections but not yet action not because we hadn't brought action in as a dancing partner but actually this was above any interpretation of the nodes as inference or action and then it's as easy to condition on action as it is to condition on any other inference because action is just a parameter for parametric active entities that are engaging in action as inference planning as inference and so on and so then one can think about what is the compute graph whether it's composed of just nodes and edges or whether we think of some of those nodes as being like variance estimators and then you know this maybe again the connections are not going to rewire this whole graph but just like is is this a water boiling situation okay so we have a thermometer it says that it's 100 c it's boiling temp we did the pressure sensors broke it but we back propagate it we're in a water we're high confidence on the water boiling then this is like another variable that's going to require like a different sort of question that it's answering but one can imagine that if this is a water boiling situation and the water is not boiling there might be an error and so how can we reduce loss function on the total compute graph including observation and action how can we reduce uncertainty about a cognitive model including observation and action there's two ways to do it and it's how they introduce it in the very beginning of the of the paper which is that the minimization of the loss function of the free energy expected free energy variational free energy can be achieved in multiple ways through immediate inference about the hidden states of the world explaining perception updating a global world model to make better predictions slower parametric inference updates that have the interpretation of learning and when action is a variable that we have agency over it introduces some complexities like the need to have preference and the need to specify a time horizon and to have uncertainty about the consequences of action right wouldn't it be easy if we were just watching the movie and our actions didn't have any influence upon the story that we saw it be a different inference and finally through action to sample sensory data from the world that conforms to the predictions potentially providing this integrated account of adaptive behavior and control so it took us 50 equations at the general before we could jump into action but it's a deflationary approach to action because action is a parameter in a broader cognitive and computational framework okay Stephen thanks Daniel yeah that's pretty helpful actually going back to that's the example of water as well you can even bring in the higher level of effect in the sense that one thing that's really hard with water compared to a lot of other things when you heat up oil or the stuff is you can have the water boiling you've probably had it and and you can have a thermometer theoretically say in 100 but the steam can go above 100 but it's very hard to really it can trick you right it's like it's hard to it's actually quite hard to get I find it hard to get my head around steam being hotter than water because it's kind of just kind of how I know you just think of it as being 100 degrees water right and it's almost like if you were to get into that situation or after I burnt myself one time or there's almost maybe I just start to have a feeling of fear if I have a sense that I'm somewhere where what might be coming out is high temperature steam right even that and that will override the fact that perceptually I haven't got a lot of ways to easily sort of gauge that visually unlike say a ring getting red and you know other things okay okay here's here's what that makes me think of sometimes we have indirect cues that reduced our uncertainty about something that's not directly observed like the mood ring if everybody was wearing a mood ring and that was an accurate predictor you know that would be this other alternate world and so there are certain kinds of variables that are more like the magical mood ring or like the temperature of metal within a range because I mean it's only going to start to get glowing when it's super super hot so it doesn't help you differentiate whether it's free like at zero c or 50c or 200c probably even but there's some range for certain things in the world where we can use indirect sensory cues um affect is not quite here in this model in this graph but we've talked a lot about how uncertainty can be understood as like anxiety and or negative valence in the world of reward maximization less reward is bad in the reward um in the world of precision optimization excess variance is bad not any variance it's not a simple you know destroy the variance estimators it's about the appropriate applications of variance but excess variance more variability than expected you told me it was 10 plus or minus one but I just pulled 350 negative 80 that's confusing it's more confusing than I thought so then some of these mean and error estimators in a computational model might have this could be a mean estimator on valence or one could choose to write a paper where an error estimator is framed as being an anxiety parameter and that's always that that that jump between what any parameter is in a graphical model and anything about the world or attaching it to some word because oh it's anxiety well then that makes me think of this other situation but so you can send someone really too easily down a road of associations when the deflationary perspective is just stating what it is um okay fun didn't didn't expect that we would go this way with the back prop but blue mentioned it last time and and it is important to bring up and also it connects to modern methods for training neural network architectures and that's what a lot of the author's research has focused on so let's just continue to explore action a little bit they reiterate that when we're thinking about active inference about inference involving action so cybernetics control theory any area where some parameters represent observations or latent states and other parameters represent active things the first way that you can minimize prediction error is to update predictions to match sensory data corresponding to perception and the second is to take action and earlier from the introduction of the paper they kind of did a little minimum of two even on this which is to split out perception as rapid inference and learning as a slower updating but we've talked about the continuity between perception and learning like if you're seeing the ball move across the screen is that perceiving the movement of the ball or is it learning a parameter representing the location of the ball so whether we call something more perceptual or more learning oriented and and even abstract learning can be seen as like perception in interior spaces which has been the focus of a lot of work by fields and levin with competency in navigating these abstract spaces but perception sometimes brings us a little bit um not too close but a little bit um just linguistically to like our experience of perception when the phenomenology and the qualia is not what is in play here it's about parameter updating in the model yeah dean do you think Daniel it's about sample size but sample size is related to maybe the the time commitment like if i'm inferring something that's a very very short volume of sample versus prediction which is slightly or relatively larger volume of time commitment to move to that volume and then modeling for example which which takes the the perception and tries to build some something material from that is that do you think that that counts here because at least that's what i was that's what i was telling participants we had to take under consideration was sort of the time factor of the sample sample sampling process sure i think um if we allow inference to um be just broad and including everything then that's what it is but yeah just in its connotation inference using a model often has the um it's sort of like the machine is ready the model's ready and we're going to do inference the new patient comes in with a biometric data we're going to do inference using this model and we're going to predict their risk of the diagnosis that's the fast perceptual time frame learning is unless it's one-shot learning is taking in multiple patients and updating the parameters of those models now those two stages can be alternated like an expectation maximization or in the tail of two densities but that doesn't get away from the fact that there's sort of this like one data point mode like plug and chug inference perception mode variational free energy and then there's a little bit more of like the updating the model parameters it's like are you tuning the engine or are you using the engine loosely in the um non-action space in this first way to minimize prediction error those given the ongoing stream of patients coming to your office there's two ways to reduce your surprise those are the um perceptual and learning continuum which does come down to the specifics about like the sample size like your um and the architecture of learning and perception um you could also reduce surprise by choosing what to sample and or modifying the world so that's how they bring action in and it's very interesting to think how many pages of prose i've gone into wondering where action fits in and there's still many more pages to go in terms of where does formalism 51 take us have we um cached the check of pragmatism and activism extended cognition and so on like what is 51 with respect to all the qualitative insights of action with all of its depth where what is 51 a short answer it might be that it's like the y equals mx plus b with respect to all the ways that linear models are used in the world this is just like a sort of skeleton archetype that is just the trace or a hint for someone who then wants to analyze specific actions over specific time frames for specific cognitive computational models but that's where 51 takes us steven yeah one other piece that could be kind of interesting to throw in there as well as matching my uh the sensory data with my predictions i could reimagine my reality to explain my beliefs which in a way would mean i create my own sensory data by proxy if that makes sense which is a kind of an interesting slight add on to that but um it's the same idea because you're not strictly taking action you're taking actions to match sensory data but not necessarily the real world sensory data it could be the the kind of uh you know the imagined world data i i am seeing an angel coming over the top of that hill now you know yeah one could engage in um mental reverie and it would either be reducing their surprise in the long run or that system would fail to persist in an entropic world and people have made all kinds of arguments like in evolutionary psychology about the basis of why we perceive or believe different things along similar arguments so definitely makes sense let's just look a little bit in four five two which we didn't um previously get to but of course anyone with any other like questions in the live chat welcome to do so welcome to do so so here um in 26 live stream 26 we talked about pid control we talked about integrator chains and about um generalized coordinates of motion so just um a reminder on what 26 looked like we had um from left to right is time these are different time points and then at each time point there's this vector it's like a stack reflecting the observables and then also things that are um latent states they're not observables of the world real or imagined they're the derivatives of how the observables are changing and those can be the position it's like the zeroth derivative but then also velocity jerk so on acceleration so on so we looked at that in 26 but it wasn't connected formally to predictive processing predictive coding um pid control is restated in 53 using the formalisms using the notation that is going to prepare for the merge with predictive coding like a for action epsilon for errors um and so the the error term is included in all of the three levels to obtain the equivalence to predictive coding we utilize a linear identity generative model with three levels of dynamical orders so the joint distribution of all the os all the observables and all the x's through time is going to be this factorized expression so the distribution of the observation conditioned on the latent state and then that is conditionally independent of so it allows it to be factorized and calculated out separately here's like the first derivatives second derivatives with o double prime and x double prime here is x conditioned upon its own derivative the derivative of x conditioned on its derivative and the um sort of root node is the double derivative of x because this one like stands alone that's like where our model kind of taps out that's equivalent to node one here like there's no variance term leading to one there could be but then that would be um a different model so these equations in 54 restate this dynamical model and introduce um omega which I believe would be an error term this w and also the mus which is the desired set point for x at that order so at the first second or third um derivative what what is desired there um then that allows so so this is like a graph where instead of these being different things like what if what if kind of mixing architectures a little bit because the generalized coordinates doesn't have errors interleaving it just has the state the first derivative the second derivative and so on um rather than interleaving error terms but we can imagine another similar graph where as you back propagate it's like higher and higher derivatives that's the integrator chain but I guess in 55 56 and 57 and 58 they do something similar to what we had done with the generic computation graph or to the predictive coding architecture with the interleaving of the means and the errors this is just that on the graph architecture where the nodes have this interpretation of being derivatives of one another what is there to say that's an interesting connection this bringing back flashbacks to try to try to get my head around 500 level economics courses and when you'd have somebody walk in and start pounding stuff up on a whiteboard on any you know second third and fourth derivatives and just just feeling like you'd been tossed into a blender and he had no way of avoiding the blades so I'm glad I'm glad that there's people that can do this stuff but I would have to dedicate way more time to trying to parse what the heck they're talking about because it it's a wall to me it's really hard so yeah I I I feel you maybe for the last few minutes we can kind of pull back to okay as we swoop out of this bathtub out of the dot to into the great beyond like what have we taken with us why was this lengthy journey through predictive coding and the 60 formalisms and what Maria brought with the thousands of years of philosophy and history on perception so where are we and how are we different now and going forward than we were before we started 43 I'll say what it what it's done for me first of all bringing up the historical piece of it was really really helpful because this didn't just sort of come up yesterday or in the last couple of years it's been around for a while I think for me predictive I wouldn't have seen predictive coding as a as a sort of formalized way of retracing a navigational route but that's what it screamed to me and and after the point one I and listening to you having hearing you in blue talk about it when the when the idea of the when the question around back propagation came up that's when I sort of did a deeper dive on that and again it wasn't I wasn't picking up on how to do that path retracement the formalisms weren't really addressing what I thought was being said I think what today helped me with is um sort of examining if you if you don't have traces laid down that are material that you can see if you weren't actively part of the building of the path that there's still some graphical statistical ways of organizing information to give you a better sense of okay so you didn't lay the breadcrumbs down but we can still give you a better sense of where you came from as a result of being able to use this kind of formalism I mean honestly Daniel 53 53 come on I mean if you if you're one of the people that was able to write those 53 out and actually present it in in paper form I mean props to you you get you get many thank yous but holy mackerel other than the fact that that there is some evidence now that you can retrace even though you don't have necessarily those markers or you didn't self correct you didn't you weren't aware of how you self-corrected whereas the predictive coding at least gives you an opportunity to maybe trace that retrace that back even though you didn't spend a lot of time trying to organize the evidence the other thing that it does is I think it reinforces the idea that the way that these live streams are structured which is just introduce the paper just give the paper it's it's opening like open up the box in the point two you can actually go into one particular area of the unknown which for me was the back propagation which you and steven helped me with in the point two we are still doing the part where the point two could always even if you didn't have a record of the point one and the dot zero this paper told me that you would still have in some way of knowing where that elusive dot zero exists relative to this this live stream today and again I could be over I could be overfitting but I don't think I am otherwise why would you why would you spring out 53 operations a lot of interesting stuff there so um because this is level this is the exact I mean what we're doing in the live stream and the levels that they're talking about they they affect each other in exactly the same way it's diagrammed out here as a stack and we tend to think of ours because it's over multiple multiple days as sort of a horizontal continuation but they're essentially one and the same oh that's that's very interesting so it's almost like well there's the dot zero and the one of the two they can go both ways so in in the temporal sense the dot zero is a precursor temporarily to the dot one the dot two so for the for the um for the uh youtube channel subscriber this is their observation oh 43.0 oh 43.1 has been live streamed 43.2 has been live streamed but um what is like the derivative that sort of sets the initial conditions for dot two it's the dot one yeah and then what of course sets for the dot one dot zero and so there's almost this sense in which as we um dampen our uncertainty through action and inference moving forward in a world that's always continually vibrating and transcending our model like it pushes the earlier actions that we take that high school teacher that book that you read all the way back even beyond one's own corporal birth as like parameters that dampen or specify if one doesn't want to think about dampening the position acceleration and velocity and so on um they're like higher and higher chains and that means that they have like they're further and further away from where the rubber hits the road for the youtube subscriber on the outside but they're absolutely like part of the process by which something comes to be and then the dot two you said can exist without the dot one and and then and then the first time it made me think of like the negative numbers like even before that was zero what is what is there before zero what happens before zero is there a negative number there's no number that's positive that's smaller but what is is it is it a letter is it a shape like is it its own big bang it's like and there's a sense that it is and there's a sense that it isn't um that's part of it's the pre-reading the other thing that I think maybe doesn't sort of register explicitly is that the error propagation in the way that we do the live streams remains there are some corrections and there are some there is a sort of a gradual movement towards a bit of minimization but every time we come together and try to figure out we don't come from a position of we figured it out we come from a position of well what will we what will we do with this today so we retain that error propagation and I think that that's a I mean it's that's not convention convention is why would we spend time on something where we leave a bunch of error baked in but I think when again if we're if we're going to differentiate if we're going to truly integrate if we're going to be able to say that you can have a point two existing without a point one and a point zero of course you can but how superficial is that relative to the method because of the architect the because of the way it's structured how much deeper you can actually go and how many times you can bump into a dead end immediately find out how quickly you're going or you know going off the rails so to speak because I can't remember which one of the live streams was that you had the the train moving down the track but you can find out whether or not something is a dead end quickly and incorrect that's what that retracing part today I think is one of the well it's it's the takeaway from this this set of live streams that and I wouldn't have anticipated based on the title of the paper that that retracement piece was going to be the thing that was going to sort of percolate and and make the most sense in terms of why the structure is is yes is an optimizing piece to it but there's also a piece that that allows for mistakes without without you know I mean sometimes it's very consequential but lots of times it's the consequence is what's minimized not the error and I think that's what the architecture affords you want to talk about affordances so when you said the consequence is minimized and not the error that makes me think about like high reliability organizations of course one way to minimize the consequences of error would be to never have errors zero tolerance for anything but that is challenged in a world again it's always throwing us challenges so in in in that setting it's almost like the observation here the one that we actually want to reduce our loss function on is not the so-called proxy of error occurring but we could reduce our loss function on function on the thing working and just say we are minimizing the consequences of all this upstream stuff some stuff that might have the interpretation of error and things that might not there just nodes in this graph and then what is going to dampen surprise about the consequences and so like if you say you know anything that you say to me right now I'm going to accept it and believe it and then it's like in a in a trust space that opens up there to be like a communication that goes deep if it's being honestly said and that's because like what's being minimized like it don't tell me something surprising it's like my consequence you can have precision on my consequence that's one direction of the conversation and that is what allows that second direction to really come into play with the transmission of variability and rather than well how am I going to reduce my uncertainty about action relative to what the person is going to tell me I won't talk to them or I'll tell them what to tell me or um I'll leave them in the dark about how I'm going to act and then they'll have uncertainty about how I'll act if they say something even trivial and so I think this like really it it spans many levels and ways here because there's nested levels and deep chains but I agree I think on the more topical if such a thing could exist some of the takeaways that that I had also um really um appreciated Maria providing a lot of the philosophical context and then that one part where we're like wasn't this a meme like thousands of years ago and in many cultures in the world what is it that still makes the predictive mind and surfing uncertainty and being you what makes these novel and surprising and even somehow counter cultural what what is happening there what chain is leading to that where the idea that were generators with agency rather than passive receivers why is that such a hill to climb and die on that's one piece I absolutely take away then it was awesome with blue um to work on this in the dot one one where we had like on the bottom left the qualitative philosophical insights and and connecting to technical and formal insights on the bottom right and biological systems and it's sort of like where are we engaging deeply with one of these vertices where are we engaging deeply with an edge where are we trying to grab two vertices and the edge or go around the horn or where are we trying to really go for the triple play um every baseball and home base inference and action um in the bottom of the ninth of the dot two it's it's a full count and it's a power play that and then also I think the last piece that the kind of takeaway is just when the computational and cognitive framework are defined generally enough like at the level of a graphical network which opens up message passing variational inference back propagation of error predictive processing architectures hybrids of all of these approaches then action is deflated or it's inflated maybe or whatever it is that we even say it becomes composable and tractable to discuss open ended combinations of perception cognition and action and impact which is the home base in the niche and so if even that model has to be structurally revised like we end up having another way of framing entity and the niche interaction it may fit within this framework or it may necessitate revision of the framework but within the entity niche interaction space and the nested systems and the counterfactual cognitive systems there's almost like less and less to action than it seems it doesn't remove the challenge of action it it it reduces something else that I'm not quite sure about what it is like it doesn't make it easy but something is being removed as we're learning and working through this but it doesn't make the climb well it makes the climb different to think differently but it doesn't move doesn't mean that fewer foot pounds of force have to be applied but there is something different about the navigation around action and inference via active inference that I think is will continue to develop for a long time okay Dean and also Stephen thanks a lot wasn't sure there for a second at the beginning what would happen I gotta thank you again Daniel because obviously you did understand have it will have a reasonable understanding far more than mine in terms of what that back propagation implication was in terms of marrying it to the predictive coding so didn't mean to put you into the tutor's seat today but appreciate the fact that you were able to sort of pull it apart and then put it back together in a way that made a lot more sense than just reading it off the page so thank you sir thank you Dean see you later all right take care bye