 Hello and welcome everyone. It's April 28th, 2022 and we're here in Actinflab live stream number 43.0 discussing predictive coding, a theoretical and experimental review. Welcome to the Actinflab. We're a participatory online lab that is communicating, learning, practicing, applied active inference. You can find us at links here on the slide. This is recorded in an archive live stream. So please provide us with feedback so we can improve our work. All backgrounds and perspectives are welcome. And we'll be following video etiquette for live streams. Head over to active inference.org. If you want to learn more about how to participate in the live streams or anything else happening in Actinflab. Okay, we're here today to learn and discuss the paper predictive coding, a theoretical and experimental review by Baron Millage, Anil Seth and Christopher Buckley. The video is just an introduction and a contextualization for some of the ideas and some of the details of the paper in the broad sense. It's not a review or a final word. It's asked with some of these other technical dense papers. It's like an opening for those who have technical questions at the learning side of things or at the more advanced research side of things to come get involved because this is a technical paper, but also hopefully as well unpack it has a lot of cool biological and philosophical implications. So in 43.0, we're going to say hello and introduce some big questions, then cover the aims and claims of the paper, the abstract in the roadmap, and then pretty much just jump right in. And we're going to focus a lot on the introduction, the context and the single layer predictive coding model. And then we'll go a little bit faster through the later sections of the paper talking about some generalizations of predictive coding and some important points that the authors raise. So if you want to participate in 43.1 or .2 in the coming weeks, just let us know. Okay, so we can say hello and give any information that we'd like and also just maybe something that we thought was exciting or something that motivated us to get involved in this quite involved paper. So I'm Daniel. I'm a researcher in California and I was curious to learn more about how predictive coding and predictive processing related to active inference and also just about how different models framed anticipatory systems. So over to you, Maria. Hello, Daniel, everyone. I have a Bicolor degree in psychology and I am a master's candidate in philosophy of science in the University of São Paulo, Brazil. And I am researching the relationship between predictive processing and illusionist theories of consciousness. And I think what brought me here today was my wish to start learning about the formalisms because I don't really see it in philosophy and I don't actually need it for my dissertation, but eventually I wanted to continue my work on predictive processing. So eventually I had to start it. And I just thought, well, maybe that's a good idea to come here and minimize my uncertainties. Awesome. Through inference and or through action. And also thanks a lot, Brock, for helping in the dot zero. So we'll just start with the big questions. And these are the kind of questions that might motivate someone or interest them in this paper without even mentioning active inference per se. But these are like some big questions that get touched upon. What is the formal basis of predictive coding? How is predictive coding used or useful? What are some areas of current and future development? And what is the relationship between predictive coding and active inference? And hopefully we'll draw out more questions. Anything else to add about this? No, not really. Cool. Paper is predictive coding, a theoretical and experimental review. I think the first version was from 2021, but the second revised version was 22 by Millage, Seth and Buckley. And just to review the core claim, and then some of the aims and directions they set out, they write that no comprehensive review of predictive coding theory, and especially of recent developments in the field, exists. That's their claim. And then they aim towards providing a comprehensive review, both of the core mathematical structure and logic of predictive coding. So there's a mathematical review and summarization aspect to the paper. And then they also review a wide range of classic and recent work within the framework, ranging from biologically realistic microcircuits that could implement predictive coding to the close relationship between predictive coding and the widely used backpropagation of error algorithm, as well as surveying the close relationship between predictive coding and modern machine learning techniques. That's how they set out where they want to go. Would you like to read the first slide of the abstract? Okay. So predictive coding offers a potentially unifying account of cortical function, postulating that the core function of the brain is to minimize prediction errors with respect to a generative model of the world. The theory is closely related to the Bayesian brain framework, and over the last two decades has gained substantial influence in the fields of theoretical and cognitive neuroscience. A large body of research has arisen based on both empirical testing, improved and extended theoretical and mathematical models of predictive coding, as well as in evaluating their potential biological possibility for implementation in the brain and concrete neurophysiological and psychological predictions made by the theory. Despite this enduring popularity, however, no comprehensive review of predictive coding theory and especially of recent developments in these fields exists. They write, here we provide a comprehensive review of both of the core mathematical structure and logic of predictive coding, thus complementing recent tutorials in the literature. And we also review a wide range of classic and recent work within the framework, which is what we had read previously in the aims. So that's how the authors describe their own work. And now we're going to see their roadmap, see how they went about getting there. And then we're going to jump right into the introduction and then the formalisms. So here we see the roadmap and the rough outline of the paper is as follows. Section one provides an introduction that contextualizes predictive coding and will also be unpacking it in a philosophical and historical view. Section two provides an initial pass on the one level predictive coding model, sort of the kernel or the archetype or the motif of predictive coding and connects it importantly to variational inference. Then predictive coding is generalized or extended or elaborated in a few important directions. So like in section 2.2, it's generalized into a multi-level case. In section 2.3, the concept of generalized coordinates are introduced. In section 4, precision is introduced as a concept. And in section 2.5, it's connected back to the brain. Section 3 discusses paradigms of predictive coding, both unsupervised and supervised, and also connects it through those sections and others to some research in machine learning. And then in section 4, there's connections between predictive coding and some other algorithms, both those algorithms that are common in machine learning and computational statistics, like the error back propagation, linear predictive coding, and common filtering, and normalizing flows and biased competition. And then in section 4.5, explicit connection is drawn between predictive coding and active inference, which brings in the cost of action as well as the idea of PID control. The discussion section closes with just a few other thoughts and summaries and directions and challenges, and there's several appendices. It's a long-ish paper or monograph, but it's an excellent paper. And we're just going to cover more of an emphasis on the context and the beginning, and then skip over or trail off over a few things. But there's lots of unpacked, so I hope that we can learn more in the coming weeks. If the authors want to join, or if anybody else wants to join. Anything to add on roadmap or ready for introduction? Let's go. All right, awesome. So I'll just add some notes from this section of the introduction, and then Maria, feel free to add anything. So how do they introduce predictive coding? They introduce predictive coding theory as an influential theory in computational and cognitive neurosciences, which proposes a potential unifying theory of cortical function. That's like related to the cortex of the brain, namely that the core function of the brain is to minimize prediction error, where the prediction error signals mismatches between predictive input and the input actually received. Minimizing the divergence between expectations and observations is going to be a big theme. And approaching that minimization, it can be achieved in multiple ways, and they're colored differently here. Through immediate inference about the hidden states of the world, like I thought the ball was over here, I'm getting evidence it's over here, so I should just update how I think about the world, which can explain perception, like the perception of a ball moving across the visual field, through updating a global world model to make better predictions, which could explain learning, like learning a little bit higher level than just where the ball is in the visual field, but like where balls tend to be in the visual field, or how fast they tend to move around. And finally, through action to sample sensory data from the world that conforms to the predictions. And that's when we'll introduce action into predictive coding in the latter sections of the paper. And then just last point, they say that predictive coding boasts an extremely developed and principled mathematical framework in terms of variational inference, as well as many empirically tested computational models that have close links to machine learning. So that's how they introduce this is as a unifying theory that integrates inference as well as action inference of different kinds more like perceptual and more like learning. And they're saying that it has a very rich mathematical framework, and it's been connected to insights in biology and in machine learning. Anything to add? Nope. So they write predictive coding as a theory also offers a single mechanism that accounts for diverse perceptual and neurobiological phenomena, such as end stopping by stable perception, like illusions, repeat repetition suppression, illusory emotions, and attentional modulation of neural activity. So they're pointing to the integrative nature of this theory or framework to model all of these diverse phenomena as being an advantage. Like we could have a special theory for just by stable perception, a special theory for just repetition, suppression, and so on. But this is a framework by which those outcomes are occurring as kind of outcomes of an underlying framework predictive coding rather than needing special theories. So that kind of like parsimony or conciliance is pursued in the areas like cognitive neuroscience. And they write as such, and perhaps uniquely among neuroscientific theories, predictive coding encompasses all three layers of Mars hierarchy by providing well characterized and importantly in empirically supported view of what the brain is doing at all of the computational algorithmic and implementational levels. So Mars very influential hierarchy. It's kind of like Tim Bergen's questions or a little bit like Maslow's hierarchy. You can think of it as like an organizing or a sense making framework. Again, to the question, what are brains doing? And so we're going to have like three different kinds of answers to what brains are doing. There's going to be an answer at the level of computation, algorithm, and implementation. And so here with the example of a bird flying, the computational level is like flight. It's like the Y. What is the algorithm doing? It's a sorting algorithm. What is the brain doing in this situation? It's tracking a moving object. Then there's the flapping level, which is the algorithm, which is kind of like pseudocode. And that's suggesting the what of how that Y is realized. Like what is happening in this sorting algorithm? Well, it's doing this, this, this, this, this. What is the flight function being realized by? It's being realized by flapping of the wings. And then there's the implementation. So it's not just the flapping of anything. It's the flapping of something specific. And so that's the implementational details. And those levels of analysis have been helpful and provocative in connecting computational cognitive theories to neurobiological theories. So it's been a very influential and discussion-provoking framework that has helped people connect as well as challenge some of the similarities and differences between like computers and brains. This is still in the introduction. And they're going to kind of give us a qualitative perspective on predictive coding. And then we're going to be diving into a lot more historical and technical detail. But how does predictive coding work? The core intuition behind predictive coding is that the brain is composed of a hierarchy of layers which are each making predictions about the activity of the layer immediately below them in the hierarchy. So there's two directions in this predictive coding multi-layer scheme, which we're just introducing here as kind of a heuristic. And then we're going to zoom in to just what's happening at one level. And then we're going to pull back out to kind of recovering this multi-level formalism. The downward descending predictions at each level, that's the blue arrow, are compared with the activity and inputs of each level to form prediction errors. This is the information in each layer which could not be successfully predicted. These prediction errors are then fed upwards, that's the red arrow, to serve as inputs to higher levels, which can then be utilized to reduce their own prediction error. So that's kind of the bigger picture. Like the idea at the top level is like, I'm tracking a soccer ball that's moving from the left to the right. And then that gets transduced in certain ways. And that is in this case, resulting in kind of a visual perceptive level prediction about what one should expect to see. And this is again just the multi-level predictive coding scheme that is often used. But we're going to zoom back to just what's happening at a single level and then what is happening at multiple levels. Awesome. Here's a fun 2018 paper by Carl Friston, Does Predictive Coding Have a Future? And this is going to provide just a little bit of context and then we're going to jump more into though. Carl wrote, in the 20th century, we thought the brain extracted knowledge from sensation. That's like the recognition model. And that's sort of the incoming sensory processing model. The 21st century witnessed a strange inversion in which the brain became an organ of inference, actively constructing explanations for what's going on out there beyond its sensory epithelia. One paper played a key role in this paradigm shift. And Carl goes on to write that a 1999 paper by Rao and Ballard, which we're going to talk more about for him was one of those papers that's a once in a decade find. So we're going to talk more about it, but this is just kind of connecting it early to Carl Friston's line of research, the connection between active inference and predictive coding and the idea that sometimes it's these early transdisciplinary papers that can stitch two different fields together and that that can have long-term consequences for essentially perennial questions like how do perception, learning, attention, and sensation all work together and under what imperative might they work together. And then the authors provide a very preliminary discussion that they're going to go into more later about how predictive coding matters for machine learning. And so they write predictive coding proposes that using a simple unsupervised lost function, such as simply attempting to predict incoming sensory data, is sufficient to develop a complex general and hierarchically rich representations of the world. They're suggesting that this has found support in the successes of modern machine learning models that are trained on unsupervised predictive or autoregressive objectives. So here's some of the papers that they cite. This is the brown at all and the Kaplan at all. And these both have to do with the training of what are increasingly becoming important modern machine learning methods, which are the language models. In contrast to modern machine learning algorithms, which are trained to end with a global loss at the output. So given the total big data set, give me the lowest total error on the whole data set. In predictive coding prediction errors are computed at every layer, which means that each layer only has to focus on minimizing local errors than global loss. This property potentially enables predictive coding to learn in a biologically plausible way using only local and Hebbian learning rules. And this is going to connect to Kalman filtering and PID control. So they're kind of laying out the land and showing where they're going to connect in the introduction very, very qualitatively. And then again, we're going to zoom back down to find out what's happening at one level of predictive coding, and then try to rebuild some of the formalisms related to Kalman filtering and PID control. And the bigger questions that people might be curious about asking would be like, what does it mean for computation to be biologically inspired? What is possible when we're thinking about some of the similarities and differences and complementarities between biological and different kinds of conventional and unconventional computation? The next part, Maria, please take it away. I'll be changing slides. So thanks a lot for adding a lot of the historical context and looking forward to learning about what you have to share here. Okay. So this is actually the paragraph that is in the paper, okay, that they start talking about the history of predictive coding. So while predictive coding, as a neuroscientific theory originated in the 80s and 90s with Monford, Raoul and Ballard, Sir Nivasan or something, Laughlin and Dubbs, and was first developed into its modern mathematical form and comprehensive theory of cortical responses in the mid 2000s with freestyle, it has deep intellectual antecedents. These precursors include Helmholtz's notion of perception as an unconscious inference and Kant's notion that a priori structure is needed to make sense of sensory dictum. Okay. And as well as early ideas of compression and feedback control and cybernetics and information theory. And on the next slides, I will be talking a little bit more about history and philosophy of predictive coding and then some core formalisms and various generalizations and elaborations of predictive coding. If there's anything you want to say, then you're just feel free to jump in. Okay. So the historical and philosophical background of perception is huge. Okay. However, seen in the context of predictive coding framework, authors usually come up with a similar reasoning about perception. And then we can start thinking with the Greek philosopher Plato on his Allegory of the Cave from 514 to 520 before Christ, which he said basically that some people would be trapped in a cave forever watching shadows cast by objects moving near a fire behind them. And then they would like be there forever, have children that would grow like that inside the cave. And eventually these shadows from the objects near fire behind them would be the real thing and the only thing for these people. They wouldn't believe that there is anything outside the cave. So this would be it. These shadows would be the only thing. Okay. So what Plato would claim here is that our own conscious perceptions are just like these shadows. Okay. Meaning that they are indirect reflections of hidden causes that we can never directly encounter. And well, a few centuries later, the Arab medieval scholar Al-Hazan wrote a lot of interesting notes on visual perception. Actually, he wrote six volumes, if I'm not mistaken, of his book of optics, where he explored the view that human perception often depends on mechanisms of judgment and inference instead of, you know, providing access straight to the world. So you can see that very early in our western thinking, we have this notion that we do not have direct access to the world. Okay. That there is some inference going on. All right. So jumping to the 18th century, we can find two important philosophers. Okay. The English, David Hume, and the German, Emmanuel Kant. Hume in 1739 to 1740 explored the inductive inferences and causation. The problem of induction that he brought was an analysis of cause and effect in perception. And the conclusion that he gets is that all our mental life could be traced back to the effects of sense experience. Because since we do not have direct contact with the world, we could never have one-to-one relationship with the objects there. And therefore, it is possible to have one effect with many possible causes and one of causes with many effects. The solution here would be then extracting statistical regularities and imagining what happens when the world is intervened upon a controlled manner, as the philosopher Jacob Howe developed, developed in his book, The Predictive Minds in 2014. Equally important, in the critique of Peer reason in 1781, Kant will add an important element to this story among countless other elements to this story. He will add that the brain uses existing information regarding space and time to make sense of the chaotic sensory data it constantly receives to provide an organism the perception as they know. So all of these scholars would tell us right in the early ages that we do not have direct access to the things in the world and we need pre-existing information to make sense of all the data we constantly receive. Lots of predictive coding stuff. It really makes me wonder why it's sometimes still so ineffable or poorly understood or even seen as controversial. Some of these perspectives which have been, as you are describing, thought of and convergently come to for many thousands of years in different cultures. So really interesting. Yes, and well in the late 19th century the German scientist Hermann von Helmholtz in 1860 depicted perception as involving probabilistic inference. Yeah, he is one of our gods in the community, yes. And he was really inspired by Kant, okay. And with that inspiration he developed the idea of the brain as a hypothesis tester and that perception is a process of unconscious inference, okay. And more specifically this idea that he developed in experiments was that perception has to be inferred by combining sensory signals with the brain expectations and beliefs about their causes. And such inferences happen without the subject's awareness. I mean it has to happen without the subject's awareness because you know imagine all the chaotic sensory data coming and you open your eyes and you have to try to organize everything. I mean it doesn't happen that way, you just open your eyes and everything is still stable. And so it's unconscious and it needs to keep track of the causes in the world by updating perceptual best guesses or you can say hypothesis as new sensory signals arrive. So here he built the idea that we use sensory signals with the brain expectations to update the hypothesis we make for the causes of the effects we receive in the world. And this means that he somehow turned Hume's and Kant's insights scientifically claiming that we can only infer things out there in the world behind a sensory veil, which if I'm not mistaken will be eventually called Markov-Blanket. Is that right, Daniel? Not sure. Maybe unveiled. Helmholtz was crucial to many different works in machine learning, one that was even nicely named as Helmholtz machine. He was crucial in psychology, in neuroscience, cognitive science, and so on. Particularly in psychology, Helmholtz insides inspired Yurish Nyser in 1967 and later Richard Gregory in 1980 and Yves Arvin Ruck in 1983 to develop the analysis by synthesis approach, which is the process that analyses a signal or image by reproducing it using a model. There's a picture of Nyser's analysis by synthesis approach, with some schema available in this link. And in fact, Gregory built some kind of neuro-hypothesis testing model based on his Helmholtzian understanding of the brain as continually formulates a perceptual hypothesis about the world and testing them by acquiring data from sensory organs. Okay, so many things happened after the analysis by synthesis stuff, but I will just jump to what we have in philosophy today because we will touch the formalisms and the computation stuff afterwards. Okay, so I'll just make the jump and go to what we have in philosophy today. So recent works in philosophy and cognitive neuroscience have contributed to the expansive predictive coding framework and the Bayesian paradigm in the brain. And many of these seminal works, we have the field addresses predictive coding as predictive processing without, you know, any relevant difference in the first instance. And well, for instance, the philosopher Jacob Howey, that is based in Monash University, Australia, aimed to the end on the improvement of prediction error minimization mechanism and the concept of winning hypothesis for conscious perception. Also, he is interested in what PP predictive processing studies can bring to people with mental disorders such as autism and schizophrenia. Actually, the three of them of the philosophers here are interested in what people can bring to to improvement in psychiatry and medicine in general. So in 2014, he published an important textbook on the matter that is the predictive mind, the red book you can see, where he tries to solve the problem of perception inspired on humans induction problem with Bayesian inference and predictive processing. Later in 2016, the philosopher Andy Clark, based on University of Sussex, England, carefully developed what he called action-oriented predictive processing in his book, Serbian Certity, where he adds and advances the crucial role of action in our perception. And he makes a lot of interesting connections between recent works in computation on neuroscience and artificial intelligence and his eclectic philosophy of mind. And last year, in 2021, neuroscientists and USF, one of the authors of this paper we're talking here, and also based University of Sussex, made an amazing contribution to the fields with his book Being You, where he claims that we are beast machines whose perception is a controlled hallucination. And he improves consciousness descriptions using predictive processing framework and establishing the real problem of consciousness that in his own words requires explaining why a particular part of brain activity or other physical process maps to a particular kind of conscious experience, not merely establishing that it does to, you know, philosophers and scientists work on. And they, here they are. Okay. All right, so I was talking about predictive coding, and then I went to predictive processing. Is there a difference? Well, many authors would use the terms freely. Okay, but then I found Clark saying in his book, the difference that he understands from predictive coding to predictive processing. So he puts, as the fact that predictive processing is not simply the use of the data compression strategy known as predictive coding, rather, it is the use of that strategy in a very special context of hierarchical multi-level systems deploying probabilistic generative models. Such systems exhibit powerful forms of learning and deliver rich forms of context sensitive processing and are able flexibly to combine top down and bottom up flows of information within multi-layer cascade. Predictive processing does combine the use within a multi-level by directional cascade of top down probabilistic generative models with the core predictive coding strategy of efficient encoding and transmission. Okay. From this, I'm not sure if you agree with me, Daniel, but reading this paragraph makes me wonder, I mean, okay predictive coding is a local bottom up top down cascade and stuff, but it also is multi-layer. Yes, it also has some hierarchy. So I'm not sure what he means. I mean, I'm not sure if I understand the difference between predictive coding and predictive processing as he puts. Yeah, very insightful. And it made me think about the difference between like coding, like programming and like data coding, like Python coding for data and data processing, which is a little bit like it could be multiple computer languages. And so you're really highlighting something important, which is that when the data are compressed according to a strategy where the errors are being transmitted rather than the actual sort of estimate itself, that's like that information channel is in a predictive coding way. And then this helps us use predictive processing for like the bigger picture for systems that implement predictive coding modules, but that's not like the only module they're deploying. So that that'll be something cool to like hear people's perspective on really nice distinction here though. Yep. And okay, so I wrote something here that just like you said that maybe coding would be better to use for formalisms and implementations and processing for the philosophy of understanding that prediction is the basis of signal interpretation, e.g. as opposed to descriptive. Okay, so jumping back to formalism, predictive coding and information theory and signal processing. Okay, so another deep intellectual influence in predictive coding comes from information theory from Shannon in 1948, and especially the minimal redundancy principle of Barlow in 1961, 89, 961, 2. Information theory tells us that information is inseparable from a lack of predictability. If something is predictable before observing it, it cannot give us much information. I loved this sentence. Conversely, to maximize the rate of information transfer, the message must be minimally predictable and has minimally redundant. Predictive coding as a means to remove redundancy in a signal was first applied in signal processing, where it was used to reduce transmission bandwidth for video transmission. And do you want to say something? I'll just describe this without reading it. It was a common and effective, although it can be improved on by other methods. It's a method for video encoding known as frame differencing, where basically one frame gets described and then subsequent frames only the pixels that change have to be described. So that is a lot like an analogy to predictive coding in the sense that it's able to connect the differences between subsequent time steps and use that to give really only the informative pieces. Like I have a 100 digit number, and then I'm just going to tell you the third digit changed. Now it's an eight. Instead of having to repeat the same number, and then 99 of those digits, they're informative, like in the sense that they're good info. It's true, but they're not informative in the information theory context, because they were quite predictable before observing it. Okay. And so initial schemes used a simple approach of subtracting the new to be transmitted frame from the old frame, in effect using a trivial prediction that the new frame is always the same as the old frame, which works well in reducing bandwidth in many settings, where there are only a few objects moving in the video against a static is tactic background. And more advanced methods often predict each new frame using a number of past frames weighted by some coefficient on approach known as linear predictive coding. Okay. So now predictive coding in the eye and the brain. The first concrete discussion of predictive coding in the nearest system arose as a model of neuro properties of the retina with shrimp in the vessel 1982 specifically as a model of sense center surrounded cells, which fire when presented with either a light spot against the dark background on center of surround or alternatively, the dark spot against a light background of center on surround cells. It was argued that this coding scheme helps to minimize redundancy in the visual scheme specifically by removing the spatial redundancy in natural visual scenes. That's the intensity of one pixel helps predict quite well the intensity of neighboring pixels. And then the first picture related to it. Do you want to talk about it? We're just going to look at two different tissues of the nervous system that have some predictive elements. And this is referring to some of the photosensitive cells that are on the retina. And that's work that had been done qualitatively and empirically for hundreds of years, but then was connected to predictive perspectives in the 80s. And then we're also going to look as you'll now unpack at what's happening in the cortex of the brain amongst different layers. With Manfred in 1992 was perhaps the first to extend this theory of the retina and the LGN to a fully fledged general theory of cortical function. His theory was motivated by simple observations about the neurophysiology of cortical-cortical connections. Specifically, the existence of separate feedforward and feedback paths, where the feedforward path is originated in the superficial layers of the cortex and the feedback pathways originated primarily in the deep layers. Great. So we kind of had this philosophy development happening over thousands of years. And then in the 1900s, people were starting to connect the neuroanatomy and the histology to some of these predictive ideas. And then now let's kind of close the loop and bring it back to the critical work that Friston highlighted in his 2018 paper. Yep. And our predictive coding in the brain gets computational mathematical. Well, Mofford's theory contained most aspects of classical predictive coding theory in the cortex. It was not accompanied by any simulations or empirical work. And so its potential as a framework for understanding the cortex was not fully appreciated. And so, again, the similar work of Rowan Ballard in 1999 had its impact precisely by doing this. And so they created a small predictive coding network according to the principles... Oh, sorry. They created a small predictive coding network according to the principles identified by Mofford and empirically investigated its behavior, demonstrating that the complex and dynamic interplay of predictions and prediction errors could explain several otherwise perplexing neurophysiological phenomena, specifically extracausical receptive field effects such as and stopping neurons. Yeah, this is pretty interesting because even the idea that the neurons that are active while a certain stimuli is being presented, that those are neurons that are receptive to that stimuli, that's within the paradigm of signal reception. And then it's like there's classical signal reception, you know, the pitcher throws the ball, the catcher catches the ball, it's classic. But unfortunately or not, there's all these so-called extra classical effects. And so this paper proposed a simple architecture that was able to encompass the so-called classical as well as some of the important extra classical effects under this predictive processing framework. And so that really led to a lot of developments in neuroscience in the last 20 plus years. And that's exactly what we're going to go into now. So thanks for providing that awesome context. Like it really helps, I think, situate some of the details that we're about to look into. Okay, so let's get a little technical. What happened after Rowan Ballard's 1999 synthesis? So remember that Carl Friston said that that was like one of those once in a decade papers for him. So how did he change his action selection publication policies after reading that paper? Well, anyone can guess. But what Friston did was that he cast the predictive coding algorithm as approximate Bayesian inference upon Gaussian generative models. And this is going to be connected. This is all in the paper for people to read. But this connects basically all of the previous themes that we had been talking about, like the information theory, minimum redundancy, and the Helmholtzian idea of perception as inference come together in the Bayesian perspective on the predictive coding architecture by Rowan Ballard. So the authors write Friston's approach reformulates the mostly heuristic Rowan Ballard model in the language of variational Bayesian inference, which we're going to look more at. And Friston showed that the energy function in Rowan Ballard can be understood as a variational free energy of the kind that is minimized through variational inference. So that really connected the dots between predictive coding architectures and the empirical biological findings to all this work on variational Bayesian inference. And those are the 2003 five and eight single author papers by Carl. So what is variational and Bayesian and variational Bayesian approaches? And we're just going to look at the author's words. And there's many, many awesome places to look for educational materials and also check out some of the live streams like 26, 32, 34, 37, and 39. So to be brief about it, the authors write variational inference approximates an intractable inference problem with a tractable optimization problem. So like a super hard problem to solve just by thinking about it and then guessing at the right answer with a tractable optimization problem. So it's like 20 questions. It's going to be hard to guess on your first question. But if you take this iterative optimization approach, maybe there's a way to actually resolve it. So it's not exactly like that, but it's like kind of again, flipping out something that's hard to solve in one shot with an approach and as iterative gradient descent type improvement that actually gets you to a good solution. And so this is where we're going to introduce some of the variables and the letters that we're going to use. Following a lot of perceptual work in Bayesian statistics, we're going to use O for observations, which are like data, either the actual data that the sensor provides us or generated data of the kind that we would expect the sensor to give us. And then X are the latent or the unobserved states of the system. So like X would be the temperature in the room and O is what the thermometer reading is. And that's sort of a Bayesian approach that helps us have a generative model of the data generating process, which is also known as the joint distribution because it is jointly over both the observations and the latent states. Where variational inference comes into play is that is hinted at in this paragraph on base. So in order to do exact base, one would have to find this normalizing factor. However, that can be intractable because it requires basically integrating or summing over all latent variable states. And that's not always easy or known. So the approach of the variational method is aiming to approximate the posterior using an auxiliary posterior using a different set of parameters. So the Q is going to be reflecting like the distribution, the variational distribution that we control that is of like a family of functions that is more amenable to optimization. So P could be like a super messy function, but Q is going to be constructed by the modeler to be a lot simpler. And so it is going to have its own parameters. And those parameters are mu and sigma. And so it's kind of like we would be interested in the mean and the variance of the X, the actual temperature in the room, given the thermometer. But what if it were too hard to even get that? Well, what we might want to do would be imagine that the temperature were normally distributed like a Gaussian distribution with this mean mu and the variance sigma. So even if the temperature in the room weren't actually Gaussian distributed, maybe it's approximatable enough. And so the variational approach to the Bayesian method is to introduce this variational distribution Q that's going to have like a lot of good features moving forward. Here's how that Q comes into play. This equation one is expressing using the variables that we've seen. It's saying and also introducing this divergence D. So first, just the definition of D and then what this formalism says. So D of P and double line Q. So any double line means between this and that D is a function that measures the divergence between two distributions. So for example, P and Q. And here the divergence is going to be calculated as the KL divergence, although other divergences are possible. So when the divergence between the one that we simplified and controlled Q and that intractable true posterior, if that divergence were minimized to zero, we would be like fitting P as well as we could with Q. And this is saying in one, it's saying Q star, the best Q of X, the latent states, given the data and the variational parameters. So the best possible room temperature prediction, given thermometer readings and variational parameters are a minimization over all the variational parameters of the divergence between the Q distribution that we control of exactly what we want the best answer for and the P generative model of exactly what would be the best to compute, like the temperature given the observations on the thermometer. And so that goes a long way towards rewriting an inference problem as a divergence problem. But they write, however, merely writing the problem this way does not solve it, because the divergence that we need to optimize still contains the intractable true posterior. So it's rewritten as a divergence between like something we control and something that we kind of set out to do this because we couldn't calculate it. So we have prepared it, but still this is intractable. So this is not usable alone. They write, the beauty of variational inference is that it instead optimizes a tractable upper bound on this divergence called the variational free energy VFE. To generate the bound, we apply Bayes rule to the true posterior to rewrite it in the form of the generative model and the evidence. So here's that P of X given O, that's rewritten as now not of X given O, but P of O comma X, that's the joint distribution divided by the observations. So this by the Bayes rule P of X given O is equivalent to this. That's why this is the first line here. They provided some rewriting and then they write in the third line, the expectation around P of O. So the actual likelihood of the observations themselves like how likely is the thermometer saying 21? It vanishes since the expectation is over the variable X, which is not in P of O. So it's like, what is the expectation of this coin flip over the temperature tomorrow? It's like because they're different variables, it makes them very easy to separate out if we're only interested in one of them. And then they write that this F, the free energy, is a very is a tractable quantity since it's a divergence between two quantities, we assume that we as the model or no, the variational approximate posterior Q of X given O, that's the distribution that we control, and the generative joint distribution P of O comma X. So we traded out this intractable true posterior where it's like if we knew it, then we would just stop there, but we don't know it. So we've traded it out for a divergence between something that we totally control Q of X conditioned on these other parameters, and a joint distribution, which we might have uncertainty over, but at least can be modeled. So that's one of the key pieces of variational Bayes. And that's not specific to predictive coding. It's just an important way that the authors are introducing it here. Anything to add? So let's kind of continue on this theme. Since F is an upper bound by minimizing F, we derive the variational distribution that we control closer to the true posterior. As an additional bonus, under certain conditions, F can be used for model selection. So that means it can be used not just to kind of fine tune a model that we've already chosen, but it can actually do choosing from parametrically or structurally different models. They write, we can gain an important intuition about F by showing it can be decomposed into a likelihood maximization term in the KL divergence term, which penalizes deviation in the Bayesian prior. These terms are often called accuracy and complexity, and this decomposition is often used in different machine learning algorithms. So that's one rewriting of free energy as this divergence between the Q and the P, and then rewriting that as an accuracy and a complexity, which we'll go into maybe more another time. And they write, in many practical cases, we must relax the assumption that we know the generative model P of O and X, the joint distribution. Luckily, this is not fatal. Instead, it is possible to learn the generative model alongside the variational posterior on the fly and in parallel using expectation maximization. So this is basically the alternation shown by equations three of how we can be setting one side of this KL divergence fixed and optimizing the other side and then going back and doing it the other way. And so that kind of back and forth expectation maximization, we're going to be like reducing this divergence from both sides. And so formalism three are saying the variational parameters by at the next time step T point T plus one are an optimization argument of those parameters holding the theta constant. And then the second part of formalism three is the exact opposite. Now the generative model parameters theta at T plus one are minimization over literally the same thing, but holding the variational parameters constant. So there's a lot more to say about expectation maximization, but this is like converging to a small divergence by whittling away at both sides and kind of alternating there. So that's how expectation maximization can be used as a heuristic algorithm for variational Bayesian optimization. How do we go from variational inference, which all of those previous slides are things that we've basically talked about before and that are not applying to predictive coding specifically. So how are we going to get to predictive coding? They write having reviewed the general principles of variational inference, we can see how they relate to predictive coding. First to make any variational inference algorithm concrete, we must specify the forms of the variational posterior and the generative model. It's like if you want to do formalism three, those are the two pieces you need. You need like the five stuff and the theta stuff. And so they specify it here and means a normal distribution with like a mean comma variance format. And so they're going to define a Gaussian form for the generative model. The mean of this likely had Gaussian is assumed to be some function of hidden states, which can be parameterized with theta, while the mean of the prior Gaussian distribution is going to be a G of mu. So we're going to have like F of theta and G of mu. The variances of these two Gaussian distributions of the generative model are sigma one and sigma two. This is a slight technical detail, but they're going to assume that the variational posterior is a Dirac delta, which is like a spiking function distribution that's centered at the mean. However, they explore that differently with Laplace assumption. So this is basically taking what we discussed about variational inference and preparing it to be entered into a predictive coding way. So they're setting up the problem with which parameters you're going to want to do variational inference on in order to implement predictive coding. And there's of course more to say, but we're just giving a first pass in in appendix a is where they describe the difference between using the Laplacian approximation and the Dirac delta. So it's footnote eight, which suggests moving to appendix a and looking at the Buckley 2017 for a walkthrough. But this on the bottom right is kind of like a summary of the difference between the Dirac delta approach and the Laplace method approach. So the Dirac delta is like we're trying to find the spike that's at the mean or the median. There's some other details in play that is where the bulk of the probability distribution mass is. Whereas the Laplace method tries to fit a polynomial like just a second degree polynomial like a quadratic over the probability distribution also in the way that is best fitting. So there's some similarities and some differences and it's explored more in the Buckley 2017 paper on the FEP for action perception mathematical review. So let's continue with our exploration of how we do variational Bayes and make it a predictive coding model. At the top they write we define the prediction errors epsilon o that's like the error on observation and epsilon x that's the error on the mean states. So the epsilon o is like o minus f of some function you know f of some parameters and then there's also the epsilon x. So how much prediction error do you have about the observation? How much prediction error do you have about the hidden state? Epsilon's are the prediction errors and they're parameterized by these theta 1 and 2 variances. Given all of this we can derive dynamics for all the variables of interest. So that's the actual underlying hidden state the temperature in the room that we want to be actually predicting mu and then will the variational parameter of it and then theta 1 and theta 2 which are the variances you can think of it as like the variance of the room temperature and the variance of the term of the thermometer. I hope that's not wrong or an oversimplification but those are like the two variances and we want the real temperature of the room. They write we can derive dynamics for all these variables of interest by taking the derivatives of the variational free energy f. The update rules are as follows and so this is change in mu theta 1 and theta 2 over time d mu dt d theta 1 over dt and d theta 2 over dt and that has some equivalences with some f's so it's going to be going from just the change in the mu estimate through time to something involving free energy and then that's defined more formally. Importantly these update rules are very similar to the ones derived in Rao and Ballard 1999 and therefore can be interpreted as recapitulating core predictive coding update rules. For instance the mu's are typically interpreted as rapidly changing neural firing rates while thetas are slowly changing synaptic weight values so they're weaving together connecting the biology with the formalisms here and they write also that that mu can be understood as the process of perception that's like how hot is it in the room since mu is meant to correspond to the estimate of the latent state of the environment generating the observations on the thermometer and by contrast the dynamics of thetas can be thought of as corresponding to learning since theta effectively defined the mapping between the latent state and the observations so it's like you think you know how the thermometer is related to the temperature in the room and then you see your thermometer changing you're changing how hot you think the room is that happens over a shorter time scale and then over a longer time scale you might come to learn the variability of a temperature or how noisy the thermometer is but that's more like learning than perception however they're on a continuum using this model so actually this up to formalism nine completes the core formalism of predictive coding which is it's a variational Bayesian approach to having this ongoing prediction error minimizing approach to latent state estimation that is the heart of predictive coding and now we're going to jump into like a few different elaborations that we're going to be moving through a little bit more quickly anything to add maria awesome the previous examples great focused on predictive coding with a single level of latent variables mu one however the expressiveness of such a scheme is limited deep neural networks in machine learning have demonstrated that having hierarchical sets of latent variables is a key to enabling methods that learn powerful abstractions and handle intrinsically hierarchical dynamics of the sort that humans intuitively perceive predictive coding schemes introduced can be straightforwardly extended to handle hierarchical dynamics of arbitrary depth equivalent to deep neural networks in machine learning this is done through postulating multiple layers of latent variables x sub one through x sub l and then defining the generative model as follows so like p the generative model is going to be basically over all the layers so just as the generative model and the variational distribution had to be defined for the single layer model here they're going to define the p distribution and now they need the q we define a separate variational posterior for each layer so they define the p's and the q's just like if they did in the single layer and then that allows them to calculate the variational free energy which is a sum of the prediction errors of each level and so this is f now not just over one layer but multiple levels so the variational posteriors that need to be calculated are partitioned across the layers which allows them to be summed in a very straightforward way given that free energy divides nicely into the sum of layer wise prediction errors it comes as no surprise the dynamics of the mu and the theta are similarly separable across layers that allows different layers of this prediction hierarchy to be like precise or imprecise and allow those movements to happen in a way that's uncorrelated which is not just because the real world presents itself with settings where there's like confidence at lower and higher and vice versa but it makes this calculation of the free energy of the whole system more like a simple sum whereas if there was these like really complex interactions with the first layer by the third layer if the fifth layer is this way then doing the statistics would be a lot more challenging so visually here's what that looks like we have the mu which is the mean estimate of what's happening at that level and then there's passing of these epsilon error terms so they write this is the architecture of a in figure one multi-layer predictive coding network shown here with two value and error neurons in each layer the value neurons project to the error neurons of the layer below and the error neurons represent the current activity so this is like starting to walk us back towards that cortical layout where there's a cortical column where there's some so-called upwards and downwards signaling but also there's lateral signaling so this is a graphical model that's reflecting the way that predictive coding can be arranged almost like in series and in parallel to have like wide models and deep models just like you could have a neural network where it was um four neurons four neurons four neurons four neurons or you could have 64 and then 64 that would be like a shallower but wider model and so that notion of shallowness and depth is also going to come into play with predictive coding here we're in formalism 12 and 13 and so now we're looking at the rate of change the derivative of mu sub l so the estimate of the mean at that layer and theta l the variance at that layer so these are the update rules the gradients for the mean and the variance of different levels and they're also written as free energy functionals of those layers and the dynamics of the variational means depends only on the prediction errors at their layer and the prediction errors on the layer below so again we don't have this epsilon at l minus one connecting up all the way to mu it's only through these local connections within and between layers by which we're needing to calculate anything we can think of the muse as trying to compromise between causing error by deviating from the prediction from the layer above and adjusting their own prediction to resolve error at the layer below so it's kind of like a hierarchy of of bosses or tasks and there's like top-down expectations and there's the bottom-up reality which might be like ahead or behind schedule and then it puts each person like in this compromise situation crucially for conceptual readings of predictive coding and this is where there's like a doorway to the philosophy and some of the broader discussions this means that sensory data is not directly transmitted up through the hierarchy as is assumed in much of perceptual neuroscience and so it totally returns us to these questions like what is perception what is cognition what is action what what is coming from the eye to the brain or what goes from the brain to the eye what is the eye doing what is the brain doing with respect to the eye and how does it relate to predictive coding any quick answers on those not really cool so after introducing that kernel of the predictive coding formalism and taking it into the multi-level context now we can look at another elaboration or generalization so they write in 2.3 we have considered the modeling of just a single static stimulus oh however most interesting data that the brain receives comes in temporal sequences oh bar which is oh through time to model such temporal sequences it's often useful to split the latent variables into states which can vary with time and parameters which cannot in the case of sequences instead of minimizing the variational free energy we must instead minimize the free action which is the path integral of the variational free energy through time we're not going to go into the formalisms related to the generalized coordinates 15 uh 14 15 16 and 17 but it's something that we can explore later and just as a reminder just to kind of um excite somebody who might want to explore it we explored the idea of the generalized coordinates of motion a lot in active livestream number 26 with Dacosta at all and so here was a model where we have from left to right time at different time steps and then there's the observables of something and then it is part of this like column of higher derivatives of its motion so it's like position it's um velocity it's acceleration and so on and so at each time step moving forward you're computing the generalized coordinates of motion for like the movement of a baseball and that is what ties generalized coordinates of motion to PID control which will also come back at the end also known as integrator chains and this is all happening within this discussion of Bayesian mechanics just like if we had been talking about the baseball we would be talking about physical mechanics or statistical mechanics well it's Bayesian so it's Bayesian mechanics and the physics of action and control that's what they explore in section 23 section 2.4 introduces and goes into a little bit more detail on precision one core aspect of predictive coding absent in the original row and ballard 1999 formulation is the notion of precision or inverse variance so precision and variance are like one over each other so if someone said it's very high precision that's a very narrow very sharp distribution and one over the variance is another way to write that so sometimes it's used with like a beta and a gamma in other models precision served to multiplicatively modulate the importance of the prediction errors and thus possess a significant influence in the overall dynamics of the model and so we won't go into every detail with how the precisions are themselves fit but just wanted to note that this big sigma is the free energy being summed over the layers so big sigma is a is a multiplicative sum oh sorry it's just a sum but then this smaller sigma inside is sigma as the precision matrix at a given level so it's kind of like different uses of sigma but this is a way where the variances and the prediction errors at a given level are being summed across levels and that's helping fit these precision terms statistical detail but it's important for making it work in real inference okay in section 2.5 they they pull back from elaborating on predictive coding and they talk more about how it has some biologically plausible bases so this was quite interesting well technically predictive coding is simply variational inference and filtering algorithm under Gaussian assumptions from the beginning it has been claimed to be biologically plausible theory of cortical computation and also for other brain regions like the eye and for other cognitive systems the literature has consistently drawn close connections between the theory and potential computations that may be performed in brains for example rowan ballard explicitly claimed to model the early visual cortex and friston explicitly proposed predictive coding as a general theory of cortical computation this is like the realism and instrumentalism discussion are we just modeling these systems and the kind of data that they provide us and it just fits really well as a model but we're not saying that the architecture of the statistical model has anything to do with the architecture of the physical pieces of the brain or is it a little bit in the gray zone where it's kind of like wow that model fits so well and the anatomy looks so good maybe there's something there to it and so in this section they review empirical work that has attempted to validate or falsify key tenets of predictive coding in the cortex as well as some issues with the approach so it's mainly focused on the mammalian brain and the neural cell type but as always we can sort of think about what is the adjacent here like what are the roles of other cell types in predictive coding in the mammalian brain and then how might like the insect nervous system or how my other nervous systems or non-nervous systems perform predictive coding and then they just provide one example where they take what they had shown in figure one with this multi-level predictive coding scheme that was sort of predictively programming as they say want to think about cortical cognition and now they're going to make that connection clear in figure two so on the right side is there figure two and that's the canonical micro-circuit model of Bostos at all mapped on to the connectivity of a cortical region and so then in red are those parameters that we've been discussing the mus and the epsilons the means and then the prediction errors and then here's just a few other representations of what others have written about like that cortical column so here's like an evolutionary perspective that connects some of the mammalian cortex with the avian pallium brain region and so just the way that people work out these kinds of circuit diagrams like by looking at the actual tissue and then here's the statistics and so how does the tissue relate to the statistical model but that's what they explore in figure two okay any thoughts okay in section three they continue to review empirical work but with more of a focus on machine learning of the unsupervised and supervised setting and they're going to also explore how the predictive coding architecture can be made more biologically plausible by relaxing certain assumptions implicit in this canonical model and introduce action into the picture figure three they summarize a few different what they call paradigms of predictive coding summary of the input output relationships for each paradigm of predictive coding so it's the input and the prediction the output of different paradigms so here's the classical predictive coding in observation data are coming in predictions about those observations are coming out it's an instantaneous real-time snapshot the real-time anticipatory unfolding is still getting data as input but it's predicting the observation at the next time step which is a similar problem to predicting at this time step but it's like a little bit different that can happen not just through time but also in the context of space with like a spatial predictive coding and then these two on the right supervised predictive coding in the generative and in the discriminative direction have to do with the relationship with reading in labels and giving out observations or vice versa make me a cat here's a picture of a cat here's a picture that's a cat so these are some of the ways that predictive coding models have been applied 3.1 they explore the way that unsupervised training relates to predictive coding so that is machine learning on unlabeled data like here's 100,000 hours of human speaking learn how to speak that's like an unsupervised approach whereas the supervised approach would be here's a thousand hours of human speaking here's a thousand hours of a river here's a thousand hours of you know of some other noise and so by the label then the algorithm would learn what is speech versus non-speech versus the unsupervised but there's also a lot of like continuum and complexity and it's a whole topic in some of the subsections of 3.1 they go a little bit deeper into the temporal predictive and the auto encoding aspects but we're not going to go into those right now in figure four in section three two they look at supervised predictive coding and look at it in that forwards and backwards direction and so here they're looking at the classic MNIST dataset for evaluation where there's some pictures of hand written digits and then it's predicting and so here's the generative predictive coding and the discriminative direction where here the data are coming in like the picture of the five with the pixel intensities and that is relating to the more high-level estimate of it being a five so this is observation in label out that's a five here's label in pixels reflecting it out so that's connecting machine learning to predictive coding in section three three they're also exploring some relaxed predictive coding and it has to do with relaxing certain mathematical assumptions that they had introduced earlier for simplicity and for clarity we're not going to go into it but it has to do with making it more useful in the real world by relaxing certain assumptions and then in three four they explore deep predictive coding so they write so far in this review we've only considered direct variations on the first in round ballard models of predictive coding which are relatively pure and use only local biologically plausible learning rules there also exists a small literature experimenting with these models often achieve better performance on more challenging tasks than the pure models can achieve so pure again here meaning that it uses only local biologically plausible learning rules so only connected entities locally and so these models they suggest provide a vital thread of evidence about scaling properties and performance of deep and complex predictive coding networks and they write the first major work in this area is pred net which uses multiple layers of recurrent convolutional lstm's neural networks to implement a deep predictive coding network so here is what pred net looked like this is the original pred net architecture and so they have convolutional lstm's that are passing information in a top down way and in a bottom up way so maybe we could look at it more later and then I also just found this 2019 2020 paper where they also explored like whether pred net actually is doing predictive coding but again a whole another rabbit hole so section three reviewed through figure three and some discussions on the subsections a few of the different paradigms and the ways that this predictive coding architecture has been applied in section four they're going to connect it to other algorithms so we're going to basically speed through all of the connections except for active inference first they connect predictive coding in section four one to the back propagation of error this is really important like for neural network training but we're not going to talk about it here but it sounds like really interesting to discuss one that will introduce but then skip over most the details is the relationship between linear predictive coding and colman filtering so we can remember that the linear predictive coding is not just the frame differencing it's where there's like that coefficient of how much you should forget each of the previous frames and this is known in signal processing as the colman filter the colman filter as this slide shows is an iterative mathematical process to quickly estimate true values position and velocity so that's also like the getting towards the generalized coordinates of motion and so the x's are like our observations that's the thermometer and then here's the red is our colman filtered temperature unfolding through time so our initial estimate gets quickly corrected and then even though there's a lot of noise in the observations the colman filter is like giving a good prediction on the actual temperature and this has been implemented in a ton of Bayesian approaches it's a totally standard signal processing Bayesian approach and in the following sections in 4.2 they have a lot of formalisms about the colman filter and in appendix D they provide even more formalisms on the colman filter so we're not going to really go into either of those in section 4.3 they introduce this idea of normalization and normalizing flows and so they wrote the deep link between predictive coding and normalization has been extended by marino 2020 by situating predictive coding with general recipe for building or representing a complex distribution from a simple and tractable one so we encourage people to look at this marino 2020 paper if they want to learn more about predictive coding variational auto encoders and biological connections but not going to discuss today one other model that they connect predictive coding to is known as the biased competition model and that is unpacked in 4.4 but i just wanted to pull up one example of this biased competition model and so this is from the sproutling paper of 2008 within each processing stage nodes compete to be active in response to the current pattern of feedforward activity received from the sensory input or previous processing stage so it's like there's competitions amongst the subunits to be activated by certain kinds of inputs and then there's some technical details okay now finally to active inference as we're near the end of the discussion this will be a little fun to explore also in the coming weeks of course so they write in 4.5 predictive coding can also be extended to include action allowing for predictive coding agents to undertake adaptive actions without any major change to their fundamental algorithms the key insight is to note there are two ways of minimizing prediction errors the first is to update predictions to match sensory data which corresponds to classical perception i thought that the ball was going to be here but i guess it's over here perception the second is to take actions in the world to force the incoming sensory data to match the prediction i'm going to move the ball to where i expect it to be or i expected the ball to be in the left side of my visual field so i'm going to move my eyes to the right so that it's in the left side of my visual field so it's action is not always like reaching out and moving the ball it can also be the action of choosing where to look for example and this is what is really fascinating because we spent so long in the earlier sections in the first 50 equations you know without action so it's like oh no are there going to be like another 250 equations with action coming into the picture but actually it's quite a simple introduction the basic approach to including action in predictive coding is to minimize the variational free energy with respect to action although free energy is not explicitly a function of action it can be made so implicitly by noticing the dependence of sensory observations on actions so where what you see is going to depend on where you look we can make this implicit dependent dependence explicit using the chain rule of calculus active inference it's inference about action so here's the rate of change of action through time is the rate of change of a free energy functional of previously it was of observations and of states through time and now o is o of a observations are a function of action they're partially or completely dependent on action and that is being understood as a derivative of change in action o of a over a partial of a over a is a forward model which makes explicit the dependence of the observation upon action and must be provided or learned by the action by the algorithm because that is how do observations which are a function of action change as actions change so that is like a layer that hadn't been brought up so they're saying because you do need to solve that ratio or that derivative you do need to like learn that or be provided that and it's a separate thing to learn in addition to this standard generative model for perception so we tuck action as being like a dependency of observations which is also what connects active inference to perceptual control theory so that's how we introduce action into the picture just from a first pass let's go one more level in and then any thoughts you have would be great so if we were talking about just the inference case prediction error minimization would be like the picture that we're seeing is exactly the one that we're predicting or vice versa so what does that look like for action the prediction error simply becomes the difference between the current observation and the target or set point already bringing in this kind of homeostasis allostasis perspective however this raises the question of where the set points comes from where the targets come from and how are they computed whence priors a generic answer to the question is that set points can be inherited from evolutionary or ontogenetic which are developmental imperatives or supplied by other neural circuits involved in goal directed behavior and planning for present purposes we can simply take them as exogenously given variables so what isn't addressed in this review is this big question of priors and learning on preferences like how do you know what to prefer how do you know how much to change how you prefer how tight should your preferences be those are like really really important areas but they're just like opening the door but not going to go into it in 4.5.1 it's also straightforward to model the potential costs of action in biological organisms there's different costs of action and so how can that be modeled mathematically well they say buy explicitly including action within the generative model as follows and so now they've added in this extra term with a cost of action so it just shows how like once action has been introduced as a dependency of the observation function then there can be like other things that can be calculated related to action any thoughts on action or like in your readings where has action been integrated well or not with predictive coding well I think that from the from my perspective predictive coding wasn't really about action before this text so it was new to me because I always relate action to predictive processing and active inference not really predictive coding so it was new and from all of these that we've been through I found fascinating how the complexity grows and how many different applications predictive coding can be found on especially in machine learning and you know all of this mathematical heaviness brings many many different solutions to problems that we've faced for a long time so fascinating nice yeah agreed it's really interesting so just one more take on active inference which we explored in 26 as well like predictive coding PID control optimizes the system towards the set point and is ideal for simple regulatory systems like thermostats action a of t so we're taking that same notion of a is being like something that influences observations so that's how we introduce action into the model is we make observations depend on action action is determined by three terms a proportional term which minimizes the distance between the current location in the set point P an integral term which minimizes the integral of this error over time I and a derivative term which minimizes the derivative of the error D the combination of the three terms produces a robust and simple control system which can be applied with some tuning to control almost any simple regulatory process and higher coordinates of motion could do even better but as we explored in 26 like often the PID is sufficient and so that's a very common engineering method and so it connects the Bayesian mechanics of action and perception and cognition of active inference on generalized coordinates to PID control in the engineering setting and then they provide a bunch of other formalisms in appendix B they give some detail on this idea of natural gradients which I think we could go into in the dot one and in the dot two so we won't talk about it here but appendix B is about precision as natural gradients and appendix C are providing some challenges for the neural implementation of back propagation by predictive coding so also we're not going to explore it here but those are both really interesting short and very topical appendices in section five it's uh of quite good length and we're not going to cover actually any of it today because it's been a great and long enough dot zero but in the dot one and the dot two and beyond we hope to unpack the discussion and the future directions so we have a lot of room and space for questions and I think we both or all walk out of this with more questions than answers but what would be your closing thoughts Maria? I'm not sure what to think now I have to digest all the information we have here today but I have this this wish to continue with the understanding of formalisms I want to you know understand better what you said here and I'm really excited about the biological part of predictive coding implementation and predictive processing so as it is not really developed in this paper I think that the author said that it's not really explored actually I I have this this idea of you know looking to it to see what's what's been doing uh and in the past years awesome yeah I think it'll be some fun upcoming discussions thanks so much for all the help and for this dot zero and to Brock as well so see you around thanks a lot see you bye