 Hello and welcome, it's Active Inference guest stream number 51.1 on July 28th, 2023. We are here with Tomasso Salvatore and we will be having a presentation and a discussion on the recent work causal inference via predictive coding. So thanks so much for joining. For those who are watching live, feel free to write questions in the live chat and off to you. Thank you. Thank you very much, Daniel, for inviting me. Always been a big fan of the channel and I've been watching a lot of videos, so I'm quite excited to be here and be the one speaking this time. So I'm going to talk about this recent preprint that I put out, which has been the work of the last couple of months. And it's a collaboration with Luca Vincetti, Amin Makarak, Bernmille and Tomas Lukasiiewicz. And it's basically a joint work between Versace, which is the company I work for, the University of Oxford and Theo Vien. So during this talk, I will, this is basically the outline of the talk, I will start talking about what predictive coding is and given interactions of what it is, a brief historical introduction, why I think it's important to study predictive coding, even for example, for the machine learning perspective. I will then provide a small intro to what causal inference is. And once we have all those informations together, I will then discuss why I wrote this paper, what was basically the research question that inspired me and the other collaborators and present the main results, which are how to perform inference, so intervention and counterfactual inference, and how to learn the causal structures from a given dataset using predictive coding. And then I will of course conclude with a small summary and some discussion on why I believe this work can be impactful and some future directions. So what is predictive coding? Predictive coding is in general famous for being a neuroscience inspired learning method, so a theory of how information processing in the brain works. And brain formally speaking, the theory of predictive coding can be described as basically having a hierarchical structure of neurons in the brain, and you have two different families of neurons in the brain. The first family is the one in charge of sending prediction information, so neurons in a specific level of the hierarchy send information and predict the activity of the level below. And the second family of neurons is that of error neurons. And the error neurons, they send prediction error information up the hierarchy. So one level predicts the activity of the level below. This activity has some, this prediction has some mismatch with what's actually going on in the level below. And the information about the prediction error gets sent up the hierarchy. However, predictive coding is, was actually not burned as a neuroscience, as a theory from the neurosciences, but it was actually initially developed as a method for signal processing and compression back in the 50s. So the work of Oliver, Elias, which are actually contemporary of Shannon, they realized that once we have a predictor, a model that works kind that is well in predicting data, sending messages about the error in those predictions is actually much cheaper than sending the the entire message every time. And this is how predictive coding was born. So as a signal processing and compression mechanism in information theory back in the 50s. It was actually in the 80s that it became that exactly the same model was used in neuroscience. And so with the work from Mumford or other works that, for example, explain how the rate enough processing formation. So we get prediction signals from the outside world. And we need to compress this representation and have this internal representation in our neurons. And the method is very similar, if not equivalent to the one that was developed by Elias and Oliver in the 50s. Maybe what's the biggest paradigm shift, happening in 1999, thanks to the work of Raoul and Ballard, in which they introduced this concept that I mentioned earlier about hierarchical structures in the brain where prediction information is top down and error information is bottom up. And something that they did that wasn't done before is that they explain and develop this theory about not only inference, but also about how learning works in the brain. So it's also a theory of how our synapses get updated. And the last big breakthrough that I'm going to talk about in this brief historical introduction is from 2003, but then it kept going in the years after, thanks to Carfriston, in which basically he took the theory of Raoul and Ballard and he developed, he extended it and generalized it to the theory of generative models. So basically the main claim that Carfriston did is that predictive coding is an evidence maximization scheme of a specific kind of generative model, which I'm going to introduce later as well. So to make a brief summary, the first two kinds of predictive coding that I described, so signal processing and compression and information processing in the retina and in the brain in general, they are inference methods. And the biggest change, the biggest revolution that we had in 1999, so let's say in the 21st century, is that predictive coding was seen as a learning algorithm. So we can first compress information and then update all the synapses or all the latent variables that we have in our generative model to improve our generative model itself. So let's give some definitions that are a little bit more formal. So predictive coding can be seen as a hierarchical Gaussian generative model. So here is a very simple figure in which we have this hierarchical structure, which can be as deep as we want. And prediction signals go from one latent variable, Xn, to the following one, and it gets transformed every time via function gn or gi. And this is a generative model, as I said, and what's the marginal probability of this generative model? Well, it's simply the probability of the last, can you see my cursor? Yes, right? Yes, perfect. So it's the generative model of the last vertex, is the distribution of the last vertex times the probability distribution of every other vertex conditioned on the activity of the vertex before or the latent variable before. I earlier said that it's a Gaussian generative model, which means that those probabilities they are in Gaussian form and every, and those function function g in general and especially since, for example, in a round baller paper and in all the papers that came afterwards, also because of the deep learning revolution, those functions are simply linear maps or nonlinear maps with activation functions or nonlinear maps with activation function and an additive bias. So we can give a formal definition of predictive coding and we can say that predictive coding is an inversion scheme for such a generative model, where its model evidence is maximized by minimizing a quantity that is called aberration of free energy. In general, the goal of every generative model is to maximize model evidence, but this quantity is always intractable and we have some techniques that allow us to approximate the solution. And the one that we use in predictive coding is that of minimizing aberration of free energy, which is a lower bound of the model evidence. In this work, and actually in a lot of other ones, so is the standard way of doing it, this minimization is performed via gradient descent and there are actually other methods such as expectation maximization, which is often equivalent, or you can use some other message passing algorithms such as belief propagation, for example, and going a little bit back in time, so we're getting a little bit about the statistical generative models, we can see predictive coding as a, I said already a couple of times, as a hierarchical model with neural activities, so with neurons, latent variables that represent neural activities that send their signal down the hierarchy, and with error nodes or error neurons that send their signal up the hierarchy, so they send the error information back. What's the variation of free energy of these class-operated coding models? It's simply the sum of the mean square error of all the error neurons, so it's the sum of the total error squared. And this representation is going to be useful in the later slides and in how I'm going to explain how to use predictive coding to model causal inference, for example. What do you think predictive coding is important and is a nice algorithm to study? Well, first of all, as I said earlier, it optimizes the correct objective, which is called model evidence or marginal likelihood, and then it does so by optimizing a lower bound, which is called the version of free energy, as I said. And the version of free energy is interesting because it can be written as a sum of two different terms, which are, and each of those terms, optimizing it as important impacts, for example, in machine learning tasks or in general in learning tasks. So one of those terms forces memorization. So the second term basically tells forces the model to fit a specific data set. And the first term forces the model to minimize the complexity. And as we know, for example, from the Occam's razor theory, if we have two different models that perform similarly on a specific training set, the one that we have to get and the one that we expected to generalize the most is the less complex one. So updating generative model via rational free energy allows us to basically converge to the optimal Occam razor model, which both memorizes a data set, but is also able to generalize very well on unseen data points. A second reason why predictive coding is important is that it actually doesn't have to be defined on a hierarchical structure, but it can be modeled on more complex and flexible architectures such as directed graphical model with any shape or generalized even more to networks with a lot of cycles that resemble brain region. And the underlying reason is that you're not learning and predicting with a forward pass and then back propagating the error, but you're minimizing an energy function. And this allows basically every kind of hierarchy to be... allows to go behind hierarchies and allow to learn cycles. And this is actually quite important because the brain is full of cycles as we have some information from some recent papers that may have managed to map completely the brain of some animals such as fruit fly. The brain is full of cycles, so it makes sense to train our machine learning models or our models in general with an algorithm that allows us to train using cyclic structures. Another reason why predictive coding is interesting is that it has been formally proven that it is more robust than standard neural networks trained with backpropagation. So if you have a neural network and you want to perform classification tasks, you... predictive coding is more robust. And this is interesting in tasks such as online learning, training on small datasets or continuous learning tasks. And the theory basically comes from the fact that predictive coding has been proved to approximate implicit gradient descent, which is a different version of the explicit gradient descent, which is the standard gradient descent used in the... in every single model, basically. And it's a variation that is more robust. I think, okay, I did a quite a long intro to predictive coding. I think I'm now moving to the second topic, which is causal inference. And what's causal inference? Causal inference is a theory, is a very general theory that has been formalized the most by Judea Perle. He's definitely the most important person in the field of causal inference. He wrote some very nice books. For example, the book of why is highly recommended if you want to learn more about this topic. And it basically tackles the following problem. So let's assume we have a joint probability distribution, which is associated with a Bayesian network. This is going to be a little bit the running example to hold the paper, especially with Bayesian networks of this shape. Those Bayesian networks, the variables inside, they can represent different quantities. So for example, a Bayesian network with this shape can represent the quantities on the right. It can represent the social-economic status of an individual, its education level, its intelligence, and its income level. Something that classical statistics is very good at, and it's a well-most-used application, is to model observations or correlations. A correlation basically answers the question, what is the, if we observe another variable C? So for example, in this case, what's the income level, the expected income level of an individual, if I observe this education level? And of course, if that person has a higher degree of education, for example, a master or a PhD, I'm expecting general that person to have a higher income level. And this is a correlation. However, sometimes there are things that are very hard to observe, but they play a huge role in determining those quantities. So for example, it could be that the income level is much, much more defined by the intelligence of a specific person. And maybe the intelligence, so if a person is intelligent, he's also most likely to have a higher education level. But still, the real reason why the income is high is because of the IQ. And this cannot be studied by simple correlations, and has to be studied by a more advanced technique, which is called an intervention. An intervention basically answers the question, what is the, if we change C to a specific value? So for example, we can take an individual and check his income level and then change its education level. So intervene on this word and change his education level without touching his intelligence and see how much his income changes. For example, if the income changes a lot, it means that the intelligence doesn't play a big role in this, but the education level does. If the income level doesn't change much, it means that maybe there's a hidden variable in this case, the intelligence, that determines the income level of a person. The third quantity important in causal inference is that of counterfactuals. So for example, a counterfactual answers the question, what would be, had we changed C to a different value in the past? So for example, we can see that the difference between interventions and counterfactuals is that interventions act in the future. So I'm interviewing in the world now to observe a change in the future. Well, counterfactual allow us to go back in time and change a variable back in time and see how the change would have influenced the world we live in now. And those are defined by Judea Pearl as the three levels of causal inference. Correlation is the first level, intervention is the second level and counterfactual is the third level. What are interventions? I'm going to define them more formally now, now that I gave an intuitive definition. And I'm using this notation here, which is the same actually, toward all the presentation. So X is always going to be a latent variable. SI is always going to be a data point or an observation. And VI is always going to be a vertex. So every time you see VI, we are only interested in the structure of the graph, for example. So let's assume we have a Bayesian model, which has the same structure as the Bayesian model we saw in the previous slide. Given that X3 is equal to S3, this is the observation we make, statistics allows us to compute the probability or the expectation of X4, which is the latent variable related to this vertex, given that X3 is equal to S3. To perform an intervention, we need a new kind of notation, which is called the do operation. So in this case, X4, we want to compute the probability of X4, given the fact that we intervene in the world and change X3 to S3. And how do we do this? To perform an intervention, Judea Perl tells us that we have to have an intermediate step before computing a correlation. First, we have to remove all the incoming edges to V3. So we have to study not this Bayesian network, but this second one. And then at this point, we are allowed to compute a correlation, as we normally do. And this is an intervention. A counterfactual is a generalization of this that, as I said, lived in the past. And they are computing using structural causal models. A structural causal model is a tuple, which is conceptually similar to a Bayesian network. But basically, we have this new class of variables on top, which are the unobservable variables they use. So we have the Bayesian network that we had before, X1, X2, X3, S4. But we also have those unobservable or variables that depend on the environment. You cannot control them. You can infer them, but they are there. And F is a set of functions that depends on all the... Basically, F of X3 depends on X1 because you have an arrow, on X2 because you have an arrow, and on the unobservable variable that also influences X3. So, yes, intuitively, you can think of a structural causal model as a Bayesian network with those unobservable variables on top. And each unobservable variable only influences its own related variable X. So, for example, IU will never touch X1 as well. U3 will only touch U3. U1 will only influence X1 and so forth and so on. So, performing counterfactual inference answers the following question. So, what would X4 be at X3 being equal to another variable in a past situation, U? And computing this counterfactual requires three different steps. So, abduction is the computation of all the background variables. So, in this step, we want to go back in time and understand how the environment, the unobservable environment was in that specific moment in time. And we do this by fixing all the latent variables X to some specific data that we already have and performing this inference on the use. Then we're going to use the U to keep the U that we have learned and perform an intervention. So, a counterfactual can also be seen as an intervention back in time in which we know the environment variables U1, U2 and U4 in that specific moment. And what's the missing step? So, what would X4 be at X3 being equal to another data point in that specific situation? Now, we can compute a correlation. And the correlation, we do it on the graph in which we have already performed an intervention using the environment variables that we have learned in the abduction step. And this is the counterfactual inference. This is the last slide of the causal inference introduction and is about structural learning. Basically, everything I've said so far relies on the fact that we know the causal dependencies among the data points. So, we know the structure of the graph. We know which variable influences which one. We know the arrows in general. But in practice, this is actually not always possible. So, we don't have access to the causal graph most of the times and actually learning the best causal graph from data is still an open problem. We are improving in this. We're getting better. But how to perform this task exactly is still an open problem. So, as I said, basically the goal is to infer causal relationships from observational data. Given a data set, we want to infer the directed acyclic graph that describes the connectivity between the system and the variables of the data set. So, for example, here, we have an example that I guess we are all familiar with, thanks, because of the pandemic. So, we have those four variables, age, vaccine, hospitalization, and CT. And we want to infer the causal dependencies among those variables. So, for example, we want to learn directly from data that the probability of a person being hospitalized depends on its age and on the fact whether it's vaccinated or not, and so forth and so on. So, this is the end of the long introduction. But I hope it was clear enough, and I hope that I gave the basics to understand basically the results of the paper. And now we can go to the research questions. So, the research questions are the following. First, I want to see whether predictive coding can be used to perform causal inference. So, predictive coding so far has only been used to perform to compute correlations in Bayesian networks. And the big question is, can we go beyond correlation and model intervention at interfactual in a biological, plausible way? So, in a way that is, for example, simple, intuitive, and allow us to only play with the neurons and not touch, for example, the huge structure of the graph. And more in practice, more specifically, the question becomes, can we define a predictive coding-based structural causal model to perform interventions at counterfactuals? The second question is, as I said, that having a structural causal model assumes that we know the structure of the Bayesian network. So, it assumes that we have the arrows. Can we go beyond these and use predictive coding networks to learn the causal structure of the graph? Basically, giving positive answers to both those questions would allow us to use predictive coding as an end-to-end causal inference method, which basically takes a data set and allow us to test interventions and counterfactual predictions directly from this data set. So, let's tackle the first problem. So, a causal inference via predictive coding, which is also the section that gives the title to the paper, basically. And here I will show how to perform correlations with predictive coding, which is already known, and how to perform interventional queries, which I think is the real question of the paper. So, here is a causal graph, which is the usual graph that we had. And here is the corresponding predictive coding model. So, the axes are the latent variables and correspond to the neurons in a neural network model. And the black arrow passes prediction information from one neuron to the one down the hierarchy. And every vertex also has the error neuron, which passes information up the hierarchy. So, the information of every error goes to the value node in the up the hierarchy and basically tells it to correct itself to change the prediction. So, to perform a correlation using predictive coding, what you have to do is that you take an observation and you simply fix the value of a specific neuron. So, if you want to compute the probability of X4 given X3 equal to S3, we simply have to take X3 and fix it to S3 in a way that it doesn't change anymore and run an energy minimization. And this model, by minimizing, by updating the axes via a minimization of the variational free energy, allows the model to converge to a solution to this question. So, the probability or the expected value of X4 given X3 equals 3. But how do I perform an intervention now without acting on the structure of the graph? Well, this is basically the first idea of the paper. This is still how to perform a correlation. So, fix S3 equal to X3 is the first step in the algorithm. And the second one is to update the axes by minimizing the variational free energy. An intervention, which in theory corresponds in removing those arrows and answers to the question, the probability of X4 by performing an intervention. So, do X3 equal S3? Imperative coding can be performed as follows. So, I'm going to write the algorithm here. So, first, as in a correlation, you fix X3 equal to the observation that you get. Then this is the important step. You have to intervene not on the graph anymore, but on the prediction error and fix it equal to 0. Having a prediction error equal to 0 basically makes sense meaning less information up the hierarchy or actually sends no information up the hierarchy because it basically tells you that the prediction is always correct. And the third step is to, as we did before, to update the axes, the unconstrained axes, so X1, X2, X4 by minimizing the variational free energy. As I will show now experimentally by simply doing this little trick of setting a prediction error to be equal to 0 prevents us to actually act on the structure of the graph as the theory of Duke-Alculus does and to infer the variables after an intervention by simply performing a variational free energy minimization. What about counterfactual inference? Counterfactual inference is actually easy once we have defined how to do an intervention. And this is because as we saw earlier performing a counterfactual is similar to performing an intervention in a past situation after you have inferred the unobservable variables. So, as you can see in the plot I showed earlier about the abduction action and prediction steps, the action and prediction steps, they did not have those two arrows. They were removed. Pretty coding allows us to keep the arrows in the graph and perform counterfactuals by simply performing an abduction step, as it was done earlier, an action step in which we simply perform an intervention on the single node. So, we fix the value node and we set the error to 0 and run the energy minimization, so minimizing the variational free energy to compute the prediction. So, I think this is like an easy and elegant method to perform interventions and counterfactuals. And yeah, so I think the thing we have to show now is whether it works in practice or not. And we have a couple of experiments. And I'm going to show you now two different experiments. The first one is merely a proof of concept experiment that shows that the predictive coding is able to perform intervention and counterfactuals. And the second one actually shows a simple application in how interventional queries can be used to improve the performance of classification tasks on a specific kind of predictive coding networks, which is that of a fully connected model. Let's start from the first one. So, how do we do this task? So, given a structural causal model, we generate training data and we use it to learn the weights, so to learn the functions of the structural causal models. And then we generate test data for both interventional and counterfactual queries. And we show whether we are able to converge to the correct test data using predictive coding. And, for example, here, those two plots represent the interventional and counterfactual queries of this specific graph, which is the butterfly bias graph, which is a graph that is often used in testing whether causal inference, whether interventional and counterfactual techniques work is as simple as that. But in the paper, you can find a lot of different graphs. But in general, those two plots show that the method works, show that the absolute error between the interventional and counterfactual quantities we compute and the interventional and counterfactual quantities from the original graph are close to each other, so the error is quite small. The second experiment is basically an extension of an experiment I proposed in an earlier paper, which is the learning on arbitrary graph topologies that I wrote last year. In that paper, I basically proposed this kind of network as a proof of concept, which is a fully connected network, which is, in general, the worst neural network you can have to perform machine learning experiments. Because given a fixed set of neurons, basically every pair of neurons is connected by two different synapses, so it's the model with the highest complexity possible in general. The good thing is that since you have a lot of cycles, the model is extremely flexible in the sense that you can train it, for example, on a minced image and on a data point and on its label. But then the way you can query it, thanks to the information going back, is you can query it in a lot of different ways. So you can form classification tasks in which you provide an image and you run the energy minimization and get the label. But you can also, for example, perform generation tasks in which you give the label, run the energy minimization and get the image. You can perform, for example, image completion, which you give half the image and let the model converge to the second half and so forth and so on. So it's basically a model that learns the statistics of the dataset in its entirety without being focused on classification or generation in general. So this flexibility is great. The problem is that because of this, every single task doesn't work well. So you can do a lot of different things, but none of them is done well. And here I want to show how using interventional queries instead of standard correlation queries or conditional queries slightly improves the results of those classification tasks. So what are the conjecture reasons of this test accuracy on those tasks not being so high? The first two reasons are that the model is distracted in correcting every single error. So basically you present an image and you would like to get a label, but the model is actually updating itself to also predict the error in the images. And the second reason, which is the one I said, is that the structure is far too complex. So again, from an Occam raiser argumentation, this is the worst model you can have. So every time you have a model that fits a dataset, that model is going to be less complex than this one that is going to be preferred. But in general, just to study it, the idea is can querying this model be interventions be used to improve the performance of those fully connected models? Well, the answer is yes. So here is how I perform interventional queries. So I present an image to the network. I fix the error of the pixels to be equal to zero. So this error doesn't get propagated in the network. And then I compute the label. And as you can see, the accuracy improves. For example, from 89, using the standard query method of creative coding networks, 292, which is the accuracy after the intervention, and the same happens for fashion means. And I think that a very legit critique that probably everyone would think when seeing those plots is that, okay, you improve on means from 89 to 92, it still sucks, basically. And yeah, it's true. And I'm actually in the later slides, I'm going to show how to act on the structure of this fully connected model will improve the results even more until the point they reach a performance that is not even close to state-of-the-art performance, of course, but is still up to a level that becomes basically acceptable than worth investigating. So yes, so this was the part about causal inference using creative coding. And I guess to summarize, I can say that the interesting part of the results I just showed is that I showed that creative coding is able to perform interventions in a very easy and intuitive way because you don't have to act on the structure of the old graph anymore. Sometimes those functions are not available, so forth and so on, but you simply have to intervene on a single neuron, set its prediction error to zero, and perform an energy minimization process. And these extended allowed us to define creative coding-based structural causal models. Now we move to the second part of the work, which is about structure learning. So structure learning, as I said, deals with the problem of learning the causal structure of the model from observational data. This is actually a little problem that has been around for decades and has always been, until a couple of years ago, tackled using combinatorial search methods. The problem with those combinatorial search methods is that their complexity grows double-exponentially. So as soon as the data becomes multidimensional and the Bayesian graph that you want to learn grows in size, learning it, it's incredibly slow. The new solution that came out actually a couple of years ago in a new newspaper from 2018 showed that it's possible to actually learn this structure not using a combinatorial search method, but by using a gradient-based method. And this was basically the skilled problem in general, because now you can simply have a prior on the parameters, which is the prior that I proposed that I'm going to define a little bit better in this slide. Run gradient descent. And even if you have a model that is double, triple the size, the algorithm is still incredibly fast. And for this reason, this paper is... Yeah, I think it's kind of new, and I think it already has around 600 citations or things like that. And every paper that I'm seeing now about causal inference and learning causal structure of the graph uses their method. It just changes a little bit. They find faster or slightly better inference methods, but still they all use the prior, this paper defined. And I do as well, and we do as well. So here we define a new quantity, which is the agency matrix. The agency matrix is simply a matrix that encodes the connections of the model. So it's a binary matrix, and in general, it's a binary matrix. Then of course, when you do gradient-based optimization, you make it continuous, and then you have some threshold at some point that basically kills an edge or set it to one. And the entry ij is equal to one if the Bayesian graph has an edge from vertex i to vertex j, or zero otherwise. So for example, this agency matrix here represents the connectivity structure of this Bayesian network. And basically this method tackles two problems that we want about learning the structure of the Bayesian network. The idea is that we start from a fully connected model, which conceptually is similar, actually is equivalent to the predictive coding network I defined earlier, which is fully connected. So you have a lot of vertices, and every pair of vertices is connected by two different edges. And you simply want to prune the ones that are not needed. So it can be seen as a method that performs model reduction. You start from a big model, and you want to make it small. So what's the first ingredient to reduce models? Well, it's of course sparse city. And what's the prior that everyone uses to make a model more sparse is the Laplace prior, which in machine learning is simply known as the L1 norm, which is defined here. The solution that this paper that I mentioned earlier proposed is to add a second prior on top, which enforces what's probably the biggest characteristic of Bayesian networks on which you want to perform causal inference is that you want them to be acyclic. And basically they show that acyclicity can be imposed on an agency matrix as a prior. And it has this shape here. So it's the trace of the matrix that is the exponential of a times a, where a is the agency matrix again. And basically this quantity here is equal to zero if and only if the Bayesian network or whatever graph you're considering is acyclic. So I'm going to use this in some experiments. So those two, to force those two priors on different kinds of Bayesian networks. And I'm trying to merge them with the techniques we proposed earlier about performing causal inference via predictive coding. So I'm going to present two different experiments. So one is a proof of concept, which is the standard experiments showed in all the structural learning tasks, which is the inference of the correct Bayesian network from data. And then I'm going to build on top of the classification experiments I showed earlier and show how actually those priors allow us to improve the classification accuracy, the test accuracy of fully connected predictive coding models. So let's move to the first experiment which is to infer the structure of the graph. And the experiments, they all follow basically the same pipeline in all the papers in the field. The first step is to generate a Bayesian network from random graph. So basically normally the two random graphs that everyone tests are Erdos-Rini graphs and scale-free graphs. So you generate those big graphs that normally have 20, 40, 80 different nodes and some edges that you sample randomly. And you use this graph to generate a data set. So you sample, for example, N big N data points. And what you do is that you take the graph that you have generated earlier and you throw it away, you only keep the data set. And the task you want to solve now is to have a training algorithm that basically allows you to retrieve the structure of the graph you have thrown away. So the way we do it here is that we train a fully connected predictive coding model on this data set D using both the sparse and the acyclic priors we have defined earlier. And see whether actually the graph that we converge to after pruning away the entries of the agency matrix that are smaller than a certain threshold is similar to that of the initial graph. And the results show that this is actually the case. So this is an example and I show many different parametrization and dimensions and things like that in the paper. But I think those two are the most representative examples with an aeronautary graph and a free-scale graph with 20 nodes. And here on the left, you can see the ground truth graph which is the one sampled randomly. And on the right, you can see the graph in the predictive coding model as learned from the data set. And as you can see, they are quite similar. It's still not perfect. So there are some errors, but in general, the structure is... they work quite well. We also have some quantitative experiments that I don't show here because they're just huge tables with a lot of numbers and I thought it was maybe a little bit too much for the presentation. But the results show that they perform similarly to contemporary methods. Also because I have to say most of the quality comes from the acyclic prior that was introduced in 2018. The second class of experiments are classification experiments which, as I said, are the extensions of the one I showed earlier. And the idea is to use structure learning to improve the classification results on the means and fashion means data set starting from a fully connected graph. So what I did is that I divided the fully connected graph in clusters of neurons. So 1B cluster is the one related to the input. And then we have a specific number of hidden clusters. And then we have the label cluster, which is the cluster of neurons that are supposed to give me the label predictions. And I've trained them using the first time, the sparse prior only. So the idea is, what if I prune the connections I don't need from a model and learn a sparser model? Does this work? Well, the answer is no. It doesn't work. And the reason why is that at the end, the graph that you converge with is actually the generate. So basically, the model learns to predict the label based on the label itself. So it discards all the information from the input and only keeps the label. And as you can see here, the label Y predicts itself or in other experiments, when you change the parameters, you have that Y predicts at zero, that predicts X1, that predicts Y again. So what's the solution to this problem? Well, the solution to this problem is that we have to converge to an acyclic graph. And so we have to add something that prevents acyclicity. And what is that? One is, of course, the one I already proposed, and then I show a second technique. So the first one uses the acyclic prior defined earlier. And the second one is a novel technique that actually makes use of negative examples. So a negative example in this case is simply a data point in which you have an image, but the label is wrong. So here, for example, you have an image of a seven, but the label that I'm giving the model is a two. And the idea is very simple and has been used in a lot of works already. So every time the model sees a positive example, it has to minimize the variation of free energy. And every time it sees a negative example, it has to increase it. So we want this quantity to be minimized. And actually, with a lot of experiments and a lot of experimentations, we saw that the two techniques, basically first lead to the same results, and second lead to the same graph as well. So here are the new results on minst and fashion minst using the two techniques that I just proposed. And now we move to some, which are still not great, but definitely more reasonable test accuracies. So here we have a test error of 3.17 for minst and a test error of 13.98 for fashion minst. And actually, those results can be much improved by learning the structure of the graph on minst and then fixing the structure of the graph and do some form of fine-tuning. So if you fine-tune the model on the correct hierarchical structure, at some point you reach the test accuracy, which is the one you would expect from a hierarchical model. But those ones are simply the one, the fully connected model as naturally converged to. So for example, from a test error of 18.32 of the fully connected model train on fashion minst by simply performing correlations or conditional queries, which is the standard way of querying operative coding model, adding interventions and the acyclic prior together makes this test error much lower. And we can observe you for minst as well. I'm now going a little bit into details on this last experiment and on how the acyclic prior acts on the structure of the graph. So I performed an experiment on a new dataset, which is, I mean, calling it a new dataset and maybe too much is the, I called it a two minst dataset, in which you have the input point is formed of two different images, and the label only depends on the second image, on the first image, sorry. So the idea here is, is the structure of the model, the acyclic prior and things like that able to recognize that the second half of the image is actually meaningless in performing, in learning the in performing classification. How does training behave in general? Like, for example, we have this input, input node, output node, and only the nodes are fully connected and the model converges to a hierarchical structure, which is the one that we know performs the best on classification tasks. Well, here is an example of a training method, of a training run. So that's C0, which is the beginning of training. We have this model here. So S0 corresponds to the seven, so to the first image. S1 corresponds to the second image. We have the label Y and all the latent variables, X0, X1, X2. And the model is fully connected. So the agency matrix is full of ones. There are no zeros. We have self-loops and things like that. We train the model for a couple of epochs until, and what we note immediately is that, for example, the model immediately understands that the four is not needed to perform classification. So it doesn't, so every outgoing node from the second input cluster is removed. And something we didn't understand is that this cluster is the one related to the output. So we have a linear map from S0 to Y directly, which is this part here. But we know that actually a linear map is not the best map for performing classification on minst. So we need some hierarchy, we need some depth to improve the results. And as you can see, this line here is the accuracy, which up to this point, so up to C2, is similar to, so it's 91% which is slightly better than linear classification. But once you go on with the training, the model understands that it needs some hierarchy to better fit the data. So you see that this arrow starts getting stronger and stronger over time until it understands that the linear map is not actually really needed and it removes it. And so the model you converge with is a model that starts from a zero, goes to a hidden node and then goes to the label with a very weak linear map which actually gets removed if you set a threshold of, for example, 0.1, 0.2, at some point the linear map gets forgotten and everything you end up with is with a hierarchical network that has learned the correct structure to perform classification tasks, which is hierarchy, and it has also learned that the second image didn't play any role in defining the test accuracy. And this is all performed, so all those jobs are simply performed by one free energy minimization process. So you initialize the model, you define the free energy, you define the priors, so the sparse and the cyclic prior, you run the energy minimization and you converge to a hierarchical model which is well able to perform classification on minst. And then if you then perform some fine tuning, you reach very competitive results as you do in feedforward networks with backpropagation, but I think that's not the interesting bit. The interesting bit is that you all this process all together of intervention and acyclicity allows you to take a fully connected network and converge to a hierarchical one that is able to perform classification with good results. And yeah, that's basically it. I'm now, oh yeah, wow, I've talked a lot. And this is the conclusion of the talk, which is I'm basically doing a small summary and I think the important takeaway if I have to give you one sentence of this paper is that predictive coding is a belief updating method that is able to perform end-to-end causal learning. So it's able to perform interventions to learn a structure from data and then perform interventions and counterfactuals. So causal inference in other inefficiently model interventions by simply setting the prediction error to zero. So it's a very easy technique to perform interventions and you simply only have to touch one neuron you don't have to act on the structure of the graph. You can use it to perform, to create structural causal models that are biologically penausible. It is able to learn the structure from data, as I said, maybe a lot of times already. And a couple of sentences for future works is that something that would be nice to do is to improve the performance of the model we have defined because I think it performs reasonably that well on a lot of tasks. So it performs reasonably well on structure learning, on forming intervention and counterfactuals. But actually if you look at state-of-the-art model, there's always like a very specific method that performs better in the single task. So it would be interesting to see if we can reach those level of performance in specific tasks by adding some tricks or some or some new optimization methods and to generalize it to to dynamical systems, which are actually much more interesting the static systems. So such as dynamical causal models and other techniques that allow you to perform causal inference in systems that move. An action taken in a specific time step influences another node in a later time step, which is basically Granger Causality. That's it and thank you very much. Thank you. Awesome and very comprehensive presentation. I think you're muted. Sorry, muted on Zoom. But yes, thanks for the awesome and very comprehensive presentation. There was really a lot there and there was also a lot of great questions in the live chat. So maybe to warm into the questions, how did you come to study this topic? Were you studying Causality and found predictive coding to be useful or vice versa or how did you come out this intersection? I actually have to say that the first person that came out with this idea was Beren. So like, I think a year and a half ago even more, he wrote a page with this idea and then he got forgotten and no one picked it up. And last summer I started getting curious about Causality and I read, for example, the book of why I started listening to podcasts the standard way in which you get interested in a topic. And I remember this idea from Beren and proposed it to him and like, why don't we expand it and actually make it a paper. So I involve some people to work with experiments and this is the final result at the end. Awesome. Cool. Yeah. A lot to say. I'm just going to go to the live chat first and address a bunch of different questions and if anybody else wants to add more. I'm going to turn the light on first because I think I'm getting in the dark more and more. Yes. The conference can't solve the dark room issue. Oh, yes. Here we are. So would you say the light switch caused it to be lighter? Yeah. I think so. No issues here. M.L. Don wrote since in predictive coding all distributions are usually Gaussian the bottom up messages are precision weighted prediction errors where precision is the inverse of the Gaussian covariance. What if non-Gaussian distributions are used? Is basically the general method stays. The main difference is that you you don't have prediction errors which as was correctly pointed out is the basically the derivative of the variational free energy if you have Gaussian assumptions. Yeah, you have that single quantity to set to zero you probably will have to act on the structure of the graph to perform interventions. And also you and colleagues had a paper of 2022 predictive coding beyond Gaussian distributions that looked at some of these issues, right? Yes, yes, exactly. So that paper was a little bit the idea behind that paper is and we model transformers. That's the biggest motivation using predictive coding and the answer is no because the attention mechanism as a softmax at the end and softmax calls to not to Gaussian distribution but to to softmax distribution I don't get the name now, but yes. And so yes, that's a generalization. It's a little bit tricky to call it once you remove the Gaussian assumption is a little bit still tricky to call it predictive coding. So is a so for example, like talking to to car freestone like predictive coding is only if you have only Gaussian assumptions. But yes, that's more a philosophical debate than interesting. Another I think topic that's definitely of great interest is similarities and differences between the attention apparatus in transformers and the way that attention is described from a neurocognitive perspective and from a predictive processing precision waiting what do you think about that? Well, the idea is that yeah, I think about it is that in from a predictive processing and also operational inference perspective attention can be seen as a kind of structure learning problem. There's a I think there's a recent paper from from Chris Buckley's group that shows that there should be a in which basically they show that the attention mechanism is simply learning the precision on the weight parameters specific to a data point. So this precision is not a parameter that is in the structure of the model. So it's not a model specific parameter with a fast changing parameter like the value nodes that gets updated while minimizing the version of free energy and once you've minimized it and computed, then you throw it away and from the next data point you have to re-compute it from scratch. So yes, I think the analogy computation wise is the attention mechanism can be seen as a kind of structure learning but a structure learning that is data point specific and not model specific and I think if you want to generalize a little bit and go from the attention mechanism in Transformers to the attention mechanism in cognitive science I feel they're probably too different to draw similarities and I think the structure learning analogy and how important one connection is with respect to another one probably does the job much better. Cool, great answer. Okay, ML Don asks in counterfactuals what is the difference between hidden variables X and unobserved variables U? The difference is that you can I think the main one is that you cannot observe the use. You can use them because you can compute them and fix them but you cannot, the idea is that you have no control over them. The use should be seen as environment specific variables that they are there. They influence your process. For example, when you go back in time, the environment is different. The idea is for example going back to the example before of the expected income of a person with a specific intelligence of education or education degree. The idea is that if I want to see how much I will learn today with a master degree is different with respect to how much I would learn 20 years ago with a master degree is different. For example, here in Italy with respect to other countries and all those variables that are not under your control, you cannot model them using your Bayesian network but they are there. You cannot ignore them when you want to draw conclusions. It's basically everything that you cannot control. You can infer them so you can perform a counter factual inference back in time and say 20 years ago I would have earned this much if I was this intelligent at this degree on average, of course but it's not that I can change the government policies towards jobs or things like that. It's a deeper counterfactual. Yes, exactly. Those are the use. Awesome. All right. Have you implemented generalized coordinates in predictive coding? No. I've never done it. I've studied it but I've never implemented it. I know they tend to be unstable and it's very hard to make them stable. I think that's the takeaway that I got from talking to people that have implemented them but yeah, I'm aware of some papers that came out actually recently about them that tested on some threshold encoder style. Actually, I think still from Baron there's a paper out there that came out last summer but no, I've never played them with them myself. Cool. From Burt, does adding more levels in the hierarchy reduce the distraction problem of predicting input? Adding more level in which sense, because the distraction problem is given by cycles. You provide an image and the fact that you have edges going out of the image, going into the neurons and then other edges going back this basically creates the fact that the error basically these ingoing edges to the pixels of the image they create some prediction errors. So you have some prediction errors that get spread inside the model. Yeah, and this problem I think is general of cycles and it's probably not related to hierarchy in general. It's related to incoming edges to the pixels. If you don't have incoming edges you have no distraction problem anymore. Cool. That's a very interesting technique and when was that brought into play? As far as I know I think it came out with a paper I cited in 2018. I don't know at least in the causal inference literature. I'm not aware of any previous methods. I would say no because that's the highly cited paper so I would say they came out with that idea. But yeah, that's quite nice that you can do gradient descent and learn the structure. That's a very powerful technique. Yeah, sometimes it's like when you look at when different features of Bayesian inference and causal inference became available it's really remarkable. Why hasn't this been done under a Bayesian causal modeling framework? It's like it's only been like 5 to 25 years of this happening and so that's very, very short and also it's relatively technical so there's relatively few research groups engaging in it and it's just really cool what it's enabling. Yes, exactly. That's also I think the exciting part of this field a little bit I mean there are definitely breakthroughs out there that still have to be discovered because for example as much as a breakthrough that paper was they found they simply found the right prior for acyclic structures I don't know exactly but it may be an idea that you have in one afternoon I don't know about the story of how the authors came up with that but could potentially be that they are there on the whiteboard that actually works that's a huge breakthrough and I simply defined the prior and also a lot of these breakthroughs they don't just stack it's not like a a tower of blocks they layer and they compose so then something will be generalized to generalized coordinates or generalized synchrony or arbitrarily large graphs or infusion with multimodal inputs and it's like those all blend in really satisfying and effective ways so even little things that again someone can just come up with in a moment can really have impact OK, ML Don says thanks a lot for asking my questions and thanks a million to Tomaso for the inspiring presentation so nice Thank you very much and then Burt asks how would language models using predictive coding differ from those using transformers? OK I think that actually if I would have to build today a language model using predictive coding I would still use transformers so the idea is that for example if you have a let's say this hierarchical graphical model or this hierarchical Bayesian network I've defined in the very first one arrow to encode a function which is the linear map OK, so one arrow was simply the multiplication of the vector encoded in the latent variables times this weight matrix that you can then make non-linear and things like that but that can be actually something much more complex the function encoded in the arrow can be a convolution can be an attention mechanism so actually how I would do it I would use the which is actually the way we did it in the Oxford group last year is that we had exactly the structure every arrow is a transformer now so one is the attention mechanism and the next one is the feedforward network as transformers and basically the only difference that you have is that those variables you want to compute the posterior and you make those posterior independence independent via field approximation so all the steps that allow you to converge to the variation free energy of creative coding but the way you compute predictions and the way you send signals back is done via transformer so I would still use transformers in general I mean they work so well that I don't think that we can be arrogant and say oh no I'm gonna do it better via a purely predictive coding way structure learning is a way to do it but will still approximate transformers anyway sorry you said structure learning would approximate the transformer approach yes the structure learning I mentioned earlier when someone has the similarities between creative coding and the attention mechanism very interesting one thing I am wondering from MLBog I could not see the concept of depth in the predictive coding networks you mentioned most likely I missed it the definition provided for predictive coding involved the concept of depth what did you mean by depth no yes it's true because the standard definition as I said multiple times is hierarchical you have predictions going one direction and prediction error going the opposite direction basically what what we did in this paper and also in the last one which is called learning on arbitrary graph topologies via creative coding is that we can consider depth like as a as independent basically pair of latent variable latent variable and arrow and you have predictions going that direction and prediction error going the other but then you can compose this in a lot of ways so you can you can basically this composition doesn't have to be hierarchical in the end can have cycles so then you can for example plug in another another latent variable to the first one and then connect the other two and you can have a structure that is as entangled as you want so for example in the other paper we train a network that has the shape of a brain structure so we have a lot of brain regions that are sparsely connected inside and sparsely connected among each other and there's nothing hierarchical there at the end but you can still train it by minimizing a rational free energy and by minimizing the total prediction error of the network hmm so you could have for a given motif in a entangled graph you might see three successive layers that when you looked at them alone you'd say oh that's a three-story building or model that has a depth of three but then when you take a bigger picture there isn't like an explicit top or an explicit bottom to that network yes exactly and this is basically given by the fact that every operation in creative coding networks is strictly local so basically every message passing, every prediction and every prediction error that you send you only send it to the very nearby neurons ok and whether the global structure is actually hierarchical or not the single message passing doesn't even see that I guess that's sort of the hope for learning new model architectures is the space of what is designed top down is very small and a lot of models in use today albeit super effective models although you could ask effective per unit of compute or not that's a second level question but a lot of effective models today do not have some of these properties of predictive coding networks like their capacity to use only local computations which gives biological realism or just spatiotemporal realism but also may provide a lot of advantages in like distributed compute or distributed computing settings no yes exactly I completely agree because I think the idea in general is that and I don't know if that's going to be an advantage I think it's very promising exactly for the reasons you said and the reason is that today's models train with back propagation you can basically summarize them as a model train with back propagation is a function because basically you have a map from input to output and back propagation basically spreads information back from its computational graph so every neural network model used today is a function while predictive coding and not only predictive coding like the old class of functions that class of methods that train using local computations and actually work by minimizing a global energy function they're not limited to model functions from input to output they actually model something that kind of resembles physical systems so you have a physical system you fix some values to whatever input you have and you let the system converge and then you read some other value of neurons or variables that are supposed to be output but this physical system doesn't have to be a feedforward map doesn't have to be a function that has an input space and an output space and that's it so the class of models that you can learn is also basically you can see like feedforward models and functions and then a much bigger class which is that of physical systems whether there's something interesting out here I don't know yet because the functions are working extremely well we are seeing those days with back propagation they work crazy well but so yeah I don't know if there's anything interesting in the big part but the big part is quite big there are a lot of models that you cannot train with back propagation and you can train with predictive coding or a background propagation or other methods that is super interesting certainly biological systems physical systems solve all kinds of interesting problems but there's still no free lunch an ant species that does really well in this environment might not do very well in another environment and so out there in the hinterlands there might be some really unique special algorithms that are not well described by being a function yet still provide like a procedural way to implement heuristics which might be extremely extremely effective no yes exactly and yeah and I think this has been most of my focus of research during my PhD for example like finding this application that is like out here and not inside the functions cool well where does this work go from here like what directions are you excited about and how do you see people in the active inference ecosystem getting involved in this type of work I think every probably the most promising direction which is something maybe I would like to explore a little bit is to as I said there is to go behind statical models so everything I've shown so far is about static data so the data don't change over time there's no time inside the definition of creative coding as it is as I presented it here however you can for example generalize creative coding to work with temporal data using generalized coordinates as you mentioned earlier by presenting it as a Kalman filter generative model and that's where for example the causal inference direction could be very useful because at that point maybe you can be able to model Granger causality and more complex and useful dynamical causal models basically because in general the do calculus and the interventional and counterfactual branch of science is mostly developed on small models so it's a like you don't do interventions on gigantic models in general so if you look on medical data they use relatively small Bayesian networks but of course if you want to have a dynamical causal model that models a specific environment or a specific reality you have a lot of neurons inside you have a lot of latent variables they change over time and an intervention at some moment creates an effect in a different time step so maybe in the next time step in ten different time steps later and I think that would be very interesting to develop like a biologically plausible way of passing information that is also able to model Granger causality basically where do you see action in these models where do I see action I didn't think of that I think I see actions in those models maybe in the same way as I as you see in other models because creative coding is basically a model of perception so an action is you can see that's a consequence of what you are experiencing so by changing the way you're experiencing something then you can compute maybe you can simply perform a smarter action now that you have more information but but yeah I don't think action is very is like yeah I don't see any explicit consequence of actions besides the fact that this can allow you to basically maybe to simply draw better conclusions to them perform actions in the future I'll add on to that a few ways that people have talked about predictive coding and action first off internal action or covert action is attention so we can think about perception as an internal action that's one approach another approach pretty micro is the outputs of a given node we can understand that node as a particular thing with its own sensory cognitive and action states and so in that sense the output of a node and then lastly which we explored a little bit in live stream 43 on the theoretical review on predictive coding we're reading all the way through and it was all about perception all about perception and then it was like section 5.3 if you have expectations about action then action is just another variable in this architecture and that's really aligned with inactive inference where instead of having like a reward or utility function that we maximize we select action based upon it being the like least course of action the path of least action that's Bayesian mechanics and so it's actually very natural to bring in an action variable and utilize it essentially as it as if it were a prediction about something else exteroceptively in the world because we're also expecting action no yes yes exactly no I like the way of defining actions a lot actually and and I still think it has been like for example there are not so many papers that apply this method I think there are a couple from from Alexander or Obrie does something similar but in practice like the pure active inference applying predictive coding and actions to solve practical problems hasn't been explored a lot well thank you for this excellent presentation and discussion is there anything else that you want to say or point people towards no just a big thank you for inviting me it was really fun and I hope to come back at some point for some future works cool anytime thank you Tomasso see you bye