 Hello and welcome. It's September 18th, 2023. And it's active guest stream 57.1 with Andy Keller. We're going to be talking about natural neural structure for artificial intelligence. There will be a presentation followed by a discussion. So if you're watching live, please feel free to write questions in the live chat. Otherwise, thank you, Andy for this really looking forward to it. And to you for the presentation. Yeah, thanks so much. Thanks for having me. I'm super excited to be able to present this stuff with the active inference group. I'm a fan and very interested. So hopefully, yeah, get to have a good discussion and see what you guys think about it. So my name is Andy. I'm finishing up my PhD supervised by Max Welling at the University of Amsterdam, starting up postdoc Harvard after this. So I'll start out just talking about the goal of my work in general is to try to bring modern artificial intelligence closer to more human like generalization. And so what we mean by this is maybe some sort of structured generalization, or maybe more familiar to the active inference community like a structured world model, which we believe that humans have. And the way that we propose to do this is by integrating natural neural structure into artificial intelligence. So first, let's define what we mean by structure generalization. So I think it's fairly uncontroversial to say that modern machine learning generalizes beyond its training set in the traditional sense. So for example, even the earliest artificial neural networks, multi layer perceptrons could be trained on data sets of images like this and achieve high accuracy. Then when they're presented with a held out test set of images that they've never seen before, they can still classify them relatively easily with the same level of accuracy. And this is what we typically call generalization. However, even fairly early on it was noticed that these systems really struggle with small shifts or deformations applied to the images. For example, if we look at the model, so we think why is this surprising? And I argue it's really due to our innate ability to perform this type of structure generalization that this example is a failure. So, for example, this shift is nearly imperceptible to us, and we handle it automatically, whereas in the system that's very clearly a major problem. So in words, we can say that structure generalization is a generalization to some symmetry transformations of the input, or in this case, the symmetry transformation is a small shift that leaves the digit class unchanged. So the obvious question then is, what precisely do we mean by this natural structure? And why do we think that this would help us with these settings? So first let's talk about what we mean by natural neural structure. One way to talk about structure or any type of bias in a system is inductive bias. And so an inductive bias can loosely be defined as an a priori restriction of a set of realizable hypotheses when you're doing model selection. More colloquially, we can call this something like before seeing any data, it's a restriction of what and how you can learn. So very broadly, this can include anything from model class to optimization procedures or even hyperparameters. And in some sense, they really define what is possible to learn, and it defines generalization in that you actually can't generalize beyond a training set without having some inductive bias, as soon as it's explained more thoroughly in this paper by David Wolford. So what we mean by natural inductive biases then is biases that stem from the restrictions and limitations that are faced by natural systems by the nature of having to live in the real world. For example, the brain has many efficiency constraints and physical constraints by nature of its construction. And following this logic, then these constraints are really playing some role in our generalization abilities, which currently exceed modern artificial intelligence as we'll go into next. So in this talk, I'll be focusing specifically on two types of structure, which my work has studied. These are topographic organization and spatio temporal dynamics. And before I go into my work, I'll give a short example for why I believe that natural structure may be useful to achieve the structure generalization that I was talking about before. So the first example comes from Fukushima's neocognition architecture from the 1980s, which was actually built to directly address the problem of robustness to these small ships and deformations. So in the paper, he writes about inspiration from people and reasons, measurements of hierarchy and pooling in order to achieve robustness to these distortions. And so if you look at the figure, he writes use of S1, use of C1, and they stand for simple and complex cells. And so this was a fairly radical approach at the time, but it really served to improve robustness and shifts that were plaguing these early artificial neural networks. And over time, these ideas were simplified and abstracted and obviously yielded the convolutional neural networks that we know today, which ultimately drove the success of the deep learning revolution. So this is really an example of a natural inductive bias, which achieved structure generalization. So for our research, it's really of utmost interest to try and understand what makes these models work so well and see if this principle can potentially be generalized to cover more abstract transformations and symmetries. So what makes a convolution achieve this structure generalization? Intuitively, you can see this is done by applying the same filter or feature extractor at various spatial locations. So here we see a single convolutional filter being applied at all locations of an image. And this means that no matter where your input is, whether it's kind of in the middle of the image or on the right, you'll have the exact same features. With one exception, they'll be equivalently shifted. So mathematically, this type of a mapping is called a homomorphism. It preserves the algebraic structure of the input space in the output space. In this case, it's with respect to translation. And at a simple level, something that will be important to remember for the rest of this talk is that we can verify homomorphisms of our feature extractor if we can see that there is this commutation with the transformation, this commutative diagram. And so we can write this also algebraically by showing that the feature extractor f commutes with the transformation operator t. And basically, what we want is there to be no difference between first extracting the features and then performing the transformation or performing the transformation and then extracting the features. So the challenge is to date is we don't really know how to construct homomorphisms with respect to more complex transformations that we see in the real world. For example, our brain is able to handle changes in lighting and season naturally. So here we see lighting on a person's face or the change of seasons. We can tell us the same face or the same road, but we don't know how to build models which respect these transformations. And so it makes us hard to build systems which handle them in a robust and predictable way to give an even more abstract example of what I mean by this and the potential negative repercussions of models which don't handle symmetry transformations. Consider modern text to image generation programs. So in this example, I asked Dolly to generate an image of a teddy bear on the moon and it does this incredibly well, right? Probably better than I could. It has texture for incredibly detailed. However, if I ask it to do something which I see as conceptually simpler such as draw a blue cube on top of a red cube, it fails to do this. And to me, this seems unintuitive since the second task seems significantly easier. But what I'm arguing is that the reason that this is surprising is precisely the same reason that the Amnus translation example was surprising. There is this symmetry transformation happening here, namely the transformation between these complex objects of a teddy bear on the moon and these simple objects of cubes, which we intuitively expect the network to be able to handle and respect and we see that it doesn't. So just like how Fukushima's work showed that these natural structure of hierarchy and pooling of our visual system are effective for making generalizations to small transformations, I argue that potentially higher level structure may be necessary to fix these abstract generalization problems. And so the question then that I'm studying and that I'm asking is what might this structure be and how do we implement this in artificial neural network architecture that can actually be used for performing computation? So to begin to answer that, I'll jump into my first line of work on topographic organization. So topographic organization is observed widely throughout the brain from primary visual cortex to higher level areas. And it can very loosely be described as this property that neurons which are close to one another tend to respond to similar things. For example, on the left, we show the color coded preference of each neuron in the macaque primary visual cortex as a response to oriented lines. And we see this smoothly varying set of selectivities. Another type of organization is known as retinotopic organization where nearby neurons in the visual cortex tend to respond to nearby receptive fields. However, this organization isn't limited to these low level features extend to more complex features such as those present in faces or objects or places. And this relates to the so-called functionally specific areas of the brain such as the fusiform face area, FFA, and the peripepicampal face area, PPA. So in this work, the main idea, again, is that perhaps this topographic organization in some sense, which is intimately related to the convolution operation and Fukushima's architecture, we can maybe generalize the benefits of this to more abstract transformations. In other words, learn how to build more complex homomorphisms that we can't do analytically right now. So just to show that we're not completely insane with this idea, there is some prior work in this domain from people such as Kounen, Yonlipka, Apohei Baharinen in the early 90s and 2000s. And they studied how topographic organization may be useful for learning in variances, mostly in linear models. So the question for us when we entered the space is what is the most scalable abstract mechanism that can be leveraged from these approaches, which we can integrate into modern deep neural network architectures. And ultimately we settled on a generative modeling approach, which I think might be interesting to the people in this community, which then allows us to relate it more closely to topographic independent component analysis. But the basic idea being that we can learn a topographic feature space by imposing a topographic prior distribution over our latent variables. So just to give a brief background, I assume most people are already familiar with this. But the kind of general assumption is that the brain is a generative model and this idea in some sense can be attributed to Helmholtz from the 19th century. Where he said that what we see is the solution to a computational problem. Our brains compute the most likely causes from photon absorption within our eyes. And so as an example, if I show you this image, you immediately recognize it as a globe with some curvature. Or it could just as equally be a disk with a distorted perspective on it. So this is how we get obstacle illusions or images. So like this one, your brain infers that there is a cube here because of the structure, but really it's just a flat piece of paper. So you can think of this generative model aspect as kind of like an inverse graphics program. In the program, the abstract properties of the sphere are known, the position, the size, the lighting. And these are used to project the sphere to create the 2D image that is rendered. So in effect, what Helmholtz and others are saying is that as a generative model, the brain is actually trying to invert this generative process and doing inference and infer the underlying causes of our sensations. So the reason I'm kind of belaboring this point is that there's a lot of talk with generative models today. And I'm not necessarily just talking about generating images or pretty pictures. I really want to mean a framework for unsupervised learning. So then to get a little bit more into the details, what do I mean by a topographic prior? So generative models are typically described as a joint distribution over observations, X and latent variables, which we'll call Z. And this is typically factorized or one way that this is done is factorized in terms of a prior P of Z and this true conditional generative model P of X given Z. And so one way that we can think about this is that the prior can be seen to encode relative penalties for each type of code that is produced when we invert our generative model. This is called computing the posterior P of Z given X. And so to develop a topographic latent space, we want to introduce some sort of a topographic prior, which has been or which this topographic ICA work showed is equivalent to something like a group sparsity penalty. So people might be familiar with typical sparsity penalties from independent component analysis. You want your activations to be sparse, meaning many of them are zero. So I could look something like this. You have a bunch of blue squares that are active, but most of them are not active. But specifically with a group sparsity penalty, we want these priors to assign lower probability to these distributed sparse axes. And higher probability to these grouped, densely packed representations. You can also think of this like a higher penalty when things are spread out, a lower penalty when things are closer to get there. So again, this can be written abstractly like this. But I want to make clear that these neuron, each one of these squares here represents kind of a neuron in our model. And they're organized in this 2D grid. So when we're talking about grouping, we really mean grouping in that 2D topology. So one thing that's really interesting and kind of important is that these priors don't just give us topographic organization, but they've also been noted by people like, or studied by people like Erosimacelli and Bruno Olshausen, to actually fit the statistics of natural data better, specifically natural images. They've shown that using this type of a prior, you actually get a sparser set of activations, meaning that the prior fits the true generative process a little bit better. And as we're aware, the brain has a high degree of sparsity. And this is believed to be very relevant for efficiency. So to get a little bit more into the details, to implement this type of a group sparse prior, we use a hierarchical generative model. And this is basically introduced by some of the topographic ICA work. The idea is that you have a higher level latent variable U, which simultaneously regulates the variance of multiple lower level variables T. And this is how we get group sparsity. Then to get topographic organization, you can have multiple of these latent variables U slightly overlapping with their fields of influence. So their neighborhoods we can call them. And this will give you this smooth correlation structure you're after. So get the intuition for this. You see that this variable T over down on the bottom here is not getting any input from this U on the top, but it is sharing a U variable with this T in the middle. So it's like they're sharing variants. They're sharing some components with their neighbors, but not all components. And that's really due to this local connectivity of these higher level variables U. So to keep it simple about how we use this generative model, let's go back to a single U variable. And the challenge in this type of an architecture, which made it difficult for many years, is how do you infer the approximate posterior over these intermediate latent variables in this hierarchical architecture? And this is not super straightforward. So prior works have used heuristics developed for linear models. And in our work we found that this really didn't extend to modern neural network architectures. So really our insight is to leverage a factorization, a specific reparameterization of this distribution. And so this reparameterization specifically is achieved by defining the prior to be what's known as a Gaussian scale mixture, meaning that our conditional distribution of T given U is actually a normal distribution where the variance is defined by this variable U. And for certain choices of U, this distribution is indeed sparse and encompasses a range of distributions such as Laplacian's and student T distributions. One way of defining it is a Gaussian scale mixture admits a particular reparameterization in terms of independent Gaussian random variables Z and U. So specifically then we see that this T variable which was originally fairly complex is actually just a product of a bunch of Gaussian random variables which now know how to work with much more efficiently in generative models. And specifically what we're going to do is so that we can actually get approximate posteriors for U and Z separately and then do a deterministic combination of them in order to compute our topographic variable T. And this is much easier to do. So without going into too many details, the method that we decided to use is known as a variational autoencoder which leverages techniques from variational inference to derive a lower bound on the likelihood allowing us to parameterize these approximate posteriors with powerful long linear deep neural networks and optimize them with gradient descent. This is going to be familiar to the active inference community but really what we've done is instead of having a single encoder and decoder as is typical VAEs we now have two encoders one for U and one for Z separately and then we combine them in this deterministic manner to construct our topographic T variable if you see that this is actually the construction of a students T distribution from Gaussians and then we do this before decoding and then maximize the likelihood of the data altogether. So this is the elbow, the evidence lower bound, a bound on the likelihood of the data and it's actually very similar to the variational free energy that is used in the active inference community. So with these details out of the way what's really interesting is what happens when we train this generative model which has relatively simple group sparsity penalty in its latent space and we want to look at kind of what it's learning in terms of its organization of features and first we start with the simplest possible data set we have a black background with white squares at random XY locations and if we train our autoencoder with this group sparsity penalty on it and then we look at the weight vectors of our decoder which we're plotting in blue here again organized in this 2D grid we see that indeed they learn to be organized according to spatial location so this can be seen as similar to convolutional receptive fields where the receptive field of each neuron is really given by the inputs at its location and this makes sense intuitively from the group sparsity perspective since for any given region to highlight like in yellow here the filters in a given group are much more highly correlated they have these overlapping receptive fields than other random locations so essentially we see that our model is learning to cluster activities together in sort of a simulated cortical sheet according to the correlations in the data set so instead of in convolution where you're actually doing weight tying and you're manually specifying I want to copy this weight everywhere you can maybe think of this as like approximate weight tying and really we're learning this from the correlation structure of the data set itself and just to give a little bit more of a biological inspiration for this we know that retinotopy is present in the brain this is an example of retinotopy in the macaque visual cortex and you can see if you show the macaque an image like this it gets projected into this topology preserving space actually on the surface of the cortex so the idea is that topographic organization and even learn topographic organization is preserving the input correlations of our data set and potentially this may be beneficial for generalizing these ideas a little bit further so like I said at the beginning it would be even better if we could just learn something more than just convolution maybe more complicated equivalences so how do we do that? one thing that's clear in natural intelligence is that we don't exist in this world of IID frames we exist in a world of continuous sequences of transformations so maybe we can extend our model to this setting to learn observed transformations this is the idea of temporal coherence so what would happen if we just simply extended our previous framework over the time dimension so instead of just grouping saying we want our neurons to be group sparse in terms of spatial extent on the cortex we actually want them to be group sparse over time meaning if one set of neurons is active now we want that same set of neurons to be active into the future as well if we look at if we intuitively think about this we see that this is actually more encouraging invariance than equivariance the way to understand this is we're saying we want the same neurons to be active constantly but the input transformation is changing the feet of this little fox are moving so if the same neurons are coding for the same thing over and over again but the feet are moving those neurons are going to learn to be invariant to the motion of that leg of this dog for example so instead is that I went the wrong way here so instead our insight was that this group sparsity could instead be shifted with respect to time so this would mean that sequentially shifted sets of activations would be encouraged to activate together and then our latent space would really be structured with respect to the observed transformations so you can see here that rather than the same set of neurons being active at all time steps there's actually a sequentially permuted set of neurons that were grouping together in this sparse way and then this allows us to model different observations over time but they're still connected in terms of learning a transformation and preserving this correlation structure of the input data set so if we put this together into our topographic VAE architecture you can get something that looks like this you see that we have an input sequence we're getting encoding a Z variable and then multiple U variables in the denominator here and then each one of these U variables is shifted kind of like we were showing before in order to achieve this shift-equivariant structure that we're looking for when we combine these in this student T product distribution we get a single latent variable this is now our topographic latent variable T and now that we have this known structure in our latent space you can think of it like a structured world model we know how to transform this latent space in this case it's by permuting these activations around these circles doing like a cyclic roll, a cyclic shift we know that this is going to correspond to our learned input transformations and we can verify that by saying okay what if I continue this input transformation the true transformation in the data set which is a rotation and then I compare that with how I've done my roll in my latent space by moving my activations around in my brain and then we decode and we see that we get the exact same thing and so this is demonstrating this commutivity property that I was talking about before for verifying homomorphism and so to measure this a little bit more qualitatively we can measure what's called an equivariance loss so this is really the quantification of this difference between our rolled capsule activation our rolling in our head versus watching the rolling unfold before the watching the transformation unfold before us so we see the topographic VAE achieves significantly lower equivariance error this bubble VAE is what I was talking about before where it's learning invariance so it doesn't have the shift operation and the traditional VAE kind of has no notion of organization or temporal components so it performs very poorly in addition to this we see that the model is a better generative model of sequences it just gets a lower lower like mega vlog likelihood on the data set so it's better able to model this data set because it has a notion of the structure of the transformations we can test this on multiple different transformation types on the top row we're showing the true transformation we pulled out these grayed out images and then on the bottom row we encode and then we just kind of roll our activations around and we keep decoding to see what the model has learned as the current transformation that's being observed and we see that it can basically perfectly reconstruct these elements of the sequence that it's never seen before additionally with images that are from the test set that it's never seen before simply because it knows what the transformation is that it's currently encoding it can generalize that to new examples so the takeaway from this part is really topographic organization we showed that it preserved the input structure and now we're showing it can potentially improve efficiency and generalization as we would hope that a structured world model would be able to do so finally something that surprised us and I thought was potentially the most interesting is that these transformations that are learned by our model actually generalize the combinations of transformations that we're not seeing during training so for example despite only training on color and rotation transformations and isolation the model is presented with a combined color rotation transformation at test time we see that it's able to completely model and complete these transformations perfectly through the capsule role implying that it's learned a factorized representation with respect to these different transformations and it can flexibly combine them at inference time so again maybe we also don't just get efficiency and generalization we also get some basic compositionality so let's talk about the limitations of what we could do next the main limitation is that there's a predefined transformation that we're imposing in both space and time so although we freed ourselves from group transformations and specifically like translation or rotation as is currently done in the machine learning world we still have this hard coded latent role in our heads for everything we see and to make this a little bit more flexible with greater diversity of transformations we think maybe we can take inspiration from more structured spatial temporal dynamics that are observed in the brain and so that takes us to the second part of this talk which is spatial temporal dynamics that we're going to try to integrate into artificial neural networks one example of that is traveling waves like I showed here so what do we mean by that here's a very recent paper by operating at 36 millisecond resolution to image a single slice of a rat brain under anesthesia and what we see is this very clearly structured spatial temporal activity in correlations and these authors of the paper go on to analyze this activity in terms of the principle modes as depicted on the right so our hypothesis is that perhaps some sort of a correlation structure like this may be beneficial for structuring the representations in a much more flexible way than simply just a cyclic shift like we were doing before and let me say that this is not just observed in anesthetized rats you can see these traveling waves happen in the MT cortex of awake behaving primates so for example on the left here they show traveling waves that actually change how likely it is to see a low contrast stimuli based on the phase of the wave furthermore they show that the like a high contrast stimulus on the right can induce a traveling wave activity that propagates outwards in primary visual cortex so these are really ubiquitous throughout the brain at multiple levels and it would be interesting to study what their implications are for structure representation learning in our case or generally there is prior work which has studied these types of dynamics and they build models so on the top these are the equations which describe a spiking neural network which they show if you implement time delays actually external time delays between neurons you do get these structure dynamics of traveling waves as long as your network size is large enough however as many people probably know it's relatively challenging to train spiking neural networks as deep neural networks similarly on the bottom another system which is significantly simpler but perhaps too simple is a network of coupled oscillators these are known to exhibit synchrony and spatial temporal dynamics and complex patterns but this is called like a phase reduced system and doesn't quite capture the full complexity that we're interested in so we're looking at something that's and what we settled on is this work in this work is to parametrize the network of coupled oscillators slightly more flexibly than a thermo model so this is really built on this couple the solitary recurrent neural network of Konstantin Roush and Nishra where they basically took the equation which describes the simple harmonic oscillator it's a second order differential equation the acceleration on a ball is proportional to its displacement you can add additional terms such as damping so that the oscillations slowly die out over time you can drive this oscillator with an external input to kind of counteract this damping or to give slightly more complexity to the dynamics and then furthermore if you have many of these oscillators you can couple them together with these coupling matrices W you can really think of this network as a bunch of balls on springs and they may be connected to each other also by springs or elastic bands whatever you want and this is the couple the solitary recurrent neural network of Roush and Nishra with these various terms and this has been shown to be very powerful for modeling long sequences they also mentioned they were inspired by the brain building this and there's a lot of good analysis in that paper that actually happened in recurrent neural networks but if we want to look at spatiotemporal dynamics in this type of a model it's slightly challenging because these coupling matrices here the W's that connect each or each oscillator's position to one another these are densely connected matrices like I've tried to depict on the left here so if you try to visualize the dynamics of this network to the the latent space of this model so you can think of this like in our previous example a neuron is connected to a potentially arbitrary set of other neurons those neurons are connected to another arbitrary set of neurons and you'll just get a solitary dynamic certainly but kind of fluctuations that don't make a lot of structured sense so in our work then we thought okay how can we convert this more to a way to do that is to have a more structured connectivity matrix W which we found is easily implemented and efficiently implemented through a convolution operation which you can think of like a local a locally connected layer so instead of having every neuron connected to every neuron neurons are just connected to their nearby neighbors and then you after training you'll end up getting something like this second order differential equation that we were describing before you discretize it into two first order equations you can think of this as like numerically integrating the ODE we now have a velocity and then we update the positions with this velocity and we can train this model as something like an auto encoder or an auto regressive model so we take an input we encode it to our latent space really the input is so it's like driving these oscillators from the bottom and then they have their own dynamics which are defined by the coupling terms these local couplings and then at each time step we take this latent state this wave state and we decode to try and reconstruct the input be at the current time step or a future time step we can do some analysis of these models during training to see what happens before training and after training to see if the dynamics in the latent space basically we see at the beginning of training there's no waves in our model but after training after 50 epochs we see that there's these smooth structured activity propagating downwards in service of this sequence modeling task that we're doing like rotating objects so what's the benefit of this I mean the whole reason I motivated this was to say we wanted to have more flexibly learned structure are we actually doing that and what we're doing in the paper is that we really are learning some sort of useful structure and the way we showed that is again with something like this commutative diagram if you take an input and you encode it and you get a wave state and then you propagate waves artificially in that wave state and then decode you can observe that it's actually exactly the same as if you had just performed that transformation in the input space originally and in our case the transformation the operator in the latent space is now the traveling wave of activity and so in a sense it is structuring our latent space with respect to observed transformations that structure comes in the form of waves so natural spatial temporal structure yields preserved input structure again as we were looking for and one of the benefits of this as opposed to previously is that now this is slightly more flexible and we can see this just by showing images of different transformations so a lot of different digits different features and we see that we get different types of wave activity in each case in order to model that different transformation if we train it on different data sets as well we similarly see more complex dynamics in this case maybe not even traveling waves or standing waves which can be thought of as traveling waves in opposite directions so we see if we're modeling these orbital dynamics if we're modeling a pendulum we similarly get kind of complex oscillatory activity so it's preserved input structure but additionally more flexibility than we had before which is kind of our ultimate goal so finally I want to talk a bit about how I think the outcome of this research may not only improve artificial intelligence but also how it helps us understand why our measurements in the brain look the way they do so for example of what I mean by this I talked a bit about before about these localized areas that respond to faces and places so in this fantastic work with Ching He Gao we studied if our simple topographic priors we discussed may be able to reproduce these same effects so specifically we brought the value of this co-nc selectivity metric for each of our neurons with respect to a different data set of images potentially containing just a few objects or bodies and so we measure for every neuron is it more likely to respond to faces or the rest of the images and we see that we get these spatially localized clusters that share many of the properties that we actually see in the visual system of humans and primates and many animals so one of these properties that's shared beyond just the fact that we have these spatially localized clusters is like the relative placement of faces and bodies with the body cluster which makes a lot of sense and is also seen in humans faces and bodies are most often seen together and this isn't just a single a single fluke or a cherry-picked example if we rerun this model many times you virtually always get placed in body clusters which are overlapping or right next to one another so to be clear I'm not suggesting that this is exactly how the brain works or it emerges in the brain but I do think that it tells us that the relative organization of selectivity may at least be partially attributable to correlation statistics in the data after being passed through a highly nonlinear feature extractor such as a deep neural network so in a similar vein something that's interesting there's a known what's called tripartite organization the visual stream so images of selectivity with respect to objects is organized by more abstract properties such as animacy is this thing alive or inanimate versus also real-world object size like what is the size of a teapot versus a car and what we see is that in humans this selectivity is organized in this tripartite structure you typically have small objects that are in between the animate and inanimate objects in terms of their selectivity of the same set of neurons but with respect to these different sets of stimuli we see that the small cluster is in between animate and inanimate cluster and again this happens for multiple different initializations so this is something I hope we can explore a bit further for this community I think it's interesting because it's it's really a way of showing that we built a structured world model and potentially this world model is beneficial for better representing real-world data and lower free energy in that sense so yeah I think by developing these models like we showed here we may get insights into new mechanisms for how this structure emerges including topographic organization that we never thought of before so as an example in developing this neural wave machine model I was looking at the orientation selectivity of neurons I wasn't particularly expecting something to happen but you're looking at kind of these waves propagate over the simulated cortical surface and I thought okay maybe I'm showing rotated images maybe this has some effect on the orientation selectivity and actually if you go in and you measure the selectivity of each neuron with respect to these differently oriented lines what you see is that it's surprisingly reminiscent of the orientation columns and piper columns that are seen in primary visual cortex that just kind of came out of this model and the fact that it has the spatial temporal structure with respect to transformations so of course this is a really course analogy but I think this is an example of how building these types of models can help us think about how the brain builds representational structure and the way it's organized in a way that maybe we haven't thought about before I think I'm not the only one who's doing this type of work and so I want to talk a little bit so I've been talking about like this equivariate structure people such as James Wittington and Tim Barron and Siriganguli have shown recently that by introducing algebraic constraints into a learning process in this case it was like the motion of an agent in an environment by saying you need to preserve kind of this algebraic structure of if I move in a circle north, east, south I end up back at the same point again by introducing these types of constraints you get the emergence of grid cell like representations so I'd be interested to see how this idea of representational structure can help us explain maybe more than our scientific findings we're finding as well and how this relates to generative models as a whole and then finally I think there's something to be said about cognitive possibility of these models from a neuroscience perspective but also from a cognitive science perspective for example there's these Raven's progressive matrices on the left where you have to say which one of these images is more likely to fit in this pattern or for example how likely is it that this Jenga tower falls over when you pull over a specific block or with a given structure and I think these things are these types of tests are really testing if our world models that we're building are similar to the types of models that we innately have our own common sense as humans or as beings living in a natural world and I've done some preliminary work in this direction I think very preliminary and not nearly this complicated but I'm kind of trying to model visual illusions so if you take a really simple data set a moving bar stimuli or a static bar and you flash that you can see that the model will actually infer that missing frame and then actually also infer continued motion so it's like overshooting the trajectory of what the actual stimuli is providing it before correcting it again so I think modeling illusions is certainly an interesting way to study if our world models are similar to the types of models that we have ourselves so in conclusion yeah I think topographic priors we could show that they effectively transformations are structured world models this learned structure is flexible and adaptable to arbitrary transformations unlike traditional aqua variants and topographic priors can be induced statistically as we did in the topographic VAE or through dynamics like we were showing in these neural wave machine type models so to conclude I'll leave you with this quote that I found in Fukushima's paper from 1980 the same capability for pattern recognition as a human being it would give us a powerful clue to understanding the neural mechanism in the brain so that's kind of I think some of the goals that we're going for here so I'll say thanks to my advisor Max my co-authors Patrick, Yue, Emil, Jinhe and Yorn and interested in discussion thanks alright alright thank you great very interesting presentation a lot of places to start maybe just what brought you to this work a little context on how you came into this work for your PhD direction yeah I mean my group has been studying the group that I'm in and the university has been studying structured representations from a mathematical point of view for a while and I guess I guess what's something that had always been somewhat challenging to me is that these structured representations were fairly rigidly defined in terms of group structure mathematical group structure so for example I think I think we can build a model that respects rotations 2D rotations perfectly well but if we want to do 3D rotations we can't do that because that's not a group in terms of a projection onto a 2D plane you're losing information when this thing rotates around or just any sort of natural transformations like I was trying to point out at the beginning I think it was trying to think about how the brain models natural transformations and different frameworks couldn't really explain and yeah my advisor Max had worked on this topographic stuff a long time ago during his postdoc and so he kind of had this intuition that maybe topographic organization has a relationship here then COVID happened and I got really deep in neuroscience literature and got into this stuff and yeah I haven't left the person Cool I'm playing a role in terms of variational auto encoder models that include not just external patterns but also the consequences of action or world models structure with action right yeah no it's a good question and I think active inference is effectively the answer to that I mean I think it's a good answer to that I know there are reinforcement learning frameworks that do use kind of externally trained world models so you train a VAE or something and then you use that representation in your reinforcement learning system but I think having a fully kind of a system that is a single objective with action as part of the likelihood of the data and yeah I think that's much more elegant and so I'm a big proponent of that I have not gone so far as to study how these structured world models in a VAE or I haven't worked on that at all but I think it would certainly be very interesting to see it having a more structured world model in a variational auto encoder would be beneficial in an active setting as well I think that would be awesome I mean I think some of these examples that I've been using before like emergence of grid cells and things like this maybe point towards that direction maybe the brain is doing something it's really obviously has a lot of structure this clearly has to be useful for performing actions in some way yeah I thought a really nice parallel that you brought in with the talk was the locally connected units enabled your models to structurally embody constraint and pattern and that led to these arising patterns and then analogously there was the doral at all where they had the path exploration constraint and so then it's interesting to think about these action or policy heuristics or sparsities like a joint through babbling and motor exploration eventually it becomes understood that there's like two mutually opposing ways to move a joint and then the compositionality across joints can be learned to these higher levels once it's locked in at lower levels it's a very appealing and a niche relevant way to generalize because it's both based upon the actual constraints of the world but then especially through action potentially embedding something that's quite simple right yeah no I think that's definitely true that's a really good point if you do have constraints coming from your actions themselves then that would be hugely beneficial for helping to to structure your latent space and I think yeah I guess one thing I wanted to mention there's something made me think of like Stefano Fuzzi's work on kind of the representational geometry and how that determines how we how how generalizable a given understanding of a system is and I think if you can understand your representations in terms of geometry of like these sets of activities are separable or highly parallel separable with a linear classifier essentially then you're going to be able to do generalization and I think there are constraints that are imposed by action something like this you are yielding or kind of inducing a better representational geometry and this has all sorts of benefits like compositionality or generalization so it's a great point cool yeah very interesting area alright I'll read some questions from live chat Love Evolve wrote practical or observed limitations on modeling illusions yeah it's really hard one of the challenges is most models that we use that I use that like deep learning uses they're not foveated you don't have a center of gaze and you also don't have like a time I mean most neural networks I'm using these kind of recurrent neural networks but time is not as clearly defined in these models as it is in a continuous time setting for human undergoing an illusion trial and I think the combination of these two of the fact that as a human for most things you're gaze you're shifting locations in your gaze and a lot of these illusions are dependent on like you looking to a particular area of time's tests and so I think it would be really helpful if we had models that yeah I mean learn that you can think of this as a type of action like learning where to move your gaze one of the simplest possible actions and that would help a lot for being able to model illusions and just I mean for me it's like I read a paper of some cognitive science experiments or about some illusion and it's I think of okay can I put this data set into my model and test it because I don't have a model that looks around or has a restricted field of view something like that so yeah I think that's one of the limitations another one is the models that were trained you have to think about what you train your model on before you test it on the illusion because that has a huge impact so it's like do I train this on M mis digits or do I train it on ImageNet do I train it on natural video but the ideal thing would be to train it on a huge data set of natural video to say it's like now it's kind of like learning what a human sees and then you test it on these illusion data sets but obviously then you're going to need a huge model and yeah it makes the experiment much more complicated so that's one of the practical limitations well great answer makes me think of a paper with letters rotating on a table that's the digit rotation task and then great points about the foveation and the dynamics of the illusion I think you actually did mention an illusion which is however you mentioned the generalization context which is rotating on the two-dimensional screen doesn't generalize to three dimensions and that dimensional collapse or reduction is the basis of the cube projection illusions and cube and figure rotation illusions it's on your screen there's a silhouette or there's some ambiguous stimuli that a generative it's near a criticality or bifurcation in the generative model so it could represent it one way or another way and so a lot of the switching illusions are just based upon the flatness of images and the limitations in generalization that are revealed by that right yeah I think there's even some work sorry there's some work where they can argue people have a kind of three-dimensional image in their heads like even Nancy Ken which her her lab had a paper on this recently and showing yeah I don't know do our models have that that's not super fitting anyway yeah that's pretty interesting alright from upcycle club in the chat they wrote kudos is it true that by inducing sparsity beyond a certain threshold runaway behavior in artificial neural networks is triggered depending on the task architecture and sparsification method I'm not sure what they mean by runaway behavior but yeah I think too much sparsity is is a problem and so there could be a point where the model is no longer able to learn nearly as effectively if you imagine you only want a single neuron to be active for every example your model is going to be trying to memorize the data set to some extent or something like this and you're not going to have enough capacity so yeah I think tuning that level of sparsity is certainly an important factor and yeah when you look at the likelihood if you're talking generative modeling framework typically this is balanced automatically with the likelihood itself if you're not doing generative modeling you just have sparsity penalty you're going to want to tune that parameter okay they added just to clarify runaway behavior in artificial neural networks can refer to phenomena where the network becomes unstable or chaotic due to various factors such as feedback loops, noise or adversarial inputs yeah I guess I haven't looked at this in like a recurrent setting where you would get feedback loops but I could yeah I could see adversarial examples being potentially affected by your level of sparsity the interesting point is would you be more susceptible or less susceptible to adversarial examples I don't know well sparsification projecting from a fully connected model just into a progressively smaller it's pretty well understood in general what the trade-offs are it's easier computation it's a smaller model sparser the Bayesian graph is going to be clearer to represent and then also it will have all of the other trade-offs with false positive and negatives of generalizing but that's why it's an iterative fit process so sparsification approach balance does it use AIC or BIC or some other model fitting approach to determine the relevant sparsification for a given input how do you determine like in lasso regression like how do you know how much, how do you threshold how many, how sparse you want it to be right yeah I think there's a lot of good literature and even so some people like Demba Ba at Harvard and some people I'm working with now have done these kind of unrolled iterative sparsification networks where it's like a recurrent neural network and it iteratively sparsifies and you can show that this yields something like values or group group sparse activations like we're using here in this setting it's really just by having this this construction of this T variable where we have Z on top and and then it's in some effect gated by these the sum of U variables in the bottom so W maybe I wasn't super clear about this is a matrix that is connecting that's what defines the groups when I'm defining the group sparsity that connects all of these used together and so the idea is like here if all of one of the other examples if all of your U's are not active for a given T or if all of your U's are active for a given T that T variable is going to be very small because your denominator is going to be very big and that induces sparsity it's like constraint satisfaction if you have a set of U's that are all small then that constraint is satisfied and now Z is allowed to kind of express itself and that's what then kind of achieves this sparse activation so this is induced by these two KL divergence terms here these are saying like how far is the each U and each Z from a Gaussian and then through this construction of the student T variable we're effectively constructing a sparse prior distribution just from these Gaussians but in terms of the actual objective the terms in the objective that we're optimizing are just these two KL terms that are pushing it towards sparsity to some extent and this is balanced automatically with the likelihood term here through the decoder so we don't have terms that we're tuning the parameters of these different encoders and then analyzing the KL divergence cool alright another question from Dave Douglas who wrote speaking of gaze and illusion can the studies on constancies and infants be separated into lower level illusion relevant neural activities versus perhaps higher level conceptual constancy can you read it again sorry I don't know yeah speaking of gaze and illusion the two features that you highlighted were absent from the current kind of architecture might the studies on constancies in infants cognitive constancies be separated into lower level illusion relevant neural activities versus perhaps higher level conceptual constancy interesting um yeah probably I'm not an expert or actually even very familiar with like object permanency studies and infants and constancy stuff but I think that would be incredibly interesting to study in neural network architectures and that was kind of some of the idea with this illusion that I was trying to model I don't know if I was super clear about this but the top row is the input and we're effectively like blocking the input for a single frame and I wanted to see does the network kind of encode that the thing is still there when that frame is gone can I still decode the presence of the object from the neural activity and then what is it also inferring about the motion because of the fact that it saw the bar is at a slightly different location than from before when the bar is gone so yeah I think there's definitely multiple levels to it where some would probably be much lower level and maybe long term object permanency I would guess would be significantly higher level um it just makes me think of those experiments with cats back in the day where it's like they raise them in darkness except for an hour a day they raise them in vertical world or horizontal world where they only saw horizontal lines or vertical lines and you can see the the organization of their cortex changes like they have less receptivity to horizontal lines if they've never seen horizontal lines before and then you take a stick and you wave it in front of their face and if the stick is horizontal they do nothing it's vertical they're swatting at it they're trying to hit it I think in that case then this is evidence of a low level efficiency and vision contributing to some sort of an illusion so I think yeah there could certainly be some aspect of that in infants as well one very curious point you brought up was the animate and inanimate manifold with small things being intermediate right what does that represent or is it because they're handleable or it might be an insect or it might be something that might move away just with wind or what does that say right yeah so this is work by like Talia Konkel I think it was the one who discovered this organization and they tried to figure it out and they think I might be getting this wrong so I recommend people to read her work on that they call it tripartite organization but if I remember correctly they did a lot of follow-up work on why there's this organization and some evidence points to kind of mid-level statistics of curvature of these objects and kind of like the distance that you see objects from or like animate objects or maybe more curvy regardless of what the actual answer is there were a lot of different hypotheses that were stemming from like properties of these objects maybe mid-level or low-level properties more so than high-level properties I still don't know if it's exactly been solved of whether it's like interaction like you said with the objects causes the separation or yeah the little shapes of these objects I would bet as with most things it's like some combination of all of the above but I think the interesting thing from this modeling point of view is that this is only trained on correlation statistics from the image data sets itself this has no interaction this has no notion of animacy I mean this is really just training a model on image net just images of dogs, cats and yet it still achieves this type of organization so there's some sort of it could be semantic characteristics we have a network that can classify boats versus dogs versus 20 other breeds of dogs but it might also have some correspondence with low-level image statistics as well so yeah I don't know I guess it's nice and also a very evocative analogy was the translational shift in the MNIST in the handwriting recognition setting what are the translational shifts that exist today what's the three pixel example is that some prompt engineered attack on an LLM or something or something in a special character being inserted or some overlay on an image that we can't even detect so what do you think those challenges are and what are ways that we can pursue that yeah absolutely I mean I think kind of the way I was thinking about it is like these symmetry transformations if you're thinking about language models you can imagine a symmetry transformation that's just like replacing a word with a synonym or something you have the sentence to us means the exact same thing but now suddenly the model looks completely different or translation between languages this can be seen as a type of transformation it preserves the underlying meaning of the input to us but to the model it looks completely different and we would like to have models which behave in a predictable way with respect to these types of transformations because I think humans behave very predictably with the type of transformations dealing with AI systems we expect them to also behave that way and I think that's part of what causes a lot of challenges interacting with these systems and I kind of tried to do a rough cheeky demonstration of that with this bear and squares and stuff we expect it to be able to do something simple like this because we think most humans could and yet it doesn't and if you imagine that you expect this then that's a big problem how do we handle that? I think that's kind of what I'm searching for I think my direction I'm taking it is looking for more simple kind of like bottom up building blocks of neural network architectures or algorithms that kind of yield these emergent and much more generalizable way rather than building something on top of what we already have I think that's something that would scale much better and also matches more with the brain does Very cool one kind of implementational question what are the computational requirements of just running this or what's the day to day like of being a student or researcher running variants of these like do they use terabytes of data and you're using large computation of your own laptops I think almost everything I presented today can be run locally so like this stuff is super simple you can run I mean you can run it on your laptop if you want to like train and experiment with different things it's going to be pretty slow so I'd recommend some commercial GPU I run pretty much everything on like Nvidia 1080s pretty old, pretty cheap but they have 12 gigs of RAM or whatever and it's kind of more than enough for these models most of these are only a couple gigabytes of RAM I think one thing that some people think is weird is I do most of my experiments on stuff like MNIST so it's 32 by 32 pixel images because I can train it small and locally if you want to do stuff like like yeah most of my experiments are on MNIST if you want to do stuff like this it gets more complicated this Hamiltonian dynamic suite here you're getting into bigger models that are running across multiple GPUs and so here I was using a cluster to run these types of models but I'd say most of these types of ideas you can start to play with on a single machine with the GPU is more than enough or even just like in a collab notebook something like that if you want to train something on ImageNet it gets more complicated in your needs at least one GPU ideally more but yeah I don't do a whole lot of big scale stuff yet I think it's certainly interesting and there's definitely a lot more you can do there but for some of these kind of simpler or more fundamental questions I don't know if you want to call it a smaller machine is nice and fast cool, useful alright I'll read a comment from Dave Recalling Bert DeVry's comment during the applied active inference symposium about the desirability of spending less effort or ATP on foraging or control situations where we don't need much precision I don't know if you listen to this but Professor DeVry has mentioned about variable precision models and how they could be used to enable different features of generalization and actual structural course creating as well as like reduced computational requirements does he have any suggestions on how to introduce this distinction into active inference theory what kinds of experiments could wrinkle this out oh wow, yeah that's something I don't think I have too much intelligent to say about to be completely honest it's super interesting question because I think the intuition makes a lot of sense to me that you're talking about if I understand correctly variable rates of precision when you're encoding or in your model in general doing computation that somehow has an impact on your future performance as a relation to some energy store I think if you wanted to build this into an active inference system you would need to have really an embodied system where the agent has some notion of internal energy store and something that is trying to conserve while it's performing its actions and running out of energy would need to need something bad for the agent and then maybe you could observe kind of an emergent reduction and encoding precision or something like this as the agent is trying to learn to act more effectively and have an ability to control its precision like I said definitely out of my own expertise but it's just kind of thoughts okay on this slide right here first very cool image it's kind of like a digital Jackson Pollock if it were a simpler input or reduced data size or just reduced complexity of patterns or if it were an increased complexity how would this image look different? Yeah so I did some experiments trying to change these orientation columns and you can yeah basically changing the parameters of the model you can get these columns to be bigger you can get them to not have very similar structure to what we see in the humans or you can get them to have more bands and it also like you said it depends on the data set that you're using if I use like really simple sinusoidal gradings of input I get something like this if I use rotating emnis digits I get something that's a little bit more rotational curvy higher entropy so I think these are all interesting things if you want to study the emergence of the type of organization in a natural system if you have a model that now yields different organization for different settings that's a great way to see okay then what settings best match our observed data so yeah I can I can send those around it if you're interested but I think one also other one other interesting point there is that different animals have different types of orientation selectivity and different numbers of pinwheels some animals don't have it at all I think maybe mice if I'm correct have this kind of like a little salt and pepper selectivity so basically random you don't have any sort of like topographic orientation sensitivity so there is evidence that yeah different systems do this differently and it's interesting to figure out why yeah this is very cool it reminds me of visual space where there's different skin patterns and fur and bands and speckles and then also these islands of activity enable the multiplexing which you described with the encapsulation through through space and time so it's actually possible that a region might have no activity from a given granularity like if it was being looked at at fMRI spatial and temporal timescale if the pockets of activity first off if they're even reflected by what is being measured but if the pockets of activity are slower or faster then that measurement is going to not be different than noise it will all have been averaged out so then there might be some yeah interesting like data sets that do actually have a lot of richness but then for one reason or another it just was averaged out over because it wasn't being connected to the right scale absolutely I think that is one of the main reason that brain reasons that traveling waves in the brain have been difficult to study to this point there is a classic review by Terzanowski and Myle Miller Nature reviews where they basically go through all the evidence of traveling waves and they go and show like if you're averaging over trials you're going to completely miss this traveling wave activity it's going to look like some mean value or something like this you really need to go at the single you need to have high enough spatial revolution such that it's you know it satisfies micro frequencies and this just is something that people didn't do for a long time especially if you're doing single electrode recordings you're not going to see a traveling wave you're going to see oscillations so you need like multi electrode arrays and basically they're saying okay now that we have the technology to do this as much we're seeing this structure a lot of the noise that we were seeing before maybe it really is just traveling waves so yeah I think there's a lot to be done in the future with increased abilities for recording that's very cool well any final thoughts or questions or where are you going to take this work yeah no thanks for having me hopefully in the active infrastructure action I would love to I'm going to be super fun so yeah I'm not really sure I'm looking at maybe music right now looking at other kind of crazy directions I don't want to sound too crazy but I'll go down yeah a lot of things so one thing that's coming up something we submitted to Neuripsis is a lot of waves so that paper just came out on archive today waves are really good at encoding long-term memories which I think is super interesting so I might go a little more in that direction sounds good and yes would be very exciting to see action come into play when there was the neurons that stayed active even as the dogs feet were moving there's a lot of like action that's like throwing a baseball and then it goes and it's like there's something about that action that's continuing to influence and so having like a deep temporal representation of alternative actions and then the variational auto encoder is already basically the right Bayesian statistical architecture to discuss the variational and the expected free energy so I think it's a very promising area yeah I'm super excited about that I really hope to to get over there and happy if people have ideas I would love to hear from you via email or anything like that really appreciate it alright thank you till next time thanks so much