 Got it and then you already I think they should be all in already. Can you share your screen? I'll first say something on the board and then I share my screen. Yeah, I'm on Zoom. Yeah, and your co-host or something. I think so, yes. All right. Okay, so. Good morning, everyone. That's the last lecture in this series, and it's going to be on a slightly different topic. I'll go, I'm going to talk about optimality theories. I'll also motivate what that is and give a few examples. We will connect to maximum entropy work as you will see how so in an interesting way. But before doing that, you know, even at the sake of saying something very obvious that you're all familiar with I wanted to spend just five minutes on on the board to refresh the terminology about Bayesian inference because that's what we are going to do today and you know it's not on the slides so I would really not like anyone to be confused at that point so you know apologies for those of you for whom this is this all very familiar. So let me just say a little bit. And give one example that you all wreck that you'll all recognize. So, very quickly. We have a set up where we have some data be which is in predict talks yesterday in our case will be consisting of pairs, like for regression, let's say, of some of some covariates acts, and then there is a measurement why. There is just in this in my samples and there is big T of them. So that's, that's the data set I'll work with. And in, in, in, in Bayesian inference I need to specify some sort of probabilistic model let's say them, you know probability observe observing some why given, you know data at the max and some parameters theta so these are parameters. You know this can be anything and again I'll in a few lines I'll drive right down a concrete example. But first let me pose the task so I'm given the data I'm given a probabilistic model that's parameterized with this theta I mean this is a set place can be a vector of parameters right. And my task is to find to find those parameters that if you want best best fit the data kind of our best kind of consistent with the data we within the context of this probabilistic model. You know, the, the usual set of steps that we would do in, in terms of the inference is with first, after we have them all will first write down the, the likelihood, or log likelihood. So, write down, what did I choose here, let's do the log likelihood. Okay, so I just assume that that all these data are drawn ID from some underlying distribution and then the log likelihood which I write as a fact explicitly as a function of the parameters will be the log of the probability that I draw all these that I see all this data under the model, which is t y t, even x t and theta. Right and again, so this is the ID assumption, I can reshuffle this a little bit in into a more common convenient form, which is adding the logs. t, y t, x t, theta. Right. And so if you stopped here and did what is known as maximum likelihood inference. So if you did ML inference, you would say okay I've written down the log likelihood and what I'm looking for is a estimate or a set of parameters that's called on theta star that maximize right so I'll write max that maximize this likelihood. Right, and this would be your estimate of the parameters given the data right so up to here, I guess I hope, you know it's it's it's all clear. So if you want to go if you want to do a vision inference instead. So in Beijing in France. Right in in Beijing in France what you form instead of just maximizing the log likelihood directly is you form the posterior. So you write down the following quantity. So this is again this is my data. Right. And I write it down as p data given the parameters times p, I'll call it p zero, and there's some normalizing constant. So this is my likelihood as before. So being just a little bit sloppy with notation all over just to go through this quickly so don't, you know, I apologize for that. So the likelihood as we had before me there is the log likelihood but this is just the likelihood. Here is what is called a prior. Right, and it's here the priorities understood the some sort of a price as a distribution over parameters in the absence of any data right so what what would I start with in some sense what would I assume about the parameters before I've seen any. Any data item, and, and this object which is the object of interest in Beijing parents is called the posterior. That is the distribution over parameters. After I've observed the data so it's sort of you can and you can you know view this, I mean this is mathematical identity rights you can view it in many ways you can also break it down let's say in a sequential formulation where you say, this is my prior I've seen any data item and then I observe one data point let's say the first data point, you know and I use this rule to update my distribution over parameters after seeing the first data point. And then I can put that into the prior and observe the second data so you could see you but you could see that as a iterative scheme if you want you can see it as a, you know, kind of one update how to go from the prior to the posterior after observing all the data so you know there's a lot of ways to slice and dice this, but the important thing is. So you, you formulated from the prior and and the data you format this probability distribution now, you know this is supposed to summarize all the knowledge about the parameters you have and this can be a complicated object right this can be a multivariate distribution of this data is, you know it's a vector or something of high dimensional vector, and often we wish to go from a distribution to a point to an estimate so here, right we had a point estimate like one set of values for theta here we have a full distribution. And of course also in this Bayesian paradigm it's possible to go from the posterior to a point estimate but it requires one more step of reducing the posterior to an estimate, right so from, so from posterior you can go to point estimate, but you have to usually, you know, the framework of thinking is such that you have to tell me, and that's in another sort of assumption or specification of how we want to reduce the distribution of one value, right, for instance you could say, I want to choose such theta star point estimate that minimizes the square error between the best estimate and the values sampled from this distribution so that would be, you know, means, so a square error estimate in this posterior and then you would conclude or derive and it's a two liner that what you should do is you should just take the mean of that distribution, let's say, right, so you could take the one possible estimate would be this one. It's kind of a mean of the posterior right and this will give you in a sense the kind of least square error estimate. So this is, right, you could also do another one which is nothing but the arg max of the posterior. Right, so this is the value of the parameters at which the posterior is maximized, right, and so depending exactly on what kind of you, you know, error function you would define you can reduce the, you know, the posterior into a point and there's various choices there. And, you know, I will not say more about that. But the one thing that's kind of nice to look at because it will connect what I'm going to say next is, you know, is to just consider this in the context of a, let's say slightly more concrete problem right so if so imagine that now I work this out for an example as an example. And I make, you know, I'll make my model for this. It will look complicated but it's actually something that you all know right so I could say that this, the, the, the, the this distribution is just a Gaussian distribution in why whose mean, why mean will be equal to x dotted into this vector of parameters, right and whose standard deviation or variance is some constant. Okay, so this is variance. This is the mean. And this is obviously Gaussian and you all know this problem right it's a slightly we're their way to write down their linear regression. Okay, so it's a linear regression. And this is my, you know my predicted value, which is just the dot product, you know, there is no offset, but you can think of everything being zero centered, and this is my assumed noise in the data right which is constant for all the data points. Right. And so, if you want, you want can explicitly write this down right so it will look. As follows. You know why minus theta dotted into x square. And there is no issue. Right. And so if you know for this concrete example, which is very simple and we are all familiar with. I can plug it into. If I if I look at this base rule and you know plug that into the likelihood and likelihood into my my base rule. I can write down the log of the posterior. So estimate of linear regression parameters for that case. Right, and this will be the log likelihood which I have actually already written down here so I can write it. This is log likelihood. This is log of that term. I have to take the plus log of this term, plus log of the prior. And then this is some normalization constant that does not depend on the parameter so I'll just write the constant here. Okay. And if I now, you know, write down these log likelihood as a sum of the log of, you know, of this guys here. I get, I get the log of this, this is just a constant so I can forget it it doesn't depend on the parameters. Right. So I'll have the sum over all data points of minus one, or I can put this minus one over two Sigma square upfront. And here I will have the y t minus theta vector parameters dotted into x t square, plus log P of theta. And now so this is my log posterior. And so if I now read this is just to make a very simple connection if I now wanted to infer my theta star as maximizing. So as a maximum posterior estimate. I would be maximizing this quantity right so maximizing this obviously means minimizing this part here because you have a minus sign and this everyone recognizes a chi square error right so this is just a case up to here. So maximizing the posterior would be chi square fitting. Okay, and so the one difference from the chi square fitting is that we have this term, which is the log of the prior that also depends on the parameters right so invasion inference this is how the prior would sneak its way in into your regression and now you could ask you know how do I well like what does what do we do with this what does this mean, and you know for one very standard choice that you could make is you could say let me assume that without seeing any data my prior over parameters theta is they are zero centered gaussian so mean zero with some variants that I'll write as omega square. Okay. And so what would you get if you inserted here this type of normal distribution zero centered normal distribution so I have to take the log of the gaussian. Right, and the log of the gaussians of course just what's sitting in its exponent. So this choice. This term would be nothing else but one over two omega square times the norm of the parameter vector square. Okay, so some constant times the square version of the parameter vector now forget about all the gaussian. Now forget about all the Bayesian stuff and ask yourself, what does this problem remind you of so it's a chi square minimization, plus some constant times the norm of the, of the parameter vector. That's nothing else but rich regularization for those of you who do regression right so this is rich regularizer it's, it's trying to pull all the parameters towards zero. And so, in this context of regression, you can interpret a particular choice of prior. It's a simple prior like this one, which is known as very non committal right it doesn't. It just says like I want my parameters to be as small as possible in square norm. In the context of your usual linear regression this is just the square, you know a square norm L two regularization if you want. If for a prior I were to choose instead of a gaussian distribution if I were to choose, you know kind of a Laplace type thing. So we're parameters are also zero centered but they decay. Right they decay with, you know, in this exponential sense, right, then you insert it into here and you would get a L one regularization for the regression right so so. We need to get that here, because that sets the frame for what I'm going to be talking next is that you usually when we do inference either Bayesian inference or we do some sort of, you know non Bayesian stuff but we do regularization. These are objects that serve, or you know whose role it is to regularize our inferences. Okay, then they can have the simple statistical structure where you just shrink the norm or they might be, how to say a bit more fancy but in practical applications oftentimes there. This is what they do, right, they regularize stuff. There is a lot of theoretical discussion for Bayesian inference for what you should choose for prior if you don't know anything about about you know that the data at all. So it's like I'm informative priors and there has been old work in statistics done on this and so some new work as well. But usually the priors, either they're non informative or they are doing this very, how to say, naive in some sense but successful regularization. So what I'm going to be talking about today is a very different role for priors where priors will have a lot of structure so here, they only encode the fact that parameter should be small here that they should be sparse. Okay, but nothing more. So today we'll discuss a type of prior that induces a lot of structure on the problem. And there is not that many examples of that, I believe, but I find, you know, I think there is something interesting to be said about that so that's what I want to motivate and show you today. Okay. All right, so let me start by sharing the screen. How do I share. And then yesterday there was some magic in which made that full screen right so double click. All right. Perfect. Okay. All right. Well, now I'll motivate a problem that seemingly has nothing to do with this. So don't be confused, you know very soon we come back to that. Okay. And specifically I'm going to be talking and motivating the problem by discussing optimality theory so everything that I'm going to say is actually in this paper. But you know there was a little hint already in the in the first thing to see there, although we didn't talk about it in the in the way I do it now. All right. Okay, so I'll try to introduce optimization approaches to biological systems and and contrast optimization type models with statistical models, which is, you know, in inference, and then use connection to this to motivate a Bayesian framework inference with what we call optimization priors. Then there will be sort of three examples that I want to give to our, you know, this is on a very simple toy model neuron, the stupidest LN type neuron that you can imagine so just to illustrate the statistical points. This is application to publish data. And then this is something that we the third example something that I don't know whether we have enough time to go there but this is something we have been doing for a very long time. And trying to sort of optimize the biochemical network to make predictions about how the same network actually works in in nature and contrast, you know, optimal networks with what is observed in in in nature. Sorry. Let me do this because obviously the picture appeared before. So, so, so the idea is that optimality theories in many ways pervade biology okay and the reason for that, in some sense, at least conceptually is the fact that the whole evolutionary process of can be seen as a stochastic optimization problem where you know this is schematized here where various organisms encoded by their genotypes are you know adapting to their environmental niches by increasing their fitness through the process of mutation and fixation and so on right of course that's a very simplified picture, there is many nuances and the environment can fluctuate and so on. But somehow the notion that there is this ongoing adaptation where you know where organisms are climbing the fitness peaks can actually be realized also experimentally if you hold conditions fixed and you know you have an evolving population of bacteria and so on. But there is a very nice theory of evolution that can be put into mass, but only so when you know how the genotype changes mapping to the phenotype so the genotype phenotype map and then of course the fitness right to fitness. Usually that mapping, although conceptually clear is very hard to mathematically write down right the fitness is very complicated functional phenotypes and phenotypes mapping you know are generated by genotypes in ways that are very hard to understand if you actually wanted a concrete prediction and this optimization so what what people have done by invoking this as a rationale is to not look at the full fitness of the organism, but to look at something that can in certain circumstances be understood as a fitness proxy. Okay, something else that can be optimized that you can mathematically write down. And the rationale is that something else is so important for survival that it will, you know, be forced by evolutionary adaptation to be optimized and so here are some examples. Perhaps one of the best known examples is, is here so this is the set of metabolic reactions in E. coli bacterium and as crazy as it looks like of course you don't see much here is that we do know the list of those chemical reactions and you know E. There's two to the point where if you put down new reactions and discover them it's actually quite surprising so you know the list of this chemical reactions reasonably complete, including their stoichiometry right so how many of a and how many of B come together to generate whatever is the molecule of C. So we have reactions and we have stoichiometry. What you don't have is the fluxes through each of these reactions right so how quickly do these reactions turn over. And so what can be formulated and this has been, you know, done quite successfully and it's still being pursued by by a number of researchers is you can write down these fluxes through each reaction is not known. Right. There's many unknowns. Okay, and then you can say, I want to optimize this flux I make a mathematical theory, assuming that these fluxes are set in such a way that they're considered they will be consistent with stoichiometry that you know, and that what is going to be optimized is the so called growth flux right so we what what you what you also know is how many of each of these components do you need to make another E coli from one E coli. So you need this and this many of these amino acids and that and that many of lipids and so on. Right. And so you want to set the fluxes in this reaction in such a way that in the shortest time possible basically you will make exactly the ingredients for the next bacterium. And the rationale is, well they're growing exponentially they're competing so somehow fitness is connected with how many, you know how quickly you can make a new organism from the components and so make this thing we know how to mathematically write down and optimize. So let's do that. You solve the optimization problem and that gives you predictions for the biochemical fluxes which you can then go and check. And, you know, there has been, I mean, there is no assist to this it doesn't work perfectly but there has been now quite successful collection of works claiming that this principle is really predictive right so you have predicted something quite high dimensional because the flood there is lots of this fluxes right about how the metabolism works from the optimization standpoint right so it's an initial prediction. So this is another example it's a little bit more crazy for those of you who know it is super cool actually so this is a serum. So this is a picture of a slime of a type of mold that grows on a on a dish. And what these guys have done a semi for fun is they deposit it. So this is like a coastline of Tokyo, the outline and then they deposited food with density that mimics population centers in Tokyo or the density of population and then this this organism has this nice feature it actually creates it grows and it creates tubing by which it transports the food right. So you have it creates these tubes that you see here, it trains them it grows them and so on. And, you know, until it creates basically a real physical transport network it transporting nutrients right. And the claim is that that, you know, there is some sort of an optimization principle because what it ends up doing is creating something that almost looks like urban transport map of Tokyo, and you can compare the two, which of course has been engineered by people from dense population centers you know with in kind of in short as possible time. So people have studied this I mean I put it here because it's kind of a cool picture, but there have been same arguments optimization arguments made let's say for matter transport in living in living organisms how do you organize matter transport, whether this is blood or it's like food or something like that. So there is many examples from neuroscience that I mentioned, and there is stuff that we have worked about for quite a long while which I'll, you know, do in the end what is that's actually what I that perturbs me. Can I shut it off. Oh okay fine. Well then it's okay. All right, so, so, and this is something that if I come to the last example is what we have studied which is the information transmission biochemical network so can you know concentration of some signaling molecule fluctuates, and that signaling molecule influences gene expression which is another noisy variable right and somehow by changing this molecule concentration you're affecting the level of gene expression, and you can view that as as a information channel that's noisy. And that's because of all the intrinsic stochasticity, and maybe a predictive principle is how do you wire up chemical reactions so that you can transmit through these noisy system information as well as possible and, you know, in the end I'll show hopefully a few of these examples. All right, so let me, you know, let me outline the rationale for for this type of for optimization approaches that deal with information right so some deal with as you saw biochemical network some with matter and if you want to talk about information transmission, you go back to Shannon, who proposed a theory to quantify how information can be transmitted from the source to the sink. More to the receiver and to optimize that transport, given some limitations on the, you know on the process of information encoding right so he's abstraction was some signals are generated by the source. So you can come from some alphabet or you know a distribution of our vocalizations or from English words or whatever this is. These inputs are then encoded in a channel which can be described by these conditional probability distribution so that the wide distribution well because that captures the noise so when you put in input x. It's clear that the, at the end of the line you actually get x back because it can be corrupted by noise you can get something slightly different which is why. Right and noise is the constraint that's actually acting on this channel, right if this channel is completely random so whatever x you put in it spits out a totally random why that's not connected to eggs then it's useless for transmitting okay. And then the receiver and you're receiving wise and want to recover back the access and the question is how well can you do it. And can you do it without error and so on. So these are very this is a very abstract formulation in in Shannon's work, and the key quantity that I'm putting down here is the mutual information that you might have seen, which is a number a non negative number in bits that you can compute if you have this distribution and that distribution, and that basically is is a measure of how much you can squeeze through this channel in some sense without you know without loss, or you can recover at the end. You know, approximately speed. And what's interesting is even if you go back to Shannon's work, which of course, you know, Shannon was thinking about a concrete, you know, set of electronic systems that you designed to transmit information, and you know this stands behind today's mobile phones and he already realized so that's a citation from from from his work. He already realized that of course this formulation so abstract that it's not limited to engineered systems right and systems that you build out of electrical components there's nothing that says that this has to be voltage or radio waves or anything right so it can be any type of signal that you can, if you can frame the problem in this probabilistic sense that you're seeing right there. And so very soon after his work applications in biology started. So this is just a timeline right so this was his theory and then we heard about became correct about neuron encoding right wondering very early on as you see, wondering how much information a simple neuron can encode about its you know inputs. Before we actually knew about the DNA, just to put it, you know, when people, you know, and then of course once, once the, you know, DNA sequence hypothesis was was out and so on, people started wondering how much information is encoded in the DNA. Can you quantify that, you know, in bits and so on. It's a very interesting problem still open. And then there's this gentleman right there, Horace Barlow in 1961 proposed what's known as efficient coding hypothesis right so this is in neuroscience a very influential hypothesis which says that the sensory systems of animals, meaning, you know, sensory neurons in years in 2000 have been, and now listen to that link with evolution right have been pushed by evolutionary pressures in the direction so as to encode as much information as possible about the natural stimuli. While their metabolic cost is constrained. So there is a limit to how quickly they can spike for all sorts of mechanistic reasons but also because spikes are costly, metabolically costly. So the proposal was perhaps our sensory systems are organized such that they maximize the information in this strict sense now in the sense of you know bits per second okay that you can compute if or measure or estimate from data. So they try to maximize the information rate from natural stimuli, not any stimuli natural stimuli because we have adapted to our niche right while conserving metabolic resources. So that was quite a while back right 1968 61, and then there were you know first test and measurements of these idea of actually estimating information rates and testing out these efficiency arguments by someone laughing and later by many others. And, and you know this of course this applications were not limited to neuroscience you can think this is you know one particular view where you think of light signals and neurons that encode them. And you know the downstream brain is trying to reconstruct something about the world but of course you can think in biochemistry about inputs being concentrations of various ligands or transcription factors that are processed by these biochemical gene expert like promoters or you know or signaling networks, and then the cells are making either self fate decision or anything like or something like that, or you can think of bacterial chemo taxes where the inputs are concentrations of chemo attractants right and cells are needing to sense them and transduce them to drive their motors to climb up the gradient or away from something that don't like and so on. I would say that, you know, some, you know you can you can view some by the way I'm not saying that this is a universal hammer, but some systems in biology can be seen as transmitting information they do also other things obviously but you know they do also that in part. And you can talk about whether this transmission or encoding is efficiently organized, right, which is this efficient coding idea that, at least here was applied to neuroscience, although the idea itself is is broader than just neuroscience. And so, there was one if there is one showcase success of that thinking that you know people put would put forward almost as the first success of efficient coding hypothesis is the following if you. If you were to think about the neuron in your in your retina, and often people think even about neurons in the primary visual cortex if you think of it as a linear filter that's applying a linear filtering operation to the natural image. You can ask the question what should that filter optimally be optimally in the sense that I take the natural image I filtered with some arbitrary filter. When the neurons come the spikes for the firing rate of the neuron, I want to maximize the response of the neurons and the information between the response of the neuron and the stimulus. And the stimulus is natural. So that, right, there is by physical constraints on, let's say the maximum firing rate or something is constrained, you cannot exceed it. So then this ends up being an optimal optimization problem and you predict the shape of these receptive field of the neural filter. And what you'd find is a shape that looks like this, right so take all the pixels in the center region and add them up and take the surrounding pixels and you know subtract them away, right so that's the optimal filter. Well, because natural scenes are highly redundant so there is large swaths of white that doesn't change and large swaths of black. And if you had a neuron that was stupidly designed and you know fired every time it sees the bright pixel it will go fire fire fire fire fire fire fire even though it's changing. So you only want a neuron that you know doesn't respond to large uniform redundant patches of either white or black you only want them to report differences right and this is what this guy is doing. Right, if you put a uniform stimulus in here, it will just subtract center from surrounding gets a zero response and it won't fire. If it sees however a bright, you know patch on a dark background that's like a, you know, deviation on top of the redundant, brightness, then it will respond. Okay, so it's, if you want like a point detector or change detector right. And it's very similar to what practically was saying at first layer neurons and neural networks right where you where you optimize for other purposes but you also get similar things Yes, it should show green in in a moment, I think, without moving anything does it. Yeah. So with attention to visit disorders, do we see neurons that don't have this kind of filtering and actually fire, even though it's, it's not necessary. So I don't, I mean, so this is this is for very early sensory periphery say in the retina so I don't think you would see any, I don't know. I would doubt that you see any changes at the very early sensory layers like retina. I don't know what happens in the cortex though. You know, and you know how much we can actually estimate these type of stuff on on humans right because you know I mean this is invasive recordings right so you don't typically don't do it in the retina. I don't think so. So you just to finish the thought here right so so this is what you would predict that's actually receptive field I think in these cases in the LGN I'm not sure it's in the retina so it might be the next relay station after the retina. But that's a measure to set the field. And of course I chose the one which matches the colors to make it all nice right and you know you but you definitely see this universal centers around structure in recorded neurons as was predicted by the theory. And this is this is kind of a poster child of efficient coding. Let's, let's call it like that and of course then it has been after this extended to many other types of predictions in the, in the sensory periphery. That I'm happy to talk about but maybe it's not that important. So but what is important I think is the following. Okay, we have we, you know face a lot of challenges like this looks like a success or we can close the book and we are done. Okay, but as you see I, when I try to convince you that efficient coding works I showed you one picture and the other picture like see they look similar. Okay. We, we, our theory predicted. Now, this is not really a rigorous test of whether the theory works or not so what what I became interested after actually working a lot on these optimal theories is how can we formulate a rigorous statistical test for optimal. The theory efficient coding, you know there's some data of the real receptive fields. How can I formulate a statistical question, you know, given fine amount of data and given this candidate theory is the data consistent with optimality theory or not. And interestingly enough, this is not something that, you know, this is not. We are very rigorous about deriving predictions from optimality theory and do large scale optimization. They're very rigorous about measuring receptive fields, but then many papers end up basically showing you two pictures and saying see it works. Sometimes they measure the center sizes and see the center size matches right but what is what would be the proper way to actually, you know, do a hypothesis test. And that's the kind of problem. This for another paper quite beautiful right so it's a receptive field on cell off cell, radial functions non linearities inferred from data with error bars everything. These are optimization predictions. So this is from young car clean and aero cement is very nice example of the word they like optimized in the synthetic retina receptive fields of many neurons. There are non linearities right they get this. And then, you know, the question like then you show the two pictures and say like I think the theory works okay so. And it's a very good piece of work don't get me wrong I really respect that's why I put it there right but this last step is somewhat arbitrary. This is the only one so should we actually be asking this question if you have a optimization theory. Do we really believe that an biological system is fully optimized for for an objective that you guessed. Okay, maybe it's not a yes no question maybe what we should be actually defining is a measure of how close a biological system is to an optimal of some theory so it's kind of a continuous work not a yes no hypothesis test. Okay, so how would we define that measure of close and stop to melody. Moreover, this for the practitioners of optimization that is actually where the pain is, although we don't like to talk about it so we sweep it under the rug. So many times optimization theories that we write down are complicated. You optimize many numbers of some objective. And guess what right often it's a nonlinear optimization problem. There is multiple solutions to that problem some global some local. Which one do you compare to data. Right. No, no, no, no optimization theory exists in a vacuum right I told you for efficient coding I said, maximize the amount of information transmitted, given some constraint on let's say the firing rate of the neuron now in that case we're lucky because the firing rate which is the constraint is something that we know from the experiment. You know, a priori before doing optimization I know that these guys don't fire more than 100 hertz let's say or more than 50 hertz I can put that in as a constraint okay 50 hertz is the constraint. Often it happens in in systems that we study that we know that there has to be a constraint let's say biochemical circuits right, we know that these circuits work with a limited number of molecules okay and and you know the fact that the number is limited means that there is some floor to the noise right you cannot go below the noise because you only use 100 transcription factors. We don't know is what is the constraint we don't like if there are systems that we know there is a limited number but we don't know from data what that number is okay. So if we knew that number we could put it into the theory turn the crank and get the prediction out, but we don't know it because it hasn't been measured you actually have to jointly infer it from data. So I would like to infer that constraint from data while optimizing everything else right and now we have this very complicated statistical problem where parts of your problem are being inferred while parts are being predicted and you want to test that. And the way we do that now is we cheat in many ways. We say, okay, you know here's optimality theory for some biochemical system and I know there is a limit to the number of molecules let's say the limit is 100 okay I put in 100 I optimize all the other parameters. I look at what comes out from my prediction I compare to that and it's like, doesn't quite work. So maybe the limit was 200 molecules and let me put in the 200 molecules right and then work out the optimality prediction for 200 and you know, and what, like, you see what I'm doing statistically right statistically I'm manually fitting the constraint, while trying to predict everything else right and that's not, you know, and of course what if it's not one constraint what if there is more constraints right so now I'm kind of doing manual fitting and I'm going to control right over complexity over fitting and so on. I'll show exactly if we come to the last example you will see an example of this. And then the third thing, the fourth thing sorry is sometimes we just want to infer we have this complicated model I'm sure you have been faced with this complicated maybe nonlinear dynamical systems model of some process in biology, many parameters, and the data is not sufficient. We cannot constrain all the parameters. So the question is, if you know what the system is supposed to be doing, and you can formulate that as an optimization principle can you use that to help you in constraining the parameters that are not constrained well by the data. Okay, biological systems are mainly as opposed to non living things biological systems have evolved for a function right so there's this notion of they work, they do something which we can describe usually by words but if you can put it into formula and say that's the thing that they are doing with some optimization function, can that be used to help your inference, because you can put it in as you will see as a prior. And this issues, what I claim is that these issues really become limiting for high dimensional optimization problems. Right, I mean, there is super nice successes of efficient coding, when the few things that you wanted to predict are just are simple, like beautiful work saying in your eyes there is three types of color detectors red green and blue right the three types of cones. You know there is 8% of the blue ones and 92% of the red plus green and there can be a large variability in red plus green but if you want to see well you need 8% where do these magic numbers come from you can do the optimization you can say natural images. What, given the limited budget of cones how should I a proportion them between the red green and blue. That's an optimization problem that you can write down, and you can optimize it, and you'll get 8% blue out. And, and, and you even get that the, that optimization functions rather flat for red and green so there can be lots of variability in how much red and how much green you have while still seeing well. It's actually a beautiful work by by Vijay Palasubramanian who you know, if you were here I like to give this an example. So that's a reasonably well defined optimization problem because you're optimizing three numbers okay but when we go to complicated nonlinear things with many numbers. This, this is starting to kill us and also how do we know that we are getting something non trivial how do we face it with data right how do we compare it with data. So that's what I so long motivation but that's what where I am what where I want to go. Okay, and to get started and I don't care if you don't come to the end. Let me just give this so the idea that I want to propose will be very simple. It's, it's, it can address those questions. All of them in some consistent vision framework. So if I get the idea across that's great and then I just have examples which we can skip or not. You know, so, so, so that's why I'm taking time here. To get there to get to this idea. Let me contrast serve the, you know, bottom up and top down approaches so what is an optimization theory right for a little bit more formally right. So I'm looking for a model that optimizes some notion of biological function. Now this notion of biological function. I encapsulate mathematically by writing down the utility. Okay, this is a mathematical function I think is being maximized with respect to some parameters of my model. Maybe probabilistic like here with some parameters theta, and I have a utility function that can evaluate given the parameter values, it can evaluate this model and say, okay that would be, you know that is a good information transmission or that neuron is encoding information well or you know that transport network has a high rate of flow of nutrients, or you know that biochemical. A metabolic circuit will generate a new E. coli fast. Okay, so that's my utility function, and then optimization right the normative theory takes the model and utility function and finds the set of parameters theta star that maximize the utility, maybe subject to some constraint. Okay, this is agnostic about data so data in this clean picture doesn't come into this right it's ab initio derivation. Maybe data helps you inform what kind of model you choose but it doesn't set the parameters the parameters are set by optimizing whatever you think the model is doing, or should be doing in biology. And you contrast this with sort of statistical modeling right which which is fitting, which is very different. So this is fully data driven so you you start with the model like we wrote it down here you start with you know we have the data, right. And then you run the likelihood or the Bayesian inference and basically what you're doing you're going from the data to the best set of parameters theta star that fit the data so this is data driven, but it's agnostic about function so in this case you doesn't come in okay so here data doesn't come in and here you doesn't come in. Okay, and oftentimes these things are done in parallel independently right so you on one hand you derive optimality prediction so the other hand you fit models from data and then maybe you can ask. Are they consistent. And our question was, can you combine these two things, as opposed to being, you know, to distinct approaches. And, you know, the idea is of course that both of these approaches so here is the normative theory, the optimization of utility function. And here is inference from data, both of these approaches are making statements about parameter space, right so inference is saying, given the data I think my system is here. And that's the best estimate of parameters in these like theta one, two parameter model. This is just illustration. So, from data I think I'm here and normative theory is saying, if I think this is the true utility that what the system is trying to do it should be here right because that will maximize utility right. So both of these are making statements about a, you know, something that lives in the parameter space. And I can think, therefore, about a continuous axis that interpolates between these two approaches, where in this regime which I call the data rich regime. I only use the data to do inference in this regime which is no data regime I only use optimization to predict the parameter values, but there can be intermediate regimes right for instance, I can write down an optimization theory that's not fully specified because I need some data to break the degeneracy, right about which optimum is best or something which optimist closest to data. So I use a little bit of data to inform the optimization theory, or as you will see in this example, I do mainly inference but use the theory to regularize somehow to inform the the fits from the data. Right, so the question is, can you have something that moves from in between the two extremar points. All right, and so. So the idea is, is very simple, which is to take your favorite optimality theory, a normative theory, and formulate it as a prior for Beijing inference. Okay. So how does, how would this look like so this is what we will do so we, you know you give me an optimization theory which is a function of the parameters right so there is some set of parameters that maximizes utility, I would claim that's a good system that works well. And I embed that into a prior that looks like this. And here it is the connection to maximum entropy you see that this prior has a maximum entropy form with one hyper parameter, which is called beta. So what does that mean right so this means that, okay in the limit of beta going to zero this prior is flattening out so you ignore optimization theory entirely right to just get the flat prior as beta gets large and large. What this prior does it's localizing the parameters around the point where optimal systems live. Okay, so they're in the limit of beta going to infinity. If this, if this utility has only one peak, then all the way is localized exactly at that optimal solution. Right. So why is this a maximum entropy what does it mean maximum entropy prior it also means that, how to say the probability spreads, right given beta the probability spreads over those parts of parameter space that achieve a fixed average utility, a given fixed average utility. Okay. So, so you can you can see it like as a, you know, as a sea feeling thing right so if this is my utility function here is the best system, if prior is infinite, if that is infinite I sit here, if better goes lower. I'm allowed to spread over the hill right over the hill in the parameter space, such that here on average I have a high utility and if better goes even lower I'm allowed to spread even more to even worse solutions right that has lower fixed average utility, right, but otherwise it's as random as possible over the parameters in that sense it's max and. This is my standard likelihood. Where is it this one. And of course I can then put the two together into a posterior over the parameters. Right. So now, so this is likelihood this my optimization prior, and now I can, now you can see that by choosing the values of beta. I can now interpolate between these two worlds the red world of normative theory and the blue world of fitting, right if I take this to zero then I'm doing pure fitting so small beta limit is fitting limit, which is equivalent to strong, you know data, lots of data and vice versa large beta puts weight here. So the data is acting just as a small perturbation on top of a structure that's imposed by the prior. I'll be right there with you. Let me just finish this this, however, is a very different use of priors than this statistical regularization price, this price have a ton of structure potentially right they embody the full optimization theory. And it localize you in a particular space, you know, point in the parameter space let's say. Yes. Oh yes, I should work. So when we seen first Max and like the large multipliers the violence were computed by looking at the empirical values of the constraints and fixing. What about here like how do you fix. Yeah, okay, so great question so it. Let me, let me, let me show you how so it actually depends on which. So my claim will be that many of those main questions, statistical questions how do you do hypothesis testing for optimality and so they can be resolved within this framework. So it depends exactly which of the statistical questions you want to pursue is what you do with beta. Okay, so there is, you can, as you will see you can view beta as a hyper parameter that you can try to set also using your data and then if you do that, it will have the type of interpretation that that, you know, was one of the questions like how do you know how close to optimality am I right so if you if you also infer beta itself, and you find values that are very high. So that's a sort of continuous support that your optimization theory makes sense given the data, let's say, right. But let me let me show a few few examples of that. All right, so to let me how are we doing on time. Okay, all right. So, I'll try to illustrate this on on a very simple toy example of a single LN neuron so the you know here is the example is stimulus is some set of values acts and it's all very simple so stimulus is just one is a scalar so I get samples one after the other these are these numbers. And then the neuron what it is doing is taking the stimulus and mapping it through a nonlinearity like a sigmoid that has two parameters k and x zero so the slope and the location. And then this gives me the prediction for the probability of firing right so if I'm here the probability of firing is one and I make a responses one responses binary so it's either spiking or non spiking. So that's a, you know this the simplest type of model neuron. I suppose I had some data by data I mean I would have the stimulus values and it stimulus value would come equipped with whether the neuron fired which the black dot or whether it didn't fire right so with a label if you want. And then, you know, if I had a limited amount of such samples, I could do maximum likelihood inference of these two parameters and since these parameters are only two for, you know for time all illustration. So I would call them as living on this parameter plane at zero and K, and with the final number of samples on this plane, there lives a likelihood. Right, which, because the data is limited puts my system after seeing some number of samples here somewhere where it's dark, you know I cannot be more precise if I had more samples this would localize right and I would be more precise about where the system is but with some final data I can only localize it to where the stuff there is blue. So that that's sort of the inference line of thinking. Now for such a neuron, I could also think of an optimality theory. Okay, and, and if the optimality theory for instance is a is a that this neuron should maximize the transmission of information as inefficient coding. Okay, let me, and this is all synthetic okay now just to illustrate. Let me assume that stimulus these values come from this type of distribution that has three peaks, whatever that means okay so it has three peaks. That's the stimulus. And then if I have the stimulus, I can write down the mutual information between the stimulus values and the neurons response which is binary. And that is that is so this is the mutual information it's in red, and that's a function of these two parameters. Okay, so it has this weird shape the utility okay it's high here here here and here it's kind of degenerate it has four peaks and it's very bad here. And if you want to understand why for peaks well that's not that hard to understand. If you have a sigmoid neuron, and this is your stimulus distribution well you can put the sigmoid such that it separates the first peak from the second two. Okay, or it can separate the first two peaks from the last one, but that's two positions. And the nonlinearity can go monotonically up or it can go monotonically down right so that these are positive and negative case right so that these two peaks, these two positions correspond to this and this separation, and the two values of high K low case whether the nonlinearity goes monotonically up so nothing mysterious, but it's a funny utility because it's not a single peaked one already here in this example right there's so multiple neurons that would work equally well okay. And if you have that utility you can embed it now into this prior. So you put this utility function into the prior now it depends on beta and you see what I was before. I'm trying to explain words. When beta equals zero you forget about the utility so the prior is uniform, but as you increase the beta, you're localizing your prior closer and closer to these optimal points. And of course, that can be summarized by a curve that plots the average utility so here the average utility is low and here is very high average in this in this in this distribution. So this is the utility as a function, you know parametrically of beta. And here is the entropy of that prior meaning the entropy is very curious uniform so it's the highest. Right, and, and here, the entropy is low because you're very localized. Right. And so, as you increase beta, you, your entropy gets lower and lower so I go this way, your utility goes higher and higher. So until here you have localized the system in this four points and the entropy is almost zero right well it's not exactly zero because there's four points right so it's actually, you know lock two of four, but. Right and so now when you have these two quantities, you can now address our first problem so how would you devise a hypothesis test for my neuron to say, is it is its function is that does the data tell me that the neuron is consistent with this optimality that I post possibility which is transmit as much information as possible or not. Right. So you can consider three cases and it is now a synthetic right so I can consider a neuron that lives here which is I design it right so test case number one is the new one is certainly not optimal I put it here. The neuron two is here somewhere difficult region or it's not at the top of the utility but close to it and this one is specifically sitting on the optimum okay so for this one I know that it's optimal right. And for all these three such neurons out I cannot pretend I did not know where they are I can imagine that I observe a limited number of samples from all the three. And after the limited number of samples the likelihood for these three cases are is localizing me to various regions in the in the parameter plan so for, you know for for neuron one I know I'm somewhere here after a finite amount of data for neuron three I'm here. Now you can see why the hypothesis test for optima is not easy when you have finer data, two problems right so first problem is, if you have finer data, you don't know exactly where in the parameter plan your system is. Right. It's spread out your knowledge and second, also utility itself. It's not like it's here it's optimal and here it's crap. It doesn't look like that right it looks like well here is optimal fine and but seriously pretty good. So what does it even mean to ask whether the data is consistent, given these two constraints. Okay, so one proposal by which you can go is to define something that's, you know, a function of beta only, which is the marginal likelihood for beta, given the data. Right, so this is basically, if you were to integrate the likelihood against utilities or integrate over the parameters, you can end up with this statistical quantity, which is, which is just a function of beta right so how much support. You have for a particular beta, given a finite observations and given a particular optimization function which is embedded in the utility. Yeah, what it just happens. Oh, right. And so. So, what what what you can do is then for for very for all of these three cases, you can construct then this margin likelihood for a particular value of beta and you can plot it as a function of beta. And this would be for the first test case with the number one, the number two, and for number number three so what would you, what does that mean right you can look at where there is most like for which value of beta for this number there is most support and you spread so where does this margin likelihood peak, and you'll see that in this case, it peaks, essentially at values close to zero. At this case it picks somewhere, you know, at the non trivial beta value but it's not at zero. Okay, and in this case, you know you want to put as high beta as you can obviously this is for number three because it is sitting on top of the, on top of the optimum by design, right and so this curves are telling you now when like, you know what, how far into the optimum you are so these guys at the optimum so it wants to make better the data is suggesting the beta is large this is better final and this is better. We are still going to zero, we are still not done, right because what we need to do is we need to do a, we need to design a statistical test for whether these values of beta are sort of significant or not right given given this kind of non trivial values of beta. So, I can decide design a test statistic, which is the likelihood ratio that the data are at some beta which is larger than zero, which suggests, you know which gives me some evidence for optimization. So that test statistic, under the hypothesis of no optimization so beta equals zero I can just construct a new distribution, Lamda is a test statistic. This is a new distribution which has a weird form but you can, you know, here for this simple case you can just construct it, because we are leaving a potential parameter space. And so you can put a you know 5% if you want that to be your significance a 5% cut and ask for these three guys, there is three values of this test statistic lambda one lambda two lambda three. Right, and they live lambda one lives here, lambda two lives here, and lambda three is the only one that's in the tail that would pass the significance cut, for instance. And based on, I mean this is of course a very simple okay illustration but it at least formalizes what are the difficulties of testing optimality from data or devising a statistical test, namely the smearing and the fact that utility functions can be smooth. And so you can do this in a and I hope I come there you can do this in an actual application. Okay, with real data which is higher dimension it's not two dimensional like here and so on. And so it this takes properly into account the uncertainty of the finite data and the shape of the utility function. So that's what so you know can we formulate rigorous statistical test yes one can it has its drawbacks, which we can discuss in the end. And the question is, can we estimate the population level optimality so if I don't want to frame a yes no question is it optimal or not but want to measure how close to optimal can I do something like that. And I'll just give a very brief hint for how you go around doing it. So basically what you do is you assume a hierarchical model. I imagine I measure many neurons in the in the eye or many neurons in the visual cortex I have some sort of optimality hypothesis for how they work. And you know this theta one up to theta and our parameters of each of these measured neurons, let's say the shape of their receptive field or you know something like that. And you know data is observed for each one of these subsist systems. And what I imagine then in this setup is that all of these measured neurons are their parameters are drawn from the max and distribution with high utility with the common value of beta. Okay, so all of these neurons have been optimized to the same extent, and I want to assess what that extent is by inferring beta from many observations of many neurons right so you can see that as a hierarchical model if you want, where I want to infer, not the individual parameters but the beta of all of them together and then I will try to make a statement when I infer the beta whether that means the neurons are optimized to what extent they're optimized. So again you can think of you know of a synthetic case again this is our same utility as before you know for the neurons but now I think of many of them each individual red dot is a one synthetic neuron. And in this case, the neurons that I observe would be as you see by the eye I mean this is I made up this case right they would be consistent in some sense with the automatic hypothesis because all the dots live in places where utilities high and I sampled them from there. And in this case you know this is not really very optimal and in this case I just drawn neurons from some point here in between which is actually pretty like it's really bad okay because it sits here in the utility trough. Okay. And here there is no magic you simply do hierarchical Bayesian inference which means that in the end from all of these sample data you are constructing the inferred beta distribution right so these guys would be consistent with beta. That's drawn from, you know from the from this particular posterior where the value is quite high, and this one where the value is lower and this one with the value is very low. I mean there is, you know this is standard part that what's not standard is like how to judge these numbers you know what is 12 high or not, you know, can we transform that into some scale that makes more sense. And you can, because you will remember that there is this nice mapping in the maximum entropy prior there is this nice mapping between the beta. So let's start the curve if you remember there is a mapping between the beta and the average utility. Okay, so one to one monotonic map. Right. And so these beta can therefore be changed into this particular scale, where I can plot the given the inferred I can get the average utility, which will be high here and lower here and lowest here, and I can measure that utility relative to the maximum utility that I could get if everyone set exactly at the peak, right of the utility function. And that's a measure that now goes between zero and one. Okay, so if you're at one you sit exactly at the top of the utility function and if you are at zero you're well zero we can set in multiple ways you can be you know the expectation in in the random flat prior something like that, but you can now compare, for instance for this case, you know you would infer there is the of course it's synthetic so I can compare the inferred the ground truth, but you know these guys are at, you know 93% of what would be the highest possible utility in this system, right. And again, this is a toy model, but you can see that I mean, or I hope to show you that you can do that on actual data. So if you have a hypothesis for your utility function now, you're not doing yes no optimality test you're trying to say, this is, you know, at 90% of what would be possible given your utility or this biological system is at 60% right. Okay. All right. So there is. We're now going through this very different application right so we had yes no hypothesis test, a measure of closings to optimality. What I tried to do to convince you now off is that you can use the same machinery to help your inferences. Okay, so now we're interested in inferring a complicated model from limited data, and we are asking if I have a utility optimality hypothesis can I help my inference, can I improve my inference. So the setup will be the following. So this is now a classical neuroscience setup where you want to learn the receptive field of a neuron right so stimulus goes into my neuron which acts on it via linear filter feeder, where you can think of this as a as a linear filter as in yesterday's deep neural network lecture. The output of the filter is passed through a sigmo and nonlinearity again exactly the same and that's sigmo and nonline, nonlinearity determines how many spikes on average you know there's stochastic person spiking how many spikes the neurons will admit. And now in the inference application, I would be given stimuli and spikes and my task would be to infer this linear filter. And the, the issue is this linear filter is no longer just two parameters like it's a full image patch so it's a lot of parameters okay so typically I'm data limited for real neurons. Okay. So what I'll try to show you now is that if you think these neurons are optimized for something for which you can write on the utility you can use that to regularize the inference problem. And I guess this is like 20 by 20 this type 400 parameters or something right in the receptive field for each one. So, what is the example, if this looks like a receptive field for in this time all this looks like a receptive field of primary visual cortex, they look like a bores, and we actually have a theory and optimization theory for what this filter should look like that we think is reasonable. Okay, and that theory is called sparse coding for those of you might have heard about it. The theory suggests that highest efficiency for also information transmission will be reached when you take natural images and convolve them with such filters that at the output of this filter, you will get a distribution of responses that is as sparse as possible but that's a statement about its cretosis. Okay, given the fixed variance you want to make it like that, peaked. Right, mostly the neuron is zero, but then if it does respond it responds with a response that's actually typically it's higher than under the Gaussian distribution right because these are fixed variance. There's many reasons for why this might be a good idea, but what we will do here is we will just assume that sparse coding is a good theory for these guys that neurons are you know doing sparse coding I can write this down as a utility function that maximizes sparsity I'm not even saying what this is. And I could think of generally using this sparsity to generate a bunch, you know, a bunch of test neurons in this case some are optimized for this utility and some are not. Okay, I can, you know, make these are these are each little pictures is a neuron that's that's a optimized for first bar scouting. Okay, I'll use this guys to to simulate spikes and then pretend I'm doing the inference back of the receptive field so I'll have the ground truth, one of these filters will be you know these filters will be the ground truth and then I'll try to infer them back right from simulated data. Okay, so. Okay, so how does that look like. I have limited, I pretend I have a limited number of stimuli and response pairs in this case. I, you know, these are the two example neurons so that's one filter and that's the other filter that's the ground truth that I have, and I can now simulate spikes for it and infer back this receptive fields. Right and so wait so I wanted to further just show let me see if I can do it. So, right. So, the first thing is I just do beta equals zero in my optimality prior, so prior doesn't do anything and then my inference reduces to maximum likelihood because there is no influence of the prior. So with this number of stimuli, this is the receptive fields, I would recover. And the reason why is the correlation coefficient between the recovered receptive field and the ground truth. So I'm doing pretty, you know, pretty well, you can see the structure, but it's not perfect. Okay, because I have final data if I had more data to get better and better obviously right. But now I can ask so what if I now use inference, but I use as a prior, my sparse coding optimization. Okay, and I can control how much prior I put in by controlling that beta variable right beta equals zero means I don't put any prior but more and more beta I mix in more and more of the optimality theory. Right, and this is what you see what happens. Right, so you as you take a stronger and stronger prior, you're improving the quality of your inferences in both cases. And what's interesting is here I give a little bit of a thing away. One of these receptive fields was actually consistent I generated it to be consistent with sparse coding utility but the other was not the other was some crap filter that I think this one you know has too many wiggles and so on. And yet still if I use the optimality prior for sparse coding I'm actually for to some extent improving the inference quality. And then if I put in too much prior what will happen. This goes too quickly, right, but you can try to guess what happens if I use beta that's too high. Okay, so quality collapses. Why is that. Well, this is now the regime where the prior is so strong I forget the data. So I just start drawing optimal filters from the prior they have nothing to do with the data whatsoever their solutions to the optimization and the coding optimization problem they look nothing like the data. Okay, so obviously there is some amount of a mixture of optimality prior that helps my inference and not too much so I have to set beta somehow and of course I can do that by various methods I can do fully Bayesian I can do cross validation, but that's all in the regime or that we know how to do. Okay. Okay, this. So there has been a lot of work in neuroscience on how to infer these objects, and people handcrafted good priors before by saying, Oh, we know that the receptive fields are typically localized. So let's make a prayer that makes them localized. Oh, we know that receptive fields are typically oriented and bent past so we can you know impose a prior on the frequencies of these components to embody that into my prior. So the idea is that all of these insights, you know they're localized their smooth their band pass they're directed. They, these things are all encapsulated naturally in our sparse coding efficient efficiency theory actually that's where we took the ideas for the prior from anyhow, right. So, in some sense, the best way you can do it is actually to take the full optimality theory and use it as a prior right because it simultaneously captures all these notions that before your hand crafting for the prior. What you can of course then ask is which beta should I be using right to maximize the quality of my predictions and you know again this you can do in by cross validating and so on so here is just as a over a bunch of neurons. It's the maximum likelihood performance that the performance with non trivial priors that you mix in. So performance and maximize that certain you know intermediate value of beta that I can that I can use and I can set that if you want properly using cross validation, but I'm not, you know, going into this year. Let me say that this approach can be extended. You can have an optimality theory, which itself has unknown parameters in it. Imagine I had an optimality theory which it also depended on some sort of parameter that I do not know. Okay, you can of course do. I don't know and I can set it by across validation, but there could be extra parameters that I can set in a similar way for optimality theory itself. Typically, for instance that would happen when you have when you would say, Okay, I think the neurons are optimizing information transmission but the limited cost, so then the optimization theory limited firing rate. So optimality theory will have two terms right in the Lagrange multiplier that balances them, but you don't know what the value of that multiplier is, you know that it's balanced between two things but not how strong. So you can leave that balancing term as a free parameter and then set it in a similar way as it's set here by cross validation. Okay, so I will now give. Basically I will have one example that I can give you on on real data. Here. Just one more comment that last use where you use optimization theory to regularize inference. So just as we, you know, wrote this paper that I give the reference so there was a paper that we didn't know about each other so there was a paper that was just about that type of use of optimality to regularize inference offers of precisely receptive fields that was, and you know this was not just a toy example that was like an actual proper research work that was I think coming out of Max Planck and tubing and the ideas are quite powerful and what's interesting is that for regularizing inference, your optimality theory does not need to be exactly correct right. It, you can view it solely as, as you know something that makes your receptive fields localized directed and so on and so forth. So if you missed the optimal if you missed the prior the priors mismatch to the data right because your guess of optimality theory was wrong. Well, that's not a catastrophe because cross validation will then just, you know, set your betas to be closer and closer to zero. So you will start to disregard your prior. Of course then it won't help your inference that much maybe, but you will also get an indication that you know you, this is probably not a very good idea, right. Because your prior will be tuned down. But as long as it captures some of the statistical regularity in your actual receptive fields, even if they're not purely optimal right you're already gaining, because it's regularizing in a way that's quite powerful. Okay. All right so the one example I have time is, is exactly what you saw but now with real data. So, this is an example where we, you know where we looked at same same same idea, but we actually looked at real measured receptive fields. This is from Dario Ringoches group. These are available so these are the actual receptive fields of neurons in v1 32 by 32 pixels in dimension is about, you know, 250 neurons they're quite variable in terms of their receptive field so you see this one looks like a really good bore. This one looks more like a spot, you know just as a dark spot. And this one is like some other structure that extends across the receptive field, maybe it's noise. Who knows. Okay, so but one can take this data and now just carry out all those questions that that I was trying to show to you. So, you can do an optimality test neuron by neuron. So this is this picture, asking, you know, is this, which is now I can, I can take this as a 32 by 32 parameters right is this consistent with the prediction of sparse coding theory is this consistent and so and to judge that you can generate much as I was showing before you can generate a new distribution and new model for what the utility should be under the hypothesis of no optimization. So that's this red distribution you can define a, you know, say 5% significance threshold, and then you go neuron by neuron and try to evaluate is the utility of these guy on the left or on the right of the of the second scot. And what you would find is that there's 204 out of 250 neurons that would pass individually this test because they're on this side, there's 46 guys. And you know this is an example and it really does look like a nice good war so this one would pass the test right, and then there is 46 of the 250 neurons that look, I mean this is one example right it looks like that and that would not pass the cut but also by I you see that this is kind of a very dispersed receptive field right it doesn't it doesn't have direction it doesn't have a you know it's even hard to say whether it has a center surround structure. So this would be if you ask me to do a yes no testing for every one of them. What's really interesting is the other population type measurement I take all of them together, and I asked as a collective as a selection, right. Can I infer the beta value that that's in this population a measurement, can I can I do a measurement from zero to one how close to optimal they are in this utility, right. And so I can I can construct therefore from all of them jointly I can construct a posterior distribution for my beta value, so that this is the log distribution this is beta. And for these data points, you should be looking at the purple curve which is this, and you know this is. This is a point estimate from this posterior, which is quite spread out by the log right, so you would you would find out that you know the beta most consistent with the data is about 3.4, and now you want to judge that. Again, right compare that value, you can compare it to the ensemble of solutions which are optimal solutions to my utility so I take my utility and computationally maximize it right get many filters that I generate by using my computer, and those ones would have the red distribution of beta values so they would be slightly better in terms of beta 4.5 you know compare to what I what I have. And then this is this is so called anti optimal I can make like really crappy neurons right on purpose I kind of anti optimized parsley, so you know minus parsley right so I get that. If I'm trying the utility of real neurons in v1 and of optimize ensemble, I can put this on the zero to one scale right and so I would conclude that all these 250 neurons correspond to 70% of the utility that I could get if I in my computer found optimal solutions to this problem. Okay. Right, so it's kind of interesting to. And maybe I skip this plot, although it's interesting. You can also make, you can also, you know, take this theory and make new predictions. Okay like in the maximum entropy framework, because what what have we done right, we have assumed an optimal theory, we have inferred the beta for it. I'm just saying you know this is the receptive fields are drawn from e to this beta value times the sparse utility, right. So that means from this distribution I can now draw receptive fields that I claim are consistent with data, right, I only fitted this better value is a one number. Okay, and I draw from it new receptive fields that I claim are consistent with data, and I can compare them with the real receptive fields in other statistical measures. And this particular case, so that's the beta value that's reflective of the data, and what's plotted in this figure is not a receptive field it's a special autocorrelation function of the optimal receptive fields. It's like this, and this would be the special autocorrelation function of actual measured V1 neurons so that's a second order statistic right and predicting it from optimality, comparing it to data. And this is if I choose a non optimal data if I took this anti optimal solution I get a totally different thing, but also if I choose the really super optimized neurons I would get that you maybe you cannot see it but it's a, it's clearly a tighter autocorrelation function it's a smaller dot here right, and there is many other statistics that now from this one parameter fit right I have optimality theory with one beta value that I fitted to data but I can now predict stuff about receptive fields from it and really, you know compare every such observable to data. Right. And so, I mean this is. So I'm ending here, you know these are a few examples. This was the one that's on actual published data in the paper there is others that not not restricted to neuroscience so there is for instance an example if you want to look at it. I'm going to talk about the wiring wiring diagram of see elegance worm for there have been optimization theories about, you know, the worm wants to wire its neurons to its muscles using the minimum amount of wire right and so you can ask is the data really consistent with that theory or not and what can we learn. So there is other examples that we were out there that I do not have time to go through. When I finish I'd like to point out two interesting extensions. You can, of course, now that we put everything, all the optimalities to your theories into Bayesian framework, you can turn the crank of Bayesian mechanism to ask all sorts of other questions, right, you can do model selection between competing. There are competing optimality theories, let's say for neurons in V1 one is parse coding one is slow feature analysis and you can say now with more rigor okay so you know I have these two. Can I do a proper model comparison between them. It's possible to be inferring parameterized utility function so utility function that you know this I mentioned right that trick that you know requires also some parameters to set the optimality but you don't know their values so it somehow want to jointly infer their values and test whether data is consistent with its predictions. So it's kind of a joint testing inference statistical problem. So there is, I think there's, you know we just kind of scratch the surface of what can be done. So let me this I will skip because there is a lot of slides, but I'll put down the conclusions. Okay. So, I think this Bayesian inference with max and optimization priors is interesting. It smoothly interpolates between a pure optimization and pure inference. You can use it to define various tests of optimality or measure distance from predicted optimal solutions. You can use it to aid inference in high dimensional problems. And I think I think that's underappreciated I think that's actually very important, because you know oftentimes if you have a notion of function, and you have a very complicated model you know, or these with lots of parameters and so on. There is huge swaths of parameter space where, you know, where that thing wouldn't find like you choose that parameters it just nothing would happen right so in my work when we do biochemical networks and signal transmission. There's a lot of parameter space where, you know, the signal at the input just doesn't propagate to the output or propagates in a super crappy way to the output. So I'm just saying that the real system is maximizing information transmission. But if nothing comes through right, I mean that is presumably a non functional parameter set and so if the if my prior kills that and I don't waste statistical power to rule it out by inference by using data samples. That's perfectly fine, you know, still doesn't mean that the system needs to be optimal but you know it's really restricting inference to a perhaps a much smaller set of effective parameters right. There can be lots of gains, I think in that direction. The one thing I didn't have time to show you that would have been the last example is a very complicated optimization theory that leads to many, many solutions that are nearly degenerate. You know, solutions where you know with this parameter set you will get a utility of four and with these other parameters said it's very different you'll also get the utility of four and four is the best you can do let's say or close to the best you can do. So how do you like how do you confront such theory which is which is degenerate with with data, especially with data where you know the data might be consistent with one of these many peaks in in the in the utility landscape. Is it trivial or not then right because if you have a theory that makes predictions that there is like 1000 different possible solutions. Well is it surprising that one of them is close to data or not I don't know right and so how do you how do you deal with that. Or in high dimensional parameter space, how do you even find a solution if a solution is close to your data. Is there any optimal solution close to the data because you know once you live in 1500 1000 dimensional parameter spaces, you know you can do the optimization and you'll just find one solution and next time you'll find another one. How do you guide your search to check whether there is one that's close to data so that that that's what we have been thinking about and I didn't have the time to show you. I think it's an interesting formal way of including the notion of biological function into our models. It does not need. We are not assuming things are 100% optimal I think it's very productive to include it right and then, you know, with when I present some results, I often get arguments with into arguments also with evolutionary theory people and so on. But why would we assume it's optimal. I think that that's a very good question we shouldn't right but I think it's a very informative even if you're 90% optimal or 80% or something right that's still a huge reduction in the space of possibilities. All right, that's it this I didn't have the time to show you so maybe I stopped here. It's one minute over time. And if anyone has any questions, I'm happy to take them here or outside. Thank you. Good. Thanks a lot for an inspiring lecture and series of lectures. Have we got questions from the room or the zoom. I think in the zoom it's just like what is the last component in the equation. Yeah, well that was there for a long time. That was partly why we started the transcript so people that are away can read as well. So my question would be I mean obviously I guess one of the big problems is how can you generally encode the optimality function right which you need but but another one is most systems try to do more than one thing at the same time right. Yep. And what you just do a vector of. Yeah, so, so I think that's a very good question and I don't mean the, the only, the only thing we have been thinking about is a subset of utility functions that you know a composition of utility functions where you compose multiple objectives together with with kind of a, if you want like a multi players right so that you would say that the full objective is a bit of that and a bit of that. And what you can then use this framework for is to fix that ratio from data. But that's a very restricted set right it's not, you know it's very small subset of possibility utility so it's not the proper multi, you know multi utility optimization problem. But you know at the moment how to extend it to this type of multiple objectives simultaneously. I was also fantasizing while I'm on this. There is endless very fascinating game theory type paradoxes where the utility of the individual collapses the community and stuff like that. So I wonder whether something like this can come out of this framework. Yeah, so, so okay so I haven't been thinking when in this sort of game theoretic perspective right but already with neurons I think there is something very interesting so there is a, there is a. There is a bit of a conflict right you can have any one needs to think carefully about it so you can have a theory where the theory is predicting individual neuron receptive fields I can disperse coding right there is a sort of utility function and you stick with receptive field and it can tell you know this how how optimal it is right, but actually of course we don't really think that in neuroscience what's being optimized as each individual receptive field alone what's being optimized as a whole ensemble because they do population coding. Right and so, and that is a little bit conflicting right because then when you do optimization of the population. It's not this case that every receptive field should be of particular shape you should have a spectrum right and together and they kind of in some sense inhibit each other you don't want to have the other guy do exactly the same as as you do, even though both of you are perfectly optimal assessed alone. So I think what's interesting is to think how to, and maybe that goes in a in a similar direction right out to how to how to how to formalize this notion and the level of ensembles of things that interact. They have some interaction rules or you know the utility function ties them together. It's possible but you know technically, I mean it might just be technically difficult to actually carry out then any type of computation but you know we haven't done it, but we have thought about it in this normal context. More questions from the room even people behind. I know there's nobody behind the pillar. I wanted to know more about the possible drawbacks of this method, like other. Maybe you said something about that but are there more. Yeah, so I think there is several things so technically doing some of these might just not be easy. Right because you're, you know that's one thing the second thing is that often let's say for for the hypothesis testing. It really depend. And I think it's not a drawback per se of this particular method it's a drawback of reasoning about high dimensional probability distributions and I'm kind of in, you know in in unbounded spaces, because when you read what you compare is how much. How localized your data is close to optimal solutions, but relative to the size of the entire parameter space but that parameter space in practice is bounded within some box right your parameters can go from I don't know minus 10 to plus 10 or whatever right. So your tests and so on depend on the size of that box, and that, you know is obviously problematic. It's not a drawback of only this it's a drawback of any, you know model where you're integrating over the volume of parameter space that is not intrinsically bounded. Like, you know there is a question how do you like, how do you regularize that right so in many, in many cases but not all, you will have, you know some sort of an external limit to where your parameters can go. Let's say with sigmoid functions you know where the mean can be and so on there might be some biophysical limits beyond which you did not go but strictly, you know some of the stuff here depends on the choice of the size of the parameter space. And, you know, I think that I don't know how to deal with that. It's not a unique problem to this framework, but you know it's a problem, right. Yeah, I, apart from that I think I said I think we don't yet fully know because this has not been really applied to that many cases right it's rather new type of work so we'll still see what we can bump, you know, into on the way. Thank you. Actually almost connecting because not just because a student in my courses or had a lecture on likely three inferences today. This looks like it could almost be a prior free inference in the sense that the prior is intractable if your utility is complicated yeah it's how do you normalize it, you know. Yeah, yeah, yeah, no that's so all algorithms for base and inference that I know at least would not be applicable in a scenario where the prior is not trackable. Yeah, it. So I said, I mean, you know, on technical level that you know this this can be so what algorithm did you use I guess you had a fairly low dimensional. This is it simply because things were low dimensional right so that's why I'm saying it force was the algorithm root for like for everything that I was showing the beginning brute force was the algorithm for the application to real neurons that was a mix of you know rejection sampling I mean you know that it becomes black magic right how we actually do it for high dimensional problems. Yeah, so you can so I think what what. So I still see us as the main selling point of this is just actually to even formalize these questions about what we do with optimality theories and data so this we can do. And then how to actually technically carry it out I think there is a lot of work to be done to actually do it you know properly so that's why I think in you know in the paper, we really focused on low dimensional examples where you could just you know you just you know the whole primary space on the grid and you do you know you can do stuff because it's in fact low dimensional right. Any more questions, there's something there in the chat but I think it's. I think this might be from the form. Yeah, no, so I have it here. Alex. So I had like one doubt I have is like this prior you have from optimality theory, the beta you choose is from data like because you have to look at how consistent you are with the data. So if you are to use like this prior for posterior inference like the fact that you're using data to find your prior isn't that tricking somehow like can be tricky in some sense like. I'm not completely sure I mean I don't think we would do. I don't think we do anything different than what you would do with a, you know hyper parameter of a prior in any other context at least I don't think they're doing anything different. Right where you either either you make a hierarchical model and you put a prior even over the beta and then you do the proper basing thing all the way down right, or you do this kind of you know you set beta which is a hyper parameter of the prior using cross validation right so I think that would be wouldn't that be the standard things you would do for hyper parameters anyhow. Yeah, actually yes. Right right of course and this and this is how we would this is absolutely how we would properly was on this framework and just to make it very clear so I didn't I didn't talk about that detail but of course this is you separate data sets and so on right. Yeah. More questions. Well, if not, thank you very much. Thank you for taking the trouble to come to the rest in thanks thanks for less than optimal season. Yeah, we should stop the recording and what's going on.