 for today of this wonderful workshop. For the first talk this morning, we're very privileged and fortunate to have Professor Manfred Oper visiting us. I don't think Professor Oper needs an introduction in this community, he's made seminal contributions to topics that span the range of what this workshop is about, and this morning he'll tell us a bit about the replica method for approximate inference. So please. Thank you very much, I'm really happy to be here and would like to thank the organizers for inviting me, and when I was invited I thought, hey, this is great, I can learn a lot of new stuff and I'm not disappointed, there's a lot of new stuff going on. But then in second thoughts, goodness, I haven't really worked in recent years on structured data, I did a few other things, what shall I talk about? And so, I got the idea to speak actually about older work, if I allowed myself to do that. Older work with Dürte Malzahn, that work came out of some work that we did on approximate resampling methods, and but possibly got a little bit overlooked, and was briefly mentioned on the first day by Professor Takahashi, and I hope that I can give a little bit of a complimentary point of view on the question of Gaussianity, and so let's see if that could be useful, but this doesn't work, oh it does work, yes. Okay, so my motivation came of course from earlier work on statistical physics using the replica method, and I still believe it's a very powerful method, and I know that by many of you, the results have been made rigorous in many cases, not the technique itself, but I still believe it's a very nice tool that can give you answers, and in this talk I will combine it with approximate inference method, but the usual typical ingredient, the replica method, the way it's used very often is we need to have a tractable input distribution that somehow leads to Gaussian densities of certain random variables by Gaussian equivalence, and then at the end it allows us to do explicit expectations using Gaussian integrals. The second typical ingredient, and that made me always a bit worried about when it comes to comparing to real data, was we need to have also a tractable target model usually called the teacher, and in the past of course there had been so many teacher-student scenarios discussed, of course mainly in simple neural networks, but we see now also yesterday that it can be done in a much broader way, but it seems we have to define a teacher also, and yeah, that together with the input distribution leads to some tractable random variables. And so if you want to use a comparison on real data, and I think also yesterday we saw this is nicely possible in the unsupervised learning case, but maybe it's a bit non-trivial in the supervised case where we need this teacher, and I read a very nice paper learning curve, so generic feature maps had been discussed yesterday as well for realistic data sets within a teacher-student model, so it was shown that I think this defining an appropriate teacher is okayish in the regression case, but already on real data for some classification cases, there were significant deviations between computed learning curves and the ones on real data, so I felt maybe this older work could give some alternative picture on that, so in this talk we try to avoid these analytical averages over the data that usually lead us to these Gaussian random variables, but we use an approximation to the replicated and average measure, any approximation is Gaussian, I mean there will be Gaussians of course, they had been so successful, but they come from a different point of view, so we use approximate inference method like the variational approach to approximate the replicated and averaged measure over the random variables of the students, if you want so, and always leave the expectations over data and the task within the formalism and work on that average sort of on a final stage, and as I said, we focus on the Gaussian approximation of that measure, and I hope nobody was really shocked this work is from 2005, still in the new century, not in the old one, and so, but we give, but we gave, haven't done essentially anything new on that, but we present the application to old machine learning models like Gaussian process regression problems and support vector machines, essentially kernel things in the kernel world. Okay, so I just give a brief review sort of on the standard way, possibly many of you know that, so the statistical mechanics applied to single layer neural networks, so usually we work on a statistical ensemble of networks W that defines, is defined by some Hamiltonian H, which is usually the sum over a loss function, summed over training data, we have also here a regularization term, and that can also be viewed as a Gaussian, a prior measure over the couplings of the neural network, and the lambda mu are these activation functions, and so that formalism can be used sort of to model a Bayesian approach, so if we set this inverse temperature parameter beta equals one, and interpret the loss function as a negative log likelihood, then, and we're interested in a Bayesian framework, so that would give us the posterior of the network parameters W, and then a student who wants to make predictions would be given by the conditional expectation of the data, of the weight vector condition on the data using this distribution, but as is also important for applications, very often we want to have not an ensemble, but look at sort of the minimum of the cost function, and so then we would take this beta parameter to infinity, and then this measure P of W given data would concentrate on that point, so this is the standard thing, and very often we use two simple assumptions, so the inputs X mu to the network are some D dimensional vectors, and we assume that they're independently generated at random, and the simplest case would be that the X, the components of each vectors, again would be IID random variables with zero mean and unit variance, and of course the simplest case in that case would be they're all Gaussian, but even we in the old days believed that maybe this assumption is sufficient to show Gaussianity for the activation functions. The assumption too is that the training labels that are kind of the, the training labels are generated by some kind of teacher target rule, so this is given by a weight vector WT and the probability distribution of the labels given the teacher activation function lambda T, and under that condition, we are interested in computing the log partition function, the free energy, which then can be used as a generating function for certain expectations, expectations over the ensemble, but then also expectations over the data when we compute the expectation over data of the log partition function, and we use the replica trick in order to compute that, so the expectation of log Z can be written as a derivative of the log of the expectation of Z to the power of R, when R is a real variable, we can take the limit R to zero and take this derivative and so we're happy, but in practice, we are able only to compute this expectation to the power of R for integer R, and then we try to bring our computations in such a form that we can easily do an analytical continuation in the simplest case in the so-called replica symmetric framework, and then in the result take the limit R to zero, and of course we can do the expectation of the replicated partition function and use already something very simple where we did not make any Gaussian assumption, using the fact that all the data and the labels are independent from each others, so we get N data points, so by moving the expectation over data inside, we get the expectation essentially over single data points to the power of N, and this outer thing is the Gaussian prior measure, usually I denote that by P zero, and in the applications where we say we have this nice Gaussian equivalent, we can perform the expectation inside here and in this term by saying, okay, everything depends only on that L lambda A, which are linear combinations of the IID random variables X, and if we have a central limit theorem or if the X's are Gaussian themselves, then essentially the high-dimensional integral over the X is becoming a finite-dimensional integral over jointly Gaussian random variables lambda one up to lambda R, R is the number again of replicas, lambda T is the teacher, so they are jointly Gaussian, and here is the sum over the labels that we have conditioned on the teacher lambda, so this is something that we usually do and we write that as exponential to G one of Q, where Q is the covariance matrix of that multivariate Gaussian, I think most of you have seen that, or many of you have seen that calculation before, so the only thing that remains is of course to integrate now over the weights conditioned on these values Q, and that is again, I'm sorry, here's again with jointly Gaussian, and these are the mean zero, we have covariance QAB, and so then we can express essentially the whole expectation of the replicated partition function by an integral over these matrices Q, where we have... So we have inside an integral over all weight vectors, replicated weight vectors, where we fix those inner products of the W's, so essentially, so the W's here are still coupled to each others, but they are coupled only in a macroscopic way that they have these conditions QAB fixed, so it's a kind of really a mean field model, very weakly coupled Q's, so essentially you say they are, if you take a finite set of the couplings, they are essentially independent, they would be even Gaussian, so you have resulting from this Gaussian average, you have a Gaussian distribution over the W's in this case, and so in practice then we can decouple the whole thing using simple, I would call them large deviation techniques, and then represent this whole integral again as something that is exponential in the dimensionality of the weight vectors, and then finally we can use subtle point methods to find the most likely essentially Q that we have here, but the only thing we need to do is of course something, we have to work on what's gonna happen in the limit R to zero, we have to say something about replicas symmetry or more complicated stuff, so essentially replicas symmetry would be to say if two replicas are different, two students are different, we have, we say all these matrix elements in the matrix QAB are equal, if we have, if A is equal to B, we have the self-overlap and there's also an overlap between a student and teacher, so essentially we plug that in and that high symmetry of this, of the joint Gaussian distribution that we have here allows us to rewrite the whole Gaussian average using, so we can decompose the Gaussians into a bunch of independent ones and the ones that couple them, and then the important thing is we have something, a Gaussian, an inner Gaussian integral that is to the power of R and now it seems like we can analytically continue R to real values and then live happily thereafter, so that is essentially the very old story and how this is done if we do first the average over the input data X. Okay, so now what we want to do is postpone that average a little bit and don't think about too much about Gaussianity or Gaussian equivalence of inputs that be a little bit motivated by this result, so we say okay, what comes out is essentially weakly coupled W's and maybe there are a lot of more cases where the W's are weakly coupled and become essentially independent Gaussian so does say okay, we approximate these replicated posteriors by independent Gaussians to start with, maybe we can use more complicated Gaussians later, so the idea is then saying okay, again we have training data, so start fresh, training data, now I call them M, the total number of them for a reason, now we assume that these guys are independently drawn independently at random from some joint distribution of inputs and labels and well, we work on that later, but maybe there's also a goal when we want to make comparisons with real data, we say okay, maybe we can sample from them in some real data by doing splitting of the entire data set into a training and test and then do cross-validation, so this is somehow that something we take as an input parameter to the computation, okay, so then we have this Hamiltonian for M data, here's again the regularization of the prior and to make things look a little bit nicer in the computation, I add another Poissonization trick saying okay, possibly in the large, I don't want to talk about a thermodynamic limit, but let's say for a sufficiently large number of data, it doesn't make sense, it doesn't make a big difference to assume that M is not fixed but is also a random variable with a bit of fluctuation, so we assume it's a Poisson random variable, so we fix N the number of data on average, so that makes the end result a little bit nicer, so remember that's what we get if we replicate the partition function and we have an inner average that comes from the independence of the data, so there's something to the power of M and here is the average, the expectation over essentially a single data point and here we have something to the power of M but we would like to have something nice in the exponential that we can use for the variational approximation, so we use that artificial Poissonization step of the number of data points and then we write this replicated average Hamilton partition function in a new way where we have this effective Hamiltonian for the R replicated variables and this H is now given essentially by an expectation over a single data point where we have the loss function here, so that's a simple way and it's nice to have this expectation in the exponential, you see that in a minute, so it's a relatively simple way of writing things down and now, okay, at some point we have to go further and do averages or whatever, but in general, let's say in finite dimensional cases for general non Gaussian axes, we cannot solve this or simplify and so we have, in a sense, we have this replicated model that defines as an intractable distribution over these weight vectors, W, so we treat it as a variational inference problem, we say we have this intractable problem that is defined by this prior part and here is this intractable Hamiltonian and so we would like to approximate this by attractable density, of course, Gaussians are usually a good choice in a case where the Ws are continuous random variables and so Q would have a theta in it and the theta would be the so-called variational parameter, before we had a problem where we said we get order parameters that come out of the calculation and we find their most likely, we find their macroscopic values, in this case, we say, okay, we introduce from outside certain variational parameters, theta, we have to optimize them by optimizing, when we want to optimize the approximation, so they replace, in this formalism, the old order parameters. So then we would say for fixed R, we would optimize these variational parameters by minimizing the Kullbach library divergence between the measures Q and P, so P is the intractable one, Q is the tractable one and so essentially it's this Q look, Q over P and integrated over all Ws, this essentially means we get a negative entropy term for the approximating distribution and we get something nice because we have to work with log P and log P has this, by taking the log of this thing, we get the intractable Hamiltonian down and we have to calculate the expectation of this object and there's also the expectation over data but now we do first the expectation over the tractable approximating distribution and fix the average over the data for a second. So of course if Q is a Gaussian then we get Gaussian integrals that would replace the Gaussian integrals that we had in the standard formulation. So as an exercise, you go back to this single layer network and we would start with a factorizing a Gaussian approximation. So each component of the weight vector, so all weight vectors are still independent but there is a dependency between the different replicas of the individual variables and so we would say, okay, they have a mean of course and but all these means are the same for all the replicas so that is replicas symmetric assumption or the simplest one but we also work with a covariance matrix, the R by R covariance matrix that has the replicas symmetric structure that we had used in the other approaches essentially copy and paste and now if we do the computation of the expectation of that term before we had an expectation over data now we have the expectation over the approximate distribution with a little bit of work you see, oh, there's the same structure, a bunch of Gaussian integrals essentially a Gaussian integral to the power of R and we can also take later the replica limit R to zero. And I haven't talked about any thermodynamic limit here but we see there is a little bit of a dependency on these variational parameters here, rho that has mi squared, xi squared in it and then we also can take the data average here if we take the standard teacher student scenario from before we can also carry out then the average over the data and we get agreement with a previous calculation let's say if we simply take xi to be plus minus one so then all the nasty exit dependency will go out and if we take all the teachers by replacing them all to be equal to one and arguing a bit with equivalence possibly that's okay so we essentially get then the same result as before I think if you would use other distributions you would have to discuss again a bit why this might work so in some sense you can get the old result by assuming that there is actually very weakly couples no couplings between the weights after the replica and so on and so forth and so this maybe it's a good variational approximation so now to do that for something slightly more complicated we looked into so-called Gaussian process models but if you want there are nothing but single layer networks in a sense so you would work again in the Gaussian in the Gaussian prior world where you have a Gaussian distribution over the weights and so if we define functions to be functions of the inputs as linear combinations of the weights with certain given features phi i then this induces a Gaussian distribution over function so a Gaussian process usually this is written as F is a realization of a Gaussian process with zero mean and kernel K and the kernel K is essentially this weighted sum over features and then you can show in practice everything all the calculations only depend essentially on the kernel and not on the individual features and then you go into a function a presentation where you say I have a posterior function over a posterior distribution or in function space so it's conditioned on the data which is given by a prior over functions that has this covariance kernel K and there is a likelihood term and so lots of people have done work on it I think it's still a popular method used in all kinds of applications and so you wanted to play with that slightly more complicated model and so this is a bit of an illustration so these would be samples from the Gaussian process prior so this blue thing is essentially the prior covariance and the prior variance of the function and if we have now one, two, three, four data points so we see the posterior concentrates more especially where we have data but there's still some posterior uncertainty and that would be samples conditioned on the data so that can fit functions and but still these functions can be non-linear so we are not restricted to simple linear cases so now if we do this program of a Gaussian variational approximation for the replicated Gaussian process models we would have again that measure and we would have this effective replica Hamiltonian it's just the same story as before so we have this E of X and Y and the average over E to the minus sum of replicas of the loss so very similar to that before but now since we are working now in that function space of course there will be dependency there's dependencies in the prior between two different input points so it would make sense of course now to work with correlated Gaussians and so that's what we did so we were assuming there is a Gaussian process measure that approximates this measure over replicas where essentially we have to define a mean function and we have to take a covariance essentially a covariance kernel for the approximating model that has the replicas symmetric ansatz in it and so the interpretation of that variational parameter functions would be R of X would be the data averaged posterior mean of a function at an input point X and these other two guys would then be the correlation of two expected functions on two sides or kind of that self overlap so they have an interpretation and of course in practice it's a bit horrible right we have to deal now with functional variational parameters so in practice we make a few further approximations there will be operator inverses but we make approximation that are justified for a larger number of data but we have not assumed here explicitly any thermodynamic limit but so we work with that formulaism and so the nice thing with working with these replicated distributions is we can also easily represent not only expected things like here the expected loss that we can now then evaluate in terms of the variational parameters but we can also work with fluctuations of losses easily in the replica pictures so if we are for instance interested in the expected loss to the power of two averaged over data sets we need four replicas and take the replica limit to zero but since all averages are Gaussian they can be evaluated approximately in that framework so here's just an evaluation on a simple one-dimensional toy data that means the input to the whole system is one-dimensional so and it's a uniform distribution over input variables so very far away from original Gaussian so and the loss again as I said is Gaussian process regression so it measures essentially the square loss between the predictor and the y and the kernel is not a linear kernel so it's not really one of the old problems that we had so it's a periodic radial basis function kernel that can fit nonlinear functions and here's two comparisons so when we simulated essentially functions from the prior added a bit noise to them and then did Gaussian process regression and compared the empirical training error and the generalization error with the predicted results that we got out of the replica theory so it's a very very simple toy scenario where we say okay many things because we have a uniform input distribution so a lot of things don't depend actually on X but we can also get decent results for sample fluctuations of measures for instance of the training error so how do they fluctuate between data samples and also the uncertainty of the generalization error so the replica picture is very nice doesn't assume that everything is self-averaging you can get fluctuations decently in that framework okay so now yeah there anything we can do with real data so there's an interesting thing so if you go into the this computational framework with this approximating Gaussians so you might be interested in saying something okay I want to evaluate certain functions that depend on posterior averages in a complicated way and we want to evaluate them on training data so this runs sorry that should be a mu it runs over all the training data so we want to know what is if we have an average over training data how can that be converted into something over test data so if we go into the formalism it's a bit complicated but essentially we can transform that into something that contains the average over the data generating mechanism we haven't specified what that is but within the Gaussian approximation we can relate an empirical average to an average over x, y drawn at random from the data generator and this is a complicated things evolving a bunch of Gaussians but for certain things they can be evaluated and there's one example that we tested it on so of course this is the uncertainty of the Bayesian uncertainty of the prediction of the function at the data points in the data set so this should be usually small but here is something that relates the variance, the uncertainty of the prediction at arbitrary input points x averaged over all possible over the density of input points so that would be the prediction that this method gives and so we can use simply a data set split it into test and training data and check if this relation is fulfilled and you see it seems to be fulfilled very well for well I mean that's what we could do there were data sets that you would do in the early 2000s so it's not really huge but it seems well possibly if they're bigger it gets better even so it means we can this method leads us to relations between certain things evaluated at data that are in the training set and they can be represent as I said you know this expectation over the data generating mechanism is a part of the formalism I don't perform it analytically but we use data splits for doing that okay how much is left okay okay fine so as a relative to Gaussian process methods we also looked at support vector machines and so they can be viewed as some sort of Gaussian process model so we have the problem of a classifier with binary labels plus minus one there is also a feature an implicit feature representation in terms of weights and the support vector training can be formulated as a minimization of the length of the weight vector such that these things are always greater than one and so it's also known as a maximal margin classifier and the nice thing is based on the kernel trick everything can be expressed in terms of kernel so in many cases that would be a sum over infinitely many features but we can do everything with kernels and so at the end of the day to relate that to the previous things that we've done with Gaussian processes we can work with a pseudo posterior P over weights using that pseudo likelihood and a prior that essentially says we make the prior over weights smaller and smaller so that enforces the minimization in the limit epsilon to zero so we work with that pseudo posterior do all the things that we've done before then take the limit epsilon to zero to reach the support vector limit and finally we rewrite everything in terms of measures over function space and then have again the replicated measure over functions and here's this prior measure and as before some thing that has this expectation front of it so we can essentially use really apply the results that we have developed before for Gaussian, similar to Gaussian process regression and so some of the predicted relations would be for instance, there's a number of support vectors that are the data points that end up on the margin so this can be represented by an expectation over the data generating mechanism phi is the cumulative Gaussian and this involves a bunch of the what I would call the order parameters or the variational parameters that we had and they can be also computed from the data sets or that would be the zero one loss the number of misclassification you can do more complicated things but they would fill sort of the whole page and that's why I didn't give it so again, we can use try that on real data so there was a data set which is nine dimensional here, not infinite dimensional so we get decent approximations here this is not so nice but for a small number of example data possibly the approximation is not so good so yeah, essentially this is all I wanted to say so I try to explain a framework where we have where we approximate the replica measure for a given data distribution but not performing the average over the data explicitly not making a explicit Gaussian assumption but the Gaussian assumption comes in of course in the variational approximation what I also liked was we did not have to define explicit teachers in a supervised learning case and the approach predicts certain relations between expectations on training and test data that could be of interest and they can be applied to kernel machines and of course, yeah, can that be useful? I don't know, first of all, can it be extended to interesting things? I mean, I don't know what, we all have different interests, I mean, I like dynamics and I always said, you know, how can't we apply this to dynamics to get something that we can compare on real data? That's what I want to do. Well, neural nets, who knows? I mean, we know we have Bayesian neural nets essentially where we do, where people approximate posteriors over network weights using also variational techniques and make all kinds of approximations. Why not put replicas on top of that approximation and do something with it, maybe it's possible? Of course, I mean, maybe there's nothing for the mathematically minded among you. This is, I don't know, when is this good or bad or it might even become exact under certain scenarios in the thermodynamic limit? Is there something that can be said in that cases? And of course, there will be hidden assumptions on data distributions. When does it lead to effective measures over these weights, W? But I sort of circumvented that problem, but what I got was something that might be easier, computable without assuming specific teachers. But of course, it's easy to construct counter examples. Essentially, if you wanna do it in a practical situation, so I mean, the idea was be to sample data from the generating mechanisms or relations between training data and hold out data, but let's say we have a relatively small data set, then essentially you sample from a discrete distribution. And we had done work before that. I was also mentioned in Professor Takahashi's work. Yeah, so we would have to introduce extra random variables saying, oh, is this guy inside of the training set or not? And then it turns out the simple Gaussian approximation is not good. We have to do something a bit more advanced, though for instance using expectation propagation things. So we have to go beyond the Gaussian approximation. So it's not something that is always working, but yeah. And then at the end of the day, I thought, oh, maybe what I'm doing here is maybe a little bit related to somehow averaging over TAP equations, but I don't know the answer. And I think now I'm done with my talk. Thank you very much. Thank you very much. Very good talk. Yes, the one note, the good point of the variational approach is a guarantee of the approximation. For example, we can have a bound or something. Yeah. Yes, yes, yes, this is my question. So if the R equals zero, it does not hold, OK? So we cannot expect it to create some bounds for R equals zero. OK, thank you. Yeah. I thank you a lot for the talk. It was very interesting. First, let me say that here. Sorry if we have overlooked this work. I think it's very interesting. I personally didn't know it. And I think some of the approximations you make, I think, are related to some of the stuff that we have been doing in kernels, and some colleagues have been doing in kernels. So we should definitely try as a community to fix this overlooking. I have two small, maybe more technical questions. The first one is if I understood correctly, the average over the teacher, you just defer it to the end. Yeah, and we're never doing explicitly. Yeah. But when you want to plot these learning curves, somehow you need to take this average to be able to plot these theoretical curves. So how do you do that? Do you approximate it in the whole data set as a population average over the whole data set? And then if you want to know the performance, I don't know. OK, thanks. And the other small question was, I was really curious about this grand canonical approximation like or this Poisson introduction like where you let the number of samples to fluctuate. So do you have any intuition of why this is important? Why does it make it easier? OK, I'm going to take a look. Looks like a nice trick. Thanks. Sorry, I had the equation, but that was his second question. No, it's fine. I guess I have maybe a naive question. But so is there an example of a system where you understand this approach to be exact, but maybe it's not exact if I don't consider a variational approximation? Yeah, is there a simple example of a model where you understand this to be asymptotic? Then it reproduces the replica calculation, right? OK, if there are no more questions, let's thank Professor O'Pher again.