 Thank you very much, thank you very much for your interest and thank you very much for inviting me. This is joint work with the postdoc Burak Chakmak and as you see here it's about analyzing an algorithm but it's not analyzing deep learning, well maybe in the future, I don't know, but it's something much simpler, it's related to probabilistic approximate inference so just to remind you the background sort of in data science, you have a probabilistic model, y are observed quantities, x are things that you don't know but you have a prior distribution on them and you compute posteriors and the usual problem is to marginalize out the unobserved quantities which give you some kind of partition function or the likelihood of the data or you integrate out all the variables let's say except for variable i and you get the marginal distribution of variable i and the problems usually if the model is not Gaussian or completely factorizing then you can't do them exactly and people have come up with lots of methods for computing those things approximately so that's essentially the background but yeah and the problem that I'm going to tackle is analyze a toy model for approximate inference and essentially it goes back to the TAP equation that I've been mentioned a couple of times this morning also with Mark and understand their typical dynamical properties of the algorithm and maybe yeah, understand what it does. So the outline would be really I concentrate on the toy example the simple ising model and I will introduce as approximate inference method the so-called TAP method or TAP equations and in order to make the data a little bit more structured not being completely random I work with a model of coupling matrices that are sort of that come from a random orthogonal ensemble then I will define an algorithm for solving these TAP equations and analyze them by a tool that has been used for many years in also in this in the spin glass physics it's a method of dynamical functionals and it turns out we can get a nice analytical solution and I think we also can get an understanding why the algorithm algorithm works well converges nicely and as I said the toy example is not really an example with or for real data but we hope that it makes sense to look at the simplest case at least and yeah so related work actually my motivation comes from a talk that was long long time ago possibly in two given by Yoshioki Kabashima on approximate message passing algorithms so you come up with a belief propagation algorithm you approach the dense limit and then you can sort of analyze the dynamics of this belief propagation and more recently I got introduced into an algorithm introduced by by bolt housing for solving the TAP equations for the SK model for the Schenck Kirkpatrick model then later there was rigorous work on on so-called approximated message passing algorithms for random IID matrices and generalization to more to more correlated matrices by by Rangan and so on and all these all these things show a result that there is some kind of a density evolution that tells you we have time dependent order parameters and they usually fulfill a one-step update of these of these order parameters and so I think my question already in 2003 was how on earth can that be because there is some random matrix we do some nonlinear operations we define a dynamics of a variable involving some kind of random matrix and we all know there must be some a lot of metastable stays there must be glassiness or things like that it must be horrible but it seems like you know things seem to converge nicely and what makes it what makes it converge but what makes it simple that's that was one of the question there I tried to understand so back to the simplest model the icing model with couplings jij binary spins plus minus one external field H I even said the H I equal to a constant H which makes me yes I can look up results from the replicas in in in the paper so I don't can really compare to the simplest case so there's no really date no real data here there is a coupling matrix jij and the inference problem that we try to solve essentially is computing the magnetization of each spin at least approximately for a large system and that brings us back to the TAP equations named after Fowler's Anderson and Palmer that gives you a set of nonlinear self consistent equations for the magnetization and this is valid actually you get the exact result in the limit end to infinity for a Sharon and Kirk Patrick model that means we have random couplings and a coupling strings scales like 1 over square root of n in the paramagnetic phase so there's these kind of things I would now call inference computing of the computation of marginal statistics of the moments of this icing model so it's a question so this this is what this our goal we want to have an algorithm that converges to the solution of this kind of fixed point equation and people telling me if you just do a simple iteration putting time t on the right hand side in time t plus one on the left hand side isn't really a good thing so it will never converge and there are more interesting ways for doing it and so in order to make to go beyond that simple random matrix Jij where you completely have independent random couplings making a little bit more interesting trying to cover correlated couplings Jij I will introduce a structured a more structured ensemble of random matrices and this is known as the rotationally invariant ensemble so that means I write this J matrix as an orthogonal matrix O times a diagonal O transpose a diagonal matrix O and assume that O is a random orthogonal matrix so it means it's a it's a random rotation a hard matrix so this is the kind of ensemble that I'm studying and we assume that this matrix lambda so the diagonals the the elements of the diagonal will converge to the density of eigenvalues will converge to something to some nice density for the limit n to infinity so in this case the J's can be really dependent random variables and of course this doesn't work for sparse matrices so we're usually in the field of dense matrices so this is kind of model now random matrices that are generated in such a way and the completely independent couplings and the J in in the Schrodinger-Cook-Patrick models are also a specific case of that so a specific example of that was actually introduced this morning in the second part of Mark's talk so you can actually generate such an ensemble by using such outer product structure in the coupling matrix and assuming that this variable x is like has a layered structure so you multiply different x's random matrices x with each others and you assume that these x's are iid have iid entries and I think the major difference between what Mark explained this morning and his talk I explicitly assume that the all these x's are essentially Gaussian and not binary but at the end of the day if you analyze the corresponding TAP equations you get the same so it doesn't care about the discreteness or or continuous so this is an example of that rotationally invariant matrix ensemble and yeah so what are the TAP equations for this case and they have been studied for quite a long time in 95 the result was given by Parisi and Potters and they showed essentially the magnetization is the hyperbolic tangent of this field gamma i the field gamma i is the simple mean field the naive mean field term plus an on saga reaction term so that means there's a reaction of all the other spins by the interaction with spin i and this constant in front of it that's the only point where the specific ensemble comes in so this rj is known as the r transform of that random matrix ensemble so it's the only it's just a number that tells you what what is the specific ensemble that that enters here and so just to briefly explain what is this r transform so we all know this green function of the matrix j so essentially this depends only on the spectrum of these random matrix ensemble and essentially it's related to the functional inverse of the green function minus something subtracted so it's all you can get it from the spectrum by computing the green function and do an inversion okay so that's essentially it so it's just one number that you have to compute in order to define these one extra number that makes one matrix ensemble different from the other okay so this is a an old result from 1995 and so here's our algorithm of course this algorithm didn't fall from the sky it was it's a little bit related to a so-called vamp vectorial approximate message passing algorithm and but since that was not originally applied to TAP equations so we modified it a little bit and yeah it looks like that I'm sorry it looks like this so this internal this there's when you come back to the TAP equations wait a minute uh so you see there's this field gamma i once you have this gamma i you can easily compute the mi so we have a dynamics that computes these gamma i so what's the dynamic dynamics so we start from a random configuration a little bit technical here ui are Gaussian so it's a Gaussian initialization and here's the essential iteration so we pass this gamma so gamma is computed multiplying a specific matrix with gamma tilde passing it through some non-linearity and get a new value so it's a typical this looks a little bit like a recurrent neural networks you have some coupling matrix a and you multiply some vectors by with this a then you pass it point wise through a non-linearity and the non-linearity is essentially this tanj minus gamma i and this a um has a somewhat strange kind of definition I'll come back to it why it should be that and then we say okay um there's a few things that have to be pre-computed before you run the algorithm and you can actually do it on a concrete matrix you need these r transforms but you can do you can do an approximation on a on a concrete matrix and that's essentially the algorithm that should do good things so why this and what what could we expect well first of all we want to analyze it using tools of statistical mechanics in the limit n to infinity and remember what we have in this in this definition of the algorithm here's the coupling matrix j the coupling matrix j defines this matrix a and if j is rotationally invariant then a is also rotationally invariant so it's a it's um it's um in order to analyze this kind of thing we use a technique that has been used um for many years in in in a spin glass world so it's a so-called generating functional technique introduced by martin sigian rose then I think the next ones were uh some polinsky and sypelius for the sk model and there was a related work by john heritz hayo somers high and so on a krisanti and many others it was quite something that people did in in in the 80s and in early 90s and so the idea is essentially we have a dynamical system so that defines a trajectory of length t over these dynamical variables they are coupled by a random matrix a and in order to compute properties in the limit n to infinity you define a kind of partition function where you sort of integrate over all these variables and you put the dynamics into delta functions and from computing such a generating functional you can then compute the properties uh by doing the average over uh this random couplings the nice thing is since this partition functions normalized to ones you don't even need replicas and this all works nicely um when you take um sort of a finite length trajectory and take then into the limit n to infinity so the technicalities are sort of you replace a delta function by four year trends uh by its four year representation then you have to do certain expectations over expressions like that and with a bit of random matrix theory you can do that exactly and at the end of the day you get uh you um uh you get a uh resulting non-random model where um the couplings between the variables i and j are now replaced by couplings of a single variable over time so you can actually calculate asymptotically what is the marginal distribution over uh single variable trajectories and that's the idea and at the end of the day what you get with a bit of work is something like that if you look at a single variable I don't write gamma i anymore just write gamma or gamma tilde so you're sitting at a specific site and you say okay here is gamma tilde here's a non-linear function defined by gamma and gamma actually interacts with the past of gamma tilde so this is a memory kernel and there is a Gaussian noise acting on it so the um the couplings the interactions with the neighboring spins is replaced by a self interaction that goes over time which is understandable so when I'm here and I'm interacting with my neighbor then the neighbor will also in the next steps will interact with me again so there will be an influence of myself uh on on on on my future uh on on the future of my field and of course um that comes by the symmetry of the coupling matrix you would expect such a thing to appear and essentially forget about all the the technicalities this g hat this um this memory term is derived from a linear response kind of thing because essentially what uh in in in the large n limit the interactions are so weak that my influence on the neighbor can somehow be treated by linear response and that's how you get this linear response relation in well so you get this thing this thing is related uh to uh an r transform in in a matrix space uh really ugly stuff and also the Gaussian process that drives the whole system has a horribly complicated correlation function and that should be the end of it right I mean this is exact possibly not rigorous but we could work on it making it rigorous but it doesn't help you I mean just then simulate the whole system it's it's as complicated as as this analytical theory so because well we could stop here because there are only very very few models that you could solve exactly analytically of course if the nonlinear function wouldn't be nonlinear but just linear you could do it if the random matrix would not be symmetric but you know a i j and a j i would be uncorrelated then of course the um the memory terms vanish because my the interaction this in this direction is independent of uh of the reaction to the other side so these things simplified as no memory and you can solve things that has been done for a certain neural network models by chrisanti sampolinsky and others the other thing that is solvable is a spherical p-spin model uh and you get in this case closed form integral equations for all these correlation and response functions but this is not the case we have just p equals two and uh it's uh spherical in our case it's not even spherical so what can we do we'll just simply go through the calculation and uh well here's an is an example how these things look like if you try to solve these these these uh equations with memory using a Monte Carlo method that we did that many many years ago and this is a response function so how does a spin at time 100 react to a little perturbation uh at times before and you see oh my goodness there's even a response to the earliest time and that will lead to a remnant magnetization ugly things and this algorithm will not converge to anything nice this is for an sk model for the zero temperature relaxation of an sk model so well okay so having said that um things are actually uh nicer for this very specific function f and what we get is surprise surprise the memory itself vanishes complete is no memory terms and what we get is a very simple dynamics this variable gamma tilde is driven by gamma point wise and gamma itself is a Gaussian and a Gaussian well we can compute its its covariance and so things are much much simpler some Gaussian fed into a non-linear function completely derives the uh dynamics of a single variable and does it work so there's a comparison between a simulated system where you where you calculate the covariance at two time steps from a simulation and this one is the one that comes solving these non-linear equations this forward and so this is a comparison between the two squared taking the 10 log so it's really small uh for for a model of a two-layer hopfield model with 10 to the four spin so well it seems to work and um actually this is a comparison of a single simulation of the system with this just no averaging a single trajectory uh compared with the theory and um the nice thing you can also calculate exactly uh what's the asymptotics of the algorithm you can show that uh about the convergence rate of this field gamma towards its uh converged value is given by this quantity and interesting enough you see this quantity should be in order to get convergence uh of course this quantity should be uh less than one and this condition equal one that is precisely the the Almeida foulest line and uh if you're on the other side then you get divergence and and and so this is not longer valid and if you just compare it uh with simulations it seems to do a fairly decent job so this is the asymptotics computed from the simulations the red lines and these are the the simulations on a single trajectory and actually this point three five is very close to the critical uh point the at point uh that we have so it's getting slower and slower in converging so well you see this works and I just I don't know it's magic why does it work and uh so in order to get an understanding why this may work um I just try to introduce one little thing that I learned about um random matrix theory it's uh the notion of asymptotic uh freeness and so um these um mathematicians introduce the normalized trace in the limit n to infinity so you take the trace of an n by n matrix normalize it one over n take the limit n to infinity and you define freeness between asymptotic freeness between a and b so if you take a bunch of polynomials p one q one p two q two and you always have one b between two a's and uh um one a between two b's and so this normalized trace is zero if the uh the individual traces are also zero in the limit so it has a kind of a flavor of statistical independence so if you would think that um these variables a and b were not just matrices but just ordinary random variables x and y and if you replace this later phi by an expectation and you would say oh a and b are independent if uh of course uh the function um of this uh this expectation of that function would be zero if these individual terms would have zero mean then of course that's the notion of independence but here we have uh instead of expectation we have this normalized trace and uh we define a similar um um operation it has a smell of independence but it's not uh statistical independence but it's called freeness here um um and especially since these matrices do not commute a and b usually we will have uh we will have this specific order so if we these have this condition um the vanishing of phi if these individual things uh have sort of zero mean then we call them uh free and um especially a and b are asymptotically free of the two eigen spaces are in generic position um for instance obtained by a random rotation so it fits very much into that framework of rate rotationally invariant uh random matrix ensembles and um one important thing let's say we have uh two let's say deterministic diagonal matrices lambda and lambda prime and we consider a random rotation then uh this matrix a which is generated by a random rotation um coming from the lambda prime matrix then a and lambda are typically in are asymptotically free so this is a result that you find in the literature and now let me come back to analyzing this specific algorithm and you will see what are the uh ingredients that we need so this is the dynamics of the original algorithm so again multiplying a random matrix by some field passing that field into a non-linear in into some uh uh non-linear uh function and um so I mentioned that the origin of this complications the origin of having a memory into the past which possibly will also be detrimental for the convergence of the algorithm at least it will be very bad for analyzing it comes from the fact that uh if I interact with my neighbor my neighbor interacts back uh with me at a different time and this interaction is somehow uh um related to the linear response so how does gamma i uh change if I um make a little perturbation at gamma j a previous time step so let's linearize the dynamics and then we'll see that the linearized dynamics essentially by the chain rule uh is composed of this matrix a and a diagonal matrix e at some previous time step so you just differentiate that equations you get an a and you have to differentiate this hyperbolic tangent and so on and the interesting thing is if you look at the construction you'll see by construction the normalized the normalized trace of the matrix a is zero and it turns out in the limit also the normalized trace of the diagonal matrix so some of the diagonal matrix is constructed in such a way that they're both trace zero in the limit and now if we uh think about that a and e would be actually free so this is a this this rotationally invariant um sort of it's a member of the rotationally invariant ensemble and e is a diagonal of course there's a little snag here so the theory is not it's not approved this is more like a heuristic since e also somehow depends on a possibly in a very weak sense and we put um instead of you sweep that this interaction a little bit under the carpet and assume as possibly not um uh can be proved still that a and e are um uh are free then we'll see that we can um we can simplify uh traces of products of a and e now if we really look at the dynamical response function essentially we go through the chain rule and multiply matrices a and e so always a and e so you differentiate take the chain rule and so on and at the end of the day you uh take the trace of it normalize it and then with this arguments that we had on freeness of a and e this would be zero so essentially we believe that the essential ingredient of that algorithm why we can analyze it and why why it is um becoming so simple is essentially the trace zero property of a and e and the assumed freeness of a and e and then we get essentially the memory term should uh be zero and um as a second result you can also look at the asymptotics so how fast does the algorithm converge and with a similar argument you could say if i kick it a little bit um in a random direction away from the fixed point and then look at the linearized dynamics you see again you have to multiply a bunch of a and e algorithms and so on and then take again the trace so assume if the initial um if the initial perturbation was somehow random at the end of the day you're going to have a big trace here and if you evaluate that again using uh the assumption of freeness what you get uh is is essentially the normalized trace of a squared and e squared and tau is sort of the number of time step that i have and uh this is precisely the um the decay that we have computed from the dynamical functional theory so using the simple assumption of freeness we can argue why there should be no response terms uh in the theory and we can argue also the speed of convergence and we find the at line uh with this method so that's essentially my my main message and um so here's a again a comparison of the algorithm that we have uh so this is the convergence um this is what's predicted by by this asymptotic that i show this is actually an older algorithm it's the one that is um was first given by Boltzner by by bolt housing uh for the decay of the error uh and this is the case of the Sharon Kirk Patrick model and now having said that so it's possibly the freeness of uh of these matrices that play a role um we try to check a little bit the robustness going away uh from these random rotations that we have that we used explicitly in the dynamical functional approach and um using an ensemble that has much much less uh randomness in it and that is given by coupling matrices that are formed by so called Hadamard matrices and these Hadamard matrices are uh just they are deterministically constructed and there's a little bit of randomness here so these diagonal elements so we have this O tilde they turn out to be um they turn out to be uh orthogonal matrices but they are not sort of random orthogonal ones but they are computed by Hadamard matrices multiplied by uh by um plus minuses um and essentially so these matrix elements of the orthogonal matrix is just a binary matrix and so we can um still uh looking at the literature um having having freeness for for for two matrices from from that ensemble and um so we would expect if we go away from the uh random matrix from the completely random matrix case of random rotations to this Hadamard ensemble which has much less explicit randomness in it we should still get decent results um between the theory that we have and and the simulations so it's the same kind of plot the predictions of the correlation functions and the one that you get from a single uh simulation of the algorithm just plugging in these Hadamard things computing the empirical r transform and so on which just depends on the spectrum and not on anything else and so it seems to work well and also in the prediction of the asymptotic decay so it seems like maybe there's something behind this this freeness assumption the idea that these uh matrices that we that we are using have sort of a a random um direction um relative to to this uh to this um to this eigenspace of the diagonal matrix which is just the simple euclidean ones so um of course it's possible I mean I just explained most of the stuff just for the for the simple ising model but you can extend that also to other algorithms that that solve problems in probabilistic modeling for a simple Bayesian classifier and again you would get you would get a decay of the error based on on that formula that we got from from the uh from this random matrix assumption so that actually brings me uh to the end of my talk and I promise nothing deep here nothing deep it's just a a simple solution of TAP equations for sort of somehow correlated coupling matrices so this is not the real world and of course um yeah one should look at is this enough to treat real data maybe we have to think about more interesting ensembles and but hopefully we can carry some of these things that we learned over to the real case but in the meantime you can still look at things that we can still do from the theory we could we will try to analyze sequential dynamics which is interesting for for practical reasons maybe sequential dynamics in the sense of mini batches or something like that and of course there's something that I don't understand at the moment no understanding this is not a gradient descent algorithm I mean you could have said well look I mean there is some sort of free energy beta free energy TAP for any free energy there should be a gradient descent algorithm that looks completely different and possibly will not converge and what's the relation between these type of algorithms that are sort of handcrafted or just come from maybe motivated by belief propagation things and what's the relation to gradient descent dynamics are there any energies free energies that we could analyze and understand why these things go so smoothly through their dynamics the next step will be something you know this is just coupling matrices that I introduced coming from some ensemble but in practice they might be learned by some learning algorithm this on on top of it there will be something telling you how do these jijs change over time and is there anything we could say about that and finally yes of course this is exact but not rigorous as a physicist I wouldn't really care too much but sort of I'm working in a computer science department and from time to time if it's possible maybe we can do some rigorous work getting getting these things really mathematically clean as clean as possible I would be really interested in that yeah that's really the end of my talk thank you very much