 Perfect. So this next talk will be given by Jean Tranquier who is a PhD student at Ecole Normale and we will also talk about generative modelism sampling this time in a slightly different context. Okay so hello everyone, thanks a lot for the invitation. So I'm going to talk about how we can use machine learning to assist Monte Carlo and like do efficient sampling and I will also show that it fails at something like very complex problems. So the motivation is that so we want to generate some equilibrium sample from a given Boltzmann distribution that is like high-dimensional and like that has a very complex landscape. For example in order to compute the microscopic properties of the system such as the free energy entropy energy but also it has like a lot of applications in a lot of domains you can think of combinatorial optimization for example when you take the zero temperature limit. So of course the universal strategy is to do like local MCMC. So for example the metropolis algorithm you have some acceptance rule and you start from a configuration and like you move until like you reach another configuration using this acceptance rule and usually what you do is to do like local moves. So if you want the new configuration that you propose only differs from the old one by like one flip of a spin and so the goal is like by this local exploration you are hoping that you will like explore the whole landscape of your probability distribution. The problem is that there are some hard to sample problems. Typically you have like some high-probability regions that are like separated by low-probability regions and the problem is that if you do like something that is local you are going to spend a lot of time to go from one region that has a high probability to another one. And so it's known that like in some model the sampling time scales like exponentially in the system size and so if you try to like do something using local MCMC it's like completely hopeless because you will never like equilibrium. So luckily there exists some system-specific solutions for example you can try to identify some set of correlated variables and update them like all together but by definition it's a system-specific so you need to work a lot on each problem to find the solution and it's possible that there is no like such solution to speed up the sampling in the problem you are interested. So the idea is to do like something that is universal in the same way like that local MCMC is universal so you want to be able to apply it for every problem. So the recent line of research is to use machine learning to assist MCMC. So the idea is that suppose that you can learn some generative model, PAR of sigma then you are going to use it to propose a new move. So in this case your transition kernel that tells you how to go from the old configuration to the new one you are going to replace it by the probability of generating the new configuration given by this generative model and so in this case it's like independent of the old configuration it's just like the probability of generating the new configuration and because of this it's a global move because the old and the new configuration are independent and so it's non-local you're like completely moving in the landscape if you want and so the idea is that if you are able to get the high acceptance rates you get a very fast the correlation because you explore very efficiently the landscape. So more precisely suppose that we can find such generative distribution of your configuration so you want that it covers well the support of the Boltzmann distribution because you want to explore really the whole landscape and not just a part of it and you also want that it's very close to the Boltzmann distribution. So I think we can summarize those two conditions by saying that you want that the Kullbeck-Leibler divergence between the Boltzmann distribution and your generative model is small. So this is like the first thing you want to do and of course since you want to use this generative distribution in order to propose moves you need it to be easy to sample and so one way of doing this is to use the autoregressive decomposition so you know that like every joint probability can be written as this product of conditional probabilities and you can learn every conditional probabilities via some neural network and if you do like this decomposition sampling becomes trivial because you just need to sample the first pin from the probability distribution of the first pin then you inject it in the probability distribution given the first one and you iterate it until you sample the last one. So it's very easy to sample and now you actually need to learn this distribution and like I said you want it to cover the whole landscape and you want it to be very close to the true distribution and so what you would like to do is really to minimize the divergence between the Boltzmann distribution and the autoregressive distribution and it's also the standard approach that is used when you want to learn a generative model when you have some data you do maximum likelihood and it's exactly the same as minimizing the cool back layer blur between the distribution of your data points and the generative model. But the problem is that to do this like to compute and to minimize this divergence you actually need to sample from the Boltzmann distribution because you do the average of the Boltzmann distribution and it's exactly what you want to avoid because you want to be able to sample in region where you cannot use local MCLC so you can you want to be able to like get some equilibrium sample when it's not possible to get it from local MCLC so if you want you want to be data-free so this maximum likelihood approach is not possible so only possible at high temperature when it's easy to obtain a good sample from the Boltzmann distribution but at least it allows to check the model expressivity at this high temperature so it's approach like it's not data-free it's only possible when you have some data and if like you are able to get a good sample like with a lot of independent sequences that are like representative of the landscape the model is forced to learn all the regions of the landscape that are that is represented in the data set so okay one way of like solving this data problem is to use the approach so instead of minimizing the divergence between the Boltzmann and the autoregressive distribution you do exactly the inverse you minimize the Kullbeich Leibler between the autoregressive and the Boltzmann distribution and in this case you are doing the average over the autoregressive distribution that is easy to sample so you don't have this problem you don't have data actually and so you can write this this Kullbeich Leibler divergence as a difference of free energy that's and you can compute the gradient of it and so to do the learning what you do is like so you have like a first guess for your autoregressive model you do a first sample of this model you estimate the divergence and its gradients you can update the model and like you stop the learning when you like are below a certain threshold under the variance of Q that is here and so Q is like beta times energy plus the log of the probability of your model and the idea is that if the two distributions are the same so the Boltzmann and the geologic model then of course you get the zero variance on this quantity because it's a constant and reciprocally if you have a zero variance then the two are equal on the sequences you sample with your autoregressive model but that does not mean that that is true on all configurations and the reason is that it's possible that your autoregressive model is like completely missing some regions of the landscape and putting a zero probability on some configuration as you can see for example on this plot and it's possible because it's not hurting the cool back labour divergence because if you put zero probabilities to some configurations it has a zero contribution to it so this problem is not is known as a mud collapse and it was not the case in the maximum likelihood approach because if you like put some zero probability to some regions the divergence goes to infinity so it can be partially solved using simulated annealing so you start at high temperature and you decrease it while learning but still like and I will show an example at the end of the talk but if you are like trying to learn some very complex model it's very likely that you're like going to miss a lot of the landscape so this approach is data-free but you might miss some parts of the landscape due to mud collapse so one solution that tries to take the advantages of both approaches is like to do some combination of local MCMC and global MCMC so there are a lot of ways of doing this I put two examples on the slide and what we did is called simulated tempering so it's not really data-free but what you do is you generate a sample if you look at MCMC at high T where something is easy and fast so you have a first good sample and you can learn a first autoregressive model by maximizing the likelihood so you don't have the problem of mud collapse and then you are going to generate a new sample but at a slightly lower temperature using this global MCMC update so if you're able to get a good acceptance rates you're able to like get a new sample at a slightly lower temperature and you can iterate the procedure you will learn a new autoregressive model and you decrease the temperature create a new sample and you do this until you reach the temperature of interest so then we need to choose some architecture so you can do something very simple from like a very shallow model with just one layer and of course if you need a more expensive model you can go deeper also like those kind of autoregressive models that the number of parameter scale as the system size squared so it could be like a problem if you are trying to generate sequences for a large system and so there are also like some architecture that allows to reduce the number of parameters by some shared weights strategy and also if you want so it could be like a good idea to include some the some form of your model in some the geometry of your model inside the autoregressive model so for example if you are working with lattice could be a good idea to do some convolutions if you are like working with a graph it could be also a good idea to do some graph no let's work with also the autoregressive structure okay so now I will first show some models where it works very well and you can like speed up the sampling time by order of magnitude and then I will show like more complex model when we're all methods fail so first we reproduced the results of this paper and the model is a 2g Edwards Anderson model so we have like a square lattice with Gaussian couplings and what is interesting in this model is that the real action time of local MCMC grows very fast when t is decreased and you can see it on the last plot is the blue curve so it's like the number of sweeps you need to decrylate in function of beta and you see that for beta equal 1.5 you already need more than 10,000 sweeps to decrylate and actually it's not possible to go like to higher beta it becomes too slow so okay look at MCMC doesn't work at low temperature in this model and so we applied this simulated tempering procedure so we start at t equal 1 and we slightly decrease the temperature and so to check that the model was like a good model we we checked that we were able to reproduce the energy and entropy of the true model so you see like the equilibrium results are in green and so of course after a certain time we cannot compare with the true value because we cannot do local MCMC but we also found that we were able like to reach the grand state and and if you look at the acceptance rates it's the inverse of the acceptance of the graph you see that like it's going to 1 meaning that's basically you are accepting every move and so that means that you are more or less decrylating in one step even and if you see like the local MCMC needs like probably millions of steps to decrylate in the switcher so there is like a real gain of computational time and also you can actually sample regions where you cannot sample using local MCMC so it's also the case we did exactly the same analysis for the 3d a and this despite the fact that there is actually a phase in this model we found similar results so high acceptance rates we could reproduce the energy and entropy we could reach the grand state and like the local MCMC time becomes like super slow so of course we need to do like something more systematic on the system size maybe these results are not true if like you go to a large system size but still like it shows that's for models where like local MCMC is not possible using this strategy allows like to actually sample equilibrium sequences so then we wanted to see if it was the case when we have a really complex model and so to test this we wanted to see the class of model where there is like some critical temperature and below this critical temperature the sampling time becomes exponential in the system size and so in this case like local MCMC is not possible at all and so we took the coloring problem that is on different magnetic parts at finite temperature so we have like some random graph with connectivity 40 of colors and so you want to penalize the fact that two notes share the two neighboring notes share the same colors and so also what we did this also because we can do something that is called quiet planting that is that we can prepare some one equilibrium configuration for any tip and that's useful because like below the critical temperature because the sampling time is exponential in the system size it's not possible like to obtain equilibrium configurations and though we've described planting the idea was that we could have at least one configuration and we could compare the configurations we were sampling with this equilibrium configuration okay so yeah at the beginning we were hoping that probably it was not possible like to sample in this zone so like below TD but maybe we were hoping to be able like to sample very close to it in a zone that already like is impossible to reach with local MCMC but okay now I will show that it was a failure and so we think like this kind of model are good benchmark for future works to test new models okay so what we did is first to do some model selection so at a high temperature at equal amount so in this case the the critical temperature is around 0.2 so it's like way higher than the critical temperature we tested a lot of models that are present in the literature to select like a good model so the first two plots shows like the long energy and the long entropy of a lot of different models and the goal is to be as close as possible as the yellow line that is the equilibrium value so you see that at this temperature there are a lot of model that are good like quite close to the true value but it was already a little bit surprising because we saw that like the best models for the shallow ones we tested also deeper models and also like some models with smart architecture like graph network and the best models were the shallow ones we saw that like increasing model expressivity leads to overfitting and you can see it because the entropy is too low and regularization leads to underfitting so okay at least at this temperature it was working we had a good acceptance rates a good reproduction of the thermodynamic observables and then we tried to lower the temperature and it fails it failed because to check if your model is good so you can do many things but for example which one thing you can do is like take one sequence that is generated by local MCMC so that is a real equilibrium sequence run some local MCMC dynamics on the sequence and then do the same thing but with one sequence that is generated with your directive model and check that they exhibit the same kind of dynamics and so like in the plot the dashed line are the dynamics of the two equilibrium sequences and the lines are like the dynamics of the generated sequences and so you see that at high temperature the dynamics are very similar so the model is good but when you decrease a little bit the temperature like it becomes very different so this is the maximum likelihood approach the variational one is a little bit better but it's like the same at low temperature it's super different and we explained this because what we saw is like the variational model keeps a good entropy because it has to like reproduce the diversity of the data set but it has a too high energy so it's like the yellow line here while the green one is like the line you should reproduce and on the contrary the variational model is like not collapsing it has a too low entropy so it's just like focusing on some regions of the landscape but it's able more or less to reproduce the energy and that's why it's a little bit better the sample from the variational model sound a little bit closer to the two equilibrium sequences so the maximum likelihood fails so it's a good model it has acceptable acceptance rates at high temperature but as soon as you try to decrease the temperature the acceptance rate is like going to zero basically for all kind of model and we try to use like larger training sets more expressive model but everything fails and in all cases like the energy of the proposed configurations is too high in comparison with like the energy of the equilibrium model and the variational approach also failed but for different reasons so in this case you are able to get good acceptance rates doing this Gloss-Pell MCMC dynamics but actually you don't decrylate and you can see it in this plot so like the red line is a high temperature you are able to decrylate this is like the number of steps in your global MCMC but you see like for lower temperature you basically like stay at one and so you don't decrylate at all and that's because you're just learning a very very small portion of the landscape and so you're just staying in this very small region so maybe you can think we are not able to explore the landscape but at least we are able to sample some configurations okay in a very small portion of the landscape but that gives us some equilibrium configurations but actually this temperature is way higher than the critical temperature and when we tried to decrease the temperature it was like not even able to reproduce the energy so this is not collapsing and it cannot like sample good configurations at a low temperature so okay so I think this very complex models with a very complex landscape are a good benchmark for future works so for the moment I think autogasi models do not seem to provide a good enough auxiliary solution for sampling complex models and there is no clear path for improvement because so we try to do something more expressive we try to increase the training set size maybe it's an interesting limitation of the autogasi structure maybe get learn like a few peaks but it's not an exponential number of peaks still there are some directions I think so first we can try to simplify the problem so instead of like trying to learn how to generate the whole configuration we could try to learn to generate a subset of spins given its boundaries so maybe it's a simpler problem and also it's interesting if you want to do something with a very large system size because you could just like try to generate some subset of spins in one region then do the same thing in another region and then you can like scale to very large system maybe also it's a good idea to explore more complex architectures such as the transformers we are also trying we are also trying to do this so they are very like powerful for language and also for biological data such as proteins but it's not so obvious they are also working well for spin data and then there was also this paper that is interesting I think so they try to reformulate the pulse-man distribution into an autogasi form so they show that if you do this you need an exponential number of parameters in the system size to like reproduce your pulse-man distribution but then they also show that if for the co-evice model and the SK model they were able like to reduce the number of parameters and they show that it outperformed the naive architecture that you can try using an autogasi model but again it's a system specific need to derive it for every problem but maybe you can try like to use the architecture they found for the for example the SK model on other problems and maybe it's better than the other architecture yes for the examples on the SK model do you know how your method compares to the something like the mocked a chaise class replicated cluster methods or simulated annealing like ordinary parallel and temporary or something like this so I think for example for this model like like you said there is some system specific solution that are super like effective so it's not really useful for this model but I think like it's like comparable in term of but we didn't compare actually but I think it's comparable more or less but there are some models where there is no system specific solution it's still useful I think I see that evaluating the direct cool back library is impossible no so you invert it if I got it right so not really sorry in the version of approaches but in this approach we are actually using the direct playback labour because we start at a high temperature with a true sample that we are able to generate your local MCMC and then we decrease the temperature but at each step we are like minimizing the direct playback labour okay I see thank you very much any other question it's the question okay no otherwise I was I was going to ask do you do you have an idea of a good continuous variable benchmark of I mean I'm sure we can we can think of plenty but a good time model to also test this type of algorithms in the continuous case in the continuous case not in the discrete one yes thanks very nice talk you mentioned that you were playing with transformers now in this context with transformers I think what I'm saying if this was a perspective or whether that's something that you've already tried and if you've started playing with transformers can you tell us a little about that so yeah we are trying to do this I think for now the results are a little bit disappointing because like shallow models are like a little bit deeper models with the auto aggressive structure are working for the moment better than transformers but we are still trying and what is also useful with transformers is that the input and output size can be like changed so like for example if you try to learn a subset of spin in like the graph you can change the number of neighbors that are on the boundaries what you cannot do if like the shallow model so I think it's still like a good direction but for the moment we don't have very good results