 And it's my distinct pleasure to welcome our speaker today who is Fiala Shanahan. She obtained her PhD in 2015 from the University of Adelaide and then held a postdoctoral position at MIT and worked as a senior research scientist at the Thomas Jefferson National Lab. She then returned to MIT where she is now a professor of physics. She received a career award by the National Science Foundation, was listed as one of the 30 under 30 in science by the Ford magazine in 2017, and became an Emmy Noether Fellow in 2018. Her research revolves around theoretical nuclear and particle physics. She is working to understand how the fundamental degrees of freedom of quantum chromodynamics eventually give rise to the hadrons and the entire atomic nucleus. Making this connection explicit is an extraordinarily computationally challenging problem. And that's why today she will talk to us about how machine learning can help us make progress in that field. Fiala, please, whenever you're ready. Thank you very much, Phillip, and thanks to the organizers for the invitation. It's an absolute pleasure to be here. So in the spirit of this series at the intersection of physics and machine learning, my plan to this talk is to outline a particular computational challenge, a sampling problem, as the title suggests, that stands between us and the ability to do some really exciting theoretical physics calculations. And then dig in a little bit into the details of some new approaches to meet this challenge with machine learning, especially when we build the symmetries of our theory into our architectures from the ground up. So in particular, I'm talking about the challenge of studying the structure of matter right from our understanding of the most fundamental quark and gloom degrees of freedom in nature, which are encoded in the standard model of nuclear and particle physics, which is something that we want to do both for its own sake to understand the emergence of nuclear complexity, to see really how the structure of a proton all the way to the structure and interactions and reactions of nuclei come from our fundamental theory. But also to be able to constrain backgrounds and provide benchmarks for searches for new physics. And so just to give one small piece of additional physics context here, one place in which standard model studies of nuclear physics are extremely important in new physics searches is at the intensity frontier. So I'm talking here about experiments that have the sensitivity to probe the rarer standard model interactions to search for beyond standard model effects. Those things like dark matter direct detection experiments, neutrino physics experiments, like the new long baseline neutrino experiment being constructed in the United States June, looking to constrain the neutrino mixing parameters in the mass hierarchy to searches for double beta decay, all of which have in common that they use nuclei in some sense as targets. And so then a big challenge is that lattice field theory approaches, which as I'll describe are the only first principles way of studying the standard model that's systematically controllable and improbable, become more expensive with the atomic number of the nucleus being studied by compounding exponential and factorial factors. And so these sorts of studies are severely limited by available computation, which of course makes it an area where accelerated algorithms can have particular impact. And this is just one class of many examples where first principle standard model studies are limited by available computation. Okay, so to talk about first principle standard model calculations is to talk about the strong force quantum chromodynamics, which is responsible for binding the quarks and glons into protons and neutrons. It's also responsible for binding the protons and neutrons into nuclei. And of course it's then relevant to questions like the structure of the proton, the structure interaction of nuclei, which would like to understand from the standard model from first principles. And the challenge with the strong force, the challenge with QCD is of course that it's non-perturbative. So the interaction strength depends on energy. Here's a plot of the strong coupling constant are on the horizontal axis. We have the energy scale moving to the left. We move to small coupling at a small energy and the coupling becomes large. And so of course what that means is that you can't do a perturbative expansion if you're trying to study QCD at low energy, which is what's relevant to say the structure of a nucleus at rest in a detector. And you have to do something different, something that takes the non-perturbative physics into it. And so the only first principles approach that we have to study in QCD non-perturbatively is numerical. It's called lattice field theory or lattice QCD. And the idea is quite straightforward. You discretize the equations of QCD onto a four-dimensional space-time lattice or grid. And then the QCD equations, the path integral that you need to solve to compute the value of some observable, then corresponds to integrals over the values of the quark and gluon fields on each site or each link of this four-dimensional lattice. So essentially this is an integration problem in something like 10 to the 12 to 10 to the 15 variables for state-of-the-art calculations. So we evaluate these integrals by important sampling. And this is again something that you can understand quite simply just by thinking about quantum mechanics. You of course know that paths near the classical action are dominant and other paths are less important. So you can sample the dominant paths correspondingly more and the others correspondingly less. And then on your set of samples of the background quark and gluon fields, you can compute your observables. So to go into just a little bit more detail, in this discretization, we have a Euclidean space-time ladder with some finite ladder spacing and some finite lattice volume. And of course, if you want to recover continuum QCD computations, you have to take the limit as the ladder spacing becomes small and the limit as the lattice volume becomes large to match on to continuum physics. Now the path integral that I described that we want to compute. So here the expectation value of some observable. Here's the QCD path integral in Euclidean space. That's why there's an exponential of just the action without an i. We have an integration over our quark and gluon fields here. This becomes, once you have sampled those quark and gluon fields with this weight factored in, just looking at the mean and the standard deviation of the quantity you're trying to compute over the samples you've taken. And that's roughly how these sorts of calculations present. Okay, so one thing I'd like to emphasize about this approach is that it really is a first printable's approach. So the three parameters in lattice QCD are the same free parameters as in QCD itself. So that's the quark masses and that's the strong coupling. And once you fix those free parameters, so you fix the quark masses by matching to some measured hadron masses. So the pion, the kaon to fix the light and the strange quarks. And you have one additional experimental input to fix the lattice spacing or equivalently the strong coupling. Then everything else you calculate from detailed of the three dimensional structure of the proton up to nuclear reactions are predictions from the theory itself. There's no additional modeling, no sense in which a nucleus is composed of protons and neutrons. A nucleus is composed of the quark and ghoul and degrees of freedom in the standard model. So everything else is predictive. So these calculations work extremely well and they're also extremely expensive. So to give you an idea of the cost is a rough sketch of the workflow of a lattice QCD calculation. And I should say that I said at the start, calculations especially of nuclear physics are extremely computationally limited. That's despite using more than about 10% of open science supercomputing in at least the United States. So the workflow, the first step of the workflow is the generation of these field configurations via some sampling. And this is something we'll come back to you for most of this talk. So this is something that requires leadership class computing. So you need to imagine thousands of GPUs, 100,000 of cores, tens of teraflop years in order to create just a few thousand samples, each of which is something like 100 gigabytes in size. Then on each of these samples, you need to of course do some work. You need to compute propagators which requires large sparse matrix inversions on a few hundreds of GPUs. You need to contract those propagators into correlation functions. So tens of contractions, which is smaller in scale again, but you need to do far more of. And what we see is that the computational cost grows exponentially and factorially with the size of a nuclear system, which is why we're limited on this one. So there are efforts to apply machine learning to all aspects of this workflow. And I'm just going to focus on one particular problem for the talk today, but I wanted to be clear that there's a lot of work in this area. Most of it still in a very early developmental stage. So a lot of it using toy theories instead of QCD itself. And so there's a lot of potential. There's also a lot of more related work in the condensed matter context that's not on my slide. But some of the most sophisticated and comprehensive efforts to turn machine learning tools to lattice field theory tasks has been in this first step of field configuration generation, which is what I'll focus on. I also want to just emphasize one more time that since we are doing a first principles theory calculation here in a systematically improvable framework in applying machine learning to this task. One has to be extremely careful to only consider approaches which rigorously preserve the fact that what you're doing is a first principles theory calculation that is systematically improvable and which recovers quantum field theory that we're trying to study in all of the right limits. That means there's no room for inexactness. Everything needs to be provably exact with any modeling uncertainties propagated rigorously in order to preserve the rigor of this approach. Okay, so as I said, I'll talk mostly about just one class of applications so we can dig into some of the machine learning details of it for this first part of the workflow, which is sampling. Now, let me explain the sampling problem. So we want to generate these field configurations with a known probability distribution. So the probability distribution here of this field configuration, which you can think of as a four dimensional grid of links on a lattice. So the probability distribution is given by the exponential of the action where the action is of course the function that defines the clock and glow and dynamics that we're trying to study. So in QCD, the gauge field configurations are represented by something like 10 to the 10 links here on each link here of the lattice and each of these links is an SU3 matrix. So a three by three complex matrix with unit determinant. And so what we want to do is we want to sample in this, you know, 10 to the 12 double precision number type space we want to sample these configurations according to a known probability distribution. So if you have those samples, then your weighted averages over the configurations give you your physical observables of interest and ultimately we want order thousands of these samples. But the way the sampling has been done very successfully for the last 30 years or so is via a Markov chain Monte Carlo approach called hybrid Monte Carlo or Hamiltonian Monte Carlo that couples. It's a coupled position and momentum walk through Hamiltonian dynamics to explore the space. So essentially, you can use classical motion so you can use molecular dynamics. But if you try to integrate along too long a path, you have energy non conservation for your numerical integration. So then to guarantee exactness in the limit of a large number of samples, you can do a metropolis except reject step and metropolis Hastings step. And so you can propose an update in your chain using your integrated molecular dynamics directory, then accept project with some probability, which corrects for any numerical error and then continue on. But unfortunately, you need to have fairly short integration trajectories for high acceptance. So just to give some more intuition on what this looks like, you can really think about a walk in this space that allows you both to explore your level set and jump the nearby sets by injecting random momentum into your Hamiltonian dynamics and then refreshing. So it's a very efficient way of sampling in such a high dimensional space. And as I suggested, the fact that you have fairly short integration steps to maintain a high acceptance rate, this gives you correlations in your chain. So the cost of the sampling then depends not only on how much it costs to generate each sample in the chain, but on if you need independent samples to do your calculations with on how correlated the samples in that chain are. The measure of that correlation is the auto correlation time of the chain. So, given a fixed cost in sampling a shorter auto correlation time implies less computational cost in your calculation. A particular challenge is that this hybrid Monte Carlo this Hamiltonian Monte Carlo formula is in some sense a close to local or a diffusive updating approach, which means that if you're trying to generate independent configurations on some fixed length scale say a physical scale like the size of a proton we care about correlations on that scale. So this ladder spacing becomes finer, which is of course a limit you need to take to recover QCD. The number of updates you need to change physics on that scale diverges. Now if you have a course letter spacing it takes fewer updates to decorrelate on this link scale than if you have a fine letter spacing. So this divergence is a manifestation of critical slowing down in this context a critical slowing down in the generation of uncorrelated samples and this critical slowing down is a particular severe computational challenge that we face. So, just to show you a figure of what this sort of a critical slowing down looks like. Here's a more formal definition of the auto correlation measure that you can just think of it as being how correlated the configurations in your chain and here's a figure showing this this integrated auto correlation time to various observables moving from right to left on the horizontal axis on this figure by moving towards the critical limit. And you can see that here this is a topological chart this is a flux the for these different physical quantities for auto correlation time diverges exponentially badly as you move towards the critical limit, which means your calculations diverge in cost exponentially badly in that limit. Okay, so this is the problem that or one of the problems that my group's been working on addressing with machine learning. I just want to flash that there's a fantastic group of young people working on this. In my collaboration we also have some fantastic industry partners at Google deep mines. Okay, so coming back to this this sampling problem, we can introduce the machine learning approaches that we're exploring in a nice test case which is easy to visualize. Let's think about a scalar lattice field theory. So this is a theory where our samples our configurations just look like a real number per site on a two dimensional letter so you can visualize them like this. Then the action, which in the QCD case describes our clock and glow and dynamics, we can take a simple five of the four theory, like you'd look at in any introductory field theory course with just some kinetic terms and a coupling. And then the sampling problem for this field theory is to generate these field configuration with a known probability distribution given by the exponential of the action. So what this looks like is that we want configurations like this, which have correlations on some sort of canonical scale, and far less often we want configurations like this this is just random noise and you can see the actual log probabilities in this theory, given here. So this has a lot of parallels with an image generation problem that has received a lot of attention has also found a lot of success with machine learning applications from from the machine learning community. So we want samples like this, and far less often samples like this, and there's been a lot of success in for example, generating these images of faces these aren't photographs their machine generated images, generating these successfully, and not this. So you can go to www.thispersondoesnotexist.com and refresh and every time you refresh you'll get a new fake image of a person. So if you can design a machine learning algorithm to give you this, you'd think maybe you can design a machine learning algorithm to give you this. So, of course, unfortunately, there are a number of challenges that we face in our problems that are different from the image generation case so as we'll see we can't just take out of the box machine learning tools and hope they work. So some of the challenges that we'll come back to are, of course, the symmetries in our theory. So firstly, our gate should configurations are symmetric under rotations and translations in four dimensions with boundary conditions so this stage should configuration and this stage should configuration in code exactly the same physics, they have exactly the same probability. So these are equivalent. And for for our gate field theories, we also have more complicated symmetries like gauge transformations. So for QTD remember we have a three by three complex matrix on each link of our four dimensional lattice. Our probability distribution is invariant under this gate field transformation here. We transform every link differently but in a correlated way across the lattice. So this field configuration and this one encode the same physics. So this is of course quite different from the case that you have the images. So if we sort of compare and stack up our problem compared to an image generation problem. You see several things. The first one is the symmetries. But then you also notice that we have quite a different data hierarchy. So each sample of our lattice QTD gate field is described by something like 10 to 9 to 10 to 12 numbers. But ultimately we only want or we only have thousands of samples compared to an image generation problem where you might have samples described by 3000 numbers we have 60,000 to train on. So we have a severely inverted data hierarchy which says that any sort of methods involving training on a using a training data set what won't work for our case. Of course we have an ensemble of gauge fields has meaning rather than each image having meaning meaning that we do actually want the configuration with random noise just very rarely. Whereas in the image generation case you of course always want your images to look like faces and never have a random noise image. It's long distance correlations that describe our physics in an image you have local structures that are important like an eye or a nose in a picture of a face. And so what we'll find is that through our applications to QCD out of the box tools are really not appropriate and we've had to develop custom machine learning for this physics problem from the ground up. But okay so just just to summarize the thumb playing problem before we get to a solution. We have a known probability density that can be computed up to normalization for a given sample. We have precise symmetries that we need to encode and our data hierarchies are challenging. As I said 10 to 9 to 10 to 12 variables per configuration but only order thousands of samples available. We don't really use any training paradigms that rely on existing samples from our distribution if we have the training data set would be done. We could just use those samples for our calculation. So one approach to this problem is via flow based models and the idea of flow based models is quite straightforward it's essentially just a change of variables. You have to learn between to optimize a complicated function to give you a change of variables between a probability distribution that's easy to sample and the distribution of interest. And of course if you have a function like that that then you're done then you take samples from the easy distribution you put them into your function and outcome the samples from the complicated hard to sample distribution. So you can learn or you can optimize such a function that's invertible and has attractable Jacobian. Then you can also make the sampling exact via a metropolis hastings except reject step via reweighting methods. You can guarantee exactness of this algorithm. So in flow based models you can build up such a function a nice expressive complicated function with very many free parameters to optimize that is still invertible and with attractable Jacobian by composing many simple layers in a specific way to give you this function. So for the case of the scalar field theory this example that will come to you can almost use something out of the box from the machine learning community for image image generation. You can use something called real non volume preserving flows. So just to unpack this terminology, the non volume preserving just means that the density can be squished or stretched by the change of variables. And the way you can think about this is in each each layer of this function so you compose your function out of many smaller layers. In each layer, you can split your variables in half. But that's what's happening here if this is your image or your scalar field. You can split the green variables and the black variables and you leave the green variables alone through that layer. But you update the black variables via a scaling here by an exponential via a translation. And that scaling in that translation can be arbitrarily complicated they can be parameterized by arbitrary neural networks for example, and as long as they depend only on the untransformed variables only on green variables. Then what you end up with is a function that has a very nice lower triangular Jacobian, which gives you a simple inverse and simple Jacobian of your, your complete function when you compose all of these components with lower triangular Jacobians together. So you can build a complicated function that you can still invert and still has attractable Jacobian. And you can optimize this function by minimizing a loss function. So you can minimize this function here which is a shifted cool back lively divergence. The shift is just to remove the unknown normalization that we don't want to compute. And you can write down this particular direction of this divergence to give you this function here. And then what you have is a loss or a function that you're trying to minimize whose minimum value is minus log Z. So there's unknown normalization constant and which you can optimize only by sampling from your model distribution so you can estimate this loss stochastically by drawing batches of samples from the model. So here P tilde, the subscript F is our model probability distribution, given by the function F our change of variables from a simple prior to our complicated model P tilde. And we would like P tilted to be as close to P as possible. So this loss here is minimized when that's the case. And in this expression you can see that if you sample, you draw samples from the probability distribution P tilde. So all you need to do is compute P tilde on your sample and compute the action on your sample and then you have everything you need to minimize this loss. And then as I alluded to you can guarantee exactness of your generated distribution by forming a Markov chain. So although your samples are generated completely independently you're drawing them out of a bucket of independent samples, you can then compose them into a chain. And you can accept or reject each sample based on the previous sample in the chain. And in exactly the same way as a usual hybrid Monte Carlo, Hamiltonian Monte Carlo, any Markov chain Monte Carlo with a with an accept-reject step you can guarantee exactness of your probability distribution in the limit of a large number of samples. So just as a reminder P tilde here is the model distribution, P is the true distribution, and you can compute both of those for any given sample from your model. So then the workflow here looks like this, you parameterize this flow. So again a change of variables from an easy to sample prior distribution to the distribution of interest, and you can parameterize that using arbitrary neural networks which you can optimize the parameters of. You then train by drawing samples from the model, you compute this loss function you do gradient descent. So in the saving your trained model, you can sample in an embarrassingly parallel way by drawing samples from the model then composing them into a Markov chain. So we can see how this works for this example of scalar lattice field theory so again one real number per site on this lattice. Here's our simple action with just kinetic terms and aquatic coupling, and we can look at this for a number of different lattice sizes where we've tuned the parameters for an analysis of critical slowing down. We can look here for the experts in the room but ultimately we've just chosen a number of ensembles along a critical line where sampling becomes exponentially more expensive using conventional approaches. And so then you can in this example so here's our initial configuration we can just choose uncorrelated Gaussians at each site on this lattice, you can pass them through these real non volume preserving coupling layers. So eight to 12 of these layers you can just use an alternating checkerboard pattern like I showed for the variable splitting. And in each of these you can have neural networks in each layer. You can train using whatever optimizer you like for example an atom optimizer and then you can stop and then sample from this model. And what you see is that this in fact works so what I'm showing on this slide, each of these figures shows the integrated auto correlation time. So here a measure of the correlation of the samples in your chain. And on the horizontal axis, we're moving from left to right along a critical line. So this is generated with hybrid Monte Carlo a Hamiltonian Monte Carlo, each of these different types of points is a different observable here generated with local metropolis so these two figures here show conventional approaches. And you see as you move from left to right, your sampling becomes exponentially more expensive the auto correlation time diverges for no matter what observable you're looking at. In the right most plot we see the same sort of figure for the flow based sampling approach. And what you see is that it's it's flat that is at the cost of the upfront training of the model. You don't have a divergent cost in sampling and that's of course by construction by design that's not surprising, because your samples are independent, and you're only getting correlations from rejections when you compose them into a chain. So the auto correlation time depends entirely on your acceptance rate so if you train your models to 50% acceptance you have a higher auto correlation than if you train them 70% acceptance, but for a fixed level of training you have no divergence in sampling along the critical line. That's the cost of upfront training of the model. So this looks like a nice proof of principle success. But of course, where we ultimately don't want to sample a scale of field theory, we want to do lattice QCD, we want to study the standard model. So some of the challenges are at least conceptually straightforward that is we have to scale, we have to scale the number of dimensions to fall. We have to scale the number of degrees of freedom to something like 48 cubed by 96 so that the scale of state of the art calculations. And the scaling is a sort of engineering challenge, but it's conceptually at least straightforward. But more importantly, we have to develop methods to deal with our gauge field theories as opposed to just a scalar theory. So I should say that we're working on these scaling problems in an Aurora 21 early science projects or Aurora will be the new larger supercomputer to be built in the United States of the next years. And so there's a targeted effort to try to scale these sorts of machine learning approaches for physics to work on a new access scale hardware. So let's come back to methods for gauge theories, which is where some of the interesting developments are. So just a reminder of ultimately what the distribution looks like that we want to sample. So the field configuration here is composed of links and each link here is an SU three matrix. So the first thing to note is that we have group valued fields. So they don't live on the real line, but they live on compact connected manifolds. So first we have to deal with how do you build flow architectures for compact connected manifolds. And then secondly, our action is invariant on the group transformations on the gauge field, including gauge transformations. So how do we incorporate symmetries into this these sorts of architectures make gauge equivalent flows. So previously we talked about real non volume preserving flows so you can think of that as being a map that takes you from something like this distribution to something like this distribution so on the real line. So that we really have group valued variables so firstly just compact connected manifolds. So first you have to design flows that take you from something like this a distribution on a circle to a different distribution on the circle without any sort of discontinuities without any sorts of instabilities in how you define a function like that. So if we think about the circle. Well that's exactly what you want for you one field theory. Of course you can also have many other interesting variables described by circles, like the joints on a robot arm. Of course the angles living in circles so designing flows for these sorts of problems has applications not just a field theory but elsewhere. So what you want is a morphism so an isomorphism of the smooth manifold so an invertible function that maps one differentiable manifold to another such that both the function and the inverse are smooth. So you have that under some number of conditions here so we want the transformation to be monotonic so that it's invertible. The question is that the Jacobians agree at zero and two pi so that the probability density is continuous. You can compose with a phase translation to remove this restriction of zero and two pi being fixed points, which is actually volume preserving so you don't have an extra volume factor in computing the probability density. But if you have some of these dipmorphisms you can then make them more complicated more expressive by composing them or making complex convex combinations of them. So the very first step in this process was to go back and to write some papers on how you design these sorts of flow architectures with dipmorphisms for circles and spheres and we're not the first to talk about these sorts of dipmorphisms but here's a collection of some of those that we've explored. For example, you can have a Mobius transformation. So that's mapping Z here to H omega Z. So this transformation is parameterized by omega and what this transformation does is it expands the part of the sphere close to omega and contracts the rest. You have lines that you can define. You can define a non-compact projection so project to the real line and then back. You have to be very careful with numerical instabilities at the endpoints. So to implement something like this in practice you have to tailor expand near the endpoints and you have to analytically continue this function to make it well defined at the endpoints. But you can solve this problem essentially and I won't go into too much detail here. But you can you can write down how to define normalizing flows on a circle, a torus, a sphere so you can take Cartesian products of Torai and intervals and recursively build flows on the dimensional spheres, for example. So more interesting again is, well, even if we can write down a flow for a compact connected manifold like a like a torus or like a circle. What about symmetries and the first thought might be, well, you don't have to worry about the symmetries because incorporating them is not actually essential for the correctness of the machine learn generated ensembles, you get a guaranteed correctness through the except project step anyway. But what we have is a very high dimensional symmetry group through the gauge symmetry. And if you try to learn that through training. The symmetry group is so high dimensional that you would need a very, very large number of samples to try to learn it approximately. And ultimately, you can't do this efficiently. And so what you need to do to have any sort of efficient training is basically factor out your space. So you remove the gauge degrees of freedom and only try to learn the other features of the distribution, rather than spending all of your effort in training to reproduce the fact that you have a flat direction in your probability distribution along that high dimensional symmetry in the gauge space. So to define a flow that's invariant under the symmetries, you can do this with just two conditions if your prior distribution is symmetric. And if every layer is a covariant, then your flow as a whole will be invariant under your symmetry. So by a covariant I just mean that all of the transformations commute through applications of the coupling layer. So a reminder of the symmetry that we're trying to build in here. So we have our our field defined by each link on this lattice. And the symmetry that we want to be a covariant under is this one here where you left and right multiply each link by matrices in your group saying you one or an SUN where where you left multiply this by the same thing you right multiply the next link by and you're transforming each link differently, but in a correlated way across the lattice. So you can define a gauge equivalent coupling layer in in the following way. So just a reminder this coupling layer is taking us from a dimension here of our group. So you one or SUN to the power of the space time dimension times the lattice volume so that's because remember each each group variable here is along the link of a lattice. So we have a two dimensional lattice this site comes with a link here and the link here. And so we want this layer to again act on just a subset of the variables in each layer remembering that the variable splitting was what gave us a nice lower triangular Jacobian. So we want to define this layer that updates some half or some fraction call them a of the links in that layer and leaves another half or another fraction be alone. So we want to define this invertible equivalent coupling layer via a kernel from the group to itself and I'll show an illus I'll show a cartoon of this just on the next slide where the updated link is updated via a kernel where what this kernel is doing is it's taking in a group element back where the group element that we're feeding it is the link but composed into a gauge invariant object so composed into a loop that starts and ends at the same point. And the update is then conditioned only on gauge invariant quantities constructed from elements of the frozen sake of links. And if you build your layers from kernels like this, then your coupling layers equivariant under the simple condition on the kernels. So just to make this clearer. Here's a here's a figure. So we'd like to update this link here of our lattice. What we can do is we can compose our link into this gauge invariant object is placate by multiplying it by the other links that form part of this this unit square here. And so we update this link via updating the placate. So we update the placate here via this kernel where it's conditioned only on frozen links here placates made out of so gauge invariant objects constructed from links that are frozen in this application of the layer. And then to get the update on the link itself here, then we multiply with with the dagger of the other links that we can post it with so we basically compose into a placate we transform then we uncompose again to get our updated link. And of course in doing this you you also passively update this placate here. So you end up with a pattern where you update, for example here, all of these links here that was the updated link here in blue. Then you have here your your frozen placates which provide context for your transformation and you can repeat the pattern. So here we're updating all these blue links and all these blue links again moving across the lattice, then the next layer you update a different set of links in the next layer a different set of links again until you've updated all of the links in your lattice. And here's just a cartoon of another way of thinking about this. So you go from one layer. You can take context from the links and the loops surrounding the the variable that you're trying to update and push all of that information forward into the next layer in this gauge, a covariant one. Okay, so I'm not going to go into too much technical detail on how you extend this to SUN. So we've gone in some sort of gory detail for you one, but you can actually extend exactly this idea to SUN matrices on each link. You just need to step via an eigen decomposition of the SUN link matrices. So the way to read this this onion figure here is that the outer gray pieces are exactly what I described in the U1 case. So we take this link we can pose it into a placate where we uncompose and we get our updated link. But if you want to transform an SUN variable. We call this this inner piece here a spectral flow. The intuition is that each conjugacy class in SUN or in UN is the set of all matrices with some particular spectrum. So for example, all matrices with some given eigenvalues and intuitively this kernel this transformation should move density between possible n tuples of eigenvalues while preserving the eigenvector structure. And in one of our papers we showed that that's exact so you can generically define a kernel as an invertible map that acts on the list of eigenvalues of the input matrix is echo bearing and under permutations of the eigenvalues and leaves the eigenvectors unchanged. So here's just some some nice pictures for SU 2 SU 3 and SU 4 of what these eigenspaces look like. And here's an example you can go up to SU large so here's a picture for SU 9 flows we went up to SU 100 or SU 150 and do the same thing. Okay, but we can come back to actually seeing how this works in practice the application of these sorts of ideas. And we can look at the application to a U1 field theory. So this was our simple gauge field theory example where we have just one complex number you so either the I theta her link on a two dimensional lattice instead of a three by three complex SU n matrix SU 3 matrix per link on allowance. Then we can write down an action so expressed in terms of plakets so that's products of links around closed loops with just a single coupling here. Our plakets is just this product of links that we saw before. And again, we can explore this for a set of ladder sizes with some or fixed letter size with some set of couplings where again we've just chosen the parameters of these ensembles to give us a critical line. And then the picture looks really very familiar. So exactly the same sort of procedures before except now instead of a real number on each site, we have a complex number on each link of this lattice. So we can choose our prior distribution to be uniform. Yeah, we can build our flow out of these gauge equivalent coupling layers. So something like 24 coupling layers, and then the kernels in each layer, you can make out of the same mixtures of non compact projections, or you could have splines. Any of those examples that I gave before for circles is exactly what you what you want. And you can parameterize those transformations via arbitrary complex neural networks. And in particular you can use convolutional neural networks. And what I mean by parameterizing those transformations by the neural networks is that the output of the neural networks gives you the parameters of your your spline or your non compact projection of your transformation. Then again, you can optimize in exactly the same way. So you can train using your shifted KL loss with some optimizer. And once you reach your stopping criterion you can sample. So as a summary here so we can dramatize our flow using not the real non volume preserving coupling layers like we had before. I can use the gauge a covariant coupling layers that that we designed. But again, these coupling layers are parameterized using arbitrary neural networks. So you have complicated expressive functions of many free parameters that you can optimize. Then you train your neural network or you train your it's not really just the thing on your network. It's a it's an architecture, including some neural networks, but you optimize the parameters of your model by training by drawing samples from the model itself. So again, you don't need any training data from the distribution to start with at all. You can train purely by sampling from the model. You compute the loss function you degrading descent, then at some point you save the trained model. And then exactly as we had before, you now have a nice embarrassingly parallel way of sampling by drawing samples from your trained model by composing them into a chain and by doing an except reject. Okay, and what you can see is that just as we saw for five of the four theory. This in fact works. You again have a significant reduction of critical slowing down in this context at the cost of upfront training of the model. So once you have trained the model, there is no more critical slowing down and sampling. So what I'm showing here is a figure showing the sampling of the topological charge, which you can think of this is really the, the winding number of the gauge field. So the idea thing index theorem told you that it's quantized. It's it's global it's an inter global or space time. So includes long distance correlations. So this discharge on the vertical axis and on the horizontal axis you have the Markov chain step for the blue line, the hybrid Monte Carlo or the Hamiltonian Monte Carlo approach that I described. So what you see is that, although you're updating 40,000 times the Hamiltonian Monte Carlo reminds essentially stuck in topological sectors. This black line here is for a heat bath approach so another conventional algorithm which does better than the Hamiltonian Monte Carlo but also remains very stuck. And in orange, you see the flow based model, which is exploring the space very efficiently. So we can rephrase this and ask about the integrated autocorrelation time so the measure of cost. So here's that autocorrelation here that cost measure on the vertical axis and on the horizontal axis we have the coupling so this is a moving from left to right is moving you can think about a board's topological freezing. And you see that hybrid Monte Carlo here and heat bath the blue and the gray they again the cost diverges exponentially as we move from left to right and the flow model remains fairly fixed. So this increase here that comes from it being harder to train the models as you move to larger beta. So the training does become more difficult, but you see that you can nevertheless train the model and you still have order the magnitude more efficient sampling here at these large feeders where conventional approaches are struggling with frozen topology. And actually this is a fair cost comparison so the cost to generating one sample from hybrid Monte Carlo and one sample from the flow model is approximately equal. The heat bath is cheaper by about a factor of two, but that of course doesn't make up for the orders of magnitude gap here in cost per sample. So this is really a success. It's a group of principle of an efficient and exact machine learning algorithm for sampling UN or SUN ladder gauge field theories. But of course there's still significant work required, particularly on the engineering front on how to scale from these these small examples that I gave you over showed you a two dimensional lattice 16 by 16. And ultimately we'd like a four dimensional lattice, something like 48 cubed by 96. So there's a big scaling to be done and moving these sorts of models on to extra scale hardware is a big challenge there, but it's a very exciting proof of principle success. And I should also say that although for my particular application, we still have a long way to go on scaling, a lot of the tools that we're using and developing here have other applications where the scaling is not so much for challenge. So in robotics, as I said, each link on a robot arm with many segments is of course the motions of the joint are defined by angles which live on compact connected models. So being able to design flows on those spaces has applications in robotics. You can use the same sorts of ideas and the same sorts of ideas have been used in molecular genetics and drug design to great success. So then the outlook here is really that machine learning accelerated algorithms have huge potential that so far largely untouched were naval first principles nuclear physics studies that are currently just computationally impractical. And so in order to be able to get into some of the machine learning details I gave just one example, which was of a flow based generation of QTD gauge fields. And if we could do this at scale with the same success, then what you would have is past embarrassingly parallel sampling, which would enable high statistics calculations, but there's also a host of other advantages so actually you can have efficient with found that you can easily retune trained models to nearby parameters. So when normally you need to create an entire hybrid Monte Carlo Markov chain at any set of parameters that you'd like to explore from the space exploration is extremely expensive. If you have a trained model that enables very efficient sampling, and you can easily retune it to nearby parameters, then that makes parameters space exploration much more efficient. You can reduce storage challenges so you could imagine storing only the trained model, not all of these samples which are hundreds of gigabytes each. And so it would be a very exciting possibilities if we can extend this to scale. But of course, as I said, implementations at scale, although conceptually straightforward need a lot more work to move to access scale hardware. Okay, thank you very much. Fantastic. Thank you very much. It was a great talk. Excellent. Okay, so we have a lot of time for questions. If anybody wants to ask a question and please raise your hand on zoom or type it in the chat window, in case you cannot. Okay, if nobody speaks up, let me go ahead first. Let's talk about the upfront training. And in principle, I could imagine you can pick certain, you know, trade offs between when you stop the training and you start the sampling. You'll get the correct distribution in any case but maybe you could save a bit on the training line but therefore lose a bit on the sampling sort of efficiency that you have. It's a trivial effort that you would have to put into making this compromise and doing well there or is it sort of just that you drain always until you convert to your thinking conversion then you move on to sampling. Yeah, so you're exactly right that there's a trade off there and ultimately it depends on how many samples you want to generate, of course, is where your cost curves will intersect. And so for our applications, the gauge field ensembles that we typically generate are shared in a national or an international community because they're so expensive to generate. So I think ultimately the samples are such a limiting thing that for our purposes, if we can, you know, train as far as we can. That's really where the biggest gains are going to come from because especially since we can retrain these models for parameters space exploration having just one trained model will be worth, you know, a lot. So I don't think it's also very hard to estimate how many samples you ultimately want because we're still using configurations generated 10 years ago in the community just because of the cost. I think for us that's not a calculation we're going to worry about right now at the moment the calculation is, can we even train anything to any sort of reasonable acceptance at scale. Right. Okay, that's fair enough. Okay, there's one question by Alex say in the chat. Alright, could you please explain what is the why you still need an MCMC sampling after you have trained a flow model, can the sample itself, or can it sample itself. Yeah. So, yes, the flow model gives us samples from our probability distribution p tilde, which is close to the probability distribution we want to be. But since we're trying to do a rigorous first principle theory calculation when no matter how close p tilde is to p. That's that's not good enough we want to be sampling from the distribution p itself so we need to correct in some way. So you could do reweighting instead of an except reject set that's perfectly fine. But since the sampling is only the first part of our calculation and on each sample, you need to do a lot of work. It's typically more efficient to, you know, do an except reject and so throw out correlated samples rather than doing all of the work of computation on closely correlated samples. So that's why except reject instead of reweighting but you could do reweighting. Okay, thanks a lot. Then there's another question by will, please go ahead. I was wondering if you've done any experiments on the extensions of the lattice itself. As in, could you extend the lattice to more than 64 cubed grid without losing generalizability of the model. Yeah. So actually that's one of the ways we're training these models. We find that for fixed parameters you can use hierarchical model architectures and so you can train on a smaller volume and extend to a larger volume. And that's actually a direction where, you know, the model transfers pretty efficiently. So what's, what's more challenging is of course changing the parameters in the direction of this, this criticality there. You know, that's where we're still trying to learn how to train most efficiently but you're quite right that using hierarchical architectures you can transfer models from smaller to larger volumes. Okay, thanks very much. Thanks a lot, Will. Any other questions? Could I ask a question, please? Of course, please go ahead. So regarding the importance reweighting step, I was under the impression that kind of importance weighting schemes scale quite poorly with dimensionality. So I'm just wondering about the kind of challenges faced when you have this trained model and you're trying to get the guarantees of correctness. Do you face kind of a lot more challenges with kind of exceptions rates and importance weights as you go up to these very high dimensionalities at scale? Yes. Yes, you certainly do. So naively, I said that transferring works very efficiently. If it works perfectly, you still expect an exponentially bad scaling of your acceptance rate. So what you find is you need to train your model better as you go to higher dimensionalities, which is exactly one of the things that becomes challenging. So do you have found it possible that it is possible to train models well enough to get decent results? Yeah, so that's right. So I mean, it's no accident that I showed you an example of a case where hybrid Monte Carlo and heat bath are extremely stuck in topological sectors. And I mean, if we choose a place where they work efficiently, then we're not going to win because we faced that challenge also. You know, we have found that in cases where you really have severe critical slowing down that that is bad enough that even facing, you know, the volume scaling challenges we can win by orders of magnitude. Thanks very much. Thanks a lot Andrew. Okay, I have one, one other question which is fairly broad and mainly for my own education, if I may do you spoke a lot about gauge fields and how to sample their gauge for configurations and so forth. Is there anything that adds to the challenge as soon as you start to include fermionic fields, dynamical ones? Yes. Very much so. So that's that's still something that's work in progress. I mean, there are ways to deal with it, but it's certainly a non trivial extension of what I've shown you. So I, I can't tell you details of how we're solving this because of an undisclosure agreement with some of our collaborators. I can't tell you that you're exactly right. That is a challenge. And that is the remaining, other than what I've shown you that is the remaining piece of conceptual difficulty, other than just scale. And we should have something coming out fairly soon on that. I see wonderful looking forward to that. Will, do you have another question? Please shoot the head. Yeah, I was just wondering if you have any plans to look at like explainability of these models at any scale, whether you're, you think you can recover exact physics from the learning in these models. So the explainability is an interesting question. So we do know one example of a very nicely defined sort of flow that does this. Martin Lucius trivializing maps and example of a map that takes you from a trivial distribution to your, your gauge for theory of interest. So we sort of have a limiting case of a nice explainable example. But that particular example is, is not very efficient to implement expensive. So what we're essentially trying to do is approximate that sort of a trivializing map via our, you know, low architectures, which is more general. So, in terms of learning something from the flow. It's, it's challenging for a range of reasons, you can think it we've done experiments you can think about doing something like quite a make the flow a long coupling. So from a trivial theory in a limit of zero or infinite coupling, and actually guide your flow via loss function to intermediate layers to train a long, you know, a physically motivated trajectory. Or you could, and other similar ways of trying to force interpretability. And we found that typically that's less efficient or it just doesn't doesn't really help. And so we also have tried to diagnose what trajectory exactly the flow is following. And, you know, we can talk about how the correlation structures emerge, but that doesn't really tell you much about the physics since you're evolving in a space where at any intermediate point, you don't necessarily know it's hard to match the parameters corresponding to a simulation at that intermediate point in time. But there's even earlier work that I've done on how, if you just take samples from a theory, can you use your networks to go backwards and say, Well, what are the parameters of the, you know, closest to CD and extensions like theory that these samples, most likely came from and try to map backwards and understand like a flow in model space, how things are moving. And so that's sort of challenging as well because you don't, you know, if you do a, if you think about coarsening and refining in doing a renormalization group flow and a very large number of dimensions which adds an infinite number of terms to your to your action. So if you're trying to determine the parameters of all this term, it gets complicated in various ways. So I guess the answer is yes we're thinking about it it's not clear what you can learn. I don't know if anything from the interpretability of these models, but it's a really interesting question. Okay, thanks very much. Thanks indeed. Okay, if there's any more questions that we thank you very much on behalf of the entire audience and behalf of myself of course for this wonderful talk I definitely learned a lot. And I'm looking forward to your first principles computation of the proton radius, I guess. Thank you very much.