 Our speaker today is Michael Kagan. Michael received his PhD in experimental particle physics from Harvard in 2012. Ever since, he has been affiliated with the Slack National Accelerator Laboratory at Stanford. He is also a member of the ATLAS collaboration, which operates the experiment of the same name at the Large Hadron Collider at CERN. With his work, Michael harnesses the power of neural networks to extend the physics capabilities of ATLAS, but also teaches the wider particle physics community how to use these techniques well. Many of these applications have more general cousins that lie beyond the boundaries of high-energy physics. In today's seminar, he will cover two such examples, namely optimal experiment design and unfolding. With that, Michael, over to you. Thank you very much for the kind introduction and for inviting me here today and everybody for coming. Let me share my screen. Okay, I think, can you see that okay? Yep, that comes up nicely. Okay, great. So yeah, thanks very much for the intro. So today, I'll be talking about some of the work we've recently been doing on using these very powerful tools that have come about in the last few years, generative models, for helping in some of our inference tasks, so mainly optimization and unfolding in high-energy physics. But as was mentioned, these are methods that can be used elsewhere. And one of the places that we're really going to be focusing is on the kind of boundary between machine learning and the simulators that we use in high-energy physics and other sciences. So I should mention that many collaborators worked on this, great work. Even some of them from here at Oxford, GUNESH. I want to mention them up front, and they're nice work here. So we're going to be talking a lot about using generative models in different kind of inference pipelines, different analysis settings. And in many sense, when we think about generative models, they're approximators. They're basically approximating and simulating a data generation process, whether that's in high-energy physics, in cosmology, in climate modeling. And effectively, generative models cover a very wide array of these processes, scientific simulators, differential equations, modeling, physical processes, and even more recently, things like deep generative models. Now, deep generative models and scientific simulators have certainly different properties, but in some sense, they're both trying to model this data generation process. And when we look at the kind of differences or we kind of compare and contrast the two, when we think about scientific simulators, we're thinking about these computer codes often that are built from our scientific knowledge, kind of built from the foundational principles by reducing a complex physical phenomenon, a complex observation and detector, like at Atlas at the LHC, and breaking it down into its components so we can simulate it from the ground up. It tends to have relatively few parameters, the parameters, for instance, of the model, the standard model in high-energy physics, maybe the parameters of the geometry of your experiment, but these are often very interpretable. And because of the kind of physics ground up approach to scientific simulators, they can be very high fidelity, but often computationally costly. On the other hand, when we think about machine learning for generating data, we're thinking about models that can have millions, billions, and even these days, there's a pushing towards trillions of parameters. They're not necessarily interpretable, but instead of using physics knowledge to build up the simulator, we're using data to fit all these parameters. And we use some of our knowledge potentially about the physical processes to design the model, design an optimization process, but in general, it's kind of a large function fit with some structure added in. So this can be slow to train when you're training millions or billions of parameters, but afterwards, it's an evaluation of a neural network, an evaluation of a function which can be relatively fast. And so there's really this kind of learning from data process and building up foundational knowledge. And today, we're going to be talking about trying to put these two together. So one of the reasons we might want to put these two together is when you look at simulators in something like high energy physics, or in many domains, we have this reductionist approach to understanding how data is generated. We have a theory like the standard model, we can generate interactions from this model, plausible interactions, in this case, producing top quarks. There's, for instance, a radiative process that's roughly independent of the previous process, set a different energy scale, we can produce particles that we might see in a detector, and then we can simulate their interaction in the detector. And this kind of goes from a model that has something like 19 parameters into observations in a detector that have 100 million sensor readouts, 100 million points or parts of that observation. So while we know all the, how to simulate and build up these 100 million detector element observations, we can't compute the probability of seeing any given observation, we don't know what the probability of this configuration is. So this is kind of an observation to detector, we don't know its probability. So we can simulate it, we have a mechanistic understanding of interactions, we can generate plausible samples, we can say what could do samples look like given given these parameters, but we cannot say how likely anything is. And that turns out to put constraints on our analyses when we want to do inference. So while it's straightforward to run the simulator in the forward direction, kind of this black box simulator here on the bottom, we can run it forward given a set of parameters to produce observations. It's very hard to take a set of observations and invert this process. The reason why we don't have likelihoods in some sense is because in the middle of this simulation process going from parameters to these intermediate steps of our physical simulation to our eventual observations, there's many random processes going on. And in order to kind of compute the likelihood of any observation, we'd have to integrate over all these millions and millions of random variables or latent information that may be occurring in the process of this simulation. In the collider setting, this may be radiating particles radiating energy and particles interacting with detector elements. So we want to be thinking about how we can take these simulations and use machine learning essentially to help do inference, learning about parameters, or learning about this latent information in the simulation chain. So the kinds of common questions that come up are things that, you know, they have terms in particle physics like reconstruction, given an observation. What was the properties of the particle that produced that? So given an electron's energy that we observe in a detector, what was its true energy that entered the detector? Or if we have a distribution of electrons and we want to know a distribution of electron energy, we want to know the distribution that was input to the detector that might be unfolding, that we'll talk about a little bit later. We may want to infer the parameters of the standard model. What's the Z-mass given a bunch of di-electron events? And even further, supposing we wanted to build a detector that was even better at measuring the Z-mass, we may want to try to optimally design that experiments to tune the parameters of the experiment structure of the detector geometry. So these are the kinds of questions we may want to ask, which are challenging due to the nature of the simulator, the fact that we can't evaluate the likelihood of any simulation that we produce. So when we kind of write this down in terms of, you know, common inference questions, we might want to do maximum likelihood. So finding the parameters, you know, given some, with some observations, what are the parameters that would make that those observations most likely? We may want to compute posteriors. If we want to use Bayes' rule and compute a posterior given observations, what are the distribution of parameters that could plausibly give those observations? Or when we think about optimization or simulator tuning, we kind of pose it as not just, not maximum likelihood in the same way, but we want to maximize or minimize some cost function, some objective. For instance, if we want particles to be swept away from a detector, we want to minimize the cost of particles hitting a detector, as we'll see in an example later. So today, we're going to be doing, we're talking about some techniques in simulation-based inference, which is this combination of machine learning simulators, where we're going to be combining the benefits of simulators to produce high-fidelity data with generative models. And effectively, one of the tricks that comes in along the way is we're going to be using the simulator to produce observations and generative models for sampling for density estimation. And very importantly, when we want to do optimization, we're going to be using them to approximate gradients. So this is the kind of thing I mean when we're talking about deep generative models. You likely have seen these in other seminars, so I'm not going to go through them in too much detail. I'll talk about them a little bit later in different sections of the talk. But when we're talking about here is things like generative adversarial networks, variational autoencoders, or normalizing flows. And these are different kinds of generative models. And the thing that I want to emphasize here is in each of these models, we have something like a generator, or a decoder, or kind of a neural network that can, by objective, that can go backwards and forwards. But the point that I want to focus on here is that basically these models are built around what's called a latent variable model. We have some latent random variable, which is a distribution that we know and we can easily sample from. Most often this is something like a Gaussian, a multi-dimensional Gaussian. And we apply some function to that, some deterministic function that whose output is, for instance, a sample of a plausible sample of the data set. And so these deep generative models are basically parameterizing the transformation from a latent noise variable, Z, into something that looks like real data. And this transformation is typically a neural network, in this case, that depends on some parameters. And most importantly, it's continuous. We can take the derivative of this neural network, which means for any sample that we produce from this generative model, we know how that samples output, we know how the features of that generated data point depends on either this latent random variable or any parameters of the model. So we basically can sample data points and take their gradients and use that for optimization. And that's, we're going to be talking about that in two forms throughout the rest of the talk. So we can dive right in into the first aspect, which is black box optimization or design optimization with what we call local generative models. So to kind of give you the problem setup, we can think about this in a physics setting. So in this case, we have a magnet here in green that has some parameters. These parameters define the geometry of the magnet. What's going to be happening is we're going to be, some muons are going to be sent into this magnet, and they're going to be swept to the left and right by the magnet itself. And our goal is to optimize the magnet design so that the muons are swept away from the detector. So we can see detector observations here on the bottom. And we want to, for instance, sweep the magnets outside of this purple box where the sensitive detector elements are. So in order to do that, one of the things we can do is we can simulate this whole process. So we have some simulator, something like geont, where we can simulate the impact of a magnet on an incoming muon. And with that simulator, we can generate lots of data. And we can define an objective function. For instance, a very simple one would be we, we lose the loss increases if muons are inside of this detector region, and then the loss decreases if they are outside of the detector region. That, that's not particularly differentiable, but that, but some smooth version of a loss like that may be something we as users would define in order to say this is better or worse. So because this is a stochastic process, well, our objective is going to be the expected value of this, this, this loss or this objective over a bunch of muons. So our goal then is to optimize or minimize this objective function R, given over the set of magnet geometry parameters. Now this is a bit challenging. Often when we think about optimization, one of the first things that come to mind is gradient based optimization. But our simulator here is, you know, a computer code that's been developed for many years, we can't necessarily compute gradients. So how do we do this? Sorry, but maybe to set up a flow chart of this process. We can, here we've formalized, we've written down the kind of minimization problem here for design optimization. We're sampling data points and computing their loss and basically given some inputs and parameters, we can run the simulator to produce outputs and evaluate the objective. But we don't know how to compute the gradient of the simulator with respect to the parameters. So that's why I say this is a black box optimization. We don't have access to the internal gradients of this simulation code. Um, so how would we normally do this? Well, I mean, in a physics setting, when there's been decades of research on detectors, one one approaches expert intuition. Certainly people with many years of experience can have a very good idea of how to start optimizing new devices. But in a more general sense, if we want to optimize some generic cost function, we might rely on something like numerical derivatives, finite differences. Now these can have numerical instabilities when the step sizes are very small or with stochastic functions. We can think about other methods such as it's called a score function estimator. This is basically the reinforce algorithm or the reinforce gradient that's used in reinforcement learning. Now this can be applied here but tends to have high variance. And there's other techniques like Bayesian optimization, but these can be costly because they involve large, they can potentially involve large matrix inversions. Um, so we've been we were in approaching this problem where we have a very costly simulator in this case to optimize a magnet design. We need to be able to run a very costly simulator, which is Geon. We wanted to think about ways in which we could reliably compute gradients and perform gradient descent. So if we look back at our optimization problem, although we can't take gradients of the simulator, one of the things we can do as we were discussing a bit earlier is we can approximate this simulator. So this black dotted box, we're going to basically approximate this process with a generative model and we call that a surrogate. So now in, um, this surrogate will basically be a generative model that's trained by taking our simulator, generating a bunch of data points and learning to produce, you know, given our data points and some random vector of latent random variable Z, whose distribution we know, we can output a plausible sample from the true simulator. So we're basically building a generative model to mimic this dotted box here. So what we typically rely on in this work, we've been using GANs. GANs tend to be very good at generating plausible, reasonable looking data, and we'll talk about them only very briefly. And again, we start with a generative adversarial network. We start with some random noise Z. We're going to apply some function that outputs a sample of the right dimensionality of the data, in this case an image, this N by N image. And the output of this generative model, this generator is fed into a discriminator D, whose job is to tell the difference between real and fake images. Now in this example, it's not hard to tell the difference, but as they're trained, this becomes harder and harder. And so it becomes an adversarial game, a competition between the generator trying to produce more plausible images and the discriminator trying to tell the difference between real and fake images. So this is set up then as a minimax optimization problem, where the classification loss of the discriminator is basically the objective, and we're maximizing the discrimination power whilst trying to minimize the discrimination power with the generator. Now we're not going to talk much about the training dynamics. These are a little bit tricky to deal with, but nonetheless, we can go ahead and train them on our dataset with some tricks of the trade. And the result, for instance, in a very simple example here, is we might start out with a magnet, we're trying to simulate plausible muon hits on a detector when we send muons through a magnet, and we're basically given, we give the scan the input vector of a muon, the input direction of a muon, and it outputs a hit on a detector. So as we can adjust the length of this magnet and have this generative model learn how to generate plausible images as a function of length. So here as the length of the magnet increases, the muons hitting the detector are separated, and if you look on the left and right, this is a comparison between this generative model and the true simulator, and it's a relatively good match. So we can generate reasonably plausible looking data. So that's very good, at least the step of trying to emulate our simulator is something we think we can do, and it's not too surprising, these models are very powerful. But we need to go one step further. We don't just want to generate data, we also want to be able to optimize what the parameters we're going to be using are, what are the best parameters of this magnet. So when we look at this loss function, we're basically looking at the expected value of this objective. We're basically adding up the loss of the objective function for every simulator call, every simulator simulated data point. Instead of evaluating this loss on using the simulated outputs, we can evaluate it using this generative model, which we call the surrogate here S. It takes as input the parameters of the model or the parameters of the geometry in this case. It takes as input, for instance, some random variable the muons, and it takes a random variable so we can generate stochastically potential observations. And nicely, if we compute the gradient of this approximate objective here on the bottom, we can use the chain rule and pass the gradient right through. This requires the objective to be differentiable, but that's typically not the challenge. It's usually the simulator, that's the differentiable problem. So in this case, we can pass the gradients directly through this objective function to compute gradients of the generative model with respect to the parameters we want to optimize. So because we can compute those gradients with respect to the parameters, we can now do gradient descent here. So we basically can iteratively compute the gradients of this objective, update the set of parameters, and repeat until convergence, basic gradient descent algorithms. And so we can go ahead and do this on that on various different kinds of toy examples. In this case, we can look at the expected loss. The loss is a function of different parameter values on the toy problem. This is a two-dimensional toy problem with the loss function shown here by different colors. In blue is kind of the minimum, and it's basically a ring that's sloping upward away from the center. It's kind of down and up. Now, when we look at this true loss function, we could start at any point in this parameter space, Psi1 and Psi2, and run gradient descent, and it would basically take us directly down to somewhere on this blue minimum ring. When we're trying to approximate this with a generative model, it's a fair bit noisier. You can see this approximate. We've evaluated the generative model at various points in Psi1 and in parameter space, and you can see it's not perfect, but it's quite reasonable. And at different points where these represented by these triangles, we can run gradient descent using this approximate simulator, and the gradient descent works reasonably well. We can find our way to this ring. Okay, so this works. This is a nice example that shows that we can kind of do gradient descent with generative model-based gradients in two dimensions, but it turns out scaling this to high dimensions is quite challenging. The reason the way you can think about that or the reason for this is when you have many, many parameters, it's very hard to learn simultaneously over all parameters of not only a very good generative model, but also very good gradients. If we didn't sample enough points in important regions of the parameter space, we may be completely interpolating with a generative model. That's not going to be great. So learning in high dimensions all at once didn't really work so well for us. So we started to think more and more about gradient descent as an algorithm. At each point in gradient descent, you're computing the gradient at the current parameter point. You don't need to know what's happening in some far off point in the loss function space in this objective. So with that in mind, we looked at this optimization algorithm, and instead of trying to learn over all parameters at once, what we ended up realizing is we can just learn a generative model around the current parameter space. So if we're running gradient descent, we have some value of the parameter psi. We can learn a generative model just in some, for instance, box or some region right around where we are in parameter space, approximate the loss function locally, and that's enough for us to learn a gradient over the loss surface and take gradient step. So the algorithm kind of gets extended in the sense where given some set of parameters, we sample different values of the parameters in some local region. We use that to run our simulator and generate training data. We train a generative model. That generative model can then be used to take a gradient step to update the parameters and then start over again with finding a local region and making a local approximation of the loss function. So we're training a generative model at every step of gradient descent, but we can do that because we can always sample these parameters locally. Of course, if the simulator is very fast, training a generative model may end up being very costly in time, but in many cases the simulator can be very, very slow and the additional time spent training a generative model may not be the highest cost in the process. So if we kind of look at a cartoon of this happening on the top, you can see this is a kind of umplex loss function as a function of the two parameters with two minimum on the top and bottom here in blue. We're trying to make our way to the red star and at each step of gradient descent, we're retraining a generative model and approximating its gradients. And you can see here on the bottom the comparison between the true gradients in this toy problem where we know what the gradients are and those estimated by this by the game by the generative model, and they match reasonably well. They're certainly correlated and they allow us to take good gradient steps down this this this loss function. So you can see in black is the region where we're actually sampling data points and approximating the the loss function and outside is is an extrapolation where the gradients tend to not be as informative. Okay so with this local model in hand we can test this on a variety of toy problems. There's there's various toy problems in the optimization literature that people often try to approach. So one of them is for instance 10 dimensional Rosenbrock problem that has multiple minimum. We also tested this on trying to optimize the weights of a neural network given a neural network output and some loss function we can try to directly run gradient descent by approximating the neural network itself and trying to output updates of its weights. That was kind of a fun example. So what we do is we compare this optimization example this optimization approach with things like numerical optimizations and finite differences, Bayesian optimization or these reinforcement learning type approaches and that's in blue for instance. And what we see in in red and green are the generative model based approaches these local generative model approaches and in terms of the number of function calls to reach a minimum they tend to be very competitive with many of the other algorithms some of them even faster than than many. Now this depends on the complexity of the problem but we can see that in terms of the number of simulator calls this can actually up this method can actually optimize reasonably quickly which is good. We can try to minimize how many times we actually have to generate data from a slow simulator. On the right when trying to do this with a neural network optimization we can optimize it relatively quickly and relatively even better than some of the recent kind of reinforcement learning based approaches although at very very large number of function calls they tend to to reach very similar minimum. So we can nicely see that we can we can optimize these spaces these these objective functions quite well. And one of the things we also noticed is when there's some degeneracy in the parameters when two parameters are kind of controlling the same part of the simulator we can still optimize and that that tends to be a setting that's very difficult in a lot of other cases especially numerical optimization. So we can move on from toy problems and have a look more closely sorry have a look more closely at a physics problem. So in this case we can go back to this magnet problem but instead of just optimizing the one-dimensional length of a magnet we can have a multi-stage magnet in this case it's a it's this is a simplified version of the ship experiment which is basically a magnet with 42 degrees of freedom that control the length the height of the beginning and the end of the magnet the gaps between the magnets. So as I mentioned this is simplified but certainly much more realistic than our one-dimensional magnet system. And in this case this experiment is looking for dark sector mediators and one of the goals is to sweep away muons from the detector apparatus from the detection surface area with such a magnet. So we can we can run this approach basically iteratively running geont generating simulation simulations and approximating the generative model for gradient descent and we can see the nice the the loss nicely kind of drops over over time and in fact you can kind of look at how the magnet shapes this is the end this is basically the gap at the end of the magnets for each of five magnet stages and you can see how these gaps are shrinking and expanding as we perform gradient descent so this is very nice to be able to look back and see um uh interpretably because we're taking gradients with respect to interpretable parameters how the law why the loss is changing with respect to different changes in the parameter structure input parameter structure. So you can see a cartoon of that optimization taking place this is the kind of one of 2d projections of that 3d picture I showed on the last slide z is the kind of length of the magnet and this is the x and y cross section on top and bottom and you can see um basically across the entire optimization it was pushing the magnet to be longer and shaping various gaps in between the magnets um when compared to a recent approach to do this with Bayesian optimization we were able to um not only get a better loss basically sweep more muons away but um have a shorter and lighter magnet now this is still a simplified simulation but I think it's exciting for application in in more realistic settings um so just a few technical details pros and cons when we when we dig into the details of these generative models when we look at these gradients um nicely we see that that they don't tend to be biased they tend to be pointing approximately along the direction of the true gradient so this is the average difference between a true gradient and the approximated gradient in toy problems where we um where we know what the true gradients are so they're basically consistent with having zero bias so that's very good that we're consistently taking gradient steps in the right direction but unfortunately one of the cons here is we have to basically train a generative model at each gradient update step and we can kind of use simulations from previous steps but but it still tends to be relatively costly in terms of generation time or training time for generators um the other challenge here is that gradient descent tends to find the closest minimum so if it's a very complex loss surface sometimes making your way to the closest minimum may not be ideal and so thinking about exploration versus exportation tradeoffs and combining these methods with with some of the exploration capabilities of Bayesian optimization I think is a is a fun direction to go in the future but nonetheless um that's kind of where we um that's where we were able to um to do optimization so far and we um and in the next part of this talk I'll be talking not just about finding some single point uh in some optimization problem but trying to find a distribution of parameters a distribution of latent variables that could have generated a set of observations um so I don't know if it makes sense to pause for questions now or maybe finish up and then wait for questions at the end Philip I don't know if you have a preference I don't have a preference but let's just take a roll call does anybody have any urgent questions now well can I just ask a sort of basic question um you're getting the gradient from uh to do gradient descent from the generative models uh and they appear to be much better than trying to get them just from simulation uh now is it just because with generative model with again you get many more samples so in uh when we're in terms of just doing it with the simulator something like finite difference methods you know one of the problems especially if you need to make small steps is that can be quite noisy and you can be sensitive to even um um numerical approximation errors um so what we're basically doing with the generative model is fitting a local model to the structure of the of the loss function so we're we're kind of able to interpolate even in relatively complicated loss function spaces and then the gradients are more well defined with finite differences it can be quite noisy and they don't tend to work so so well in in stochastic optimization problems and there's a lot of random noise coming from um from the stochastic um nature of the simulator they can be even noisier so we find that we're able to kind of interpolate very well and approximate the loss surface and so that's where the benefits coming the kind of approximation capabilities of the generative model okay okay thanks for that Duncan you also have a question please go ahead thanks Michael um I just had a query about training of the GANs so yeah if I understood this right you're you're able to train these GANs on only tens of simulations is that is that right I mean my my feeling from from this would be that you need many many thousands of of um training data data samples to to train again I guess it depends how big your GAN is but I'd be interested to hear about those kind of um yeah absolutely so right so there's there's various numbers here that are important um which I didn't go through in detail to to not get too not to obscure the the too much the details or to secure the the goal um right so in this process we have both the parameters and well I can see you can see we have both the parameters and the observations we're trying to generate so at any given value of the parameter we may simulate a large number of potential observations so we're getting a good um set of training data for the for the generative model for the GAN to understand how to to plausibly generate stochastically observations for a fixed parameter set what we then do is we simulate lots of data with lots of parameters and train it all at the same time um so although we may only you know in a two-dimensional parameter space we may only sample three four five different values of the parameters for each value we may simulate five ten thousand points is that just because that's cheap for your simulator so sampling that latent space is cheap in your simulator in your forward model is it which what you mean in the actual simulator yeah no that can be very costly and that's why if you look we don't talk much about wall clock time in terms of training we're talking almost exclusively about the number of function calls yeah okay so you know these toy examples the the simulation time is trivial um uh but the the GAN training time can be quite well these aren't very big problems so it's not very long um but in something when you're like running geont where you know simulating a thousand events can take an hour um that the tradeoff there really changes quite a bit so we're not so worried about wall clock time so we we're basically assuming that in the kind of examples where you'd want to use this the simulator is going to be quite quite slow and your main goal is basically you're willing to spend a little time getting a better gradient estimate um because sampling more simulator calls is going to be much more costly yeah okay yeah yeah sorry not just running more simulator calls is going to be more costly yeah yeah okay so it's not necessarily cheap to generate the data but you have to do that no matter what yeah yeah just as a matter of curiosity do you have some nice interface for automatically setting up these runs I guess you you spend a bit of time kind of creating a an automated way of spinning off these game geont runs because otherwise it's it's pretty tedious for the uh for the for the for the full example yeah this was all containerized and set up to run on a I'm not sure if it was a kubernetes cluster but it was the very least it was a it was a cluster that can handle dealing with um stalker and singularity containers where you can run internally the simulator yeah awesome thank you yeah I should note that that was really the students who put that together that was not me right the the uh the came to containerized ship simulation yours um okay so why don't we move on to the next section thanks for the questions um so in the next section uh so in the last section we were kind of setting up these objectives and trying to find one value of these parameters that optimized this objective now when we think about unfolding we're thinking more about finding distributions given some set of observations given some set of observations what is the source distribution what's the distribution of inputs that went through some smearing process that generated those observations so it might be you know we have some distribution of electrons produced from the decay of zebozons um they run through a detector we observe the their energy in the detector that's some smeared out version of the distribution of the source distribution and our goal is you know from the output distribution what's the input distribution so you can write that down kind of probabilistically you know that the probability of some observation the probability distribution of the observations um is an integral over these unobserved variables these true values as we'll call them um basically a true distribution that's integrated against the likelihood basically the effect of given a true value a true input what's the distribution of outputs that the the simulate the detector would would generate so from our observation distribution observed distribution we want to find the source distribution but you know especially when we can only run the simulator in the forward direction we can't really compute these integrals we don't we can't estimate these densities so doing something like solving this problem with continuous functions or fitting a continue over many variables is challenging in fact one of the ways this is frequently done is kind of a histogram based approach we take our output distribution we bin it you know we make a histogram of the outputs um we can run the simulator with a bunch of different potential input values look at the different output values that are that are that are observed and we can bin that into a 2d histogram and this basically turns this into a linear inverse problem the source distribution the source histogram times the um this 2d likelihood matrix gives us the observation and we can try to do inversion and there's various techniques needed to make that stable um one of the things I would note is this this is this kind of has a lot of problems with scaling um it doesn't scale to high dimensions or many bins trying to populate a multi-dimensional histogram um with sufficient statistics in the in the bins is can take enormous amounts of data it can grow it can grow exponentially so this this doesn't really scale to try to do multi-dimensional unfolding with with high resolution right like that so what we're instead trying to want to focus on is how can we do this continuously how can we use generative models in this process so um we can focus on the goal here so we have this observed distribution that's related through to the source distribution through this integral potentially over high large number of dimensions and our and our goal is to find the p of x the source distribution that maximizes the likelihood of the observations now this is an ill posed problem there's actually many source distributions that could do this but our goal at this point is to find one of them so um the approach we're going to take is we're going to be using generative models to approximate both this likelihood p of y given x the probably observed given true and we're going to use a generative model to approximate the source distribution i'll show you how we do that in the in the upcoming slides um and then in order to evaluate the likelihood of the observations we're going to use Monte Carlo approximation so this integral is it can be approximated with a sum where we basically sample data points from the source distribution that we're trying to learn but we're going to sample it along the way we sample it we evaluate the likelihood at various uh for a given observation we're going to evaluate the likelihood at different source points sum them up and that gives us some estimate of the likelihood of the observation assuming some source distribution now nicely this is this is a nice and continuous objective we know how to relate p of y now to to p of x and we'll see how to basically use that to optimize the parameters of p of x but we'll be able to basically use gradient descent through this objective function here which is a sum over evaluations of a likelihood so that's the general approach we're going to use um and we can kind of show a cartoon of how that works so we'll start with some observed variables here why this is some observation in a detector um for instance um so then what we do is we start by um taking a generative model and using a generative model they can draw samples so it doesn't matter at this point if it can draw plausible samples and they need to be the right dimensionality but at this point we have some generative model where we input random variables and it'll output values um potential uh truth values values that you might draw from the source distribution so that's going to be a neural network and it has some parameters theta and we don't know what they are yet we initialize them and we can run that in the forward direction to generate a set of possible source samples these are possible true values so we draw a bunch of samples they may not be very likely but that's okay um and in fact that's the important question how do we evaluate how likely they are so much like we saw in the previous example with a simulator we don't actually need to know what the distribution of source values is to learn the likelihood we basically can just simulate over a bunch of different true true values um look at the observed values and learn a generative model that helps that maps us from true values to observed values that's basically like a generative model that's also conditioned on this value of x so we can do that with lots of different kinds of generative models if we just want to sample data points but one of the keys here is we don't just want to be able to sample data points but we want to be able to evaluate this value of p of y given x um nonetheless we're going to use the same approach with a different kind of generative model um so we're basically use geont or some simulator to simulate a bunch of true and observed pairs then we train a neural network based generative model to approximate that process once that's done we can then um evaluate it for any given observation we know what y is we know what um the source values are because we've sampled them from our test or our initial source distribution so we can evaluate this likelihood and we kind of get these points along this um this two-dimensional curve so this is x and y and the color corresponds in this case to p of y given x the likelihood so we can um evaluate all these values sum them up and that gives us an approximation of the marginal likelihood and nicely because this is just a sum over these likelihood evaluations we can take the gradient of this whole sum and pass it all the way back to the parameters of the generative model now the reason we can do that is is exactly the same reason we can do that we did that in the previous example we this likelihood is a neural network we can take the gradient of the neural network with respect to expect to its input x and this x is going to be a sample from a generative model which is continuously dependent on the parameters of that generative model it's a trans it's a deterministic transformation of some noise so we can take the gradients of these samples with respect to the parameters of the model and and perform gradient descent so we can now do back propagation through our likelihood and through our samples in order to update the parameters of our source distribution now we don't necessarily have just one observation we may have many so we'll take a bunch of observations and what we'll do is we'll sum basically the likelihoods over all those observations so that we can learn a source distribution which is consistent with all of the observations we see now to make this numerically stable we have to basically recompute the log likelihood and so we end up with this objective we see here on the in the middle so we're basically summing over different observations why we're computing the log likelihood of this marginal which is this internally this sum now this is a bit of a gross objective but it is fully differentiable and although the the gradients are biased they they're not so biased i'm not going to talk too much about that today but we can get a pretty good approximation of both this objective function and its gradients in such a way that we can run gradient descent through this whole process through this loss function um so great we basically compute gradients repeat sample data points evaluate the likelihood sum them up take a gradient step and repeat so this is kind of putting it all together we have our we're basically computed we want to maximize the log likelihood of the observed data we compute that through this this integral over this unseen source distribution of x and we approximate that with Monte Carlo integration and then you perform gradient descent to do the to maximize the likelihood over the parameters so our key now is we need to understand how we're going to evaluate the likelihood and how we're going to sample data points from the source distribution and so here again we come back to using generative models to approximate these pieces of the of the loss function so specifically we're going to be using what's called a normalizing flow so normalizing flow is is a is a form of invertible neural network but maybe the best way to think about this is through this well known change of variables formula so if i have some function that takes some variable random variable z and outputs some random variable x through a transformation phi that the the relationship between the probability distribution of x and the probability distribution of z is related through the change of variables formula p of x equals p of z times the determinant of the Jacobian of this transformation so this is basically saying however phi squeezes and stretches the volume we need to take that into account in order to compute the probability of x if we knew the probability of z okay so one of the keys here is we need to be able to sorry so we're going to be looking for a transformation from z to x basically from some simple noise distribution that we can easily sample from and we're going to be trying to learn an approximate a transformation that will give us a sample of the data that came from the distribution we're looking for which may be much more complicated now there are a few key things about this transformation we're going to need to ensure first it'll need to be invertible that's so that there's only it's a one to one map and we don't have to sum over different potential points in either the z or x space we're also going to need it to be invertible so we can evaluate the likelihood as we'll see and we need its Jacobian to be tractable in other words if we just had to compute the determinant of any dense matrix that could end up being a very costly computational operation but there are ways to there are Jacobians there are matrices whose determinants are easy to compute and so we'll focus on transformations with with easy to compute determinants now we don't do this all in one step it's actually usually many layers of transformations kind of coupled together each one is invertible with the tractable tractable Jacobian that we can then kind of the determinant of a bunch of invertible matrices is going to be the product of determinants so we can basically chain them together and still compute everything so that's quite nice and now each of these g's each of these in these small transformations is going to be parameterized by a neural network I'll show an example of what one of these looks like in the coming slides so nicely now if we have some data point x and we want to evaluate its likelihood we can basically given a data point x we can invert the transformation so we're getting a where that point lies in z space we can then evaluate the likelihood we've seen here on the top right corner so given this value of z we can compute p of z if it's a Gaussian we know how to compute p of z and we'll be able to compute this determinant this Jacobian determinant as we'll see on the next slide so now we can evaluate the likelihood of a data point under the this current transformation and we can run gradient descent to optimize the the normalizing closed parameters to better match the distribution of the data when we're done with this process we basically have a neural network that we can input a sample of noise and output a plausible sample of our observer of the distribution we care about and we can also invert that process we can start with a data point and find out where it lives in noise space and importantly given an observation we can also evaluate the likelihood of that observation so that's where we're going to use these models for our likelihood model in the empirical base setup so just a quick example of a normalizing flow one of these is called real MVP this is one of the early models one of the simpler ones but still quite powerful so you can imagine our data is two dimensions x1 and x2 and so we're looking for a transformation from noise z to samples that might look like x so here basically the transformation is done component wise for the first component of noise we just map z to x so we do nothing to it in the second component we we take z2 the second component and multiply by some the output of some neural network that only depends on z1 and we add the output of some neural network that only depends on z1 so nicely this is basically algebraically invertible you can basically subtract and divide and that gives you the transformation from x to z and also very importantly because the first part of this transformation does not depend on z2 we end up with a Jacobian that's lower triangular and lower triangular Jacobians have determinants that are just the product of the diagonals so that's a very fast operation than to compute so in this case the the product of the diagonals is just in this f of z2 so this is how you kind of manipulate or design these transformations so that you always get a determinant which is easy to compute and we may chain manner these these together we're flipping which of the variables we keep static so the next layer of transformation it would be z1 gets the this linear transformation in z2 say 6 fixed mapping them back and forth to try to build up a very complex transformation so here's here's an example of basically two Gaussian noise variables that are mapped through some transformation that's learned that learns how to sample from this kind of double ring and not only can it learn how to sample from this double ring but we can evaluate the likelihood at any point x1 and x2 so that's nice to see and these are exactly the properties we're going to need so when we look back at our objective function which is this sum over likelihood values we're going to train a normalizing flow to approximate our simulator we can do that up front given a value of input x it can generate values of y of observations and we can evaluate how likely that is for some approximation of the likelihood and since we're going to evaluate that on samples from our source distribution we can use the fact that in a normalizing flow we can easily sample data points so we parameterize our source distribution with a normalizing flow and we basically use it as a sampler to generate points that we use in this summation and this is approximation of the likelihood so that's all the pieces together we can run gradient descent all the way back through this likelihood back to these values of x and back to the values of the parameters of this transformation and once we optimize these the parameters of this transformation we basically know now what the source distribution is okay so does it work there's many steps to this approximation chaining together two generative models and using the gradients so we can look at some toy problems so here's a here's a fun one um this is basically a robot arm inverse kinematics problem so if we start with some observation of where a robot arm ends up what we want to basically be able to do is understand the distribution of joint positions here this is the where the joints of this robot arm starts x1 is the height and the angles basically the the robot arm angles here x2 x3 and x4 so from some observation we want to estimate the distribution of joint positions and angles that could lead to the the robot arm ending at that position so you can see here we for for one data point we've learned the distribute we can we can go and learn the distribution of all the different positions and joint angles that end at this end at this particular position and if we sample a bunch of observations we can learn a source distribution that's consistent with all of them so in this case this is an example where we have the the four variables in the in the robot arm in blue or the true distributions and in black or what we what we learn when we run gradient descent through this this likelihood approximation so not only we can get reasonably good marginals here on the x-axis these are just the 1d distribution of each variable we can also learn 2d correlations now it's not perfect here this was actually an adapted deterministic problem that we added noise to so there's um it's actually quite sharp in terms of the the likelihoods here so we weren't able to approximate the very sharp peaks of the distribution but we did a pretty reasonable job especially looking at the 2d correlations so we can also apply this in a physics setting eventually we want to be applying this into physics measurements so in this case we were looking at um events that have a z-boson recoiling of a jet so a jet is a stream of particles that's produced by a high energy quark or glue one it's basically this stream you can see here you can see the charge particles in a jet in in solid lines and then they smash into a detector and leave energy in a calorimeter you can see different jets in different colors on the left here so there's various properties we may want to measure or look at the distribution of and compare with theoretical predictions one for instance being the jet mass so using some simplified simulations this is in a full simulation of of of atlas or c mass is a simplified simulation we can look at for instance generating observations here of the jet mass distribution in black and compare and as compared to the true distribution that that was eventually smeared up by the detector here in blue so our goal is to look from the black distribution estimate the blue distribution and we can do that so this we unfolding in this case or estimating the source distribution in four variables here the jet mass with some of some some substructure variables some features that are computed from all the particles within the jet so we here's basically a four dimensional unfolding and we'll look in the next slide a little more detail at the comparisons in the 1d marginal distributions but we can see here also comparisons between the 2d correlation so we're able to unfold continuously in in four dimensions so here black is the observe the learn distribution blue is the true source distribution so when we compare also with with some of the other models on the market these days so here in in black is the you know quote unquote observe data in green is the true values and in blue is what we've been able to learn with this empirical Bayes based method this unfolding method and in red and gray are some other methods one is an iterative histogram based on folding method and another is an iterative reweighting method that's in red so comparatively we're able to do quite a good job especially estimating the mass the mass distribution and all the distribution simultaneously I should mention that this reweighting approach is also also multi-dimensional but unlike the histogram based approach which basically can do one one distribution at a time so there's some chaos tests at the bottom which show that especially the two multi-dimensional approaches are both reasonably good at doing this unfolding in comparison to the true distribution and they're they're comparative one of the things that's quite interesting going beyond unfolding source distributions is the fact that we can actually compute posteriors so once we've learned a likelihood distribution and we've learned a source distribution we can actually then draw samples from the posterior given some new observation this is because the posterior is proportional to the likelihood times the source distribution so this we might want to do for instance given some observed electron energy we may want to know the distribution of true electron energies that could plausibly have generated that observation so p of true given observed so that's essentially reconstruction where we get an entire distribution out some understanding or some estimate of the uncertainty of that reconstruction so the way we can do this is we given some observation we can sample a bunch of points from the source distribution and do rejection sampling um where we reject them basically with probability uh this proportional to the likelihood now that might sound really inefficient but remember these are all normalizing flows we can sample a million data points in six you know less than six seconds or something like three five seconds I I don't remember the exact number isn't it depends on the model but it's extremely fast to sample data points so even if the rejection sampling is highly inefficient it doesn't take long in order to to sample enough data points to populate a posterior distribution so here we can see our four dimensional example we given we had some observation where we knew the true value here the stars in red or the lines in red and we can see the the estimated posterior distribution basically the reconstructed distribution in blue and black black is just outlining the blue distribution here so that's quite nice we can do reconstruction with uncertainty estimates and just to test if this is actually working we can estimate we can ask do what's called posterior calibration we can look at if you look at where these the true values lie in the estimated posterior distribution what quantile they they fall in the fraction of events that fall into given quantile should be proportional to that quantile so it's basically the fraction of events in a quantile should be a straight line and we see that for each of our four variables that we that we learned a source distribution over and computed posteriors um we can see these calibration curves they're reasonably close to a straight line with slope equals one they're not perfect but they're reasonably good giving us some idea that that we would be able to get reasonable uncertainty estimates out of these out of this reconstruction approach that might work this way so we can do reasonable posterior calibrate period posterior estimation as well so I think that brings me to the end of the talk we've kind of talked about various ways in which we can use generative models and especially rely on not only their approximation capabilities but their gradient capabilities now it's not always easy to set up lost functions and it's not always easy to optimize them so there's certainly work to be done there but I think it's showing some promising directions and how we can put together generative models and simulators to try to approach data analysis challenges we have in energy physics and in other sciences as well um and extend them beyond for instance problems where we can only operate in one dimension to to kind of multi-dimensional inference challenges so thank you very much oh nice thank you very much Michael that was a great talk and I think we've got some more minutes for questions so if people have any more questions then please raise your hands okay Louis I see you have your hand up go ahead uh yeah okay hey yeah very nice talk a lot of material covered there I just wondered with your unfolding example sort of traditional unfolding methods have to introduce regularization in order to get a nice looking answer you've got the regularization built in automatically have you and yeah so there's a few points to that so the first there's a few levels of regularization here depending on what you want to do so the first level of regular regularization actually comes from the entire idea of a normalizing flow um we have to basically we're only able to learn bijective continuous transformations these all have um gradients that are compute so these are continuous functions with computable gradients um so that's already a regularization in the kinds of function that we could get out of this um learning process so we've thrown on everything that has any sort of a discontinuity so that's already a quite a big regularization that might be akin to a smoothness regularization you would put in a histogram based on folding approach so we've got smoothness built in um and then if we want to add more regularization for instance if we wanted to ensure we could we could make um well there's not a good example well here we can also add regularization in the form of the structure of the model we could ensure for instance that the the model always output values for instance between zero and one if we think the output should always be constrained between zero and one or we could ensure that it's symmetric about zero um so we're basically always um that the probability for a point that's at minus x plus x is is the same so we can add this kind of structural regularization as well to help make the learning process uh work better oh incidentally on this slide i mean you were saying that the correlations are are sort of nicely reproduced but actually it looks as if all the correlations are about zero so uh yeah i should say the shapes of this love the level sets okay all right uh i yeah okay the shapes i brought you that a bit okay just one other question um in the unfolding example you said your gradients were biased whereas in the first half the talk your gradients were unbiased that's right what's the difference right so the difference here is when you take um it's this logarithm so when you take uh an when you have an integral and you use a Monte Carlo approximation of that integral it's typically unbiased and so when you're taking the gradient of that approximation you're you might have a noisy approximation but on average it's unbiased however the log of the of a sum is not an unbiased estimate of the log of the integral okay so there's a small bias there and that bias drops as one over n so the more samples we um the more samples we draw the more um the faster the the smaller the bias will be and that bias will drop with one over the number of samples now the reason why i said it didn't really matter is because once we're talking about these generative models which can draw samples very easily we can basically the mem the constraint on reducing the bias here was the memory of the computer and not the speed it was very fast to draw samples but we might have to generate um to do this summation here we might have to generate um thousands of of points to do this uh to compute this sum uh my last comment on this is you can actually debias these are called like nested Monte Carlo approximations this kind of sum of a log of a sum even the log of a sum you it turns out you can debias these by introducing variance basically you introduce a random number about where you stop this sum and uh you can in the end get a debiased estimate an estimate which is zero bias but noisier and we found that it was better to just draw more samples than to introduce these debiasing techniques it was kind of more work than was necessary okay thanks thanks for the question Duncan hi um can i ask my first question again with respect to this setup uh and then and then i also have a follow up if that's okay so yeah i'm just curious as to again the the kind of training data that's required yeah uh right so uh in the toy samples i think we would uh i think the toy sample is these toy experiments there was a few of them that we looked at i only showed one here i think we used um something like 10 000 simulated data points to train the the um to train the likelihood and then something like 10 000 observations to do to do the source distribution estimation so that's this example um these these toy examples which were relatively not so complicated they're mostly just trying to make sure we could understand what we're doing um in the physics setup i think we had one point something like 1.5 million events that we split between well basically what we what we would what we did was there's a subtlety here we would train the likelihood actually using uh a sample of i think about a million petia events and then we actually unfolded from a herwig set of observations which was uh i think on the order of another 800 000 so i um i i use similar approaches but with um well okay so i've been using a abc so the approximate basing computation to to solve inverse problems of climate models and there we use a Gaussian process surrogate to to get around the fact that we just you know we can't generate anywhere near that number of climate model simulations um could is there a way of combining this with a surrogate i mean would you i guess it's just introducing another another set of uncertainties right what's the benefit of of the normalizing flow over over something like um gp enhanced abc well i well i maybe we'll like separate the abc part from the gp part from the question um so for the gp part um right so in principle you can uh if you can reasonably sample data points observations from your gp then you can use that as an estimate for your source distribution basically you need some sort of relatively fast sampler um and anything you can use that you that can evaluate that can approximate the likelihood they can give you a likelihood evaluation you can use inside of for the likelihood problem so gps are compatible um it's a question of how fast you can get them to train so if you have fewer data and your gp model is easier to train um to estimate the likelihood then that's totally compatible um and they're also differentiable so that would completely work um now for comparing with abc is a good question i think um you know one of the things we wanted to do here is not well it depends on right i mean in this sense we we weren't trying to do an exhaustive search over the kind of observed variables that we might need to to define in order to do the um the source distribution estimation um so i mean in abc you you have some summary statistics depending on how you're how you're doing it um which i sorry i that may depend but we basically not focus so much on defining summary statistics in order to be able to to define a loss function um in a lower dimensional space so we can in principle unfold up to larger and larger dimensions but of course that requires more and more data to approximate this likelihood um we did a little bit of tuning of like we played a little bit with the observe the variables we tried to unfold in this um physics example but but we didn't have to spend a lot of time doing that so i guess that's probably the benefit here we're not we don't need to do a search over summary statistics and you don't yes you don't have that cursive dimensionality that you do with abc you don't have to define epsilon right we don't have to define epsilon um but we do have to train in a generative moment yeah and that can be a little bit data hungry so that's probably one of the trade-offs there interesting thanks a lot thank you thanks actually i've also got a small question on the first part if i may so from slide 27 or thereabouts when you spoke about when you spoke about training your local generative model for the estimate of the local gradient there so from what i understand you sort of sample new you get new samples as you go along the optimization is there but your new region let's say might overlap partially with ones that you've already visited before did you have to play any tricks on you know how to pick wisely new data points that's there maximally informative or whatever of your local grade of the gradient estimated by your local generative model that would be uh that's a very good question we did not try to play any tricks of trying to maximize some sort of information gain about the the gradient that would actually be quite interesting um no we would basically resample new points and if there were overlaps we we basically stored a replay buffer like you might in reinforcement learning we stored a bunch of simulations from previous values of the parameters and we might use them in later gradient descent steps now we actually found that once you do that you you sometimes it would introduce a bias in the gradient estimation that we didn't totally understand so often we wouldn't use this replay buffer and basically leaving the understanding of where that bias is coming from for for future work because we didn't understand it i see curious i think it was shaping right if you sample everything on one side of the of this kind of hype of this like local region it may be biasing the kind of how much emphasis the generative model places on learning one side of the loss function versus another and that may affect the approximation quality that would be an argument in favor of doing something smarter than just having a replay buffer and sampling new points right yeah absolutely sampling them in a way that makes makes a lot of sense yeah exactly it's a it's a good question but we didn't we didn't we didn't get to that point yeah thanks okay any other questions Duncan you still have your hand up is that from before or do you have a new one sorry no it's from okay nice okay if there's nothing else then i think you can wrap up for today thanks a lot for joining everybody thanks a lot for people asking questions and most of all thanks to michael for this wonderful talk that's it speak to you next week