 And our first speaker at this time is, is Grünesh Spydn. Grünesh is a senior researcher at the Department of Engineering Science here in Oxford. And he's also affiliated with the Computer Science Department. He's a research consultant for Microsoft Research in Cambridge and a member of the European Lab for Learning and Intelligent Systems. I think that already makes it clear that Grünesh is a person who wears many hats. On the one hand, he's an expert on the methods for automatic differentiation and probabilistic programming. But on the other hand, he's also very interested in applying these techniques and others to concrete problems in the physical sciences. He covers topics ranging from the physics of the sun's magnetic fields all the way to the fundamental interactions of the Higgs boson. And of course, evaluating these huge benefits that machine learning can bring to all of these fields is central to this series of seminars. It's central to his work and it's also central to this talk. His contribution today is titled Probabilistic Programming for Inverse Problems in Physical Sciences. So please Grünesh, take it away when you're ready. Thank you, thank you very much. That was brilliant. Thank you, this is like the best introduction I got. Yeah, so I'm really happy to talk to you today. So the subject is about probabilistic programming, which is like one can say it's a subfield of machine learning that handles like generative models where you specify a generative model instead of learning a generative model from data like people do with GANs and flow models and VAEs and other stuff, if you know these. Probabilistic program is a field where you specify a generative model in terms of like general purpose program, you write down your generative model and you do by doing that, you define a prior distribution and you do like inference by conditioning on data observations and you get posterior distributions over the random variables that you define in your program. So that's the regular way people do that. So the subject in this talk is bringing that probabilistic programming subfield to work with existing simulators in physical sciences. So we have a method to like combine these two things together. So that's the subject of the talk. So a little bit about me, I think I don't need this anymore. So this was all covered. One thing I can mention is like, so there is a nice effort for like bringing machine learning and physical science people together. I don't know if people heard about this. I see in the names joined here, some people who already participate in this. So there is something called the Frontier Development Lab, the thing on the bottom left. So if you haven't heard about that, I think you would be interested in checking that because it's like, it's something like a summer school. It's transferred two months every year and it is the US and European versions and the European versions are posted in Oxford like officially by the university and specifically about bringing in machine learning people and physical science people together to work on well-defined projects that's mostly about the space sciences. So except that everything was covered by Philip Abril and so I'm gonna skip this. So yeah, so the subject is simulation. And I think it's well known to the people in this call that simulation is essential and in many fields, in many subfields in physical sciences. This ranges in scale from particle physics to material design to climate science all the way to cosmology. Like at all scales, people do many things with simulation. So the interesting thing in this work is we have some infrastructure that allows you to take existing simulations and treat these existing simulations as probabilistic programs so that you can bring in the tools of probabilistic programming and probabilistic machine learning to bear on your existing simulators in physical sciences if you have such simulators. So I'm going to introduce like some high level concepts before telling you how this thing works in practice. So a simulator is like is a piece of software. It takes some inputs, it gives you some outputs. So you give it some parameters or some inputs and you get some simulated data out of it. So these things in many cases, especially in physical sciences, they model some system and the forward evolution of this system in time. Like so this arrow of running the simulator, most of the times coincides with the time arrow of time. And so you can run this forward and you can simulate the forward evolution of a system and you can generate samples of simulated, like states of system or simulated data. So this is useful by itself for some fields like climate science, maybe you want to do climate projections, flood models or economics models and things like that. So there are fields where the forward runs of a simulator is already useful. And there's also the other possibility. So in many scientific settings, what people want to do is going in the other direction. And this is what we call inference or this is what we call inverting the simulator. So somebody gives you instances of data that corresponds to the output of the simulator and you are supposed to find the parameters that can produce or explain your observed data. So it's an inverse problem and inverse problems are most of the times difficult and they require manual process and lots of computational resources to solve. So there are some examples here. For example, in genomics, people have like models of gene networks that can simulate the gene expression data. So you can get the actual gene expression data from some instrument and you are trying to infer what is the network that causes this type of gene expression data that you've observed. Or in seismology, like there are fairly good models that simulate the propagation of waves through like the crust of the planet. And so you can have this forward simulations and even some seismometer readings, you are supposed to determine earthquake characteristics or location and other things that people do in seismology and our sciences. And in the last example, so this is the actual example that brought me in basically to like applying probabilistic programming to science is particle physics. So in particle physics, people have fairly large-scale simulators that simulate basically many of the things that are going on in particle accelerators and detectors. So you want to like get some particle detector readings, for example, and you want to perform event analysis or even like new discoveries or like confirmations of theoretical expectations and like the things like that. But most of these classify as an inverse problem, even some existing forward simulator. So that's the simulation part. So if you come to probabilistic programming, it's basically a machine learning framework that allows you to write programs that define probability models. So you might be familiar with probabilistic graphical models where you construct some graph structure that defines random variables and their conditional dependencies structures, the causality relationships and like you can do Bayesian inference in this if you heard about that. So probabilistic programming is like a generalization of that. It allows you to take regular general purpose programming language, let's say like C, Python or some programming language that has some features like statements that allow you to define probabilistic relationships. So using this programming languages, you can write probabilistic models. And the other part is these systems, they include automated Bayesian inference engines. So when you define your model, the system has the capability of automatically doing inference for you. That's like a very important thing. This is a distinction between model definition and inference. So inference is like in theory and like the aspiration of the probabilistic communities that if somebody defines the model in this programming language, we will be able to do automated Bayesian inference in that. So you are just supposed to say, these are my random variables. This one is observable. The rest are latents. And given any sense of my observed variable, what is the posterior distribution over the rest of the latents? And the system is supposed to give that to you without you even like thinking about the inference algorithm. So it's supposed to be automatic. And you might have heard like examples of such systems. So here below, there's Pyro, there's Edward and Stan. I think Stan is kind of known in some scientific communities. And Pyro is currently the, I think, I mean, it's gaining traction. So Pyro is PyTorch based, if you heard about that. So the situation with the probabilistic programming communities that the applications have been limited to toy and small skate problems. And there are some reasons for this. One reason is like probabilistic programming is most of the times very computation intensive. But the other thing is using probabilistic programming systems requires you to implement a model from scratch in the chosen language. So let's say you want to implement a, like you want to solve a problem with Pyro or Stan. You are supposed to learn Pyro or Stan and from scratch implement your model in this language. And in order to like do inference, Bayesian inference in your problem. So this is applicable for like some like exploratory or like prototype writing stage, but you cannot expect for example, the particle physics community to take some fairly large simulator like Sharpa and implemented from scratch in one of these languages in order to do Bayesian inference with the particle physics data. So that's one limiting factor. So no, but like not everybody has the resources or like interest or time to implement everything from scratch in a probabilistic programming system. So the idea in this work is that you don't need to do that. So there's the insight is this. So many simulators are stochastic and they already defined probability models even when the people who wrote the simulator didn't think about it that way. So if a simulator is stochastic, if it's doing random number sampling inside the simulator, it's already fairly similar to a probabilistic program. So there is some random sampling happening and you can treat that as something that defines a prior distribution by like sampling some randomness through the like execution of the simulator. So, and in most instances in the science community we have access to the source code of the simulators. So the simulator is not really a black box. We know what's going on inside the simulator. We know where we know the places where these random samples are taken. So simulators can be seen as probabilistic programs if you have the machinery and like the infrastructure to treat them as probabilistic programs. So I'm going to explain how we do that. So we just need to the infrastructure to execute existing simulators as probabilistic programs. So we've been developing a new Python library. It's called PyProp. So this is specifically designed for working with existing simulators in any language. The system, PyProp and the machine learning like tensor framework PyTorch it's built on. They are in Python, but this system has the capability of like being connected to any simulator in any existing programming language. So there is like a separation between the probabilistic programming and the inference part and the model part. I'm going to explain how that works. But so the summary is like you can take PyProp. You can take an existing simulator in your field. You need to do some bit of work to combine these two pieces by directing the random number sampling in your simulator to PyProp. Other than that, you can completely keep your code base untouched and you can treat your simulators as a probabilistic program. So this is like, so that was like a high level description. So in practice, this is what's happening. So let's say you have your simulator. So the thing works by running the simulator forward. Like that's the only thing you can do by the way with the simulator, you can just run it forward. So you run it forward as you are running it as a side effect. You are supposed to catch all the random choices in the simulator. So you can also call this like high checking all the calls to the random number generator in the system. Let's say you have a C++ simulator. There are places where you get like a random number from the continuous uniform distribution like a random uniform random number between the random one. So we are supposed to catch that by catch that, I mean like in the C++ site when somebody wants this random number, we are supposed to know that this request happened. We are supposed to return the random number. But in addition to this, like in addition to providing the random number, we are supposed to also let the probabilistic programming system know that this happened. At this point in execution, like we need to be able to record all the random number calls during the execution, like one forward execution of the simulator. And this is what we call an execution trace. So here I'm trying to show that in like it's a sequence of these like nodes. So this is you start from a location and this is the end. And so with this color orange, I'm representing, I'm going to represent in the upcoming slides, the data you are simulating. So in most of the cases, the data that you are simulating is like either the last thing or like very close to the last part of the simulation because like most of the simulators are written in a way that you simulate some data in the end. So for recording this execution traces, there's an execution protocol. So we call this PPX, probabilistic programming execution protocol. And this is based on Google Flatbuffers. If people heard about Protobuf and these like serialization libraries. So there's a way of, so there's existing low level libraries that allow you to basically exchange data between different programming languages and different runtime environments. So for example, like current this thing supports all these languages that you can see here and like many more. So we provide that. And in addition to that, we have the capability of, actually it's more than a capability, it's a requirement. So for recording these execution traces, we need to be able to uniquely label each choice of these random numbers at runtime. And these things are called addresses. You can think about them as like unique labels of like identifying random variables inside the execution of the program. It's a bit more complicated than just the locations in the source code. Like this is the source code file. This is the line number. There's like a random number sampling statement. It's more complicated than that. It's more related with the runtime behavior. For example, we treat, I'm going into too much detail here, but like so we treat the same random number call in the code can have different addresses according to where like the actual like sampling originates from. So you can have a function inside the function. You can have a random number call according to which other function ended up calling this function, you can have different addresses. So this is actually like a difficult and not so straightforward to explain. Like this is like some of the ugly details behind the scenes about probabilistic programming. So, but like just remember, there's like a unique label be assigned to each random number draw in the stochastic simulator. And these things correspond to prior distributions and they correspond to random variables, like implicitly defined by the people who wrote the code base. So, once you have this capability of like running forward your simulation and recording the execution traces a side effect, the only remaining bit in order to do probabilistic programming or like Bayesian inference is conditioning. So conditioning is in the simplest terms is a comparison of your simulated data and observed data given to you. So at runtime, like at test time, somebody gives you an instance of observed data, just that data. And you have a piece of simulated pipeline that is capable of simulating data that is in the same domain, like in the same shape and all the characteristics of your observed data, but it's simulated data. So you need to be able to compare simulated data, data instances and your observed data and make a judgment of how similar these things are. Like that's the high level definition. So by doing this, you are able to approximate the distribution of parameters that can produce the observed data or explain the observed data. And you can do this by, I mean, in universal probabilistic programming, like this general purpose programming model. One way of doing that is with using Markov chain Monte Carlo. So it's basically sampling-based inference algorithms that allow you to run the simulator many times. Computer likelihood, we are going to cover that. And like get the posterior distribution basically by sampling and running the simulator many times. So there is one problem that comes from this. So until here, things work well and everything's good, it's a good idea. So it comes with, so if you have a simulator that is like a large scale significant simulator, like for example, Sharpa, I'm going to show you, we use Sharpa in this initial work. These simulators are so big, like if you treat them as probabilistic models, you end up with probability models that nobody in the probabilistic community even attempted to run. So they correspond to giant probability models. So you need to run the simulator up to millions of times. Simulator execution is sequential, inference is sequential within each MCMC chain. There are things like burn-in and autocorrelation and other things. So these things are very, very difficult to like handle as probability models. So there are techniques, thankfully in the probabilistic programming community already like people worked on, including myself and collaborators. So there are techniques called amortized inference. So there is a way of making use of some deep learning type of stuff to make inference go faster. There's like a high level summary of amortized inference. So amortized inference, the name comes from like doing something before inference that is very costly, that something is training some deep learning model, like from your simulator data. Once you train that, like once you finish that very costly phase, you can use the result of that costly training phase to do very, very fast affordable inference at runtime. So, and do you amortize the cost of like training that infrastructure basically. So the amortized inference names come from that. So the following slides is basically explaining some details about this. So there are two phases of this amortized inference idea. So this is actually what we did with the physics simulator. So basically the first one is training. So in the training phase, you do some data generation. So you have like many instances of your simulator, like these boxes here, you can run them in parallel. You can like every time you run it, it simulates a separate event like it was stochastically. And you record these execution traces. So you can like create a dataset of execution traces from the simulator. And these things can be like millions of executions. So this is basically a recording of your simulator's behavior. And like the more you record, the more you capture about like the stochastic behavior and the variance and the range of things you can expect from the simulator code base. So once you have the dataset, you can do the training phase. And in the training, I don't have time to go into detail about like the network architecture and like deep learning details here. But so you can see that I use colors here to represent and I hope you can see that. I know there are colorblind people. So please let me know if these things look the same to you. I hope not. So in the execution trace, so the last thing that are like colored orange, they represent data. And the other great things, they are like latency, let's say. So in the training phase, you might notice that I switched the order. So you take the last part, which is the data, you make that the input to the neural network. And you expect the neural network to predict the rest of the random variables conditioned on data. So this is like where the inversion happens in the amortizing inference part. So by doing that, by training this neural network and the training can be also distributed like using like distributed training methods that people develop in deep machine learning community. So basically you train it like a recurrent neural network that takes the, for each execution trace, it sees the final output of the execution and tries to predict the rest of the stuff from this final output. By doing that, like this is extremely costly by the way for a significant like large scale simulator. So once you are finished with that, you can just discard this thing that happened and just keep your trained neural network. And oh, this is the place where I noticed I already had these things in the bullet point. But so I think I covered the same thing. So the neural network learns all the random choices. This is what I was saying. And there are also like interesting details like this is a dynamic neural network because the one thing I like forgot to mention is like if you have a real world simulator, the simulator, every time you run the simulator, the simulator can make different choices and it can create execution traces of different lengths. So this is completely open-ended. So one time you can run the simulator, it can maybe sample like thousands random choice like thousand latent variables, simulate an event. The next time you run it, it can just do 10 samples. The next time you run it, it can do a million samples. So all these traces can be of different length and they are of different length in practice. So the neural network needs to adapt to this and it needs to also account for like the labels or identities of random numbers. So the neural network is like dynamic components. So it is like LSTM or recurrent neural network backbone. To this, like we attach different pieces that are created on the flight during training is the network sees some simulator piece or behavior for the first time. So I'm going into too much detail like this is not a part of this work, but if people are interested in those details, I would be happy to talk about that. So this is a neural network that grows in size as you keep tuning it. It's that type of interesting neural network. So it's not like a fixed architecture. So this is extremely costly, but you have to do it only once for a given simulator. So when you do that, you are left with this trained neural network. You can discard all this stuff and you can do inference in a fast reliable way. That's the whole idea of this amortized inference setting. So you have your trained neural network. You have your observed data. Somebody gives you like a new data instance that was not in your training data. So the inference works by like running many instances of this neural network giving them the same new observed data as input. And networks, they have like their ideas about like what needs to happen inside the simulator, like in the places where we do random sampling to make it produce traces that have outputs that look similar to your observed data. So basically the trained neural network learns to guide the random choices in the simulator in order to make it produce data that looks similar to your observed data. And the similarity comparison I keep referring to is done by a likelihood function. So I didn't explain that in technical terms for people interested in basing inference. So basically this is you have your network, you have your observed data. I hear the background noise. Maybe this was the intention. Okay. Can I ask the traces of the, like the random numbers in their locations? Is that right? Yes, they are, yeah. And so what you're doing is you're reproducing the traces. But does that guarantee that you're reproducing the parameters? So actually like, so it's not reproducing actually at the inference time, we are running the simulator from scratch. So you start running the simulator like you do normally, but this time when you run the simulator instead of using the prior distributions encoded in the simulator, you go to the neural network. Every time the simulator asks for a random number, we go to the neural network, we ask the neural network, what do you want to give to the simulator? So the neural network makes the decision about which random number to return. This is a bit interesting. So in machine learning terms, this is called the proposal distribution instead of the prior, that is like the, in the original simulator source code, the prior distributions, we have a proposal defined by the neural network. So you don't reproduce a trace, you run a completely new trace from scratch, guided by the knowledge of like this network having seen the data. I don't know if that answered the question. Nearly, I think what Louis wanted to know is where, who has set the parameters of the simulator? Because your neural network hasn't touched the parameters and we are after the parameters in the end. Okay, so these things will be like the parameters. So these things will be, so you make that definition. In the beginning, when you set up the training, you make a definition of what are your latent variables. So the, any parameters that you would like to infer with this framework, they need to be treated as latent variables. Yeah. And like you get posterior distributions over these latent variables in the end. So basically, and it happens by like, like sampling based inference algorithm. So, I don't know if I start. The neural network tells the simulator, both the starting parameters. So the latent variables. Yes. And the simulator asks the neural network. Every time it has a new call stack which ends up in a random number variable call. Exactly. It asks the simulator, here is my call stack. Which random number should I use next? Yes. Yes. Which run, no, not each random number. Actually, like, I think I should just quickly show you some more informative thing. Okay. So we can do that very quickly. Because I know I'm in my article account. So because there is like a figure that answers this question very well. So let's do that. And it's in this paper. The thing is like, this talk was prepared in a very high level way without going into the actual details. But maybe this is not the community for like the high level, maybe I should go low level. So this is what happens. I don't know if you can see this figure now. Can you make it a bit bigger? Yeah, can you see it? Here we go. Yeah. Okay. So like this is the simulator part. The simulator part just like doesn't change. Okay. It's like the way people coded it. So it always starts from the same location. There is no choice. Like when you run the program, it always has a starting point, like entry point. You go like until you wait in the stochasticity you keep running it. Like this is the deterministic part. Anything can happen in this meantime. We don't know it. We don't have access to deterministic parts. When the simulator comes to stochastic part, when it needs a random number, this location, this is the point where the simulator asks for a random number in the code base. Let's say you are in C++. It goes to the random number generator once an random number. There we masquerade or like we show ourselves as the random number generator. The simulator thinks that we are the random number generator asked for a random number from the function we define. Then we return the random number immediately. We don't return it immediately. Okay, let me rephrase that. So we communicate to the machine learning side. We say, okay, we are at this address. We need this type of random number from this type of distribution. These are the parameters of the distribution. Give us the random number. So the machine learning side produces that random number or anything that it wants to produce and returns the number and we give it to the simulator and the simulator is set, it keeps doing this until the end, okay? So the thing that happens in the machine learning side it can be two things. If we are in like prior execution, it can just actually sample from the prior distribution like that the request was containing. Like let's say somebody asks a number from the like standard normal distribution. It just returns that. But if we are in the inference part with the amortized inference and there's a neural network on the machine learning side at this point, we go to the neural network. We say, we give this like data is the input. We say there is like a random number. The prior was this and this is the location based on your training experience what should we return to the other side? So the neural network creates the random number and returns it over the protocol. Like from the simulator's point of view the simulator like does nothing changes for the simulator. So in effect by doing this like we are intelligently controlling the numbers to return to the simulator. And I think I'm coming to the part where I'm actually going to answer your question. So the decisions of like which random number to sample next is always in the simulator. So the simulator makes the decision. So the simulator is stochastic. So in reality like some according to what you ended up sampling here you can make different choices inside the simulator. Like there can be an if statement there can be like function calls like choices that depend on the actual sampled value. So these things stay the same. So what we like what we do is just like like influence the return numbers to the simulator. Everything else stays in the simulator like including the decision of like which order these things can go like the structure remains the same. Yeah, I hope that answers like that. Yes, that was very clear. Thank you. Thank you very much. The thing is like there are so many details in this work it's really difficult to keep like feel everything into one talk like it's really, really like detailed. There are a lot of things going on underneath. Thank you for the question. So yeah, that's basically it. So there is another high level like description of this like this was actually suggested by the particle physics collaborators to like explain what's going on. So it's like a puppet master. This thing is representing that that the part on the top is the machine learning or probabilistic programming system. The thing in the bottom is the simulator. Basically the machine learning system has the full control of like the stochastic choices inside the simulator. And like basically we are puppeteering the simulator to make it do things that we'd like it to do. And if you have this capability you can do Bayesian inference with this like with this proposal distributions and inference engines and like a lot of things that I don't have time to cover here by default. So yeah, there are some like technical details of how this thing works. I already mentioned the execution protocol. This is the piece that provides the connection between the machine learning part and the domain simulation part. And yeah, basically that's all. Yeah, so all this work started as a collaboration. It's a fairly large scale collaboration by machine learning standards. And I know it's a fairly small collaboration in terms of maybe particle physics standards. But like, so we had this group but we have three types of people. We have machine learning people like myself. We have high performance computing people like at Lawrence Berkeley Lab like they provide the compute capacity. We had the company represented because the simulator we were using was running so slow we need to make it faster. So Intel was out in that. And we had our like particle physics collaborators basically they set up the particle physics like problem in the simulator code base. And we've worked with it to like develop all this infrastructure that allowed us to like treat the simulator as a probabilistic program. And in doing that we actually constructed the tools that allow you to like treat other simulators in the same way and do like the same basically for other domains or other applications. So, and the collaboration is called etalumist. So it's like, it simulates spout backwards. So it's like we are inverting simulators. It's like interesting name. I hope you find it interesting. Yes. Yeah, okay. So that's where the name comes from. Sometimes I forget to mention that. Yeah, so I think this is the first time in this talk I'm actually showing some like Bayesian notation. So this is like a repetition of what we did for the particle physics case. So this horizontal line here is like dividing the simulation work on the top from the real work in the bottom, let's say. And in the simulation part, we have like a forward simulation we have a sharp simulator and you can have giant for so these things simulate the standard model to some degree as far as I know and like the particle physics detector, let's say. So you have a simulator pipeline every time you run this it generates like a random event as realistically as possible gives you simulated the particle physics like detector data. So we have this pipeline. This thing corresponds to prior distribution or like a joint prior. So in the probabilistic program in community convention is to use Y to refer to data and X for like referring to latents. So these like multiple latents together. So this you can think these things are like multiple latents and even multiple data. So you have a prior and you have a likelihood. So the prior is over your latents. So these things can be the parameters that you're asking or it can be other things like what people call parameter change is actually from field to field a bit. You can also have nuisance parameters or nuisance variables. Like you don't really care about the posteriors over those or like there are like it's not so straightforward what you want to infer. But at the end of the day anything you want to infer anything you are interested have in the posterior over needs to be like defined. Okay. So you say I want this and this and this variables I care about them and you get you have prior distributions over this okay defined by the simulator and you have a likelihood function. So you have this pipeline and like the whole intention is at test time somebody is going to give you an observed data. You will have this observed statement you will evaluate likelihood of the observed data under the like this whole simulation pipeline and this likelihood evaluation gives you a way of approximating the posterior distribution by using this automated inference engines. And I don't have time to like cover how these inference engines work but there are like a variety of inference engines from different families like from MCMC important sampling variational inference like there are families of inference engines that do this thing for you depending on the system. So in the particle physics case we were basically in this paper we were working on a problem that was about the tau lepton decay. So this is a particle that can decay in 38 different phase I think they are called channels. So this is for example the prior distribution over the decay channels of the tau lepton particle. And so here you see some like real examples of like how these unique labels I was talking about before. So these are for example so you can see so A1 is like a short name we give to the random variable that has this label that they are very long and like very complicated labels. Since we cannot pronounce or like talk about this this way we give them like short like short hand names like we assign them as they appear we assign them short names. So when we take took sharp first later and run it like for some time we discovered around 25,000 such addresses. So 25,000 unique random like samples that can happen in the simulator. And we are saying at least because this is completely open-ended because it's in C++ there is no like we don't really know the number of latent variables in the simulator. This is what we encountered after running millions and millions of times and the number keeps increasing like I think maybe it's asymptotically increasing towards some upper limit but theoretically it's unbounded like there are an infinite number of latents in a simulator that is implemented in a complete language. So but this like a detail. So in practice these like traces that I was referring to they look like this for example one trace that is the most frequent trace is like this sequence of random numbers for some reason they refer to some type of path that you can take through the simulator to simulate an event. Like that is the most likely thing that happens to a tau lepton as it's simulated in this. And the next most like the sequence is this. So there are like sequences of like random numbers like random variable addresses inside the execution. And one thing to note is like this is just a sequence of the locations and the sample values in each would be different. So there is also like this is like not accountable or I mean all the traces of this type of this family they will be different because they are all continuous distributions except one I think one is categorical like a discrete distribution because we have continuous distributions in the simulator each instance of this trace even if it has the same sequence it will have different values. So I'm just trying to explain how things work in practice. So yeah let's talk a bit about the inference results. I think I'm running slightly out of time. So when we do this work we need to show several things. One of them is like does this amortized inference idea work? The other thing is like does your system work? So the way you do this in probabilistic programming is most of the times you have some like gold standard inference that is very costly. So we do some MCMC inference. So the results here are from MCMC. This is extremely costly. So we got this result for example in this slide for like running it for 150 hours they say. So this is the posterior distribution over some of the latents that the physics partners in the collaboration cared about. So the moment of the top particle, final state particle energies and decay channels and other things. So these are like posterior distributions over the latents for a test case that we also took from the simulator. So we run the simulator once we record an execution trace. We just take the data out of it, the simulated data. And we know the ground truth values that were sampled for given that the simulated data, right? So and these are shown as this vertical lines in this plot. And we just feed in the synthetic data to the pipeline. We run our MCMC inference and we get these posterior distributions. And this is like a way of testing this type of setup with synthetic data. And so you get interesting things like this multi-modalities in some cases, like things we discussed with the particle physics people in the collaboration. So they can interpret and explain that these things make sense. So and we like in practice we need to do things like a convergence diagnostics for the MCMC. And you see that it needed to execute a million times before anything converged. We have very large autocorrelation numbers to get non-correlated samples from the simulation. You need to run it 10 to the five times. So these are very big numbers for any practical use of this. I mean, if you do it for a significant event analysis, you can afford to wait maybe a few days, but like it's not something that you can deploy in real time, for example, for doing something. So I think I will like skip faster. Actually wait, wait, what happened here? Ah, the thing I forgot to mention is like, so this amortized inference, the IC, is the label we use this inference computation. It's a name for a particular type of amortized inference. So the amortized inference, like we can get the same result in like 30 minutes that we got with like more than 100 hours. And actually these plots show that I forgot to mention. So you see like histograms here that show you two things, the IC and RMH. So the RMH one is the MCMC. So you see that they agree. So you see that you can get the same posterior for the same observation and the same prior simulation. So this is how you like convince people and like yourself that the amortized inference idea works. So there's a nice good result. An interesting thing about the amortized inference part is this embracing the parallel. So this is something very important. You can run like according to the computational resources that you have, you can run many instances of this in parallel. Every time you run the neural network and you run the simulator with the neural network, you get the completely uncorrelated independent sample from the posterior. That's like a very, very powerful thing. And this actually allows this to be like practical like in the real world. Okay, so there are some other things. Like I have some like last slides that is about the other aspects of this type of setup. So this thing is about, okay, so we looked at this histograms. So these are just some of the latency in the simulator because I mean, they are interesting for the problem setup. But when you do this type of thing, actually you get the posterior distribution over the whole latent space of the simulator. And this thing in practice says like thousands and thousands of latent variables. And if you choose to look at them, they you get this type of like grids of histograms that are fairly large, large scale. And there are a lot of things to like, like you can think about ways of like interpreting these results. So one thing that we always say is this is interpretable. This is like because you never discard your simulator, you have like an actual model of your system, which is very different from what people do in deep learning. Like you don't train a neural network like a black box that does inference for you over the like the moment of the top particle. You can train such a thing. You can train from the same dataset. You can train a neural network that takes the observation and gives you some predictions over the top moment. But you wouldn't have an interpretability and people have a hard time like trusting this type of things. In our case, all our results always come from the simulator. We just guide the simulator and we never discard it. And as a side effect, we have the whole picture of what happens inside the simulator. This is very good when we can develop the tools to aid such things. So we are also trying this type of things. For example, the figure on the right-hand side is, so these nodes in this figure in this graph, they are random variables. And you can see there are like these looping structures in the graph. They are actually corresponding to loops in the simulator code base. That's like an interesting coincidence in two domains. So the thing is you can see that, for example, in this loop, you are sampling two things all the time, like this sample, sample, sample, sample, and you make a decision, is it good enough? No, the new sample, sample, sample, is it good enough? No. So this is like a rejection sampling loop for people who are familiar with the code base. So you keep sampling something until a condition is satisfied. And you do that through this loop and you come back and when you accept the sample, you keep going. So the central part of this crazy figure is like the main code base. And I can see like a tiny bit of blue node here that is the data. So this is like a really complicated thing to look at, okay? And yeah, we also have some like running jokes because we show thousands of histograms together and people sometimes make jokes about that. But so there are ways of simplifying these pictures. I don't have time to go into detail about that, but like so you can just remember like this, the same picture can be also looked at in a simplified way by omitting some information. So for example, the same address can be called many times in a for loop. Let's say you have a for loop that calls the same random number a thousand times. We just have a Bay of determining that from the address label. And we just use one node in the graph so that's for the sample instead of like creating thousand nodes. So when we do this type of things, we can end up with this type of more interpretable graphs. And so for example, this graph is from taking the 10 most frequent types of simulation executions. And this is like the start, everything starts the same. Everything actually ends up sampling the same sequence of random numbers until this point at A26 where you can do multiple things. You can either go to the next address or like A9 the thing or you can come to yourself, like you can end up sampling this thing again. So with some like frequencies, you can also measure this like or like record these frequencies by running the simulator many times on like getting the statistics. So you have some sort of structure that tells you the stochastic like the structure of the probabilistic model that is defined by the simulator code base. And the blue things in the end, they refer to parts where we have the likelihood distribution. So you can just see, think that the last blue one is the simulated data. And interesting thing that happens is, so actually this is how the addresses look like. So the thing we call A1 is actually this and someone who knows Sherpa like from the particle physics community, they can understand, they say, okay, this is a generate event generation. I generate this type of event. I have a hard-drawn decay event. And I ended up asking for a uniform random number. So there's a story here that you can interpret. And actually we have people in the collaboration who want to like create tools that work with this like explanations. So these are interpretable. So once you know how to read these things, you can have a picture of, okay, this is what we do. We start, we sample the values for the momentum. This is where we define the decay channel. This is where we do some rejection sampling. This is the calorimeter for the detector, for example. And the interesting thing is if we take into account 25 most frequent trace types instead of 10, we keep refining like the resolution of the detail in this plot because we keep discovering less likely behaviors that fills in the picture. Like you have some picture appearing that gives you more and more detail about the internals of the simulator structure as you construct these graphs with more like data. So this is how it looks like for 25 most frequent trace types. This is what it looks like with 100. This is what it looks like with 250. So we have this capability of really looking inside the simulator and seeing the structure inside the simulator. And this is actually happening for the first time because the community who wrote the simulator, they didn't have this intention in mind. So they created the probability model, but they never even looked at the probability model. It's like an implicitly defined probability model that we are discovering thanks to this like a software infrastructure basically. So I think this is the last slide. Yeah, so this type of thing like can lead to a lot of research directions. So like this is supposed to be a list of a few of those. So I'm interested in automatic differentiation. So you can incorporate automatic differentiation into this type of protocol. You can use this structure for surrogate model learning. For example, if you have this probability structure, you can construct some interesting neural architecture that replicates the structure and learn like a structured deep learning model, for example. That's very interesting. These rejection sampling loops, I briefly mentioned, they are very interesting and challenging. So they require their own like research from the probabilistic programming site to handle them efficiently and correctly. So there are like distributed training and inference aspects. So for example, when we run this thing, on like a Lawrence Barclay lab, we run it in a supercomputer initially. That was called Cori. It was like the largest scale pie torch execution on CPU. That was if it's interesting. So yeah, there are a lot of things. And the other interesting thing is that you can do the same for other domains and other simulations. This is the last slide. So currently, for example, I have a collaboration with the European Space Agency. We do something similar for spacecraft collision prevention. So they have simulators for spacecraft in orbit, like spacecraft mainly like satellites. And so you have to predict like collision events. There are students in Oxford that are working on epidemiology models. And this thing was happening even before the coronavirus pandemic. So we had some work about malaria simulation and like there are other people working on simulation of materials and things like that. So the idea is like you can, when you have a simulator, don't just think that, okay, it's a simulator and like assume that that's the whole thing you can do with that. So you can do interesting, cool stuff with your simulator in machine learning and probabilistic machine learning. That's basically it. And I think this is not supposed to be here. So I do this workshop at NERVS and this was in December. And yeah, so this was like an advertisement for that. And I hope we can have the same workshop next year. So if the workshop happens next year and if you have papers in physics and machine learning, like we would really be happy to get your papers to this workshop. That's all, thank you. Wonderful, thanks a lot, Yunesh. That was really nice. Okay, thank you. I'm glad like you heard me. We heard you perfectly. All right, okay, thank you. All right. Okay, questions. People have any questions? Please raise your hands. Okay, Lui. Oh, there are raised hands, okay. There you go. So in the example you had the amortized inference was like 200 times faster than MCMC. Yes. And you were using MCMC to sort of validate the amortized inference. Now, if other people were using it in other problems, what do you think is the generalization to other cases? Does everybody have to go back to MCMC to check it or do you have some general belief that it's going to work for pretty well anything? Yeah, that's a very nice question. So we tested this MCMC inference engine in the system like quite extensively. Like there are unit tests with like standard text problems. It seems to do the job. Like there are problems where you know the answer analytically. There's like an analytical formula for the posterior of a given model. And you can, what I'm trying to say is like you can test this inference engines and after a while you can trust in them and you can just say, okay, it has been shown that this inference engine works in practice. Maybe you can then skip the MCMC inference and directly go for the amortized inference one if you trust what we did here. But I would like, so from the machine learning perspective, I think you would do the MCMC first. Actually, we do that with students to make sure that we really have a good model for which the posterior can be found and it makes sense. Then we convince ourselves that it's working. Then we go for the faster one after the initial exercise. All right, thanks. And the output from like the tau problem was a posterior probability distribution over all the variables of interest. Exactly, yes, yes. And that, so from the width of that distribution tells you the statistical uncertainties on the parameters. Yes, yes. And the systematics you have to bother about separately. So actually, I'm not quite like, I don't have the background to like talk about systematics. I hear like the physics people talking about this a lot. But so for example, there's like this multi-modality in this example. And when we talk to the physicists, they say, yeah, this completely makes sense. You can have like, there's a symmetry in this direction of the momentum. It could be either this or that. So this tells you like, these results give you uncertainty that is like telling you that this type of data can be produced in many ways. And we don't, this is the level to which we can be certain about what happens. But beyond that, I'm a bit like, I like to know what systematics to talk about the question. Okay, I mean your question. Yeah, thank you. That was really a very interesting talk. And it indeed does have 1,001 possible applications. I'm asking about something which nothing to do with inference or machine learning at all, out your technology to instrument the executables to generate your call stacks. And in particular, then generate these representations, these more compact representations of the execution stacks. This is something in principle because everything between your nodes is deterministic. This representation completely captures the entire execution paths of any program. So if we were to, if we wanted to kind of analyze programs which are nothing to do with simulation in this case here, how would we learn how to instrument our code so that it can in the end give us diagrams like the one that you're currently showing? What is it that I need to do in order to instrument a piece of code? It is a simulator code, but it's nothing to do with particle physics in this case. Yeah, so is it a stochastic code, for example? That's very important. Yeah, it has a random number variable code in between. Okay, so because if it's deterministic, you will end up with just like a... Yeah, one end and one result. Yeah, so for your case, I mean, I think you already know the answer. So you just need to... Okay, so in practice, this is what happens. In the Sharpa code base, there was one single C++ file that was like random.cpp or something. The people who designed the simulator code base made that choice. They unified all the functions that returns random numbers in a single file. So our task was to just go to that file, keep all the API, like all the function signatures and everything the same without touching that. Inside each function, we redirected the random calls. So for example, it was calling the C++ standard library for some like, I don't know, Poisson distribution or uniform or something. We just directed that to our C++ library that connects to the machine learning system over the protocol I mentioned. That was all you need to do. The simulator stays the same. The simulator is happy. We returned the random number. And as a side effect, you get these plots and everything I'm showing in the talk. This is what happens. Okay, so the task would be to consolidate all of the calls to random numbers into one piece of code that you can then instrument. Yes. I mean, even if you don't consolidate them into one source code file, you could still do that. Yeah, I mean, you need to find the place of randomness in your code base and redirect all of them to the protocol. That's all. No, thank you. That's quite straightforward. Yeah, thank you. Thanks, Armin. Hannah, please go ahead. Great, thank you so much. Thank you for a very interesting talk. Can you hear me okay? Yeah, I can hear you. Yeah, thank you. Okay, I'm over in atmospheric physics and I work on developing stochastic simulators for weather forecasting in particular. And so obviously a big, a kind of important thing there is the input state. So it's an initial value problem. I didn't really notice in your kind of, your slide where you were talking about kind of training the neural network where a kind of initial condition would come in. Isn't that, I mean, how does that fit? I get the question because like I did some like oversimplification in these slides. So if these initial conditions are like, if you are after like determining the initial conditions, in this type of framework, you need to treat them as latent variables and you need to define prior distributions of these, over these initial conditions. So it's not so much. So I suppose that the idea is that I have initial conditions, which I know. And I have an observed future state, which I know. And I have my simulator, which includes draws of random numbers. And what I want to know is to try and constrain the parameters which define these sequences of random numbers better. So there are uncertain parameters in the code which I would like to constrain. Or, yeah, sorry, or like if, so if you want to infer like initial conditions based on your data that you need to have this prior distributions over initial conditions, if you like, if you have other things that you want to this like infer, you need prior distributions over this, like because it's this Bayesian inference idea that is like underpinning everything, it doesn't work for example, with things over which you cannot define priors. I don't know if it explains it, but and some people find this difficult. For example, in the particle physics collaboration, initially the physics people were not so easy about like selecting some arbitrary priors over some like physical parameters, for example. Yeah, there are these type of things. I fear that I didn't completely answer the question. I can try it again. I can get in touch with you offline. Yeah, the quick question is to follow up is whether in climate models are written in FORTRAN? Yeah. With that, does that sink in? Yes, yeah, yeah, yeah, I can imagine that, yeah. Actually even maybe easier than some other languages we've worked with, yeah. Well, there you go, thanks so much. All right, okay. Nice, thanks a lot. Okay, Tim, your question. Yes, I was interested how you make the connection between actually on the slide you've got up here, slide 44, how you make the connection between the signatures that you have and the variables that you've got. In particular, I'm just concerned that you might have in some simulations where one signature can actually be used for different variables or different parts of the model, depending upon the interaction between the other variables. So how do you take that into account? Yes, okay, this is important. And actually this is like really difficult to explain in the beginning. So these labels that this, I think when you say signature, you refer to these labels here. And they actually like they are, they have to be unique, otherwise your probabilistic programming doesn't work. Like they have to be, like they have a lot of components here. For example, this number in the end, like this underscore one distinguishes between different times, the same thing got called like in the exact, like a sequence of function calls. It's the first time this thing was called. The second time we'll get another number here. It will get two, if you end up calling the same thing again. And these things will be treated as completely separate, the random variables by the probabilistic programming system. So we don't like otherwise the inference engine doesn't work. We need to be deterministically able to like reproduce a whole execution trace from scratch by just like having a dictionary of like unique labels and the value sample for those. So I think that the answer is like, we ensure that one name refers to only one unique thing in the execution. And there is never the like degeneracy that you were referring to. Okay, I think I understand that. So how do you then link these to a particular variable? So we basically like, so the thing we start with are these labels and we make the choice of like calling these things random variables because this is all we have. So we call like, we treat each unique label that we discovered in the simulation execution. We call them random variables. We make that choice of like telling, okay, this is a random variable. So does it answer the question? I guess maybe not. Well, you later on you showed that you could say, you could point at one and said, that's PX and that's PY and that's PZ. How do you know that that one is PX and so on? Okay, for example, so maybe I can try to, so everything starts with forward. So that's like the entry point. So you, this is actually like a series of stacks, like stack frames in C++ that tells you a history of all the way from the beginning of the execution of the program until the point where you requested the random sample. This tells you the story of what happened like the series of function calls. The main forward function called another function. It called another function. It's called another function. All this series of things happened. And in the end you ask for a categorical. So I'm sorry, this is blocking that. So it's a categorical distribution. It has 38 categories. It's like a discrete distribution. And this is the first time we call this. And this 38 is like giving you an idea that, oh, this is maybe the point where we are sampling the decay channel. And when you actually work with a collaborator who is really familiar with the code base and they can read these things and they can see, okay, I know this part of the simulator. This is the part where we call the, this is where we construct the decay table. And like, so you can look at these things and interpret and actually locate the place in the code base which was responsible for this label at runtime. So basically we have a way of understanding just by looking at these things like what these things are doing. Not for all of them, but like for the main parts, yes. Another thing is like for this problem, like we set it up such that when you run the C plus simulator C plus plus simulator, the first things that happen are always like you sample the momenta of the particle like XYZ. There are some other things and you sample the decay channel. So you always start all simulations with these choices of momenta and the decay channel and the rest follows from that like the stochasticist starts after. So we know about the problem structure. I hope that answers a bit. Yeah, thanks. So it's basically a deep knowledge of the simulator, how it works and allow you to kind of tease out this information. Exactly. And you need access to someone who really knows the simulator. Yes. Yeah. Okay, thanks. Yeah, all right. Nice, okay, thanks Tim. Thanks Gunnish again. If there are no more questions, then that is the end of today. All right. Thanks to everybody and see you next week. All right, thank you. Bye everybody. Bye. Thank you very much.