 and let me welcome our speaker this week, who is Rachel Prudin. Rachel is a senior scientist at the Informatics Lab of the Met Office, which is the UK's National Weather Service. Her career at the Met Office started in 2012. Since then she's worked on a great variety of topics ranging from animations of weather data, and the necessary backend data pipelines that go with it, going all the way to scalable cloud computing for meteorological data analysis. Rachel is passionate about the union of machine learning, applied mathematics and statistics. In her current work she applies these tools to a diverse set of problems involving also very different time scales, ranging from improved precipitation now casting on the one hand to causal analysis of climate indices on the other hand. And in all of these applications and in many others, it's very important to reason properly about the uncertainties in one's analysis and in one's predictions. And that's going to be the topic of her talk today, which is on probabilistic models in atmospheric science. So Rachel, please, your turn. Thank you very much. Yes, so, as you mentioned, I am part of the Met Office Informatics Lab over in Exeter. I'm also part of the University of Exeter, both through the Joint Centre that's just started up, and also doing a part-time PhD there at the moment. So a bit about the current projects I'm working on in these places in my PhD work, I'm looking at, again, probabilistic models, especially looking at spatial analysis, in the future, convective scales where things get very unpredictable. And again, this uncertainty starts to become important. Also at the Met Office, I'm currently involved with a project on a slightly different topic, looking at using machine learning to help to emulate the gravity wave drag parameterizations, and using that to improve the models in sort of the seasonal climate predictions of the QBO and the MJO. So to move on to the topic of this talk, I'll be talking about probabilistic reasoning, and what I mean by probabilistic reasoning. It's broad actually, but pretty much anything where we're using Bayes' theorem or something inspired by Bayes' theorem to handle conditional probabilities. So conditional probabilities, the probability that some event A is true, given what we know about some other event B, Bayes' theorem telling us that we can understand this conditional probability by knowing something about the reverse, so the probability of B given what we know about A, and also the probabilities of A and B separately. So just to talk about why this is important in atmospheric science. One reason that it's really important to use probabilistic reasoning and understand this probabilistic aspect is to cope with high impact, low probability events. So situations where they might be quite uncertain or quite low probability, but they have a large impact such as flooding events or drought or other kinds of extremes. So by just focusing on a deterministic forecast, these will tend to be missed out. You might not see that there's a potential for this kind of event happening. And so to cope with this one way that a lot of Met Services use is known as ensemble forecasting. And this is rather than having a single deterministic forecast, you actually vary the initial conditions and have a whole ensemble of models. You can use the properties of this ensemble, this gives you a range of different predictions instead of just one. And you can use this to start to understand, okay, what are the possible range of outcomes that could happen, and what in the tales of these, this range of outcomes do we need to potentially be concerned about. So that's one reason why I think it's important to think about probabilistic reasoning. And I think, although this is a very good reason to be thinking about probabilities and uncertainty. I think this viewpoint can be perhaps slightly limiting in the way what we think about probabilistic reasoning. So I'd like to think about what, why else you might choose to use probabilistic methods and what else they can give you. Essentially, in my view probabilistic reasoning is about how you can leverage information by combining observations with a data generating process or what you know about how the data was generated. And by using these two sources of information together, it will often let you infer non obvious information that you couldn't have got from either individually. And by combining observations with some kind of model of how you believe that the system operates, you can then start to understand what's really happening and how you should interpret those observations. And as a result, it often turns out with a task that would at first appear that you would need to handle it using a lot of training data, a sort of big supervised learning approach. But sometimes the case that they can actually be tackled in a different way without using any training data at all by by following through this sort of chain of probabilistic reasoning. And depending on your background that might be extremely obvious or it might be quite non obvious, but I'm going to go through some examples to hopefully, hopefully convince you that that's the case. There's a takeaway from the talk as a whole I want to advocate that understanding how your data is generated can be a superpower in some situations. And so, for the rest of the talk I'm going to go through a couple of examples of this way of way of working. They're slightly limited in terms of I'm just going to be talking about work that I've done and so it will be focused on areas where I've done a bit of work, the first of which is special statistics. So these examples are going to be mostly on the sort of statistical end of the spectrum. So looking more like traditional statistics. On the other end of the spectrum in the more sort of deep learning inspired or simulation inspired approach. I'm going to show some examples of probabilistic programming, which is some more near approach and looking at using this for event detection. But I'm going to start with the spatial statistics examples. And the first thing I'm going to talk about is something I've been looking at in my PhD research, which is super resolution or downscaling. I've mostly switched to calling it super resolution because downscaling can really confuse people. It really agrees whether downscaling means that you're going to higher resolutions or lower resolutions, but for the purposes of my talk downscaling means that you start with something that's core screen that's low resolution. And you want to get to something that's higher resolution, so super resolution. The agenda behind this work is about the models that we use in weather prediction. These simulations that happen on a supercomputer. What they try to do is capture the physics of the atmosphere. And they do this by discretizing the atmosphere into little boxes and understanding how the, how the fluid flows and things like that. Interact over these boxes. The smaller the boxes are the better we can understand what's going on at these high resolutions at very local small scales. But also the more expensive the simulations become. And at some point it becomes infeasible to run them certainly on operational timescales that users could benefit from. So something that could be interesting is to look at, okay, if we take a forecast that's been run at low resolution in these models. Can we get some idea of what's happening at higher resolutions without going and running the whole model. Can we statistically understand what we might have seen had we run this model. Specifically focusing on the convective scales. So this is the scales sort of below about 10 kilometers around the sort of one to two kilometer scales. And this is the scale at which you tend to see features like clouds convection, things that can be very noisy and quite unpredictable. So the global models that we run, usually tending to be on the 10 kilometer scale. So the data that I'm using for this experiment is real data from the high resolution model. It's wet bulb potential temperature data, which is some kind of measure of the heat content that's quite correlated with the moisture and correlated with things like the cloud fields. The high resolution data is actually synthetically generated from this data. So I've taken taken the data from the high resolution model and then course grained it by doing a block averaging. The one that's shown is an eight by eight block averaging, which is taking it roughly back to this 10 kilometer scale. And the goal is to approximate the distribution of what might have been happening at high resolutions, given a particular thing that you see at low resolution. And this is just to say that this distribution can be quite, but there's a lot of variability here. It's not that you're looking at one deterministic kind of supervised learning approach where you take the low resolution field and you know exactly what's happening at high resolutions. It's more that you want to infer the structure of what's going on, even though the locations of these convective events and things is probably going to be quite unpredictable. And the model that I'm using for this work is called Gaussian random fields. These are based on normal Gaussian distributions. So you probably know, but I'll just go for it. Gaussian distributions or normal distributions are defined by their mean value, but also their covariance in one dimension, this is just a variance and tells you what amount of variability you see around the expected mean value. For two dimensional distributions as shown here, you still have the mean value. You also have this covariance, but it's now a matrix. And what this matrix tells you is the variance of each of the dimensions, each of the variables along the diagonal of the matrix. And then the covariance on the off diagonal service tells you about the amount of correlation between the different variables. And then in the case on the right there's there's a fairly high correlation between them so you'd see higher covariance values. So in the case of a Gaussian random field. It's essentially exactly the same thing but taken to an extremely high dimension. In theory, you can bear in infinite dimensions, but in practice, this isn't something I'll need to worry about. So, there's two different sorts of models that are essentially the same as a Gaussian process but usually used either if you have one time dimension or if you have abstract dimensions. And then Gaussian random fields are usually given that name if there's some sort of spatial interpretation. So I've shown some examples here of firstly a one dimensional Gaussian process. And on the right hand side, I've shown what the covariance looks like this covariance matrix. And what you'll see is the covariance matrix in some sense encodes the structure of the field of the samples that you're seeing. You'll see a highest value along the diagonal, which is saying that each each each point is most highly correlated with itself. This then drops off as you go away from the diagonal so that the correlation becomes less and less for more different points. And if you sample from this this Gaussian process, what you're actually doing is you're sampling a function. Depending on what covariance matrix you use what parameters you give to this Gaussian process, the functions that you sample will have different properties. So if you think of the speed at which this covariance value drops as you move from one point to a more distant point, that actually encodes something about the amount of variability of the field how sharp or smooth it is. And so the samples that I've shown are starting with lower length scales, so it drops off more quickly the correlation are going to longer length scales. And then below on the bottom row, I've shown the same thing, but for a Gaussian random field in two dimensions. The idea here is extremely similar. The covariance matrix looks a little bit stranger but that's because you have two dimensions of space kind of compressed into a single dimension in the matrix. But what it still shows is that each value is most highly correlated with itself and with its nearest neighbors and this covariance drops off as you move further away. The same is in the one dimensional case you can have more quickly varying fields with shorter length scales and then smoother fields with longer length scales. And this model has some nice properties, some good things you can do with it since it's essentially a statistical model. One of the things going back to the probabilistic reasoning idea that we're interested in conditional probabilities. One of the things that you can do with these models is condition room on observations. So if you have some data at particular points and you want to know what's going on between the points or further away. You have some linear algebra that you can run through and you can get a distribution that's then for posterior that's been conditioned on this data. And on some observation points and then you get a latter distribution out for this constraint to exactly meet those points potentially. Besides conditioning, we can also sample from the distribution and the way that we do this is by taking the distribution mean, and then the some random noise that's convolved with this covariance matrix. In fact, the square roots of it. In practice usually using the Chalaski this the composition instead of the true square roots. Just because it's more efficient. So then by taking this mean and then this extra noise term this gives us a Gaussian distribution over functions. If we have a mean function and then this extra noise that gives us some uncertainty in the function that we're sampling. I've described what a Gaussian random field is. But it may not be immediately obvious what how this relates to super resolution. And so, in what I've just shown you we have this way of conditioning on observations. So observations have to be a particular points, whereas for super resolution what we really need is we need to understand we have these observations that are taken over a spatial average. But fortunately, it turns out that this also works perfectly well. The reason it works is that the covariance is linear. And this means that in exactly the same way as we can condition on a point value. So condition on a spatial average instead. And all the maths turns out to follow through pretty much exactly the same. You just have this extra extra matrix this linear operator that describes the course quaining that's happening when you take these spatial averages. So in exactly the same way as we can observe for value at one point and then get a posterior distribution over functions that pass through that point. So we can condition on an observation that's a spatial average and get some some posterior that describes functions that obey that spatial average. So I'll show you some examples in one dimension because it's potentially easier to easier to follow what's going on when in two dimensions. I'll show you some two different Gaussian random fields that have been constrained on exactly the same data the same sequence of points. But in the example on the left, I'm treating these as point observations so constraining with able to go through those points. Whereas on the right I'm using the interpretation that those observations are spatial averages as we need for the super resolution task. So on the left the red dots show the point observations. And you can see that this posterior distribution and the variance drops right to zero when it goes through these points so it's any samples that you draw from this distribution are constrained to go exactly through the points. And you can see just down below that there are some samples drawn from this distribution and indeed they're all going through the points that we observed as we'd expect. On the right, I've shown the observations as lines since they represent averages. And you can see that there's no longer anywhere where the variance drops down to zero so you get a much more uniform distribution of variance everywhere in the distribution over these functions. Likewise if you look at the samples, we're actually no longer constrained to go exactly through the red points where they would have been if we'd done the point constraints. We can, they're often quite close but they can go somewhere else entirely. Okay so if this spatial average conditioning that we're doing spatial conditioning doesn't constrain the functions to go through the points. What is the new constraint that they're following. It's nice to see if you then take do the same course screening that we did for our observations the same interpretation, but you do it for the samples drawn from this distribution. And what you then get and if you do this for the spatial spatial conditioning case is you get this dark green line, or in fact it's, it's several dark green lines drawn on top of each other because in each case it agrees exactly with the observations that we made. So we've actually been able to draw samples of points that have the correct spatial averages, even though they can do what they want at smaller scales. By contrast, if we do this averaging this spatial course screening operation on the samples of the other distribution, the one where we did the point conditioning. This doesn't happen so this is where we get these, these light green lines. So this shows that actually the ones that have been conditioned on the point values can vary quite a lot when we take these spatial averages. So they're not constrained to follow, follow these spatial averages but we observed, as we'd expect. And it's actually quite interesting I think to look at what happens to the covariance matrix matrices after you do these conditioning operations so these are the posterior covariance matrices after we've conditioned on the observations. So if you look at the point conditioning example first on the left, what you see is that whereas in the original, in the original prior, prior distribution before we did any conditioning would have seen this strong diagonal. Obviously you saw us a couple of slides ago. We've all positive values. So you have each each point may strongly correlated with itself and then this covariance dropping off as you go further away. But we now have these negative values involved. And what I think those are doing you get this sort of positive negative dipole around the location of each of these point observations. What we see this doing is imposing this smoothness is continuity through the observations. So since you've got this smooth function you know that if it's lower on one side of where you've got this observation. It should be higher on the other so that there's no sharp change when you get to this constrained point. In some sense for positive negative values in the covariance of forcing this to be true. Likewise, if you look at the spatial average in case you also see some negative values. But in this case they look a bit different. And it turns out that actually what these negative values are doing is you get them off the diagonal. So for the area over which we've done this course screening. So when we have one of these blocks, we have some positive covariance locally to one point, and then some negative covariance in the rest of the block that we've conditioned over. And what this is doing is actually saying that if we've observed quite a high value in one in one point of the block. We know what the average value should be overall. We know that the other values are going to have to be lower to compensate and to keep this average as we've observed. So I think it's quite nice that this actually drops out because this isn't, this isn't things that we've had to put into the model. We've actually dropped out from the underlying mathematics of the conditioning we've just applied these conditioning linear algebra relations and had these quite nice properties come out in the covariance but force for distribution to follow our constraints. So we're not really going to go over the super resolution application in a lot of detail, but there is a pre print on archive currently if you're interested in that side of it. But just to mention briefly but we did apply this in two dimensions to these wetwall potential temperatures. So you can see some examples here so you can see the target field on the left hand side, followed by the coarse grained observation. And then there's this. I've just I've just shown this isn't actually the newest version of this figure, but there's a benchmark that's the bicubic interpolation, followed by the mean of the distribution so the Gaussian random field that we have. And then there's some examples from the distribution. What we did for the, for the super resolution was actually slightly more complicated than what I went through for the one dimensional case, because we were also trying to infer something about the amount of variability in the field service length scale. So keeping that as a static parameter. We allowed that to change with time. So this is trying to capture some of the different weather conditions that could be happening are we in a situation where there's a lot of convection going on and maybe the length scales are a bit shorter. Or is it all very sort of smooth and there's not much convection happening and then maybe you see a longer length scale. There's actually a slightly more going on in this paper, but the essential idea is the same and the mechanism of how you do the super resolution is the same as in one dimensional case. So just to bring this back to where I started the talk in terms of a data generating model, because I said that understanding how the data is generated is what allows you to sometimes tackle problems without training data that you might think might need training data. So in this case what the data generating model is we've assumed that the high resolution field is in some sense a Gaussian random field so we've assumed that it's being drawn from one of these distributions with a particular length scale. So the coarse grained field that's where is actually the one we're observing is obtained by taking these spatial averages over the Gaussian random fields. So that's the model that we're using. And this model is what enables us to be able to do this inference and to get these high resolution fields from having observed below resolution field. So by using this model that we have with these observations of the coarse grained values, let's us refine our distribution and get a distribution of what's happening at high resolutions. And I found this quite interesting. It doesn't do. This is also the possibility to use a completely different approach to this right to super resolution. People have definitely looked at using a purely data driven approach as supervised learning. Possibly deterministic approach where you look at past data and you see if I've seen this at the course scales that should be correlated with what's happening at the final scales. But what I found kind of interesting about this work is it shows that you don't necessarily need to do that. There is information that's that you can get from just the model and the coarse grained fields. They don't do all the same thing. So depending on what you're most interested in one might be more suitable than the other. These convective scales and in quite non or a graphic areas, everything tends to be quite stochastic. So, but there isn't really much point in doing a regression kind of approach in different cases then maybe a regression would be a better choice than this approach. But it's interesting that you do have this option to have an approach that doesn't rely on data. The extension to this work actually was suggested by one of my colleagues who was asking, Well, what if I have these observations of these course grained fields, but I also have a point observation. Is there any way I can combine these to understand what's happening away from a point of observation in different parts of the field. What do you expect to see. I'm thinking about a case of rainfall for a moment. If you know what's happening over the full grid box where you've got this this course grained observation, and you also have a point of observation but showing quite a high value. You'd expect to see that the values elsewhere should need to be lower to account for the fact that you know what the average is overall so we must average out somewhere. That to be what we see as well. So if we again take this time some synthetically generated data from a Gaussian random fields and also a course grained version of that field. We can look at what happens if we put in one extra observation that's our point observation indicated with the red arrow. We can compare what happens if we put in that point observation with quite a high value so I've put in a value of two there on the left hand side versus right hand side where there isn't any point of observation we've just got the course grained observations. And you can see as you'd expect if we take a sample from each of these, where we've got this higher observation there's some higher values around that that would seem to make sense. That's kind of a minimum but you'd hope for to see the rest of what's going on it helps to break it down slightly and look separately at the mean field, and also the variance field. So looking first at the mean of these two distributions. If you do the point conditioning, you can still see this kind of raised area around where we've got this high value observed. But you can also see that elsewhere. There's actually a lower value in the rest of the box over which we've taken this spatial average. This is maybe clearest if you look on the right hand side where you have the difference of the two fields for point and no point of observations. So looking in the cartoon a couple of slides ago you've got this high observation and then you've got this raised area around it, but then this compensating low area elsewhere in the grid books, and going slightly outside. Looking at the variants. This one turned out to be slightly more complicated, but taking the difference again helps. The reason it's a little more complicated is because it's quite hard to see what's going on when you take the variance field of for of the distributions themselves because of this checkerboard pattern. And this wasn't something I'd initially expected all the thinking about it. I think it does make sense that you see this checkerboard pattern when you do this spatial conditioning. So what it what's happening here is that you have the lowest variance so these these blue areas in the center of where you've taken this spatial average. So that's where it's most confident. It's less confident when you're on the border of these two different spatial averages so maybe you've got two different values, and then it's not quite so certain what's happening along the border. And then you've got the most uncertainty at the corners where you've got these these four different values of the spatial averages coming into play. I suppose the largest distance to any particular the center of any particular spatial average. So yes, it's kind of hard to tell what's going on in those left two figures. But if you look at the difference field on the right hand side, they've been subtracted so it's easier to see what's happening. So what you see here is similar to what was in the mean that you see this drop in the variance service area of higher certainty surrounding the point observation that we've added. So you see this drop in variance at the other side of the grid box. So this is kind of interesting we've got these these two different areas where we're getting this improve what could be seen as improved predictability from this point observation. This is quite nice because if we just took the point observation by itself, this would give us some extra some lower variants some more predictability around the point observation itself. And by combining it with the spatial observation we actually get this extra area of higher predictability that we wouldn't get from just the point observation and we wouldn't get from just the spatial observation, but we get from the combination of the two. And yes, that's just comparing the mean and the variance steps. And in this case, taking it back again the data generating model, there isn't too much more to add because the data generating model is essentially the same as in the previous case so for the super resolution. So this was actually quite a nice bit of work to do because it was possible to reuse pretty much everything from the first case and just change it slightly to incorporate this new kind of observation. And all of the reasoning just followed through in the same way so it's nice but it came out to these nice extra emergent effects without having to do a huge amount more to remodel itself. So that was kind of the spatial statistics part of the talk. And that was kind of going on a traditional statistics, traditional statistics models. But what if actually we're not using sort of these very nice Gaussian traditional stats models, and we don't have a closed form solution for this kind of reasoning. What if your model for the data generation doesn't even look like a statistical model, because in in those two examples that I've just explained there's actually quite a bit of leverage that you can get from the fact that you have all these linear relationships you have these Gaussian Gaussian rules for bringing in observations. What you don't have these advantages is there's still stuff that you can do with probabilistic reasoning approach. So the next and final thing I'm going to talk about is about probabilistic programming for event detection. This is actually a bit earlier stage where kids I've done some experiments it's not written up into a full experiment with real data yet. So in the first part of the talk, I'm going to use a couple of really great resources. So if you're interested in learning about probabilistic programming I definitely recommend these two there's an introduction to probabilistic programming, a sort of review paper that's on archive, and also some lecture notes on advanced topics in machine learning. She can find it back there. So they do a really good job of explain this kind of mindset and how probabilistic programming can help here. So where does probabilistic programming fit. We've got this kind of statistical reasoning approach, which is shown on the right hand side of this figure. And what you have there you have this kind of model you have these these parameters, and you have some kind of observation. And what you're trying to do is move back upwards through the model, you've got your observations and you want to infer something about the hidden variables. On the other hand the other aspect that probabilistic programming is drawing from is the kind of computer science for you. It's also shared by a lot of science that is using simulation based approaches but you start off with some parameters. These parameters define a program that's specified in code. You run the program and then you get an output maybe it's a prediction or a result or whatever. So programming sits somewhere in between the two. Where it's using the same kind of set up as as you do with programming where you've got these parameters defining a program and the program producing observations. But in a sense you're thinking of going backwards through this process, you have some observations, and you're running back through this program to try and do some kind of influence about what your parameters originally should have been. So the sort of statistical inference but within programming and the philosophy of probabilistic programming languages generally is to decouple the specification of the simulation from inference. So on the top level which is what the users see. You have some kind of simulator that you're writing that's your model for how the data is produced. The user's responsibility. It's the kind of thing that scientists generally will be able to do they know a lot about a system and they can write some code that describes how the system works, whether it's fluid dynamics or any other kind of thing. On the other hand, what the programming language is responsible for a probabilistic programming system is the inference engines that are going to run on these simulations. These are quite an abstract level but trying to do inference on these kinds of programs. And the hope is that while any traditional statistical model can be expressed within the probabilistic programming languages. Also a lot of a lot of different types of models can also be expressed. In a way that's relatively easy especially for users who are more on the science side than being statisticians per se. So once you have this framework where you can express any of these quite varied models, and then do inference on those, you've kind of liberated a lot of people from worrying about the inference of their models immediately. These are nice examples, but I think show how this can be different from traditional statistical inference. The first of which is capture generation or capture inference I should say. So the idea is that you have an image capture, and you want to know what was the string of letters and numbers that that is lying behind that image. So you could have a program that takes a string of letters and numbers, and then generate this capture maybe at first sort of prints them and then does some strange convolution to make more difficult to recognize. So you can write this down as a model you can presumably code this up. Since this is how I generated anyway. So here is to then use probabilistic programming framework to do inference to say okay I've got this capture can I recognize the string that went into it. And this is quite nice because again this is a problem that could also be tackled as a supervised learning type of thing you can have a lot of data that shows a lot of different captions, and then have labels of what was the what was the string that produced that. You can actually be using more of a reverse simulation approach so it's using the fact that you know how the data should be generated to then to then do this inference. Another example is constrained simulation services just having some kind of procedurally generated trees, but then constraining them to obey to avoid some areas or go through some areas. It seems to show that it can be used on some quite different kinds of model but you can easily code, but it's less immediately obvious that you can use them for statistical inference. So what probabilistic programming language gives you over and above what any programming language gives you. It just has a couple of extra features it allows you to sample from a distribution so regular statistical distribution. So that's, for example, as a parameter for subsequent distributions, you can transform distributions do deterministic transformations. You can track all of these dependencies. So you have all of these different variables depending on each other this is tracked by an inference engine. And you also have an extra rule that lets you condition on some target data or some observations that you've seen. And the particular, the specific one that I'm using in this example is Pyro, which is a probabilistic programming engine that's built on top of the pie pie touch framework. And the underlying inference algorithms it's got a couple of different options but basically you can either use stochastic variational inference, and which is similar to what underlies variational auto encodes if you come across those. If you're treating your inference as an optimization you've got some approximation to your distribution and you're trying to get the best approximation that you can. You can also use Markov chain Monte Carlo which is maybe a more traditional statistical idea of how to how to do this inference that lets you draw approximate samples. I'm going to be using first stochastic variational inference here. And the example that I'm going to show for probabilistic programming is, as I mentioned event detection. And so the goal is that, can you detect event signatures in a time series, which hopefully shouldn't be too hard, but what if this time series contains correlated noise that makes the events quite difficult to distinguish from the noise. So this is the knowledge of how the data was generated to do a good job of this event detection that's quite challenging usually. So the way I'm setting this up I'll just run through all of the steps. The first step is to generate the events so the nice clean time series that shows the events we're looking for. So I first got a time series that just shows a binary blip where we've got these events that we're looking for. Doing a convolution with our event signature, which is just a very simple filter that has a peak and then a decay. And this gives us a nice clean series that you would easily be able to pick out the events. So this is what the series looks like without any extra noise. So I think a problem more challenging. I'm generating some spatially correlated noise. Essentially, this is a Gaussian process actually but generating the noise with quite a short length scale and using this just to produce this noisy process to make things more difficult to see. I'm generating the input data by adding this event series to our noise series to get this kind of noisy signal. And what our eventual goal is to be able to pull out where the events happened given this noisy input. So to get this into Pyro to do the probabilistic programming approach. The first step is to define the Pyro model so this is the model that Pyro uses to do its inference on. And actually the model that we have here quite closely matches the original data generating process as you'd expect. So you're drawing a sample of these where these event happened, doing the convolution, and then adding this noisy, noisy sort of maternal generated process, combining those and then using that to get a sample. So this is explaining to PyTorch how we think that our data was generated. The small quirk of stochastic variational inference is as well as your model, you also need to have a guide function. And what this does is it describes the thing that you're optimizing to try and approximate your final distribution. So what this looks like is just a couple of parameters that explain the event series and we'll use that to put into our model so it can do its stochastic variational inference. And the most important bit conditioning the model on the input data. So we'll use the field that we generated earlier this observations plus noise field, and just use the pyro's conditioning function to do the conditioning step. It's going to be interesting that having has our condition model. And then doing the variational inference, we can then retrieve these parameters so on the left we have what it thinks was sort of internal p zero, which we've been put through the same signoid function as we did for the original data. So if it actually on the right it's matching it's matching our original event series really quite well it's definitely detecting signals where we had events and quite low detection elsewhere so it does seem that given this specification of the model. It can do a very good job of detecting where these events are in really quite noisy data. And so just to remind you on the right there's what the original data looks like. I mean, I would struggle to tell where, where these these signals are in all of that noise but it is managing to pull them out. So in this case just to go back to the original point the data generating model. It's quite easy to see where that happens because it's our function called model and essentially that's the function that describes to a probabilistic programming framework, how we think the data was generated. In this case because it's synthetic data you might have noticed we're at a bit of an unfair advantage because we actually know how the data was generated originally. We're getting some extra advantage there since we have perfect knowledge of what the system would be like, whereas in more scientific tasks we might not have, we might have good knowledge but not perfect knowledge of what's going on. So to try and understand this, I've actually tried changing the length scale of the correlated noise to be different. So on the left hand side, we have the correct specification of the correlated noise in the middle. We have a longer length scale so I've actually doubled what the length scale, what's the model believes that the length scale is of the correlated noise. You can see that it's still doing a very good job there. And then on the right I've multiplied it by 20 times so it thinks that the noise that we're adding is 20 times longer than it is in the actual data generating process. And it's starting to get a lot noisy we've definitely lost some abilities, although it's still not not terrible it is still picking up the events but it's making more mistakes about thinking there are events when they're actually aren't. But this does give some confidence but it's not just because this is a synthetic data it does seem with even if you have a mis-specified model, you can still have a good chance of pulling out the events that you're looking for. Okay, I have no idea how I'm doing on time at all unfortunately I can't see my clock. I think potentially I'm wrapping up a bit early. But just to go back to the beginning when I was saying why would you want to use probabilistic reasoning in atmospheric science or in science generally besides capturing uncertainty. And I was saying that what probabilistic reasoning can give you is you can leverage information by combining your idea of a data generating process with the observations that you have. But this can let you infer some non-obvious information. And sometimes this allows you to tackle tasks which might appear to need a lot of training data, but without using the supervised approach that requires training data. And hopefully the examples have been enough to give you an idea that maybe this can be the case in some situations. No silver bullets and I wouldn't recommend that everybody now use probabilistic reasoning for everything. In particular it's not really suitable if you don't know how the data was generated you're not going to get very far. We're trying to use this approach. Slightly more subtly, a limitation of this approach is if you believe you know roughly how the data was generated. There are still situations where data driven approaches will have an advantage because they'll be able to tell you things that are in the data that aren't in your mental model. Whereas this can't do that if there's no training data you'll only get back what was in your mental model. Also just because you don't need training data it doesn't necessarily mean that you'll get a faster model. In fact sometimes the models that you get out of this approach might be slower than what you'd get finally from having done a bunch of training. But you do get some corresponding advantages you get a nice interpretable model and you get quite a good amount of skill potentially in some cases without this training step. I'll finish up with some open questions I guess quite an interesting direction is how to combine this with data driven learning. So I've mentioned that there is skill you can get in some problems this way, but that data driven learning can have some advantages for being able to improve upon your mental model of the situation. So it can be quite it can be quite interesting to combine the two. And of course there are many things that somehow sit between the statistical view and the data driven view to list of you as sort of Bayesian deep learning which is a whole approach for people used to incorporate uncertainty and deep learning models. There's variational auto encoder type approaches which is somehow related to what goes on in probabilistic programming but using a more deep learning approach. But I think there's potentially more to explore here I don't think we've necessarily had the last word on how to combine these ideas. So for example, a way that this kind of probabilistic reasoning might help with data driven learning is it can essentially remove all of the aspects of the problem, but really don't need any training data don't need that kind of learning based approach. So if you then have a say a deep learning model that's just handling the rest, then combined with a probabilistic reasoning to handle the bits that it can handle. Maybe that gives you a best of both worlds thing. And for that's all for the future. There's also maybe some interesting questions about the approach of using these sort of mental models putting them into codes and then using these for inference. I showed in the first example but it's sometimes possible to use the pretty much the same model but in different situations where you have different kinds of observations. So other ways that it could be useful to share these models between scientists and can maybe use to support learning. I mean a lot of the times when scientists have models for huge simulation based approaches. And while they might be very useful to share to to for people to run they're not really a good tool for learning if you're a new scientist and want to understand how the systems are working. So some of these more more simple models but are useful for inference also be useful for scientists entering a new area. And also just where else can this approach be helpful I've mentioned super resolution and combining spatial statistics observations and event detection other situations where this is kind of a sweet spot for probabilistic reasoning. We shall see. Anyway, that's all I have so I'm happy to take questions. Fantastic thanks very much Rachel for that excellent talk I really enjoyed this thing too. Okay, questions. If anybody wants to ask and I would ask you to please raise your hand and then you can take it from there. I don't see any so far so that we maybe go ahead I've actually written down a couple of questions. So maybe starting from the end of your talk I really enjoyed this introduction to give to probabilistic programming and actually a very small movie technical question on one of the plots that you showed when you changed the length scale for the inference step. So on the on the last plot that you had we had this 20 times change. It seems like it's doing worse at the beginning like there's a couple of very tall spikes at the beginning is this some boundary effects that come into play because you've got this convolution here. That's an interesting question so I didn't check that, but possibly because I, hmm. Because I didn't see anything obvious about the input data that would lead to worse performance at the beginning so maybe there is something odd going on with the convolution. That's an interesting thing I should look into that. Okay, another question that I had concerns the first part of the talk when you did this Gaussian process super resolution. Maybe this question really just shows my ignorance about the domain science that's at play here. Is there any, let's say justification or argument that you could give as to why a Gaussian process is a good model to use there. Is that a generic statement or is just, is there some like central limit theorem at work in the back that always makes things turn out Gaussian in this specific wet bulb temperature data that you looked at or is this something else. I guess it's kind. So I guess there's probably an argument you can make around temperature. I think it's going kind of diffusive field. See, could be used to argue that a Gaussian shouldn't be a terrible model. I think there are certainly cases where it wouldn't be the right model. So using it for say cloud or precipitation, it's not going to be the right model at all. It should be interesting to extend it to those. So I am sort of looking at what happens if you have a transformed variable. Can you still use similar arguments but without this Gaussianity assumption. So I think there's potential for that to work. The other way of looking at this, I'm assuming it works reasonably well for this wet bulb potential temperature. There's a latent variable underlying these other non-Gaussian variables and sort of leverage the inference in the wet bulb potential temperature to then understand what the moisture is doing to then apply to the other variables. So I think there's a couple of ways to extend it to non-Gaussian variables but yeah, that's, that's the future. That would be nice. And my last question actually also concerns a bit the future. So I can talk a bit about particle physics what happens to know a bit about, and people have actually demonstrated the use of probabilistic programming and the inference technique that you mentioned on the scale of the actual simulators that are being used in production. So this is sort of very large scale stuff. It's not sort of actually being used for production purposes but it's been demonstrated that it can be made to work on that scale. So what level of impossible are we talking about if you thought about rolling this out to a, you know, mainstream weather simulation. Is there any chance whatsoever that could potentially work or even a part of that simulation. So I think an interesting thing about weather simulations is there's also a really well developed field of data assimilation that applies to to weather forecast models. So in a sense this all makes complete sense if you apply it to a big numerical weather model in theory in practice it's going to get very funny. There's reasons why that's quite hard mostly to do with both the scale of just number of variables you have of these spatial dimensions and also all of the different physical variables, and also some of the non-linearity. So I wouldn't apply probabilistic programming to it purely because there's a sort of decades long literature of people developing techniques specifically to handle this exact problem, actually inspired by very similar ideas. In a sense, if I gave this talk to a purely atmospheric science audience I might argue at the other way I might say, okay this kind of probabilistic reasoning is already implemented for big numerical weather simulations, and it's kind of already working there. But maybe there's more that we can do applying it to much smaller scale models, slightly more heuristic or more mental model type of approaches where maybe it's underutilized compared to the huge models. Yeah, I see. I see. Very interesting. Okay, anybody else have a question. That's not the case and let me thank you very much again Rachel for this great talk. And I hope I look forward to seeing you again, some of you hopefully next week.