 Good morning everybody. Welcome to this webinar. My is Matt Fry from UK Center for Ecology and Hydrology. I've been hosting this series of webinars in the Nerd Constructing a Digital Environment Program. I think this is the seventh series of webinars and this is the seventh and final webinar in this series, which is on AI and environmental science. And as I said, it's part of the Constructing a Digital Environment Program. So this is a program aimed at developing a digitally enabled environment, benefiting researchers, policymakers, businesses, communities and individuals alike. It's been running since 2019, had a series of projects with the aim of sort of envisaging and developing approaches to creating the future digital environment and exploiting advances in technology and increasingly diverse data sets to improve our understanding and management of the environment. It runs an expert network and had a series of events, including a very successful conference very recently worth pointing out. There's an event coming up in a few weeks at the Royal Society in London in the 5th of September and our registration is open now. It's called Digital the Key to Unlocking Environmental Challenges. And I believe this is the kind of finale event of the program. So it'll be summarizing lots of the projects and work that's been undertaken as part of the project. This is like I said, it's the seventh series of webinars that the programs run. We invite presentation from a leading expert in the field, followed by a chance for some Q&A. Today, the final one in the series is from Tom Anderson of the British Antarctic Survey. He's going to be talking about tackling diverse environmental prediction tasks with neural processes. So Tom's a machine learning research scientist at the British Antarctic Survey AI Lab, where he researches and develops machine learning systems for monitoring and adapting to climate change. His work currently focuses on the application and implementation of neural processes in environmental sciences. I'm going to explain a bit about that today. Tom's used Uncertainty Quantification, interpretability and active learning methods to build decision support tools. And his previous work includes ICENET, a forecasting AI system. And with that, I'll hand over to Tom. Thanks very much. Fantastic. Thank you, Matt. So yeah, assuming you're all seeing my screen. Yeah, thanks so much for the introduction and for the opportunity to present today. I've been following the webinar series and it's been amazing. So the previous speakers have set a high bar and I'll do my best to try and meet that bar. I'm going to be talking about a new class of machine learning models or fairly new called neural processes. And I'm going to be talking about how these models have a range of kind of flexible modeling capabilities that enable their ability to tackle a range of different environmental prediction problems or prediction tasks. And I just want to start off by, you know, from the beginning, just acknowledging that I don't exist in a vacuum, but rather, you know, exist in an atmosphere that I share with many other people. And some of these people I have the pleasure and privilege of collaborating with such as the people on the screen here. And you know, a lot of the work that I'll be sharing in this webinar was kind of informed by working with them. And indeed, I'll be sharing, you know, some direct research from some of the people on the screen as well. And we help from a range of different institutions, including British Antarctic Survey or BAS or INEPROM, the University of Cambridge, Microsoft Research, and also the Alan Turing Institute who fund my research. Okay, so in terms of the outline of the talk, I'm going to start off from a high level thinking about the state of play of AI and environmental science. It's just to set the scene a bit. We're then going to look at the challenges of modeling environmental observations and how convolutional neural processes can address some of those challenges. I'm then going to present a kind of array of different experiments for different environmental prediction tasks. They use convenPs, tackling different kinds of data fusion problems. I'm then going to, for the first time publicly, introduce this new open source software package that I've been developing called DeepSensor for modeling environmental data with neural processes. And then we'll wrap up with some closing thoughts and if you have any questions, then I can take any questions then. Cool, yeah. So starting off, as I said, kind of setting the scene a bit. I really wanted to highlight how much deep learning and machine learning has been kind of seeping into the environmental sciences. So this chart shows the number of journal articles with the phrase deep learning either in the title or the abstract. And although deep learning started kicking off, you know, in the kind of early 2010s, it took until, you know, around 2016, 2017 until it started being deployed for the first time in environmental science papers. Now, I've been in the game for a while, I'd say. So I've joined Bass in 2019. I promise I'm not as vain as putting this photo of myself here makes me look, I say at the time it felt like deep learning methods, machine learning methods had a lot of potential. There's a lot of kind of excitement, but there's also a lot of uncertainty and dare I say it, you know, a healthy dose of skepticism about how these methods could be used. And so it's kind of the stuff of hushed conversations in the corridor to put it in a slightly melodramatic way. And, you know, since then, you know, we're now here. But if we project the rate of paper public publications out to the end of the year, we're going to be around up here in terms of the number of papers. And on a field wise view, I'd say it feels like machine learning and AI has moved from the fringes of environmental science to what now feels like the frontiers, you know, the most advanced technology that's having some really exciting examples of use in the science. And, you know, so conferences now there are, you know, enormous volumes of posters deploying different machine learning and deep learning methods. I was at AGU the world's largest earth sciences conference in Chicago in December. And I say if I had to guess it felt like 5% or 10% of talks and posters had some kind of machine learning element to them. And of course now we have a webinar series like these as well. And I want to highlight one specific area that feels like it's having its Eureka moment now. I think this is the most exciting first example of deep learning, achieving its ideal application and use case in environmental sciences and that's weather forecasting. So, I don't know if everyone's aware, but there's been a bit of a revolution in machine learning based weather forecasting in the last year alone. I'll just run through this so on Christmas Eve, December 2022. Google DeepMind gifted us with Graphcast, which is a graph neural network based model for medium range global weather forecasting. I think it took everyone in the field by surprise. The method outperformed the leading physics based system, ECMWF for the European Center for Medium Range Weather Forecastings, state of the RFS system, and they outperformed the system on a whole range of different variables and pressure levels as well. Then Microsoft entered the game a bit with a month later with their climax model, so trying to build a foundation model for weather and climate, so taking the idea of a foundation model from large language modeling behind like chat GPT and stuff where you have this huge base model trained on an abundance of simulation data and other types of data that can then be fine tuned and deployed on downstream tasks. And then just running through the last ones, you know, it's now been a kind of flurry of increasingly rapid papers one after the other like a group of academics from Shanghai published this Feng Wu paper that looked very promising and then a different Google DeepMind team released their model met net three on kind of shorter range forecasting and then Huawei published their Pangu weather model in nature using 3D neural networks and transformer type architecture and all this has led to ECMWF themselves releasing a print just a couple weeks ago, the rise of data driven weather forecasting where they declare that new numerical weather prediction paradigm is emerging lying on inference from machine learning models. So, I kind of just want to highlight like this is really exciting so this is an exciting time to be in the game. I'm not going to be talking about weather forecasting in this in this talk, but rather I kind of just wanted to set the scene and show where things are going a little bit. Okay, so most of the approaches in those papers that I just showed are trained on re analysis data. So not direct observations of the environment, but rather state estimates that have come from numerical simulators that try to fit as closely as possible to observations, producing a nice complete gridded product. Now that's, you know, all well and good and these state estimates are known to be the most accurate guess we have of what the earth is doing across space. But what if our data isn't on a on a nice neat grid. Okay, so real world environmental observations, for example. So, you know, what if we have missing data like this land surface temperature satellite observation from MODIS of surface temperature over Antarctica which has clouds covered gaps, or what about a regularly sample data, such as in situ temperature stations across the Antarctic continent. You know, if we were back in 2019, then, you know, by that time only the most basic methods were infiltrating into the field and we might say, Well, you know, this data is gridded, let's chuck it into CNN convolutional neural network that's well suited for image data. You know, there's an issue with this which is that it can't handle NANDs. So, you know, what do we put as these missing values we can't just put minus one, because the temperature could be minus one degrees Celsius. Right. So, we need a different approach there and then just thinking of the temperature stations, maybe we'll model this as a Gaussian process, right, that'd be a reasonable choice to make. You know, fits a spatial 2d Gaussian process these temperature observations and try and fill the gaps or something. Well, GPS have some issues so they can't ingest multiple data streams in a simple way that I'm aware of. So how we wouldn't be able to condition on this land surface temperature either. And they're also pretty computationally expensive when implemented in a naive way and not very expressive, which can be an issue when we have an abundance of data that we want to learn from. Alternatively, we could try thinking of chucking this into a multi layer perceptron right just like a kind of vanilla neural network that operates on vectors. So we'll put this station is station one, chuck it in there, put this station is station two, chuck it in that element there. Get this station, which is recording a balmy negative 88.2 degrees Celsius. Put that in the third element and so on until we've put all of these station observations into a vector. Okay, fair enough. You could do that, but that's also not going to be what we want because this model also can't handle man's if one of these stations broke on a certain day. We're also not going to be able to leverage new stations. So if a new station comes online, well this model only operates on a fixed dimensional vector so it can't leverage new stations. And it depends on the exact ordering that we chose of stations. So the one we chose to label station one and the one we chose to label station two and so on. So if we give our model weights to our friend and say, you know, here you go. There's the data try and model it and they choose a different ordering. Then the model is going to be in a completely different part of input space that it's never seen before and it's going to break. So that was a bit labored but that's just kind of motivation for why the most basic kind of machine learning approaches aren't going to work with environmental data. So we're going to cut to the chase and say that in order to model this data appropriately, we're going to need to model these observations as sets, so sets of observations where the ordering doesn't matter, and we can have arbitrary length sequences of observations. Okay, so now we're finally going to have the convolutional neural process coming onto the stage. So I'm going to try and set the kind of kind of theoretical mathematical notational background here. Okay, so we're going to model or rather we're going to call our observational data sets, context sets. See, and we're going to have. A range of different context sets like you saw on the previous slide. Let's index them with the index I. So each context set is going to be a set of locations and observations. pairs. So the first entry is going to be this pair of the first X location, which could be latitude, longitude, and then all of the observations associated at that point, which could be temperature, we could have multiple observations of this specific location as well. Then we're going to have the next one and so on until we've reached the end of all the observations for this context set. Okay, so that's our you can think of this as like our input data streams. And then we're also going to have target sets. So that's where we want to predict. So that we're going to label this T. And we're going to have a set of locations that we want to predict at these XTs. And we may or may not have observations there. So let's just pretend that this might be an unknown variable, but we might, we might be able to do that. And we may or may not have observations there. So let's just pretend that this might be an unknown variable, but we might, we might also have observations that we can use to train. By the way, this is meant to be a Y here. Okay, so now one of the final bits of nomenclature here is that we're going to call the combination of all these context sets and target sets a task. So we're going to want to build a model that can take in this task and predict the target data based on the context data. So now we're finally at the convolutional neural process. This is definitely the most complicated lines or try and go through it fairly slowly. So the convolutional neural process is going to output a distribution over the target observation values. Conditioned on the target locations we want to predict that as well as all of our context sets put together, which we can just call C. I'm going to show an example where it outputs a Gaussian distribution here. So it's going to be a normal distribution over the target values. And in a convolutional neural process, the mean and variance at all of these values is going to be a function which is parameterized by a neural network. In this case, a convolutional neural network. That's where the conf term comes from. And these functions, the mean and variance functions are going to take in the target locations that we want to predict that as well as an encoding of all of our context data. And now this encoding is going to be constructed in such a way that it doesn't care about the ordering of observations. In other words, it's going to be permutation variance. And because we're using a CNN, the whole model is also going to be translation equivalent with respect to the context data. Cool. Yeah, that's most of the complicated stuff out the way. Apologies if I lost you for a minute. We're going to see lots of the cool examples where you will hopefully build more intuition in the model. I'm going to add a disclaimer here that this is the conditional neural process where we're not modeling correlations between target variables, we're just modeling them as independent. And this is also a variant with a Gaussian output distribution, but we could parameterize any distribution with a neural process. And I'm just going to use the term com NP in a kind of variant and distribution agnostic way to kind of refer to the super set of all possible models that you can construct in this way. Okay. I really want to highlight how the encoder works because I think it's really cool. And it builds a lot of intuition. So the encoder in Conver MPs that we typically use is called a set Conv. And let's say we have to contact sets, we have station observations at these black crosses. And let's say we also have some gridded data set containing auxiliary information like elevation and landmass. When we input that to a set Conv out the other side we get a 3D tensor, we also have density channels for each context set which says where we have observations. So you can see the kind of sparse station locations here and then a flat channel because this was a gridded variable here. And then we have data channels which are the actual values that were observed. And the reason we need the density channel is if you observe a zero, then that's going to be nothing in the data channel so we need to tell the model that there was an observation there. Okay. And what then happens is so we process this all our context sets with a set Conv, we input that to a unit. Okay, just a type of convolution neural network, and then interpolate the outputs of the unit at our target locations. Optionally, we can then inject some extra data with a kind of output multi layer perceptron. That's going to appear in some of our models, it's a variant that we've worked with sometimes. But in any case, we're going to then parameterize the distribution over targets at the target locations. And the parameters of our distribution so like the mean and variance for example, is now going to depend on the context data itself. So the actual observations that were observed. And because of this, this is called meta learning because it's like we've trained one model, but then it can kind of train in inverted commas when you give it new data. And it's going to learn how to kind of set the parameters of the distribution. I'm not going to lean too heavily in that interpretation. But I just thought I'd mention it. Now, what this allows is a bunch of key modeling capabilities that help us with spatial temporal modeling. So we can fuse multiple gridded data streams. We can fuse off the grid data with on the grid data. We can handle multiple resolutions by passing them through a set conf and interpolating them onto an internal grid. We can handle missing data because the density channels. We can also predict arbitrary target locations. So we're not fixed to only predict at the locations we trained on. And because we're parameterizing distributions, we're quantifying our uncertainty. And in most of the variants of conven piece that you'll seen that you will see rather. The inference cost is going to be linear in the number of context points or the number of target points. So it's going to be very computationally cheap to run this type of model. Cool. So that's the conven P. I'm going to pause just for a moment because that's maybe a lot to take in. And now we're going to move on to an array of different experiments using this model. So I'm going to start off with one of, well, actually the first example of using convolutional neural processes in environmental sciences. So let's just start off from a high level. So we're going to be thinking about statistical downscaling here to kind of spoil it a little bit. But let's say I give you some gridded reanalysis data like this. I also give you some very high resolution auxiliary background information. This is Germany, by the way. And then also a bunch of in situ stations measuring, you know, hyper local environmental conditions. How can we combine these three variables or data sets in order to, you know, do something useful, produce some kind of useful environmental predictions? So, yeah, it's an interesting challenge. You know, we've got different modalities, different resolutions. The way that this has been done in the literature, which is an approach led by my collaborator Anna Vaughn, who published this paper in 2022. And I'm just the stuff on the screen here is basically like a very stripped down version of her paper. So go and check out the paper if you're keen for more. So what we do in this downscaling approach is that we have our input reanalysis data. And in this case, I've course and era five to really core scale to try and exemplify this approach. So it's going to be point one point two five point two five or about 120 kilometers in resolution. I'm also going to input some kind of medium resolution elevation data eight kilometer still, you know, over 10 times finer resolution than the era five data, but not as fine as it gets. We're going to chuck this into a convolutional neural process. And as I showed before there was that option of kind of injecting new information right before your prediction and that's what we're going to do with downscaling. So we're going to inject 500 meter resolution auxiliary data just before the prediction is made. This is going to be kind of static topographic information. And we're going to train the model to predict what temperature stations observed over Germany. And in Anna Vaughn's paper, she showed that this architecture can outperform a whole suite of different statistical downscaling methods. So yeah, very impressive and robust architecture. And, you know, here are some training results I trained this model on 2006 to 2017 data and then validated on 2018. You know, it's quite lazy as you can see I probably could have kept training, but I couldn't go impatient. So now let's look at a prediction on a test date that the models never seen before. So again, we're passing in our error five data. And when we make the model predict a very fine resolution, we get this nice kind of detailed map of predictions which picks out all of the high frequency variation in the auxiliary data. And of course we get a mean and also a standard deviation capturing our uncertainty. If we kind of expand a little region here. We get a little bit more the kind of spatial scale of detail in the valleys and mountains and whatnot. And if I zoom in to the site of an actual station here. What I'm plotting on this time series is the convent piece prediction in blue. The actual station observation in green. And then the era five kind of grid cell average at that location in orange. The model was trained on this station before I have to caveat that I was kind of lazy I didn't hold out this station. But it hasn't seen this date before hasn't seen this this this two month period of data. And it's kind of striking how it's kind of non trivially predicting what the station observes and always capturing the true value within its 95% confidence interval. And I say non trivial because it's, it's not just kind of offsetting the error five value by a constant amount. You know, you see here it kind of agrees with error five here it offsets it by quite a wide margin. So that's very interesting. Now, some caveats with this approach is that if you think about our gridded data that we use in this approach, the data needs to kind of sufficiently correlate with our station observations. You have a sufficiently strong relationship. If it does we can train a model and then deploy it on simulation data that runs out to 2100 for example and get local scale downscaled predictions. I'm not a downscaling person. So, you know, I find that cool but it's not why I work on day to day, but I think that's a very neat aspect of what you can do with downscaling. You know, looking at local climate change impacts. Just took as an aside comment. Another caveat with the auxiliary data is that you know the availability of this data is really important it needs to reveal the station microclimate. So, you know, it needs to kind of, you know, in the case of temperature elevation is pretty strong predictor. What about urban density what about fields for example that could also affect something or you know density of trees and whatnot. In terms of the stations we need sufficient coverage in this auxiliary space so the model can learn how to kind of leverage these auxiliary variables to predict what would actually be observed. So, as I mentioned, you know, if we're missing auxiliary channels then we might not be doing the best job that we could do. Here's an example that one of my colleagues that were kind of didn't work amazingly in our initial attempts and this was trying to downscale soil moisture over the UK so taking an era five soil moisture, taking in topographic data and trying to predict and kind of overlaid the station locations and then the model prediction. You can see from the black crosses that they're quite sparse. And they also are mostly concentrated in the south south of Britain, and you know not so much in the highlands in the high topographically varying areas. And the spatial patterns just don't appear that realistic. So, it didn't work that well in this case, but I don't think all hopes lost you know, it's an open question as to whether we could do transfer learning from different regions like I think in the US. I believe there's a greater volume and density of soil moisture stations it'd be interesting to transfer across space and also try multitask learning for example like if it's trying to downscale different environmental variables some of which might have a greater abundance of observations then the model might be able to learn a kind of more generic representation of the environment say if we're trying to downscale precipitation at the same time then there might be some kind of shared features that would be useful for this task. So those are kind of open questions to explore. Now this is kind of a slight red herring, but I pretty much yesterday ran this experiment, and I don't know what to call this. It's kind of like high resolution station interpolation is the best word I could come up for it. So, in this setup we have stations on the context side. And then our kind of slightly coarse scale elevation data as well. And the station context points are shown as black circles, so I've been lazy so both of these plots show both the context and target scatter points. And then on the target side we're going to have, we're going to be trying to predict station observations as well. And we're also going to as before inject high resolution information just before the prediction. In this case it's going to it's not it's not downscaling a gridded variable, but it's trying to interpolate other station observations, while also leveraging, you know very high resolution auxiliary information. When we look at a prediction on a test date and condition the model on 20 randomly located real station observations, you can see the models prediction for the temperature across space here, and the models uncertainty. And now there's all sorts of interesting high frequency features going on, but we also have a probabilistic model here that's actually conditioning on what was observed to inform the prediction. And as we add more and more observations I think this is 100 observations here the model is going to update its estimate for what's going on. And I think it's interesting to see that. Well, you can't actually see it here because the color bar goes down lower but the, the uncertainty never goes to zero exactly. And that's because station data has noise on it you know we might not be capturing all the auxiliary information that we need for this task. So it's not going to exactly fit all of the observations, but you can see how it's being the station observations are being used to inform like the bigger picture of what the temperature is in the surrounding area. Now in an extreme case giving the model all 500 or so German temperature stations, and you can see it can build up quite a detailed picture across the country now. And we've quite low uncertainty everywhere, apart from a high altitude type regions. You can see it's now realizing that it's quite a chilly day across the country, you know around my around four degrees to the north and around minus four degrees in the higher altitude alpine regions to the south feel a bit like a weather presenter now. Anyway, if we zoom into this region. In the Alps we can just like, I guess. Yeah, just exemplify the high frequency information we're seeing so I've saturated the color bar zero just to. Sorry, I'm chopped off at zero so that the color bar doesn't saturate. But yeah we can see lots of fine scale information and yeah the most uncertain regions are definitely at the peaks of mountains here. So that was a bit of a random experiment might be useful for future work. I'm now going to switch tacked a bit towards sensor placement. So this is some work that I've recently been doing using active learning. I'm going to introduce active learning and then, and then the research project that we did. Okay, so active learning what is active learning. So you need two ingredients. You need a probabilistic model that's going to produce a distribution over the target values given some data. Okay, that's ingredient one ingredient to is an acquisition function that you define this acquisition function alpha of X is going to say how useful a new observation at point X is going to be for the model. So you can kind of combine these two ingredients and iteratively propose new observation locations in a so called greedy algorithm that optimizes one step at a time and produce a kind of vector or tensor of proposed placement locations for new observations. So here's a little kind of toy animation of this in progress. So we have a probabilistic model in green that's being fit to this kind of hillscape sort of sinusoidal curve. And it's iteratively proposing new observation locations in red, given these initial black observations, these black circles here. And you can see how it kind of gets iteratively closer and closer to the true function that it's trying to predict, while also avoiding the existing observation locations. Okay, that's a very toy example, but we're going to use basically the same philosophy in in in this project. Okay, so, yeah, this is a project that I led that has very recently been published in environmental data science so this is just yesterday for those watching online right now. And I believe the best of my knowledge that this is the first first study to look into the active learning capabilities of a convolutional neural process type model. And what we're trying to do is use active learning to work out where we need to go and place new observations for temperature over Antarctica. So we're going to go into the information locations. And I'm adding QR codes for the paper here. Please go and give it some love given as just came out yesterday, I haven't even announced this on Twitter yet. So you are the first to hear about it. And I'm also going to mention that I'm not going to go into too much detail here because there is a conference talk 15 minute conference talk that I gave specifically on this, specifically on this study. I'm going to talk about as well if you're more interested, but I will show some results so the model we used is actually a convolutional Gaussian neural process so com GNP. So this variant of a conven P outputs a joint Gaussian distribution over target points, not a bunch of independent distributions, but a correlated Gaussian. So we're just going to output a mean function, but also a covariance function or, you know, when you sample at target locations a covariance matrix. So we're going to be able to draw coherent samples over our unknown target variable conditioned on our observations. And yeah, we take in temperature on the left and also a gridded auxiliary context set in order to allow the model to break translation equivalents with respect to the temperature stations. And this allows the model to learn non stationary covariance but that's a technical detail that we don't need to go into now. Okay, so we have a bit of a chicken and egg problem with sense of placement and these models. The issue is that to train a model to learn where we need new observations, we need a lot of data to train a big flexible model that will be really accurate at finding great new locations or observations. But then if we have lots of new lots of observations to do that, then do we really need new observations in the first place. Okay, so that was kind of the head scratcher for me because Antarctica is not known for its abundance of observations. And the way that I got around this in our project is to train the model on re analysis data. So we train the Comfort GMP to spatially interpolate era five daily average two meter temperature anomaly like the animation you show you see shown on the right. Okay, so what do we mean by spatially interpolate. What we do is we randomly sample grid cells to be the context data on the left hand side. We also randomly sample grid cells to be the target points on the right hand side, and then we just train a con GMP to maximize the target probability over the target points. So yeah, it's pretty simple, I would say. And these are going to be operating on a kind of day by day basis, so it's only going to see observations from a given day and it's trying to predict what's going on elsewhere for that same day, just to clarify. Now when we train this model and run a sense of placement algorithm we get what you see here. So you know similar to the hillscape but in a much larger scale more complex problem. And on the right side for good measure we have a vanilla kind of Gaussian process baseline model. And what I'm showing as the heat map is the acquisition function that we use which is the predicted drop in variance when the model sees a new observation at that location. So how much does this observation decrease the model's uncertainty. And we can stare at this all day, but I'm going to jump to the chase and just show you some results so when we look at the root mean squared error at predicting across space. Versus the number of sensors added to the model at these locations, era five sensors in this case. And look at the conflict gmp's performance versus the Gaussian process models performance. You can see the conflict gmp starts off with better RMSE. But also despite that is able to reduce its error faster. And at the end of the day finds better sense of placements and inverted commas according to these results. And the paper with QR code again in the top right has much more detailed experiments with different probabilistic metrics and also a non stationary Gaussian process baseline. And gps by the way, have been, you know, in the field of environmental sciences for decades under the term of cringing. So very classical geostatistics approach but yeah I think we are starting to find more advanced methods that can do more interesting things. Okay limitation of this approach is that it's trained on real analysis not observations. Yeah, so it's kind of not going to be measuring the observation of the informativeness of real sensors, right. Like these are idealized zero noise observations because reanalysis doesn't have any noise or uncertainty on it. So, yeah. We've been exploring this. We've been exploring tackling this rather. So myself and Professor rich Turner have been working with this master student Jonas Schultz. He's done a fantastic job looking at trying to alleviate this using SIM to real for temperature interpolation in Germany. So, in this case, simulated phases era five pre training, and then we transfer the model to the real phase where we fine tune on real station observations. So, yeah, randomly sampled era five observations during pre training, and then we randomly split split the station data into context and target during fine tuning. And the idea is that the model would hopefully kind of leverage all the information contained in this data set of era five, but also learn interesting new characteristics from the station data like, you know, noise on the data and whatnot. So we see if we look at the model's uncertainty. When it's just pre trained on era five, where yellow means very certain and confident and then darker means more uncertain is that as we fine tune on more and more stations, the model kind of increasingly becomes increasingly uncertain around station locations. We see this by the yellow blobs kind of disappearing. And that's a good thing because the station doesn't, you know, tell you exactly what's going on in the surrounding area for 10s of kilometers. So we don't want it to pass exactly through observations. So it's kind of behaving as we would hope. This also ran this sensor placement approach that I showed you earlier, but with the fine tuned model. So yeah, era five on the left and fine tuned on the right. And you can see that the acquisition function becomes a bit more diffuse. I won't go into the details, but it does seem like it's doing something reasonable. And this is now measuring the informativeness of real station observations. What I'm finding is that on some kind of like held out results is that pre training on era five does help with predicting observational station data, as opposed to not pre training at all. But it does become less important and useful to pre train on era five as more and more observational data becomes available which is kind of intuitive. Another cool experiment from collaborator Paolo Pellucci who did a secondment with the Alan Turing Institute over the summer is that he's been looking into aerosol sensor placement. So with a target variable of black carbon, specifically black carbon aerosol optical depth, which is a variable that has a highly black carbon has a highly uncertain effect on warming and is quite sparsely observed. So quite an important variable to get better measurements of and he used the same kind of reanalysis training phase using cams reanalysis and we haven't looked into fine tuning just yet. And using the aeronet sensor network to initialize the context set for the sensor placement. You can see comparing the reanalysis with the model's prediction after initializing with observations of the context initializing context set of the sensor network. It's making like reasonable predictions for the large scale aerosol phenomena. And here are the results of the sense placement algorithm. They're fair for you in Russia which isn't a great place for sensor placements at the moment, but you know other ones in kind of North African coastline and some in Norway and whatnot. So yeah, this is an exciting other application of this work. Okay, what if I told you that all of the experiments that you've just seen in this talk is the same Python package. So I apologize for the slightly crass meme but you know I do share the same name as Neo from the matrix so I guess I kind of had to. But yeah, so everything you've seen use the same Python package to run the experiments. And now we're going to look at this so this is this Python package that I've been developing called deep sensor. And this is kind of just like a information dump high level overview of the design philosophy that I'm going to walk you through. So deep sensor is a Python package for modeling environmental data with neural processes. You can pip install it now and there's a QR code up there to the GitHub repository. But I do want to highlight a user warning that this is a work in progress so if you're interested in using it, then it's probably best to contact me first. You know, that might not be true in a few months time it might be in a more stable place then. But just looking at the design high level. You can see that everything above this line lives in X array and pandas world. So the user only sees X array and pandas data that we all know and love, for example, gridded data here time series data here. That's then going to come into a data processor, which is going to normalize and standardize the data. We then have a task loader that's going to generate tasks for training a neural process model. And then we have this conv NP model class, which actually inherits from this base class probabilistic model, which itself is subclass by this deep sensor model. And these, these classes live partially in this world of tensors so tensorflow pi torch and num pi that the user doesn't have to interact with if they don't want to. And the conv NP is implemented using this fantastic Python package called neural processes that was authored by my collaborator vessel Brinsma. It's a fantastic package, like really flexible, really nice design philosophy. But it does live in kind of tensor world, which can be a bit inconvenient when all of your data has all this rich metadata and functionality of X array and pandas. So what I've tried to do is is kind of combine those two, or these two worlds. And the conv NP can directly output on normalized gridded predictions, or off the grid predictions. And then we also have our active learning functionality as well. So yeah, and another thing I'd like to highlight is that it's designed so that the models are extensible you know you could put your model here and leverage all the same nice functionality. Still to be tried and tested, but I'm very keen that that feature works because the conv NP isn't necessarily the be all and end all and I would like this package to be able to be updated with the latest, latest models. So, yeah, my hopes for this package is that it lowers the barrier to entry both for environmental scientists and machine learners. So I hope that it will galvanize research progress, you know, building a community around this open source software, leading to kind of positive feedback loop where software improves research then research improved software. Blue sky thinking is that this could hopefully become like a leading piece of software for the latest environmental ML paradigms. I'll push through the kind of conclusions but first, don't take my word for it that deep sensor is worth checking out but check out this test testimonial from a comment on one of our issues from zeal Patel deep sensor has an easy to use interface similar similar to scikit learn and it seamless integration with x ray saves a lot of time and energy to pre process and post process the data. So thank you zeal for the for the kind comments. Yeah. As I said, closing thoughts, but I'm, I'm running towards the ends of time so I'm just going to kind of zip through this. I'd say we're entering a new phase of environmental machine learning with more and more versatile modeling capabilities. convolutional neural processor one such kind of model and they can tackle a range of prediction tasks, such as those that you've seen and more. So, you know, also forecasting filling missing satellite data and so on. Deep sensor is this package that you can use if you want, and I just want to clarify that Convenpes are not a panacea. So it's actually an active area of research and I'm sure that we're going to discover limitations and also improvements over time. And there are other novel machine learning architectures out there like graph neural networks or transformers that also have a lot of exciting potential. Some challenges that these models have to learn how to condition on data so they start from square one. And this can make them quite data hungry. So, you know, you can't give the model an entirely new data stream that it's never seen before been trained on in any resemblance and expect it to be able to produce like good predictions conditioning on that data. We had this chicken and egg problem with sensor placement but I showed how we could sort of try and get around it using pre training. But one remaining challenge is how we could scale to high dimension. So, you know, adding depth and altitude as a dimension or hundreds or thousands of observational variables that would be a very interesting challenge going forward. And then finally just like future work would be transfer learning as I mentioned earlier from dense to sparsely monitored regions in filling missing satellite data such as this preliminary work from the on Darwin et al. I think that maybe should be 2022. We've quite cool paper trying this out to fill in scan lines and sensor trajectories would be a cool thing rather than fixed placements. Okay, yeah, neural process timeline for those who are watching this on YouTube and want to pause and have a look at some references. Yeah, thanks so much for your time and attention. I've got my two QR codes for the things I want to promote here. Also feel free to reach out on my contact details here. Thanks. Fantastic. Thanks very much, Tom. That's really very interesting and really good to have a talk that kind of goes deep dive into the method wasn't too deep, I say. But again, it gives a really nice description of the kind of pros and cons of that. Yeah, outline the method and how it deals with the sort of different modalities and different missing data and stuff and really kind of shows why it's a beneficial method in environmental science where missing data in particular is quite a big issue. So we've got a few questions in the chat. I'll just try and get them in in an order to sort of make sense. Well, there's some deep digging into the into the methods then so when would you choose to use a common then P versus a common GMP? Can you explain the difference again? Yeah, quick one. Yeah, well, I'll try and be quick. So wait, this isn't the right slide. This one. Yeah, so Comfort GMP is going to output a joint distribution. So you're going to have an n dimensional Gaussian vector with a covariance matrix, whereas a com CNP conditional neural process which is model in most of the experiments I've shown you just does independent distributions at each location. You should use a com GMP if you care about spatial patterns. So for example, you can run this model in an active learning way to minimize uncertainty about spatial patterns by minimizing the overall uncertainty of this big joint distribution over predictions. Whereas if you only care about the predictions at point wise locations, you don't care about how they're related to create patterns, then a com CNP is fine and probably a better idea because it's going to be more aligned with what you care about. Yeah. Thanks. While you're on the slide on the other questions, why do you need the unit? Why do you need the unit? Well, it's a very tried and tested deep learning architecture. This component injects a huge amount of flexibility in the model. So it means that the model is going to have a lot of capability to learn arbitrary mean functions and covariance functions that get spat out the output side of this model. And a unit goes from gridded predictions to gridded predictions. So it's going to operate on this kind of internal grid of the com GMP that I talk about more in the appendix of our paper. And you need to go from grid to grid. So you need a kind of deep learning architecture that goes from, you know, goes from on the grid to on the grid, which you can then interpolate to your target locations. Hopefully that makes sense. Thanks. And it's the on the density input. So it's on another slide. Is that just a binary maps of showing where there's data and where there's no data? Good question. Yeah. So it's not binary. It's actually like we place a tiny Gaussian blob at each observation location. And that's going to make a kind of function hypothetically across space with like these blobs observation locations, but we then discretize at the kind of internal grid locations of the model, which then gives us a tensor. And so you could do this in other ways, right? Like, you could just kind of have this internal grid and you like nearest neighbor bin, like a binary mask, as you say. Well, you'd need to account for the case where there's multiple observations in one internal discretization grid cell. But yeah, that's not the Gaussian kind of blob way isn't the only way of doing this. Yeah, that's great. Thanks. So probably the last technical side of things. So you mentioned updating, which sounds kind of Bayesian, a common then please link to a Bayes approach in updating priors. Yeah, that's a that's really, that's an interesting technical question. So, yeah. A neural process when you train a neural process, it's, if we assume that our data is generated from some probabilistic model, like say a Gaussian process, then training the neural process is going to train the model to target the predictive distribution of the, or the predictive stochastic process rather, when you condition on observations. That's kind of might not be might not have explained that very well. But essentially it's not doing Bayesian conditioning in the case where you have like, you have your prior, then it's just that's all the information your model has, you condition it on observations, and then it has this kind of built in Bayesian update step using Bayes rule. So a conven P doesn't use Bayes rule, it kind of breaks Bayes rule. And that might, you know, make Bayesian people turn in turn in their grave. But I think it makes more sense when you know that your distribution is simplified. So we know our model isn't, we know our data isn't generated from a GP. So our model is much more flexible when it's actually trying to target an arbitrary Gaussian process predictive like in the case of the comp GMP. It's not just a fixed prior that you then condition using Bayes rule. However, the model does learn a prior, because if you just don't give the model any data, then that's a prior. If you start conditioning on more observations, it's going to start to deviate from what you would get from your distribution if you were just conditioning that prior using Bayes rule, because it's going through a neural network. Essentially, there's kind of more discussion on that on like the last slide of our of our 30 page appendix on the sense of placement paper if you're interested. It's good to point people to those again, maybe stick those QR codes up again just in case people didn't get a chance to look. So away from the things people saying what type of users do you like to see or hope to see using your deep sense of package, which looks amazing by the way, or in other words, which domains do you see potential for the crossover of these methods. That's a nice question. Thanks for asking. I think these models are really flexible and can be used in a wide range of different environmental prediction problems and different domains. I think, you know, these models, as I said, are quite data hungry. So I'd say, I think there's a lot of promise in using this approach and deep sensor. If you have spatio temporal data, which is maybe quite messy, but maybe you have like, you know, quite large volumes of it in some sense of the word. And in terms of users that I'm interested in, very open minded. Yeah, happy to discuss like a research idea and maybe have some bit of collaboration using deep sensor as a gateway. But yeah, I'd say there are lots of open questions as to when and when this approach doesn't work and when it does work. And what I want is to build up this community around this package that we can use to save everyone's time and and, you know, really understand when these models shine and when they may unfortunately fall down. So I'm very open in terms of, you know, different kinds of users. That's great. Thanks Tom. So maybe as a final comment and question, I think those sorts of package are going to be really useful to help the kind of wider uptake of these types of methods or testing them and then and then not take. What kind of what do you say is the sort of challenges of maintaining that type of the sort of packages or that sort of infrastructure around this type of modeling and what any thoughts on how that could be better supported. So for example, I know the software sustainability Institute does some little bit of work on on scientific package maintenance, but if you've got any thoughts on that. Yeah, no really good question. I think I think you need a whole community involved. I think you need people with different kinds of expertise from the more research end of the spectrum who can actually drive the kind of engine of usefulness in the software. And you also need people on the more software development side of things getting the thing to actually work and be extensible and be maintainable. I'm kind of increasingly moving to somewhere in the middle as a bit of a kind of research engineering type science person. I'm still learning and you know we haven't reached version 1.0 yet of this thing so the interface could change. But I think this could do with a review process like I if I had the time I'd love to submit this to a software peer reviewed journal and just benefit from that peer review process. And I would love to and I am making some steps towards trying to formalize getting more people involved and committed because I do think you need. If you want maintainers to be involved you need them to be working on on the code base on a semi regular basis at least and the Alan Turing Institute is you know the UK's National Research Center for data science and AI and I think there's going to be a lot of potential for collaborations there. Fantastic well thanks very much Tom. Wrap it up on that so thanks very much again for the for the for the talks really interesting thanks everyone for attending. Like I said this is the final one in this webinar series so thanks if you came to some of them check them out on the YouTube channel like I said before including the the deep mind talk on graphcasts which is earlier in the series. And yeah thanks very much for coming along and hopefully see some of you at the digital environment event on the 5th of September. Thanks for having me.