 My name is Matt Fry from UK Centre for Ecology and Hydrology. I'm hosting this as the third webinar in our series around AI Environmental Science, which is supported by NERC and the Constructing a Digital Environment Programme, which is a programme aiming to develop the digitally enabled environment benefiting researchers, policymakers, businesses, communities and individuals alike. This programme has been running since 2019, I think, with the aim of envisaging and developing approaches to creating the future digital environments or exploiting advances in technology and increasingly diverse data sets to improve our understanding and management of the environment. And it's been doing this through a number of funded projects and a range of other activities, also through building a community in the area of digital environments and despite the pandemic, building that through an expert network running of events and a successful conference last year. And it's on that note, it's great pleasure to be able to announce that the NERC Digital Gathering 2023 conference is now open for registrations and for submission of abstracts. So the full details are linked from the home page, which has just been posted in the chat. The event is free to send. And it takes place from July the 10th to 11th at the British Antarctic Survey offices in Cambridge. So yeah, feel free to sign up, book your place now or look at the range of themes being addressed and submit an abstract to describe some of the work you've been doing. Last year's one was fantastic opportunity. I don't know if it's just because the first thing I've done since the pandemic, but it was really great opportunity to get together and talk to people in person about these areas of digital technologies in the environment. So yeah, back to the seminar series. This is saying this the seventh of these webinars that the constructing additional environment programs run. That's been a fantastic range of subjects across the series from environmental sensors, data management, legal and ethical aspects of digital environment and decision making, as well as seminar showcasing some of the projects within the program. And this series is around AI and environmental science, considering the role and opportunity as well as some of the pitfalls in the use of AI environmental science. So the format of the webinars to have a presentation from leading experts in the field followed by chance for for Q&A afterwards. We get a lot of views after after the event so I strongly recommend you catching up with previous recordings and signing up to the YouTube channel which again I think has just been posted in the chat. Yeah, so so yeah I'm very excited today to say that the presentations from Peter Battaglia from Google DeepMind talking about his graphcast learning skill for medium range global weather forecasting. Peter's a research scientist and director at Google DeepMind. His work focuses on approaches for modeling and reasoning about complex systems by combining richly structured knowledge with flexible learning algorithms. And this webinar Peter's going to present his recent work on graphcast and new machine learning based weather simulator, talking about how the results from the system represent a key step forward in complimenting and improving weather modeling with machine learning and opening new opportunities for fast and accurate forecasting. So feel free to post any questions we'll have around 40 minutes to talk, and then followed by Q&A. So post any questions in the Q&A box at the bottom of the zoom rather than the chat and we'll field those at the end to collate these and put them to Peter. And just confirm we're recording this and with that I'll hand over to you Peter, thank you very much. Okay, cool. Right, so thanks a lot for inviting me. I really appreciate the opportunity to kind of talk to you guys today. And I think just to start off, I do have, I have to, I should have time in my talk to take some small clarification questions as we proceed. I'm not sure if the format of this is going to allow that but like, if I can be interrupted by the moderators feel free for small questions, and I can just usher us along if I think that, you know, it's going to drag out too long and we're going to need to move on to the rest of the content. And the other thing I just want to start by emphasizing and sharing is that the work that I'm going to present today is by presenting on behalf of my team. We have a, I guess, I don't know how many people quite a few people that have contributed in various ways to this. And most of us don't have much training in weather or atmospheric physics, or a lot of this sort of the topics that we're going to cover today. And we've been sort of learning this as we go. And so if I use sort of non standard lingo or say things that kind of make you raise your eyebrows. Feel free to educate me and this is how we learn so cool. All right, so let me get into it. So I started at the very beginning of sort of too long didn't read slides so this is a kind of hopefully should capture most of the setup for the talking what we're doing. So the goal of the work that I'm going to present was to use machine learning to learn global medium range weather forecasting. And the motivation is that current numerical weather prediction and WP is complex, it relies on sort of complex and expensive hard coded simulators and our goal is to learn more accurate and more efficient methods by exploiting available historical data. And some of these videos, these sort of animated gifts videos that you're seeing are just examples of some of the fields that we're modeling so that in total we have 227 weather layers across lots of vertical levels. But these are three examples you have wind in the surface surface temperature and then temperature at 500 hectopascal. And the purpose of the reason we're doing this, hopefully downstream and then what we think this work can contribute to is along with the kind of the rest of the body of work that's coming out of machine learning literature right now and weather forecasting is to provide faster and better day to day weather predictions for everyone and to hopefully improve prediction and planning for natural disasters and extreme events that are sort of disproportionately impactful and important for for human activity. Okay, so that's sort of like the top level and I'll kind of actually go through the top. The hope is that what you'll take away today is for basic things. My argument for what the motivation for using machine learning to model weather is an overview of how our machine learning based graphcast model works. How graphcast forecast compare to the top operational and ML based systems, and then some limitations and next steps that we think we can take. So I want to start by emphasizing NWP is really a major success story in science and engineering. There's been decades of research that have culminated and led us to this point where we have these extremely detailed and faithful models of graph by numerically solving these model equations on supercomputers. We can get extremely accurate forecasts and predictions of what's going to happen out many days like days, even weeks in advance. NWP scales really well with compute resources so the higher the resolution you allocate the better your solutions are going to be. And it also NWP also scales well with better current observations if you have more accurate observations and then you do your data assimilation and you get good initial conditions this is going to be two more accurate forecasts and solutions. What's the downside. So, one problem is that traditional NWP methods and the models and the solution methods are very costly to innovate. It takes experts with significant training to manually design and update these equations come up with better more accurate more efficient solution to solve this numerical solvers. So that's one problem. And another problem is that traditional NWP it, while it scales. As I said, you can get better forecast by having better inputs, you can't actually improve your forecast model based on historical observations. So, in general, just because you have really good observations in big data archives. Your forecast doesn't really translate that naturally into more accurate forecasts for the future. So it seems like we're sort of leaving things on the table we've got all this sort of data and observation about weather but how do we turn that into better forecasts. So ML based weather prediction has potential to learn solvers directly from data which can be more accurate and less costly to develop than traditional NWP. And then they have the potential to also be more efficient by learning solution functions at course or time and space resolutions, but which are still accurate. So the idea here is sort of like, you can imagine effective level descriptions of weather phenomena, rather than extremely high distribution, fine grained desiccantizations, you're learning the sort of updates at a course or granularity. One really kind of rough example I think of this is like, you know, when I open my eyes and I look at a tropical cyclone, you know, like a sort of video of a cyclone moving across the globe. I sort of can like see I can anticipate where it's going to go forward in time now I'm not perfect but I kind of have an idea it's going to kind of carry forward as if it's like almost an object with some momentum. But that's just not how traditional NWP models cyclones. So it, it's sort of, I've just used this example to raise the point that there might be kind of other ways to model cyclones or other weather phenomena, then taking these, you know, not be Stokes equations with parameterizations and then running them on super computer. But what I'll explain today is going to be a lot closer to the sort of traditional NWP anyway that. So it's not like we're actually, you know, going that far where we're kind of thinking about the cyclones that you know hundreds of kilometers across and things like that. So that's the motivation. So, one other piece of background and I'm sure you're all familiar and probably know more than me about this but what we're using as our, where we're treating as our sort of baseline or the key thing that we're aiming to target is the ECWF's HREZ forecast, which is their deterministic forecast within their IFS system, integrated forecast system. And it's considered by many sort of arguably probably but it's been made to be the most accurate operational global medium range forecast in the world. And the way that HREZ is the representation as a reminder for HREZ so it operates on approximately a 0.1 degree latitude longitude resolution with one surface level with hundreds of fields, many of these are diagnostic fields. And you can either get it in 25 vertical pressure levels with 11 fields or 137 vertical model levels with 16 fields. They're just transformed between my, as I understand, and HREZ produces four times daily forecasts. Two of those daily forecasts are 10 day forecasts initialized at 0 and 12 Zulu, and then there's two other forecasts initialized at 6 and 18 that are 3.75 days. Okay, so this is just a sort of background on HREZ and what we're sort of targeting is with our methods. So the scientific goal, I sort of summarized this at the top but scientific goal is to outperform ECWF's HREZ. And what we mean by this is globally we're going to we're going to do this at quarter degree resolution rather than 0.1 degree resolution, which I can explain why later with basically because the data so we're going to train on this is a quarter degree. We're going to choose four surface variables and then five atmospheric variables at each of 37 vertical pressure levels. And we feel this is a pretty, pretty comprehensive and complete representation of the state of the weather, with the exception of precipitation. Now, I don't, I won't talk much about precipitation. We do actually model precipitation, but we don't, we actually didn't even bother evaluating because we just don't really trust precipitation data from from error fiber analysis and HREZ. So we just sort of didn't really bother with this and we're thinking about this as a separate thing. But the variables that we do model are temperature at the surface, wind at the surface, sea level pressure, and then the atmospheric variables are temperature, wind, geopotential, and specific humidity. So that's the sort of spatial representation and the content, the variables, and we also are modeling the forecast at 10 days at six hour increments initialized daily at four times. So you can just think about the forecast is like if this the diagram here shows, if it's initialized 1800 Zulu then you know you take your six hour step and then kind of keep iterating that forward until you get to 240 hours at six hour increments. Okay, so that's that's it that's that's the goal. And I'll talk about how we're going to evaluate like what what the verification is a little bit later. So that's how our model works. So as background there's been a lot of progress recently in learning simulation with ML so some of the work that my group has done base using what are called graph neural networks which which are a category of deep learning architectures which can operate on graph or mesh based what we've shown is that you can learn to simulate very complex fluids, like on the left like particle based fluid systems we've also on the right, shown we can model complex mesh like cloth, cloth dynamics, incompressible incompressible fluids. So we think that these graph neural network based learn simulators are a very appropriate and powerful sort of generic technique for modeling very complex physical dynamics. Back before we started working on the weather this weather stuff we had mostly been training and testing these types of machine learning models. Using increasingly sophisticated and detailed simulated data and we hadn't really tried to train these types of models on real data from real observations. So the weather work we're doing here is one of the first examples where we're really putting these models to the test and in situations where the data was not the models are not trained on simulated data but on real data. And I should also point out that recently there's been a lot of excitement and advances in machine learning based weather forecasting. And what we had been seeing when we started this project last year is that they're starting to become competitive with IFS at one and even quarter degree resolution for usually just like a handful of variables and usually maybe shorter lead times like five or seven days. Some of these were based on convolutional neural networks. There's some based by some of the Nvidia folks on for a neural operators this forecast net. And we also started to see this is sort of I guess like kind of along the way as we were building our you know doing our work we also started to see other papers using graph neural networks. By Ryan Keisler that was very similar to some of the stuff actually from the work that we did in top right there with the plots and also transformers which are a very closely related architecture to graph neural networks. We've seen in the past few months and nice nice work come out there too. So machine learning based weather forecasting I think we're starting to see the graph neural networks and transformers seem to be probably the dominant and most powerful architecture right now in this space. We have some of these experts on and some of the innovators of GNN day simulators. My group we went to our sort of trustee graph neural network based simulator and then start to see if we could apply this to the problem weather forecasting. So here's here's what graph here's what we built here's a graph cast is. The graph cast is what is the name of our model and it's a it's a learned graph neural network based weather simulator. And the way it works is it takes an input weather state. It's represented sort of depicted with this image where you sort of got the globe and then this little shell around the globe is meant to indicate the latitude longitude grid, but also it's a shell so it's extending out vertically into the atmosphere to represent variables everywhere on the surface and vertically into the atmosphere. So this little sort of blow up here where it's it's got these yellow these yellow boxes represent surface variables and the blue variables and if you look closely kind of squint your eyes and look closely you'll see they sort of repeat the little color pattern. They're meant to represent like the atmospheric variables that are at each of the vertical levels. It's an input state so it has, it's because a quarter degree resolution it's roughly a million points on the globe. There's 37 vertical levels and with the surface and atmospheric variables as I mentioned before there's 227 variables per location on the globe. So the graphcast takes that input and just like a normal and WP system it predicts the next state. And then you apply that iteratively and roll out a forecast. So you just apply this 40 times for your six hour increment 10 day forecast, and then that gives you a forecast every six hours. So within the way that gradcast is implemented and what's inside the model is really three components so there's an encoder, which maps the inputs from the latitude longitude grid up to what we call this multi mesh internal representation which I'll kind of expand on in a second. The way you can think about this multi mesh is it has both high local resolution and long range connection so you can capture you can capture interactions among weather phenomena at longer ranges and larger scales if you the model could if it wanted to, but it also can model sort of high resolution detail. Once it's encoded into this multi mesh representation which is this sort of blue. It's very hard to see because you're sort of like overlaying all of these different meshes on top of one other which I'll show in a second. To update a point you in the as the input weather is represented in this multi mesh, it sort of surrounds, you know, with a spatial kind of spherical spatial representation, and then to update locations. So the representation of the weather at some point in space it's informed by the weather at both locally and far away so these big long thick arrows are meant to show that the conditions far away can also inform the state of the weather locally. And again, this is like how a nwp works because if you have like a finite element method or some kind of discretized solver, you know, so solver operating and described representation what it's doing is it's basically communicating information in local spatial neighborhoods to inform what's going on at a particular point. And we run this multi this processor, 16 times we apply this iteratively 16 times you can kind of think about these as like sub steps within a solver. And then it built the final multi mesh representation the weather location is then decoded so then it's, it's basically projected back down into the latitude longitude grid. And it's, it predicts the basically the change in the weather at that point as compared to the input. So this multi mesh representation. The way you can think about it is, it's so what we've done is we've taken an icosahedral mesh this very course that this and zero, and then we refine it iteratively six times. So that means we basically take each triangle and then we split the edges of the triangle and half, and we input a new mesh point at that midway point and then connect up all those new midpoint midway points. And then you just see as you kind of proceed from left to right, you get an increasingly high resolution mesh, and then we sort of take the union of all of those nodes and edges so they just sort of overlay on one on top of the other so you can kind of think about those original M zero node there's like hub nodes that are connected to very far stuff into their local neighbors, was the nodes that are introduced in later refinement steps are only connected to the neighbors at their own refinement further refinement levels. And this gives rise to a front from it goes from the million vertex input grid. Now we have 40,000 41,000 points we've kind of reduced the resolution in a way and add lots of edges and and and also allowed for this, you know the points in the grid to talk to things that are much farther away. Great. So that's the architecture of the model I haven't told you how it's trains this is a neural network is going to have some parameters like this graphcast box up here it has some parameters that we're going to train using standard deep learning techniques. And here's how that works. So the training data set was based on ECMW X error five reanalysis archive, which was 40 years of historical global weather data on earth. It's actually hourly increments we only use six hour increments. And it has the reason that we chose 37 pressure levels for graphcasts because error five has 37 pressure levels. The training objective was to minimize the mean squared error between graphcast predictions and error five targets, and the this so this is what graphcast parameter graph is optimized to do is optimized to predict error five targets. So we give it an error five input that time, not, and then you need to predict at the next time step six hours ahead. And we waited the loss so it because it makes the predictions of all of basically 227 variables. We waited the loss in a way that favored variables that are closer to the surface so you can just see on this plot here the x axis is the pressure level where, as you move right words you move toward the surface of the earth. And you can see that the loss that's applied is higher meaning we penalize the model for getting making errors closer to the surface, more than further from the surface and part of the motivation for this was just that the density is also higher so there's sort of like more happening. And, and yeah that's that's that's basically motivation. And then the final thing about training, the error gradients are, this is sort of just very standard deep learning this is like every neural network that you hear about is trained like this basically so you backprop a error gradient so you compute the error between the grass predictions and error five targets, and then that error, you can get the gradient of that error with respect to the model parameters, and then you move the model parameters around in order to try to get the error to go a little lower to do this like over and over and over again. And that iteratively sort of optimize the parameters so that at some point grass prediction structure look a lot like what's it's going to start very accurately predicting error five. So, instead of just training the model to make a one step prediction we actually trained it to make 12 a sequence of 12 predictions so we do is we make it we have the model generate a sequence of like a 12 step forecast which amounts to about three days. And then we take all the error, all the error from every step could be the gradient of that with respect to the parameters, and then update so the idea is to try to encourage the model to be better at making longer range predictions. So the way we train this is a kind of detail but I just thought I thought I think it's your curious. We there's sort of three training pieces at first we kind of ramp up the, the learning rate meaning like how much loss we're applying in total, how much we're moving the parameters around, then we have this cosine decay pattern, which is orange, which is the make the model, as the model kind of starts to converge to to a good solution to a good, you know, to an accurate prediction, we start to taper off the loss that's applied. And then the purple, which I've also applied the same actually shows that at the end and the last like 10,000 training iterations we actually this is the so up until now we've only been training on one step. And then at the very end of training we start training on two steps, three steps, four steps, five steps, six steps up to that 12 that I mentioned. And so there's a curriculum and we see it's it's very expensive to train these models so we start by training them on just one step which is cheaper and then the 12 steps just at the end. Okay, there's a few other things I should tell you to this, especially probably for this community probably will hopefully appreciate this. We were very, we were very, very, very careful about how we held apart our evaluation set from our training and our development phase. So what we use what we call this causal data split. So our test data was from 2018 and actually going in later years we've been testing now I'm not going to talk about that today. So we looked at that data from 2018, while we were building the model or building the model or training model we went through lots and lots of rounds of research steps for changing the model and figuring out different tricks and techniques and things like this and seeing how it worked. We never looked at the 2018 data on. We looked at about that 2018 data so that we could evaluate on it with but but and know for 100% with the 100% certainty that there was no way that anything that was happening in to that in the evaluation set could have informed our choices about how to build the model so we looked at the development phase, and this is just represented here the ideas. The blue is all the data from error five that we use to train the model, and then validate is we held out 2016 2017 is validation this is where we're, we kept like train the model making change the architecture change the model may change the architecture, and then about evaluating a verifying it on 2016 2017 to decide what was the best architecture and technique for training this thing. I'm happy with that we froze this protocol, and we didn't make any more changes the model or the or anything else the parameters or architecture like this, and then we retrain the model with data up to 2018, and then we tested on 2018. So the idea here is that all the results that I'll show you in 2018 on. There's no way they could have been informed by any of the data, like like the model couldn't have been sort of meta optimized or kind of, we couldn't have been sort of you know, to try to make it do well in the test set. Okay. I should also say just point this is also sort of what one would do in a normal deployment scenario where you actually don't see the future because if you're trying to build model is good for the future both the future hasn't happened yet so that we tried to also kind of be consistent with that. And training graphcast takes about three weeks. It's going to be two cloud tpu v4 devices which are very powerful deep learning hardware. And once trained generating a 10 day forecast graphcast takes only one minute on one cloud tpu v4 device by comparison you seem to it's hres takes about an hour on their 11,000 core if we see cluster, although, you know, this is it's higher resolution it's at one quarter degree it's it's got more model levels than we have pressure levels. So it's not really direct comparison but it's just it's a sort of point I just want to note. I see in the Q&A question did you consider constant parameters for each grid. I'm not really sure what that means. Feel free if you want to clarify that I'm not quite sure you mean we have only one input grid. I'm not really sure which parameters you're referring to but feel free to just clarify and I'll try to answer. And that's the training and then here's what we get so here's on the top you see some forecasts videos from hres and that's again the surface wind surface temperature and temperature 500 hectopascals. And the bottom is graphcast forecast of colored videos and this is just I just put this here to show that like the forecast look like forecast you can see like in the middle column. This is a graphcast forecast are showing the diurnal cycle where the temperature of the surface is changing where is changing as the, you know, earth changes its orientation for respect to the sun, and the rows that say error are showing the error of hres with respect to its ground truth and the error of graphcast with respect to its ground truth which I'll explain in a second. But you can, if you look really closely what you'll actually notice this graphcast error is actually, if you if you know kind of spatially integrate with your eyes you'll see that it's actually slightly lower than hres and I'll get into that second but the idea here is that the forecast from here is pretty, look pretty good. And the error is is comparable or actually a little lower than hres. Okay, so I just that was sort of the output after we train let me tell you how we evaluate because we had to make some key choices that are that are important to understand. So first graphcast and hres use different initial conditions hres uses its own input analysis graphcast uses the error five as input was trained on that so it's optimized to take error five and map to error five. Is a reanalysis data set the NWP that was used for that was actually the hres version from those in operation for most of 2016 but since 2016 hres has been upgraded and that previous, I think it's hres cycle. It's a 42 hour one that's in the paper. Basically, the hres hres has been sort of upgraded and improved. I mean we're testing at 2018. So we didn't want to compare it wouldn't really be fair to use error five as ground truth for hres because hres is not trying to predict error five it's trying predict, ideally its own analysis input analysis. So if we used error five as ground truth for hres that would mean that hres is error at zero time steps would be not zero so that doesn't make sense you wouldn't want to have a model that can't even predict its own input so to be more fair to hres we basically use for hres we use as ground truth. We call this hres FC zero which is basically its input analysis and that we've done sample to quarter degree resolution. So if you think about hres as a forecast that's being initialized with this so this orange represents like data assimilation some observations are being assimilated and generate an input to hres and then hres roll out rolls out a forecast that's the blue and you know at each six hours we pick that state that's the blue boxes. Then six hours later, there's another data assimilation process that data that input is given to hres and then hres roll out a forecast, and so on. What we do is we basically compare we take these red inputs to hres and we say that that's hres is ground truth and now if we want to compare hres performance at 12 hours we compare it to the assimilation at that step, the corresponding step. Okay, so that's hres FC zero is the red box so that's just to show we take the inputs hres treat that as its ground truth, the future inputs to hres and that's its ground truth. One of the really important thing that we this is not in the archive paper we've actually updated this in a journal submission you've done but I want to this is the results I'm going to present because we feel this is a better control. Error five state assimilation windows only is 12 hours long it's done twice daily at 12 hour windows from 21 to nine and nine 21 hres is for daily assimilation windows which are only six hours long. And these are the intervals with this means is the error fives zero and 12 initializations incorporate info information from nine hours ahead while hres is only incorporates information from three hours ahead. So what we decided to do is miss again this is different from the archive paper. We, we decided to only evaluate on the six and 18 hour initializations. Because this would mean so just in the schematic here you can kind of see like the red is the look ahead from each of the initializations, the six and 18 initializations for error five and the top row and hres and the bottom row both have the same like look ahead. So we thought this was actually the most fair way to compare these by only using those initializations if we use hres. Sorry error five initialized at zero or 12, it would mean that we're sort of giving the model information that was informed by nine hours into the future while hres is would only be informed by three hours in the future. Similarly, I should say though the only problem is the hres forecast of hres at six and 18 are only 3.7, 3.75 days long so you'll see the results that I'm about to show that we use that for the four days forward, we actually do end up going back to hres is zero and 12 initializations. We feel very conservative we feel because actually it turns out that hres is slightly more accurate on these initializations. So our results are going to look slightly worse. I mean, comparatively they're better than hreses, but they'll, when you look at the relative difference will see a slight decrement and a relative difference in a second because hres is actually slightly more accurate at these initialization. So we only evaluate on targets at the six and 18 validity times for the same reason we only want to evaluate and validity times that are informed by the same, the same amount of look ahead in the assimilation window. Okay, so here's the results. So this is a kind of just a, a example of what the results of how we're going to, you know, kind of do our analysis. What I got here is three curves that show the skill based on RMSE, the skill score, which is the relative RMSE, and then we do an ACC skill as well. The x-axis is the lead time so after 10 days at these 12 hour increments I noted, the y-axis is the skill or the skill score. The black curves are always hres and the blue curves are always graphcast for when RMSE is an error score so that's lower numbers are better. ACC, higher numbers are better. So that's how you can interpret this and what you can see is that if you just look at the sort of RMSE between hres and graphcasts, graphcast is always more accurate than hres on the full 10 day window. So this vertical dash line is just that point that this is the last point that we were using hres is 06 and 18 initializations and you can see this. This is what I was talking about. There's a little jump because basically hres gets a little bit better on its 0 and 12 initializations and we're still using the 06, 18 initializations for graphcast here. The skill score is a kind of key thing you want to look at here. It's basically showing that graphcast is better or lower is better. We're up, you know, 10% to maybe 8% over time, better than hres. This is how we summarize our results. So what we do is we say we take the skill scores, RMSE skill scores and we try to make a kind of analog of the ECMWF scorecard that they used to compare their models, their hres models against each other. So here blue indicates a better skill score, red indicates a better skill score for graphcast, red indicates a better skill score for hres and you can see on these plots of this again the x axis is lean time on the y axis now is the vertical pressure level. And each of one of these large sort of rectangular boxes is a single variable and you can see that on all variables except for except for the, sorry, all fields except for the 50 hectare per skill pressure level, which it turns out is in the stratosphere, which we now have realized and done some post analysis to find that we're seeing a lot more error in just the stratosphere and we think that this might actually might not actually be a problem with our model and maybe more with the stratosphere data. Basically, our model graphcast is always has better skill score except at the highest or the very highest like upper the highest vertical levels in the atmosphere. And then these are some of the surface levels and I think except for surface temperature, the first 12 hour lead time I think we're always better than hres we're better than hres a 90.3% of these 1300 targets. This I'll just go through quick. We, we also outperform all previous ML baseline so the best of them was is this Huawei Pangu weather paper. And all the targets that they reported and we'd be and graphcast outperforms on 99% of those targets, these are some examples geopotential specific humidity, surface temperature. And one more thing I wanted to show is autoregressive the effective autoregressive training so I made a big deal earlier about how we like train the model with not just one step but 12 autoregressive steps and you can see this is why so this is one of the decisions and choices we made as we were doing our model development. So here what you're seeing and I'll just call your attention to the bottom row it's a skill score. The, the, the, the skill scores with respect to the graphcast 12 AR is like the model I just showed you that's the model is training 12 autoregressive steps that are sort of mean model. These are the skill scores of other models the lighter blue curves are skills for the other models that have been trained with fewer and fewer autoregressive steps since the training, all the way up to 12 autoregressive steps we train with like say for autoregressive steps. There's a pretty systematic pattern here where the across lead time the relative performance is not as good. So like we get better and better if we train on more and more autoregressive steps and we probably get even better if you train on longer ones. And I'm not going to talk more but it's, we felt this was sort of good enough like we kind of repeating a choice we're kind of happy. There's also, I'm not going to get into this but we can discuss this if anyone's interested. We get we do notice that the model blurs that we see like spatial blurring which we can talk about, which is to be expected when you have an RMSE training objective. And I didn't want to, like, probably what's sort of happening here is you get better and better nature you're also basically getting blurry or blurry forecast which is sort of known way to kind of gain the RMSE made metric so again we can talk about that thing that's interested but we felt that 12 is sort of good but it just shows the autoregressive training is having a really big impact. And the last thing I'll show from the results and then I'll wrap up. Here's some. I just put huge pictures of specific humidity which is one of the highest resolution of our fields. This is specific humidity oops I don't know which level this is, I think it might be 700 hectopascals but I don't think I've labeled it here sorry. But this is like, this was from 2018 and November 21, and this is the 12 initialization. This is four days later the same time. This is four days further the same time. So you can see that you've got high resolution detail in the, oh actually I'm sorry this I think this is the two days into the forecast, six days into the forecast and 10 days into the forecast. You can still see you have a lot of high resolution detail. I can get into some of the details of this. We've done a lot of analysis now on the spectrum and the all kinds of, we've done all kinds of optimal blurring comparisons to make sure that we're like not giving HR as an unfair disadvantage. Our results still hold up just fine. But you can see that we do preserve a lot of the structure, even not attended at some depth at some of the in a specific community which I think have the highest frequency structure. Well, so the last thing I'll just do is some limitations and I'll switch to questions. And I'll go through these real quick so one big limitation we're lower resolution and a treasure point one degree horizontal, we're quarter degree horizontal. This choice was mostly due to error five being quarter degree resolution, and that it would be about six times more memory of operating at point one degree, our forecast 10 day forecast already 35 gigabytes, multiply that by six, you're getting you're over like 200 gigabytes. That starts to really stress even the highest end hardware right now. And, you know, we're not we're not running this on like specialized hardwares is just sort of deep learning hardware. My training objective I just mentioned encourage spatial blurring. So, in either way, think about this is because of the predictive uncertainty graph cast expresses uncertainty by blurring there's a really nice ECMWF blog blog post note on this earlier this year in January and they're introducing this fractional skill score metric I definitely recommend I should put the link here I recommend you check it out if you're interested. I feel that sort of adapt to this overcome it using perturbed ensembles probably those predictions which would allow it to express its uncertainty in different ways and preserve the high frequency structure. I haven't shown any performance on downstream applications. We're doing this stuff. We actually have pretty nice results that I'm not. We're not quite ready to show and cycle and tracking atmosphere rivers and extreme heat forecasting, but stay tuned. We want to we haven't been we there's no these are sort of next steps like why can't we make longer range forecast we can like the data is there. It's just again something that is just not has been on a row yet but we're sort of interested in this. And then using real observations so this is a really obvious one like we're always using a simulated observation analysis why not use real observations as well. And there's obviously some of the reasons are that it's it's a completely different. It's not like a dense sort of, you know, standardized data formats that are sparse and also it's hard to get observations that change over time but this could be very helpful for ML based forecasting. Okay, so then my conclusions. So the prediction is now competitive with how operational systems, accurate and efficient as compared to atres, we're outperforming all the ML baseline we've seen so far. And going back to my initial points about like the advantages that we're hoping to, to, to offer with ML, we can capitalize on rich historical observations using these learn simulation methods. And this does avoid the need for costly manual design of equations and solvers like I don't know much about atmospheric rivers but like that, because I can get the data and I know machine learning I can still build these models. And that's it. Thanks very much. Thank you very much Peter that was fascinating. Really interesting talk, great level of detail as well as giving that a fantastic overview there. I've got some questions in thanks for asking that one already. I'm going to be cheeky and jump in with one of my interests. You're training it and, and, and testing it against the era five data so see if you basically kind of saying you've got that you've there's a reliance upon there being good quality and WP anyway, and to improve this you're going to need improvements in the NWP. Do you see that kind of going forward that kind of question. I mean, yeah, so I don't mean to say this is like, I definitely don't get the impression this is oh you know we don't need NWP anymore and I really I actually really like this I really like that there's this sort of in some sense it's like changing the dependence on NWP right it's saying we need NWP especially for the data assimilation, but maybe not so much more for the forecasting quite as much. And that said, now we're using error five we could I think at first I thought you're going to say like why didn't we train on at Hres like we're definitely thinking about this. The Hres again is because it's constantly being upgraded like it's not really like one data set so it's like, it's like, it's not clear and actually before 2016 it wasn't even it wasn't even at 1.1 degree resolution but so yeah that one question that does get raised those can you use these types of MLWP models for data assimilation, or could you start to bootstrap off models and get like better observations that way. We haven't really thought much about this. It's just something that we kind of like, you know we think about and talk about over beers or whatever. But yeah, and then I think the last point I made about using raw observations like there, you know, some, some approaches might try to just sort of assure the need for data assimilation entirely by just doing like observation to observation prediction. But again observations are hard to get for many reasons and harder to use so this I think is the kind of thing that will probably take time to kind of work out. So one thing I'll just sort of add to what I kind of hope is that as these ML based approaches advance I hope that it puts more emphasis like maybe if we don't need you know to spend it, you know as much sort of invest as much in our forecast because we have the ML system, maybe we can invest more in observations so I'm kind of hoping that we get even better you know meteorological observations because that should just feed the ML models. So I'll stop there. Great. No thanks very much. I'm going to run through that some of the questions online and so you mentioned um, well it says, can it be approached to seasonal forecasting I think you suggested that it could and Oh yeah like there's no reason that we can't do that in principle I mean it's something to test and see you know so seasonal sub seasonal so I think like right now. Yeah, like I think I think it's just, it's something to try you know there's questions about the sort of, you know the Lyapunov of the actual weather dynamics and whether like, you know how precise with the observations have to be to actually like get much benefit in the longer term. But again just from a, you know just to answer the question. There's no principle reason why we can't predict. Our model can roll out like a year forecast we want, we haven't really done this just because we haven't been evaluating it but we could. Okay. So there's a couple around the kind of laws of how well it obeys the laws of physics or I mean is it constrained to conserve things like total water and. It's incentivized to through its training, but there's no guarantees. We haven't looked at this this is something I mean I'd love to talk about this more if you have ideas about like key things that you key conserve quantities you want to measure and things like that I'd be really interested in that I think for for whether I think it's a little bit like like for some of these idealize like you know turbulent systems and things like it's very clear there's certain things that need that you can measure very carefully. But here you've got like you know you've got ready to read forcing and all this kind of stuff so there's like a lot of. There's a lot of, you know, there's lots of this not conserved or that's like not easy to measure and to know whether it's conserved but if you have ideas what to do. I'm all ears. Great. Thanks. And there's one, just about what the highest altitude the model went to and then a kind of related one just to say that you showed it out performed. What is the highest altitude scale until it tried to predict targets in the higher atmosphere and is there specific reasons for this outside of the weighted loss. Model we actually the model the high, the highest field is actually at one hectopascal which I don't even know where that is but what I showed what we evaluated on the highest was 50 hectopascals which like the tropopause which separates the troposphere from the troposphere is not uniform over the globe but it's about my understanding it's about at 100 roughly 100 to 50 typically hectopascals. So, I don't I don't even know what you probably know better than me how many kilometers that is but basically like I just think in sort of pressure levels but I know that the stratosphere is basically that like what we're evaluating stratosphere is very top is a very top. One thing I should also add I didn't mention this before we. We don't wait the loss as a function of vertical altitude so up in these very very highest elevations there's almost there's like less than a percent of the total loss is being applied to this. So the model really is not being incentivized at all to make good predictions at one hectopascal just because I don't think people don't you know easy enough it doesn't even evaluate this. They don't show us their scorecard it doesn't seem like anything that people really think about. Thanks. There's one about you mentioned the blurring and that's quite a lot of interest in that aspect does does does that tendency for blurring inhibit forecasting of rapidly evolving smaller scale cyclones for example that may be the most damaging. And I kind of want to follow up on that to say, are there other kind of techniques you could introduce to help with generating the types of fine level structure you imagine. So this is a great question right so okay so first of all I will I will say that I think my understanding is cyclones usually 10 day cyclone forecast aren't as much of a thing because cyclones often don't even last for 10 days. But like I think you know in the five days sort of horizon I think it's. So remember we do we do see much less. We would like like the blurring takes you know it grows over time. We're going to see less. But there's like, there's a really important question here which is what is it better to have a blurry forecast that is better RMSE or a sharp forecast that's kind of wrong. The answer that perturbed ensembles give is well it's better to actually have a distribution of sharp forecast where some of them are wrong but some of them are right and I think that's like a really good answer right that's like that's like if you really want to model this sort of posterior future given the present. That's kind of the best this is like the best way we have to do it numerically. But I think there's this sort of gets into another question of like, yeah, like what how downstream should we be thinking about our applications like maybe we should be optimizing our model to support the downstream application directly. So maybe you start with your graphcast, and then you actually either append on another machine learning model at the end or even just fine tune the graphcast itself. So it supports the tasks that you care about. So I'm not, I'm not an expert in cyclones and I don't actually know that much about how the smaller scale high resolution structure, how important it is or what role plays but I'd be really, I'd be really excited to chat more about this and just to understand the problem because maybe there are ways that we can either without having to sort of generally improve the resolution or the frequency distribution of graphcast, we could still support these use cases. And I think maybe just to add one more point about like other other techniques. Sure, like I think that you could, you could do things like try to penalize, lose it like the high frequencies higher so that if you lose high frequencies you eat that I really think the appropriate approaches to use probabilistic methods so that you can give the sharp forecast and get credit if any, if the distribution of those sharp forecast is, is a faithful to what has highlightedly that way provides high likelihood for the actual observations. Great. Thank you very much. A couple more questions. So, can you compare graphcast to the same ground truth as each result. Is it the resolution statement. Well, we have to down, we have to down sample, raise to make that comparison or we have to learn up sampler for graphcast and we've done sort of all of these things and preliminary work. If you take the graphcast just train an era five. It's, it's pretty good. It's not like it's, you know, you see a definite decrement in like, sort of, like the performance appears like, if you basically compare graphcast against HR as ground truth, and you're going to see the relative improvement of graphcast or HR as slip a lot like you're going to see, in many cases in many like fields and levels I think especially in the short term you see graphcast be worse than HR as. But what you can do is fine tune graphcast to HR as so that you're basically, again, after you kind of pre train it on air by you now post, you know, later trained trained it to try to do better on HR as and that improves things a lot. So again preliminary sort of work we're starting to explore this and it looks like I can't really say anything to firm about whether it's going to the performance is going to be the same because we haven't this is sort of very preliminary but I don't think this is going to be the issue. It sort of should be kind of unsurprising because era five is just based on a slightly older HR as nwp so and one about some touch on why precip isn't one of your variables is that just a very good in your find my understanding I'm not an expert in this my understanding is that the ECM bill yet precipitation data. So I use remember that about half of our team or a large fraction of our team were involved in deep minds previous precipitation now casting work that was with the UK Met Office and was a nature paper two years ago or so. So these people are experts of precipitation I'm not an expert. The general consensus I think is that the among our group and I think among other folks is that the ECM bill yet precipitation analysis is not that good like ECM folks that you seem to have sort of, you know, indicated like you know this is the best you can do but this is, you know. So precipitation I think is generally regarded as being disproportionately like kind of poor in era five and in HR as. Yeah, I think that was the main I mean it is funny we modeled it we weren't modeling it at one point when we decided to just throw it into the model and then we really have done almost no evaluation on it. I think partly just for time and we could but it's sort of just like what we just don't even we don't really expect we don't we're not really sure we probably should. But I'm not sure what conclusion we would draw from this because even if it looks good, we don't really believe it doesn't look bad we don't really care. This is one of my thought for me is, could you kind of summarize how what what it is about the graphcast structure that's getting more from the, from the data, then you can get from NWP or from those other approaches to machine learning approaches. Is it integrating that score so spatial large long distance stuff with the my resolution. I'm very, very preliminary work to study the the receptive like sort of the effective receptive fields of like basically what part of the globe. What what's spatial, if you're trying to make a prediction at a particular point, what part of the globe around you is informing that. So that's that's sort of one kind of gets your question. And when we do find is it's first of all sort of asymmetric it's kind of tilted toward like in the opposite degree of the wind so like you know the advection convection is sort of driving it right it's like the weather from over there. It's being, you know, conducted over to here so and we kind of see the model seems to know that it's hard to say I guess what I would say is just that when you look at the equations of many of these fluid solvers which we sort of work with quite a bit over the years. We know that the neural networks have higher expressive capacity, like these are not these equations are sophisticated but they are not the neural network learn functions are far more sophisticated like they have their expressive capacity to learn complex functions as far greater. And at the end of the day, these NWP methods are making errors like the weather, they're not modeling whether perfectly they're generating, you know, they're making errors and that's some of that error is not strictly due to uncertainty initial conditions. So, it must be that the that these models are not any and we kind of know we need like cloud resolving models and things like this and we're, you know, at, you know, at hres is native resolution at quarter at point one degree it's still orders of magnitude off what the physics tells us we need to get to get it really right. So I think that's kind of the answer I think it's just that the physical parameter like you know so you've got your, you know, you've got your diameter core and you've got your physical parameterization and the physical parameterization is just not correcting the errors introduced by the granularity of the core enough. They're great. I mean, again weather forecasting is, it really is I mean among fields we've looked at is really a really really good example of how well, you know, the science engineers come together to make like really incredibly accurate forecast but still the weather is even more complicated than that. Great. And then the final one from somebody will the data be publicly available and I also add yeah, and he thought the code is like to be available. Yeah, yeah, so that's our plan so basically okay so for data for the forecast we're sort of doing a little bit of stage roll out right now we're kind of distributed, you know, we're we're kind of what will now actually deep mind and part of brain part of Google brain and merge it on that if you've heard but until then we you know deep mind had operated somewhat sort of independently from Google and in certain ways so right now what we're doing is just rolling at we're sort of like rolling out so to speak our forecast to Google groups right now to kind of get feedback and also to ECMWF increasingly now. I think we are plan is to just put all the forecasts, like, online soon but we've in it since the archive paper was published we have done a huge amount of verification analysis we've been like months, basically a journal paper we just submitted we have about 100 pages of appendix of like basically just further verification and we found a lot of like this is kind of what also led us to do this to cutting in half the number of targets and like changing the initialization how to evaluate because we realize like we're trying to be really conservative. So basically, we just had felt we want to kind of really understand what's going on their forecast first and make sure that we're like, you know, putting like, like what we put out we kind of like understand but yeah the plan hopefully is within. I would hope within like a few months month or a few months, but keep if you bought the more you bother me if you send me emails the more I'll probably push to try to do it. It's just, yeah. It's great. Well, just, yeah, so wrap that up today thanks again. That's fantastic, fantastic or really fascinating. And it's great to have you with us and thanks for your time. Just to summarize for the audience thanks for attending. Again please subscribe to the YouTube channel which put the link back in and note the the NERC digital environment conference please have a look. The next webinar is going to be held on the 2nd of June at 11 o'clock again presented by Tom August of UK CH always gives a fantastic and interesting talk with loads of great, great content and images. On the ground biodiversity monitoring. So please make a note in your diaries and put the place on there with the link. With that thanks very much for attending the session and thanks again to the speaker.