 It's my pleasure to invite our next plenary speaker, David John Gagney. David John is a machine learning scientist and the head of the analytics and integrative machine learning group at NCAR here in Boulder. Thanks again, David John for accepting our invite and looking forward to your talk. And here we go. So, today I'll be talking about comparing and contrasting physics robustness and explainable AI for seasonal sub seasonal to seasonal forecasting. If you are attending the trust for the AI summer school last week this a lot of material here will be fairly similar but but I'm going to try to have a S to S focus and try to highlight how physics robustness and explainable AI. I think play key roles, especially in all cross weather and climb but also especially in the S to S domain. I also want to give a large shout out and credit to Maria Molina who was my co presenter last week and contributed a lot of the slides on explainable AI. So, kind of why are we interested in looking into trustworthy AI and all these different components of AI for S to S. Based off a lot of especially research in recent years S to S forecasting provides a lot of opportunities for AI to potentially improve whether prove the prediction of different quantities of varying longer lead times, especially because our end of BP models struggle to produce skillful predictions in the S to S framework because we're kind of getting into the like chaos overwhelms the kind of initial condition signals. There's large scale for things offer predictability, but how well they're captured, there's still a lot of research on that. And there is a lot of data available and we have lots of observations and model output climate model output reanalysis that we can bring to bear on the problem. But it's also a regime that offers many dangers for effective AI. The biggest problem is that the number of events in our S to S forecasting record is far smaller than the dimensionality of event data. We have a very like for especially if you're predicting things like kind of on an annual basis for things like like in so phases, there's only so many only a few events in the observed record so it can be easy to fit a multitude of models many of which may overfit to that signal. There's also a lot of the interesting signals beyond the dominant ones like in so in the MJO. The other kind of tele connections are fairly weak and don't have a maybe strong predictive power, especially compared with some of the things you may use and for shorter elite like short to medium range weather forecasting. And this means I as tempting as it is to maybe try to apply your giant deep learning models to to S to S and there's certainly some been some good examples of how this can work. There's also also a lot of potential for overfitting and getting not so great forecasts. So how do we, what's the path forward for S to S and machine learning or S to S machine learning forecasting. I'll argue that if we can incorporate physics robustness and explain ability into the system development. We can build trustworthy AI systems for S to S that are effective and robust and and will hopefully provide improved forecast understanding across different weather and climate domains. There's been a lot of interest in machine learning recently, and it's been driven by kind of the promise of machine learning and other domains like image recognition or language translation for instance. In these cases you have some big data giant giant data sets of millions of examples that we can then feed into our giant black box AI. So the reality of what happens when you kind of like this internet connection issue. So once you're bringing in like all real world data sets they're often very noisy and the AI model may not be able to distinguish the kind of some of the noisier bias signals from like whatever the true signal is. And this results in bias and consistent predictions and poor model performance especially when you're trying to apply it and more real time setting. So, in order to get around this problem we need to bring in not just fitting our machine learning model on like on the data itself but we need to bring in other assumptions and have ways of checking our model to ensure that that the predictions are well behaved and work across a wide range of regimes. So the first area we're going to focus on is how to incorporate physics ideas into an AI slash machine learning training system. The reason why we want to do this is that our machinery models are meant to optimize any kind of data. In doing so, you can, it's often possible to fit many different machine learning models if you spend enough time tuning them to get a similar accuracy on a given data set. This is called the Roshamon effect because they have very different assumptions but on average will produce about the same error in their predictions. However, with all these models there's no guarantee that a more accurate model or any one of these models is going to obey physical properties better, and some of them are likely to but without any constraints they probably won't. So if we can add some kind of physical guarantees into the training process we can constrain the set of acceptable models to ones that not only have low error but also conserve mass or energy or use a like non dimensional inputs or other things like that. In the process we can also maybe fail to find simpler models sometimes. So how do we actually incorporate physics into machine learning. You can do it throughout the machine learning pipeline. It can be as complex as simple as how you select your input and output features. This often is underappreciated but very important and how you define the problem to make sure you have the right units, the right kinds of inputs, thinking of the causal relationships between the inputs and the outputs, and trying to avoid variables that may be non-causal or aren't going to be available at like running in real time. Doing different input and output transformations like the same things as simple as taking a log scale or looking at relative fact dividing by some kind of baseline value so you have a relative prediction instead of a relative input instead of an absolute input. That can also make your model more robust in some cases. And then tune your machine learning architecture to take advantage of structure and the data. This is where a lot of deep learning gets its power so using things like convolutions or current networks or graph neural networks. This can help produce the dimensionality of the connections inside the model and also make for more robust models that can look for spatial patterns or temporal patterns on different scales. You can also constrain your loss function so loss function being at the end of the model. So if you add things like you can ensure if you're out, if you have certain outputs that they need to conserve mass or energy, you can make that a penalty, either a soft penalty or a hard penalty to do so. You can also compare latent states and make sure that like spatial structure is properly enforced. There's also ways to train the model that aren't just using assuming independent identity distributed you may want to group your examples as like time series and forward and back propagate through time. Sometimes that can result in a more robust model. Some examples of how this has been used in practice one is a climate invariant machine learning parameterization from Buchler at all. In their case they used a mix of kind of soft constraints on on the loss function as well as setting up the model itself to ensure that it's inputs are more climate invariant by using things like relative humidity instead of mixing ratio. And also enforcing mass and energy conservation to high precision. As a result, they, their convective parameterization was much more accurate and much closer to the truth without kind of these large deviations that you can see in the convective heating profile. Another example from any you've all at all. They did some other tuning of the inputs and outputs and how they like connected the whole system together to make it physically consistent and completely stable. In this case, they were also doing a convective parameterization, but made sure to protect fluxes instead of kind of absolute values of these quantities. They also diagnosed certain quantities from the things they predict rather than trying to predict everything directly. And as a result, then sure every all the quantities were consistent with each other. And this resulted in a stable model. They've gotten this working with both regular neural networks and random forests. For our own group we've been one project we've been working on is emulating an atmosphere chemistry model. In this case, we initially used a multi of regular multi layer perceptron fully connected neural network. And after a lot of tuning we're able to get a pretty good fit to the time series of our different chemical species, like our precursor gas aerosol. There's a lot of spread so we just given random initial weights we trained on some old neural networks. So they capture a lot the spread captures the the signal but we're not capturing say this diurnal pattern in the data. However, if we train with a recurrent neural network, we're able to reduce the uncertainty considerably and capture more with the dynamics. So, so this is a good argument for potentially looking more into, like, not just fully connected models but recurrent models and take advantage to that time series to, to get better dynamics. There are a lot of ongoing issues and I think physics based ML list don't need to be addressed. One is like soft constraints on your loss function do not guarantee conservation or that the property will be absolutely obeyed. It will, like, reduce the errors in mass conservation but because it's sort of iterative penalty, it's not, it's still not guaranteed and if you need mass conservation this can be a problem. Also, while it does increase robustness it doesn't necessarily get guaranteed generalization outside of your training data range, some some kinds of physics based AI does but others you'll still you'll still run into issues so this is very problem dependent. It can also sometimes require more inputs and outputs to the model which will increase the model complexity if you're trying to conserve mass or energy you may need to predict all the outputs that that go into the conservation equation, instead of just the one you care care about potentially. And also the problem they've run into a number of these cases where the actual the physics based models or observation data sets containing non physical artifacts. They, like, don't actually conserve mass or energy. There, there's like an observations you may not capture, like, like the right, you know, like high frequency information or the observations themselves may have quality control issues. You may have differences and like, if you're using sensors from different locations, they may be observations collected different heights or you don't have the same sets of variables at different locations. And all this can limit how much of this other physics based stuff you can do or requires a lot more pre processing. Next we're going to jump into robustness and talk about some of the ways you can ensure your model has a bit more robustness there's obviously a lot of overlap with with physics based AI and with explainable AI and ensuring robustness. We want to kind of highlight a few things that were that that are more directly in the robustness category. Some things we care about with with robustness in particular is that the problem that we often run into is that machinery models that perform well on strats, like a static training and testing data set can fail to perform well in real world setting so In some ways, this is kind of a called transfer learning problem in that often the only like real world changes over time and it's not static like our like our data sets. So, so research questions in this area include like how well does our machinery model train on one domain transfer to a similar but different one. And there's a lot of domain can be as simple as like train on one model applied to another model could be train on observations applied to a model like train today test tomorrow, tomorrow could be very different. And as part of this we, we can see how well things can try to measure how well our predictions are handling like kind of real world data by quantifying the model uncertainty and there's a lot of ways to do that with varying degrees of accuracy. We also come around to the problems of our, the data itself may, especially in real time settings, we may be dealing with adversarial inputs or, or like either accidentally or intentionally adversarial. And that can cause our models to produce very wrong predictions unless we can control for them somehow. So robustness can be a major problem is what are called data cascade so there's a lot of machine learning projects that look really promising early on but then when they reach, say the model evaluation stage or the model deployment stage. Suddenly, the system is not performing nearly as well as it was based off of our initial assessments earlier on. And this is often a consequence of things like brittleness and interacting with the physical world, not having enough application domain expertise as part of the development team. Sometimes conflicting reward systems where the, with the machine learning model or machine learning team is trying to optimize on is different from what the practitioners who are using the model actually care about. Poor documentation, especially between the developers and the, the users or the domain experts on how the small shouldn't shouldn't be used. And this can cause projects to completely run aground and fail in some cases. There's a great paper if you want to learn more about that called everyone wants to do the model work but not the data work by some bad some best of on at all. So I encourage you to check that out if you want to learn more about this. On a brighter note, there are some examples where utilizing multiple data sets can actually lead to better predictions. So there's a ham at all paper from from 2019. I got a lot of attention it was in nature on in so prediction so their big result was they initially trained a convolutional neural network on climate model output and then refined it on reanalysis data. And then the process found that their, their model the kind of orange curve up here outperformed all the other comparison models in their, in their data set that were all I think more physically based models, and also retained higher correlation skill for a much longer lead time. And in some cases was able to have longer, basically overcome the spring predictability barrier to some extent. And the promising result is also kicked off a lot more interest in S to S machine learning research. Another example from from my group. We were interested in different ways of quantifying uncertainty with neural networks. And this problem is more of a kind of analysis assimilation type problem where we want to estimate ocean mixed layer depth from our surface satellite and or model fields of the of the ocean I should say. When we compared a bunch of different uncertainty quantification methods and found that some of the better ones are those that basically assume a certain distribution, like a Gaussian distribution for the predictions and predict and predicts the properties of that distribution, and this seemed to work better than sampling methods, which are a bit more flexible kind of distribution they can represent but tend to be under more under dispersive. So we have an example of what with comparing a linear model trying to estimate mixed layer depth anomalies and a convolutional network and, while it's not a perfect recreation it does recreate the broader scale patterns in the anomaly feel quite quite well. We also run into two other problems with our data that can limit robustness one is distribution shifts, where weather and climate data will contain non stationary processes and artifacts, due to both natural and human causes, whether it be issues with the instruments like the someone relocated a weather station out of climatology has changed or satellite instruments to grade over time that become uncalibrated or numerical models get upgraded on a fairly regular basis and sometimes I'll change their systematic biases or how they represent, say temperature or precept or, or other variables and at the machinery model is bias correcting on them needs to these new data to learn the, learn the new biases. We'll set the ongoing signal of climate change is also causing our, our systems to drift over time and that if the model is trained only on past climate is probably not going to handle feature climate very well unless you can account for that somehow. There's also issues of weather observation forecasting systems receiving noisy or corrupted inputs and outputs. We have images we have things like animals nesting on instruments. Example over here is a false crowd source weather reports for someone spoof location of high wind reports to make the shape of Alabama, and it's some other pretty awful things with the data where they had to shut down this imping crowd source of your weather report collection system for for a while until they could address some of these issues and do better quality control. So, in any data set it's like data processing quality control and knowing your data visualizing your data is all really important to making sure you can avoid an account for some of these kinds of problems. Finally, we're going to touch on explainable AI. The idea behind explainable AI is that prediction error doesn't tell you enough about why the model is making good predictions. So there's a bunch of methods that can provide either post talk explainability where you kind of try to apply somebody's like wrapper methods around the model to see by by perturbing the input see how it affects the outputs. There are currently interpretable models that are relatively simple but but can highlight kind of the essential features of the data set. One example is a partial dependence plot this can get allow you to see the sensitivity of individual inputs by varying that input see how it affects the average prediction from the model. Often as shown as this kind of single curve that can tell you like where the sensitive ranges. As an explainability method itself it's accurate to what the particular model is saying. But similar models trained on the same data may may provide different results so this is an example for a surface layer parameterization model. And we just changed the activation function by otherwise trained on the exact same data, and we found that there can be some significant differences in how the model behaves especially even within kind of the range of the training data but also outside the range of the training data we can get some very different extrapolation patterns, depending on our choice of activation function. Other ways to examine our models include permutation feature importance where we permute or shuffle inputs to see how if we take away that the predictive information from a given field, how much does the model or increase. You can do this in more robust ways and get rankings of features so it's fairly popular for that, but it doesn't necessarily tell you like why those features are important. So we then can apply things like partial dependence plots on tabular data or things like saliency maps and other kind of heat map type methods to see like where in an image or time series, the key features are. So this is an example from a severe storm data set. You can also look at like an S to S time scale we can use methods like clear wise relevance propagation to identify important areas of the globe that that contribute to S to S predictions of things like precipitation. So these methods all have their own like assumptions and sensitivities built in. So it's important to compare multiple, like different methods and see if they provide similar or very different results. As you can see from from this inner comparison by Mama Lachas at all, knowing what goes into your method is very important because it will affect what kind of result you get out of it. So if you want to learn more about kind of X AI and a lot of these different methods or AI Institute recently put together a explainable AI short course that by Ryan loggerquist. All the, if you go to our AI to us website, you can find out more about the course and then watch all the videos and Jupyter notebooks and whatnot. So with that, happy to take a couple questions and if we need to move on I can also answer questions in the chat. Thanks a lot, John. There's a great survey of machine learning methods and applications for S to S as well as whether any questions for David John. So I had one. Oh, she don't go ahead. So this is just a comment. When you mentioned the non stationary data input, you use the word adversarial. And I just want to point out that non stationary data input will increase because the deployment of on crude observing system. So that's an example is the is the Argo float and it will be more in the future. So I hope that wouldn't be adversarial in machine learning anymore. So I think the uncrewed observing systems are offer a lot of opportunities for machine learning because yeah we can gather a whole much greater volume of data. Yeah, I think us for those kinds of systems there's a lot of potential for them to like knowing the potential for different biases or like, you know, like different ages of floats or different kinds of drones collecting whatever information you have. No, having that expert knowledge of like how the data is collected and processed and making sure it is processed in a consistent way and the standards and good data tracking are all very, very important to make sure that any machine learning system built on top of those observations will perform well. And all the methods I mentioned this talk should, should apply to those. We use the Argo floats in the mixed layer depth. Use case. So, so there's definitely already finding useful but also under dealing with the challenges associated with those kind of matching Argo floats and satellite observations and all the processing that goes in between. Thanks Chirang and thanks Dave John. Andrea, you had a comment or you posted a link to the S2S AI challenge, which was recently announced and I think some of, yeah some of us on the call today are also participating in that. Thanks for posting the link, Andrea. So I had a question there, John. Yeah, so it's related to one of the last slides you showed the work from Mama la case I think the one where you look at many different interpretable methods to see if there's agreement and what they find. So oftentimes, I feel like when the methods we use don't agree that's when we learn something like, for instance. One of the tutorials we had in this summer school on on machine learning for interpreting prediction of two meter temperature over us will Chapman led the tutorial and we found like and so is one of the main drivers and also think Libby Barnes's work also showed the same like in the plot you are showing. So, learning that, I guess isn't it's not something new that we knew we know that it's a main driver, but something we would like to learn as new modes and new sources of predictability right and so would you have comments on how explainable AI, even if the signals are different how can we learn new sources of predictability on S to S timescales. Yeah, that's a that's a good question. I've been kind of wondering that myself. So theoretically, the, like, like with with the right collection of heat maps, like, ideally, you'll should spot new areas that don't make sense that are like that, that weren't found to be important before but I've experienced so far most of the time the machinery models tend to pick up on, like, kind of the most obvious signals, and, and because those obvious signals are often so strong that they they, they're, they're the machine model is liable to not really pay much attention to other other signals unless you take out the really strong signals. And even then, like, the beaker signals are weaker by by by definition so it's, they're more likely to have a lower signal to noise ratio and hard or hard, I think, pick out. There's also a problem of confirmation bias for really kind of look at the, the new the heat mapping like oh of course I can explain this particular heat map. Like, there's some some weather phenomenon there that is the cause but then you could generate a random heat map with similar spatial correlation and show it to someone and it's possible that they could make up a story for that particular map to. So, I think it's a, I don't have the solution to the problem yet but I suspect it's some form of pre like making sure you're you're you're pre registering your kind of what you already know or have or have some way to like robustly test type like a hypothesis of that that this is where most of the signal should be coming from and being able to account for that in order to like eliminate the already known and hopefully if you want to eliminate that then that might reveal the novel parts. I will say there are also people trying to do things other kinds of novelty detection techniques where they compare with some other baseline model that's like a linear model and then compare that with the neural network outputs and then where they're different that maybe the new signal. So there's still a lot of still a lot of work to be done to robustly find new signals and things that are that are different. So, great. So Andrea had another question on the chair. Can you comment on the use of machine learning for calibration dynamical forecast or inline model bias correction. It like machine learning has been used for this for for quite some time and follows often follows the pattern of classic statistical correction techniques like model output statistics. The main advantage of say using a machine learning like a neural network over over just like a bunch of linear regressions is that you may be able to use fewer or like one or like fewer machine models to do the bias correction and so it may be able to simplify some of the some of the pipelines because the machine learning models can learn regimes implicitly within their their structure, especially if it's something like a decision tree or neural network. And that can usually result hopefully result in a more robust bias correction system. So certainly issues with, if you if you have a very limited training set, then the the machine learning models. You're going to struggle to fit a more complex model if you have if you don't have enough data. So, so that's where things like re forecast come in really handy. I think I'm working with different groups to also look at ways to. You've already fit a like model to to to like you have like a large data set for for one model output in a new model upgrade comes along or there's another modeling system that's similar, but not completely the same. How can you is there ways to move your, your train model over to the new system without having to completely retrain the model. So, this is a big question for I think a lot of the operational weather agencies that may not want to invest all the computing time and constantly retraining a bunch of machine learning models. So, again, it's enough, I think open area of research and not the only group that's also looking into this so it's definitely a big open problem. Yeah. Thanks again, Andrea for the question. Thank you, David John for a great talking for the discussion as well after some. Thank you. Thank you, Anish. And thank you everyone for for participating.