 Well, thank you for coming. This is gonna be, I think, a good follow-up on Evan's nice talk this morning. We had a really nice discussion last year here with our first meeting of the AI and ML initiative within systems, within the cyber infrastructure group. So as a follow-up, we decided to get together some knowledgeable folks who have been working on this, mostly from academia, but we do have one representative industry to give us a little bit of an insight and open the floor for panel questions. So I will just introduce the panelists real fast. We'll have a couple of very quick presentations from the panelists, then we're gonna open the floor to questions and a little bit of follow-up at the end. So I will just go down the row here from my right to left. At the end, we have Evan Goldstein from UNC Greensboro. You heard from him this morning. We have Daniel Buscom from Northern Arizona University who is a last-minute exciting addition to our panel. We have Sophie Giard from CU Boulder here, a postdoc. And Steve Sain from Jupyter Intelligence, local startup using AI. So without further ado, yeah, Sophie, would you like to take it away? That's that. It should be on. Talk directly in. Okay. Do you hear me? Yeah. Okay, so I'm just, I just have two slides about two works that I am doing now, but mainly what I do is trying to work to apply machine learning techniques and to develop new machine learning techniques to specific atmospheric or climate applications. So this work, so I actually work with Claire Monteleoni who is also a professor here. So you might know her. This work was actually trying to predict the forecast hurricanes, forecast tracks specifically, and this from a database. So not trying to use models, but a database and actually re-analysis data. So we came up with like a neural networks that actually is a new type of neural networks because we had to fuse different types of data. So here comes like what I try to see as combining what like some specific machine learning techniques. So neural networks, but trying to put them into a specific problem, which is we have different set of types of data. How do we combine them? So this was the first work. I am now working also on trying to detect avalanche depositions or avalanche, I think you can call it different ways. I hope the position is a fine way, fine word in English. And this from SAR data. So I guess you might know this kind of satellite data. And so the goal here is how we can adapt some techniques, some machine learning techniques on data that is actually kind of blurry, but also where we don't have a lot of ground truth labeling, which is where we know where an avalanche has occurred or not. So I'm not going to enter into details. Then I have just one slide about what I think should be collaborations between machine learning and experts. So machine learning, usually when they try to apply their methods, what they look is just data, whatever data to test their new cool algorithm. But actually they don't really care about what they're solving and usually they're solving some completely useless problems. On the other hand, if you're an expert on whatever field I worked previously on medical images, so that's why he's here, a medical guy. Usually you have a cold problem. You try really to solve that with machine learning methods, but because you don't know a lot about it, usually you're using some wrong solutions. But I don't blame you. And I think if we combine the two, but combining really means to try to go back and forth between the two worlds, then we can end up in making some good research, okay. So we're gonna hold questions until both presentations and it opened the floor entirely. So next will be Steve from Jupiter. Yeah, sorry, just making sure it's on. Hi, my name is Steve Sain. I head up the data science team at Jupiter Intelligence. We're on a two-year-old startup with offices here in Boulder, San Mateo in New York City. My background maybe is a little bit interesting. I'm actually a statistician by training. I sort of started off as a traditional academic, spent a bunch of years at NCAR heading up something called the Geophysical Statistics Project that was in the Bath Institute for a number of years. And then four or five years ago, I decided to go be a data scientist, whatever that means. Ended up working for a company based out of San Francisco and then the opportunity to join Jupiter roughly a year and a half or so ago arose and here I am. So what is Jupiter? We get a lot of good jokes about Jupiter, Jupiter notebooks, the planet, I know how, all these other things, but really what we're trying to do is try to provide asset-level probabilistic climate risk information about different hazards that might be affected by extreme weather and climate change. So we currently have products in the flood space. We also have products that focus on extreme heat. We're developing a product that should be, depending on how you count sometime this year on wildfire and exploring a number of different other opportunities. Our clients span the gamut from anybody who has an interest in that asset on the ground. So it could be the property owner itself. It could be the people who finance the loan to buy that property. It could be the insurance agencies or insurance companies or the reinsurance companies that are trying to handle the risk for that property, as well as different municipalities basically around the globe. If we wanna sort of dive in a little deeper into thinking about what we're trying to do, take the example of coastal flooding. Coastal flooding, probably a lot more experts on this in this room than I am, but you know, you might think of coming from three different sources that we typically try to model, extreme rainfall, tropical cyclones, and then the growing sort of interest in sunny days. You just happen to have a high tide that when blowing the wrong direction, you get enough water and you start closing down streets. It's becoming a bigger and bigger problem in many coastal areas. Our pipelines are sort of built to do different kind of dynamic weather modeling, kind with coastal surge modeling, and then of course, hydraulic modeling to wrap all of this stuff up. And then of course, pushing it into statistics modules that then bring our risk estimates to our clients. And of course, driving all of this are some underlying assumptions about changing rainfall, sea level rise that sort of go along with different assumptions people wanna make about climate change. Okay, so right now our focus on machine learning has a lot to do with trying to figure out where our pain points are, especially as we try to do things more efficiently, i.e. cheaper. We do everything in the cloud, so every penny counts. We also are trying to scale, right? Most of, when we're talking about asset level information, we're talking about results that are say at meters type scale over tens to hundreds of millions of locations in a particular domain. So as we try to build those efficiencies, as we try to scale, our pain points have come up and machine learning is a way, one tool that we're trying to use to address some of those pain points. Many of them are probably not unfamiliar to you guys. Simple fact of the historical record is incomplete. We have to deal with missing data all the time, right? Missing data is something that we can try to use machine learning to tell us how to infill those results. There's also not a lot of it, especially when we start talking about extremes. The data quantity is pretty small when you just wanna start talking about say a hundred year flood risk. So there's this concept of weather generators that also comes to rise. Downscaling, I already mentioned the fact that we're trying to do things at sort of a meter scale type resolution. So downscaling our dynamical models becomes even more important as well. Something I'm just sort of getting involved with right now, hydro conditioning, DEMs to sort of drive our hydraulic models, lots of problems there. And machine learning can play a role in all of these things. My personal view is I simply try to find the best solution for the job, okay? So whatever problem is at hand, that could be a linear regression or it could be a deep learning model. Frankly, I don't care. I just want something that's gonna solve the problem. And my engineers aren't gonna scream at me when they try to implement. We use a lot of different things. Random forest is sort of a very nice all purpose, very general entry level. I don't know, hope I don't insult anybody. Kind of machine learning technique that we like a lot. We use it for downscaling. We use it for a lot of infilling when we have missing data. So it's sort of a nice all purpose sort of machine learning algorithm. We're looking at recurrent neural networks to deal with some of our sequence data, say like hydrological flows, to also to address some of the missing data problems that we're facing there. Convolutional neural networks, we have a project spinning up, looking at convolutional neural networks to sort of help out with some of our downscaling problems. And then sort of a bit of an exploratory problem, thinking about the weather generation idea, there's a lot of ways to do this within a traditional statistical format, but we're also looking at generative models like GANs to sort of think about how we might generate hundreds, if not thousands of say precipitation fields to help drive our models as well. Very quick overview, hope that works. Thanks a lot, Steve. So now we're going to open the floor to questions and then just kind of pass the microphone down the row of our panelists to weigh in on any sort of pressing questions or concerns we have. So be loud and we'll repeat the question and the floor is open. So the question was, is the pain point referring to the model in particular or the facility? Yeah, I think I was speaking more part of the modeling pipeline, right? Whether that's in the historical data that we sort of need to start driving the model or in the entire sort of simulation process, right? Again, you're trying to simulate something that's got 100 million grid boxes and you need thousands and thousands of simulations. You need to do something a little more efficiently. Is machine learning different than super fancy curb fitting? No, I think it's a really good question. I think it's a totally valid question and a good question. I think that what I was speaking about is just a simple example. And simple examples tend to look like curb fitting example, but it's a, as a sub-discipline, it says diverse, if not more diverse in the topics being studied than a lot of earth side sub-disciplines. So there's a lot going on. I just have this one reductive viewpoint of it that I wanted to present to you. But I'm gonna actually pass the microphone. Yeah, thanks, Evan. I guess my perspective on this is in general that a machine learning algorithm is something that's gonna end up telling you what the rules are. And a statistical approach is going to be more in line with these are a set of prescribed rules that you need to follow. You need to find the roots of a polynomial, for example, in a certain way. And a machine learning algorithm is a much more data-agnostic. It doesn't assume so many things. And it often can even get you the same answer than a statistical model can. Yeah, no, I think I agree with what was said roughly. You can view some of the machine learning techniques as just curve fitting, but it goes way more than that, I would say. This is like the simple example. Yeah. Being a statistician by training, I personally am not sure. I think the lines are a lot more blurred than what is the true reality. And there are certainly places where statistics is more appropriate. And there's certainly places where machine learning was more. But I think the biggest thing here is calling a curve fitting just doesn't do it justice because the problems that we're facing have so much structure, so much sort of going on that it's so far beyond fitting a curve or even a surface. And the ability to sort of, as somebody sort of alluded to, be able to infer that from the data directly is pretty powerful. And kind of calling it curve fitting is, just, I mean, I think that's a little weak, honestly. We have more questions right here. So how does machine learning handle outliers compared to traditional statistical models? That was the question. Maybe I can just start by saying that there are some machine learning methods that are kind of very robust to that. For example, the simple, I would say simple. They were very fancy five years ago. Super vector machines are actually finding some vectors, so some points of the data that will try to just find, for example, a classification you want to classify two classes. You will just find a few points that you need to make your decision boundary. But you will not at all depend on very far points. So in that case, you completely avoid outliers. I would say that it's like a data driven method in any ways, machine learning. So of course, if you have too many outliers, you will have a problem. But generally, because it's a data driven method, it might also be robust to very different types of data. So again, it depends on the method. It's not a general response. What I mean robust is that because it starts from data, it's robust to noise in the data, right? Because you learn from that. But then it depends. I cannot really answer it. It depends on the method. And of course, on the outliers. So maybe do you have an example on that? Outliers, particularly with sparse datasets with smaller datasets, are a giant problem, right? And it doesn't really matter if using a traditional statistical method or a machine learning method, you've got to be a little bit careful. Overfitting is always a problem. But there are methodologies that are robust. And when I say robust, I'm basically meaning that the results are not unduly influenced by those outliers, right? It's sort of an old concept and statistics of sort of thinking about how I fit a regression line. Do I use traditionally squares or L1 that sort of minimizes the impact of potential events or potential outliers? So it is a challenging problem. There's no, I don't think there's any great solution. It just requires a little bit of care and thinking about how you apply it, particularly with smaller datasets. I don't know, you guys got it? Yeah, I don't have a specific example, but I mean, a couple of just a couple of general points is that some machine learning algorithms are very data hungry. And so you'd want there to be similar outliers in both your training and your testing and your validation datasets. And if you don't have that, then you have a problem, just straight off the bat. What was the other point that I was thinking of? No, it's lost, sorry. Comments on interpretability of machine learning results. So I mean, there's lots of standard techniques out there for looking at the partial dependencies of your variables. And I mean, there are things that have translated from statistics to machine learning. There's lots of data visualization techniques out there that would definitely guide you in interpreting black box models, such as deep neural networks, for example. You end up simplifying everything and showing those results. So all those dependencies on things like decision trees that you can see, okay, there was a split here and there was a binary threshold that was reached and things like that. But there's a lot that needs to be done in that area, especially in our field. I mean, I don't think that many machine learning applications and Geosciences are really dealing with the problem of interpretability. I would just comment that sometimes that's also the good thing about machine learning is that you can actually model something, even if you don't know what the physics are underlying. Because if you have a model and it works, you don't need machine learning. But if you don't, then it's because actually you cannot model. So in that case, having even a black box is already a good thing. I think then the next level is to try to really try how you can couple the two in a good way so that you use the most of the modeling and the most of the machine learning. But that's another question. Yeah, I think, I don't disagree with anything. I mean, I think there's maybe a couple more points, right? I mean, causality is sort of a new and growing area. And while it's being sort of driven, I think a lot of biomedical sciences and oddly enough, and I have a colleague who's been in digital advertising for a number of years, and they're very concerned about sort of causal models and causality there. So it's being driven in a couple of areas. It is sort of starting to move into the machine learning area. There's also a lot of growing things where you might think your machine learning model is going to be constrained so it doesn't give you sort of non-physical results. So that's part of these areas, not only diagnosing or the diagnostic side of understanding sort of what your model is figuring out, but sort of helping the model along the way. Those are that sort of a growing area. But I would also say this is where that interaction is so important in that collaboration even that you were mentioning, between sort of the data scientist or the machine learning expert and the domain scientists become even more important. That communication sort of has to be there. I believe the best way this works is when you sort of have that good kind of collaboration. You put in the effort to be speaking the same language. It's amazing how many times you're talking about the same thing but you're using completely different words and so nobody knows what the hell is going on. But that's where it also is important to say, hey, yeah, the model says put the diapers next to the beer and you're like, huh, or maybe that's totally logical. I don't know, but there's a lot of those things that I think that conversation is what's so important. Yeah, absolutely. But that's also where I think it's very important that you have that interaction. You're right. When you sort of move out of the geophysical world, if I'm looking at customer churn or I'm looking at a market basket analysis or I'm just trying to classify cats, whatever you want. I mean, yeah, you could argue that machine learning models aren't constrained. They're just trying to find the structure that exists in the data and that's why they're so powerful because sometimes that structure is really complicated. But again, that's sort of where that conversation has to be important. And ultimately, it depends. I worked at a company looking at customer churn and honestly, really all they wanted was classifications of people so that the sales staff could go out and talk to the ones that we thought were at risk. They didn't really care why, they just wanted to know. So it sort of depends on what you're after. If you're after the learning, you might choose a different approach. If you just want pure predictions of yes or no, then maybe you choose a different approach. For the sake of time, we're gonna have to conclude, but just like to have everybody thank our panel again for coming and attending. Thanks, Chris. And thanks again to all of our panelists. How many of you thought this discussion was too short? All right, me too. So here's where you can go from here. First, this is part of the new CSDMS machine learning initiative or AI and ML on the webpage here. If you Google CSDMS machine learning, this will come up. You can talk to us and to each other on Twitter if you're feeling graphical and maybe as sarcastic as our Twitter is. And on our Google group, CSDMS AI and ML where you can share datasets, ask technical questions and generally start the kinds of conversations that we've been talking about. The conversations that bridge boundaries, that bring machine learning and mechanistic models together on this webpage with each other, with any other experts you can find. And then finally, Twitter is a little bit exciting. Google groups, sometimes beer, usually. So we'll also be talking at about 5.30 at Backcountry Pizza and Brew Pub this evening, 5.30 to 6.30. You can see us there if you wanna continue this conversation. Thank you.