 Okay, well, I'm really excited to be here. This is my first CSDMS meeting. I'm grateful to Talk to you all about something that I find really interesting working with machine learning stuff sort of at the interface of postal models data and using these really interesting computational tools so before I begin I just want to Acknowledge all the collaborators that are involved in the work that I'm presenting There's a longer list of people who have been involved in machine learning and coastal work that I've done but these are the people who have been key to either the work that I'm going to present or led the work that I'm going to present or Have deeply influenced the way that I think about machine learning here And I also want to acknowledge the fact that all of you in the room There could be some of you who know nothing about machine learning and some of you who know a lot about machine And I want to try to bridge that space a little bit and hopefully Convince you all that if there if you were working with a numerical model that has parameterizations or closures associated with it then machine learning at least can provide some some Useful potentially useful maybe an interesting way to deal with that issue so When I speak about machine learning this goes under the broad heading of data-driven science which has been in the news at Full-velocity full volume for quite a long time So here's Chris Anderson's great article from wired in 2008 the end of theory the data On the right is the fourth paradigm book from Microsoft research This is Pedro Domingos who's a machine learning researcher his popular press book the master algorithm and NSF one of NSF's current 10 big ideas is HDR harnessing the data revolution and You know, there's a lot of work going on in data-driven science science is a data is based on data So this is not something necessarily that's new that we're using data for science but I at least want to point out that the velocity or the volume of data that we're dealing with now is interesting To have to grapple with these issues and also that machine learning is this tiny subset in my mind of this data-driven Science world where we're trying to make sense of the data. We're trying to pull information. We're trying to try to pull knowledge. We're trying to pull Insight out of the data stream using computational techniques or more rigorous computational So my own introduction to machine learning which I think gives a One example of how somebody who is trained as a geoscience can come to this as I was walking to my office as a Graduate student wondering what the hell I was going to do that day and I was listening to this to an episode of the podcast radio lab called The Limits of Sciences episode and I strongly recommend listening to it. It's really interesting and in it the the host is interviewing two researchers that are working on This simple mechanical systems like the double pendulum or multiple spring mass and spring systems And they're using motion tracking software and taking the data streams that come from them And then running it through an algorithm algorithm that they built To try to detect physical law conservation laws or in variances on the right is a if you can reduce this down This is the Hamiltonian of the double pendulum system And what is this really got me interested interested because it's a way of dealing with these Machine learning it's a way of dealing with data streams and pulling out potentially rules Or other expressions of the data that we could potentially use in interesting ways So in the crib sheet in my mind, I mean there's a time I don't want to define machine learning But the operational definition that I'm going to be using right now is some sort of algorithm that takes data and develops a Generalization makes something generalizable that we can test that we can do science on that We can use as a hypothesis test to try to reject and I'm going to primarily be talking about regression tasks here Where we're trying to predict continuous variables as opposed to classification tasks. We're trying to predict discrete variables Okay, so that sort of ends the introduction That I want to give and I want to move the springboard into two examples that are sort of step-by-step one where I'm developing a parameterization and then deploying it in a model in a way that Hadn't been that that was exciting for us to do and then as the last part of the talk I just want to leave with five ideas just things that I think are interesting And could be useful to this community if you were thinking about how machine learning could integrate with with models in general or earth surface models or things like that So I just want to leave so bear with me as I stepped through this example and then Leave you with some ideas So the basic machine learning parameter is so Machine learning takes data and tries to pull something interesting out from it So that the parameterization I'm going to try to develop a simple machine learning parameterization for a run-up And this is a cross-shore profile You can see the waves breaking on the beach in the dune wave run-up is just the maximum vertical extent That's of the wave up brush on the shoreline above the still water line So that's denoted in the blue dot you want to predict we're interested in predicting right wave run-up as a coastal scientist because it's It sets if some place is going to be flooded or the erosion of the dune or other coastal processes moving sediment around So run-up is a interesting and very well studied problem So it's usually parameterized as some run-up is equal to a function of way offshore wave height h wave period t and beta which is slope the beach slope that the waves are breaking up on and Fortunately, there's a lot of available open data here and I'll get to this data issue further on in the talk But there's hundreds of data points Tense hundreds of data points mostly hundreds of data points provided by the USGS as part of a nice experiment and the typical way in which parameterizations are built for run-up is you take these three parameters and you Non-dimensionalize them or you figure out you regress them against each other in some creative way and You're picking through that regression through the linear regression or nonlinear regression the functional form of the line That's going to be fit through the data cloud. So that's been very successful and it's used in operational and forecasting models But when you have a lot of data and pairs of data like this with with run-up being paired to the wave parameters and the Morphologic parameters of data you can it then opens it and enough data then you can open up to doing things no more Computationally intensive search strategy. So I'm first going to introduce this machine learning technique of genetic programming Which is and how we do how we can build a predictor with Before wave run-up. So we the idea here What genetic programming is is it falls into the category of evolution evolutionary machine learning techniques and it's a population-based search strategy whose strategy where we're trying to find the optimal equation that fits run-up using these three input parameters So we specify some set of variables and some mathematical tools that relate these variables to each other But this is a symbolic regression problem. It's not defining the functional form It's just defining the toolkit that we can use So what we're doing is dumping these variables and dumping these tools into a bag and shaking them around and seeing what comes out in this stochastic population-based search strategy and I'll define and I'll show you how that's done But so we define so we define this as this sort of stage and then we give the algorithms some training data and with that training data and with the variables and the Functions we shake the bag around and pull out a population of different predictors that we could use to Come up with run-up equations and these populations are in fact equations themselves So the populations is a population of equations and they're expressed in this nice graph form with nodes and edges so The the algorithms that you can use you can set this as a parameter how big the population is but in this case I'm going to pick up a population of two to show you an example of how this works So the first pop the first predictor is just run-up equals four times the height and second is run-up is equal to two plus The way period the bet so we're operating on these population of predictors using vaguely evolutionary rules So this is the first generation to get to the second generation that we test all of the predictors Versus the data set that we've given us training data and we retain the best predictors. We keep the most fit Equations the ones that are bad we toss them out or we operate them on them with these evolutionary rules So they either Get crossed over they mix with one another or they mutate so T can be switched around and Here's an example of a crossover and that's the second generation and then we operate with the training data on that second generation and we rinse and repeat and we rinse and repeat and we rinse and repeat and eventually we get a whole family of predictors a Whole population of predictors That we hope can solve this run-up problem reasonably well So after a bunch of different formula tests so after 10 times 10 to the 11th predictors are actually used against this training data you end up with a plot like this So on the x-axis here is complexity. This is the measure of how many nodes are in the graph How many nodes are in that description of? Of the equation of each individual member of the population and on the y-axis is just an error metric here I'm using mean square and the line here all the dots are different The best predictor at that given size the one with the lowest error So this is referred to as the Pareto front and what you need to do there are a bunch of rules in the literature and there are a bunch of heuristics And and you can apply information criteria measures, but what you need to do is select from this the the Predictors that you eventually want to retain you can imagine that this very complex predictor away at the end With very low error is probably something that's overfit and something at the very top might be under fit So there's unfortunately I don't have to I don't have time to get into it But you there are rules that you can apply here to pick the predictor Here's an example of larger complexity that you can see on the bottom right and Here are the two predictors that we ended up selecting they're at the bottom of a cliff and This is what the results look like so this is not run up at swash Actually, which is one component of run up and on the x-axis is the observed values of the swash And on the y-axis is the predicted using these equations different predictors And I told you I'd use I pulled two machine learning predictors on the left and the right and the right pan left in the middle And the right panel is this non machine learning or traditional empirical prediction This is new data That it's being tested against So this don't train don't test with training data That's the only most important thing of my talk do not test with training data So this is testing data It's new data that wasn't shown to the algorithm and you can see that the machine learning techniques do oh and the This is the line is the one-to-one So that the algorithm develops these two technique these two predictors that are able to effectively Predict in this case the swash, which is one component of run up They fall closer to the one-to-one line than the traditional prediction scheme the problem is there's still this irreducible scatter associated with it So, you know, you should throw more data at it, right? See the scatter resolves. So Fortunately, I've had the great fortune of working with Tom Buzin and Kristen's winter at the University of New South Wales Where they have lots of data. So this is Narabean Beach in Australia. It's a very well-studied site And like I said, you usually have hundreds of data points for Hundreds of data portends to hundreds of data points for a run-up prediction experiment That is at least open and free in the wild for us to use Kristen and Tom are part of the water resource lab, which has a lidar mounted on the top of the building Which is able to take measurements of Run-up on a wave-by-wave basis. So they graciously gave gave us 8500 measurements of hourly paired run-up weight offshore wave height offshore wave period and Beach slope measurements. So this is great We get to use more data to see if the scatter resolved But as we all know more data makes the problem worse probably Well, yeah way worse. So this is another observers predicted experiment. So we cannot this is not Data in this case is not the universal asset that's able to solve this problem by to come up with a better predictor and We were sort of at an impasse with what to do here and I have You know, this is not surprising that more data doesn't work. We're trying to come up with a deterministic prediction scheme for a complex non-linear process that's Parameterized with three simple terms wave height wave period and beach slope the beach slope is a morpho Has morphodynamics involved in it the waves are have directional spread association with them We're not including offshore morphology like near shore bars things like that So it's not surprising in a lot of ways that we're trying to reduce the scatter and it's not getting any better So instead of thinking deterministically we tried to think probabilistically So we're using another technique here. I'm using a second technique on this data set machine learning technique That unfortunately has the same acronym This is gout. We're using the technique of Gaussian processes here So just for a moment, let's ignore the plot but Gaussian processes are a probabilistic machine learning technique We're defining the probability of any given function How it fits through a data cloud. So it's a probability of functions that can be used to fit the data so if you imagine that we have this big data cloud which Stensibly is behind this gray outline. You just ignore what the gray outline says If you imagine that the data is somewhat behind the gray outline That there is the mean of this So we give the algorithm all of this data and it plots out This is a single slice in one dimensional space in two dimensional space of what the data sets On on the x-axis is height and on the y-axis which has been cut off as the speaker's no good Is is r or run up? So you can imagine that if we if we give all this data and then run this probabilistic technique And that's finding a probability distribution of functions that fit this data set the black line through it The mean value is the most probable function that fits this data cloud Okay, the nice thing about doing this technique is we've tuned up this probabilistic techniques Which tells us the probability of any given function as it moves through the data cloud So we can draw from the probability distribution We can take the green line, which is one draw through the data space and through It's an it's a probable function that could fit the data cloud or the blue line or the orange line And all of these are reasonable run-up prediction schemes So instead of trying to come up with one deterministic scheme Which would be the black line in this case and we plot it up and we have all the error associated with it We can draw multiples. We can draw one two hundred thousand different run-up prediction schemes So now we have all these run-up prediction schemes Well, we can feed this into a dune erosion model making what Giovanni Cocoa and I call a hybrid model Where some component inside of a larger numerical model Is built using data or using machine learning So here's that crusher profile again And I was told when I started modeling that you should draw a picture of your model and then describe it in words So here's the picture the waves run up the beach and erode into the dune And in words the volumetric dune erosion is just proportional to the square of the excess run-up Above the dune toe So that's the model and I'm going to go into no details until I cut it up because I did a great job So the cool thing about using this dune erosion model is it's simple It's very fast and we can feed run-up is a key component in it So we can feed that mean predictor that black line Then we can feed feed the green prediction scheme Then we can feed the orange prediction scheme And we can feed all the different prediction schemes and get up with get Together and get a a whole spread of dune erosion estimates from that So in a way we're self creating an ensemble And this is what it looks like when you're done when we've drawn a hundred different run-up prediction schemes and a hundred different Run a hundred different models from a storm that hit this area And on the x-axis is the profile lines So lines of profile and on the y-axis we're showing the dune erosion volume That's that's been eroded from these profiles And if you look only on the left side of the plot, which I prefer that you do You will see that the observed values in pink Are denoted and then the ensemble mean is in black and if you look on the left Looks pretty good, right? We don't know what's going on the right total but these gray Bands show you the shaded percentile range of if how of the ensemble so 99 so these are Express the ensemble the range of the ensemble So the idea here is that we're able to provide not only the classic deterministic how much five You know this this much do this much sand is going to erode, but we're also able to provide some Uncertainty or estimates to potentially managers who would be concerned about this sort of stuff This paper is currently in review right now in the system special issue I strongly urge you to go online and give it a great review Okay, so I want to that sort of ends how I'm thinking about Presenting the machine learning work that we've done that could relate to Models in this hybrid way where we're taking data and machine learning components and putting them Inside of models and trying to create creative ways of dealing with parameterizations And instead I just want to talk to you now I want to shift gears a little bit and talk to you about five things that I see as potential and five futures five Things that I'm personally excited about related to machine learning that I think might be relevant to this audience Versus open data we I can't do Anything with machine learning without access to data And I would love to be able to cite you as a paper and then you again as a data set So open data is critically important. You can see on the top. This is the data sharing statement that I see a lot Research data is available bottom request from the first one or here's a recent grl paper that says please note The publisher is not responsible for the content or functionality of any supporting information supplied by the authors So that's how data and supporting material is being lovingly archived So I'd love there, you know, I did want to read one thing I just saw a dear colleagues letter from nsf that describes and encourages effective practices Or data and encouraging persistent identifiers like dlis So nsf is keen to to this idea too Me and nsf, you know, they're following along in great footsteps so I If you make your data findable accessible interoperable Interoperable and reusable fair I would love to be able to find it using some of these great things like data site or google data set search And then use it and collaborate with you and do all of these great things. So I really encourage you to think about uh data sharing practices that are separated from just supplementary material And lastly, we don't have benchmarks for um For lots of these machine learning algorithms that are earth science specific. So I'm very keen to see Benchmarks created where we can test different machine learning algorithms and understand where they do well and where they don't do well with different earth science datasets So two is um time series prediction So we work on systems that have solid phases in them. So Auto regression and memory effects come into play And there's been this real revolution in how to deal with time series prediction and machine learning primarily driven by a specific neural deep neural network architecture called long short-term memory And here's an example from Frederick Kratzert Where on the x-axis is time this is time series and on the y-axis the bottom is discharge and The kratzer led a group of people on this paper who are creating our rainfall runoff model With a long short-term memory that does very well in prediction And in systems that have snow. So inside the neural network There's a space that Accumulates information accumulates memory when the temperature is below some value and then dumps it when the temperature is above some value There's a space inside this thing that he's tuning up that knows what snow is in the context of this. This is a very cool paper And his work is very cool. And I think this sort of time series modeling Can guide what we can do On earth surface processes in the but for this was Investigated with near shore bars like 10 years ago in coastal processes But the network architecture of the neural network and the training methods We aren't quite there yet and these two new techniques on training and new architecture techniques. Can I think? Do a really neat job on time series prediction Okay Images are a weird diverse set of data that personally I haven't really dealt with much but There's this great paper by bus come and Richie where they're taking Images from coastal california and semantically segmenting them into regions where there's rock There's land and there's water and dealing with image data set trying to pull out information or insight from them That could be used for modeling or understanding surface processes I think is a frontier Including and I'm I'm currently working in the space trying to pull out images from social Volunteer social media data from social media streams and detect from them weird behaviors that are going on like humans using bulldozers In the middle of a storm event both on the coastline to build back some dune It's home dune. That's actively eroding or dig out some river channel as it's actively In flood stage. So I think finding these weird fit a weird human behaviors Pulling out pulling that out pulling that information out and sifting it from the large set of image Data is particularly relevant I've been looking at this plot On the left For 20 years now work, you know close to since it was published. It's a brad Werner's Plot of the different scales and the hierarchy of processes that occur So down at the bottom is sand grains and fluid physics and the dynamics associated with that and the dynamics associated with bed form shape And bed form crash line position and on the top pattern scale variables all of these Scales might have dynamics associated with them rules that Control how they evolve through time and these emergent picking out these emergent things though is a drag For my phd. My advisor made me pick out ripple defects and ripple Bifurcations on the bottom on the right through very long model simulations that were run with the cst in this cluster and instead now You can train up Algorithms to be able to detect those things for you and pick them out and detect how they move and their dynamics associated So if we can do that There's this hope that we can not only detect emergent things and figure out dynamics associated with emergent And not only detect them but figure out the dynamics associated with these emergent variables and these emergent Things so i'm sure this isn't related just to bed forms. I'm sure picking out river junctions Or I don't know what everybody else does but picking out emergent river Emergent dynamics, I think is interesting And lastly I want to talk about jennifer montaño's work. She's at the university of auclain and she's sort of wrangled this big group of people So look at shoreline Change prediction at this one beach in new zealand at tiruah So on the bottom panel, you'll see two plots in the black This is a time series plot with shoreline cross shore position as the shoreline is moving forward and backward And on the top panel is all of more traditional Empirical models for it and on the bottom panel is a bunch of different machine learning models So these sorts of competitions where we're comparing physical models or more traditional models And then machine learning models with each other. I think can highlight what spaces machine learning works. Well, what spaces Physical traditional models work. Well, how they interact with each other how we confuse them together So this sort of competition where you're getting together and In a friendly way Competing about who's who's uh, whose technique is the best. I think is a is a definite future Okay, so I'll let you read the in summary. I just want to thank you all for listening and the a lot of this work If you need more coastal examples Giovanni nifaniel plant and I just published a paper in earth science reviews Which existed for a long time as an earth archive preprint and as a member of earth archive as a community member. I'd love to Tell you all to submit preprints there if you're at all interested and I'm happy to answer any questions Later So thanks very much So we're actually going to have a panel discussion on machine learning later this afternoon So we'll be able to follow up on some of these ideas while we change over speakers. Are there a couple of questions for evan? Yes I think that that's a um I started just by being interested in one technique and I think that leads you down the rabbit hole of what technique will work I don't know if I don't know a great answer to this question other than diving in and trying it out with the technique I think most of them tend to work well Or the problems that we all deal with relatively low dimension problems there is this uh, there is this current trend of using auto ml, which I think is Being integrated or is Integrated with the side kit learn library, which is the python machine learning library and what it does is take The basic testing your you split the data into training and testing and then it runs it through different algorithms and figures out Which one works and which one doesn't work? So if you're interested in that finding the best one that works a tool like that would work But if you're interested in just diving in and trying something just dive in and try something It will probably work and if it doesn't you then you can try something else You can also tune what you want Tune the algorithm that you use on what you want I mean we wanted the genetic programming one because we wanted a smooth prediction surface an equation as opposed to something that was More difficult to read The hope was eventually we could use that smooth prediction surface to a smooth predictor to do non-linear dynamics associated with it To deal with the equation in an analytical way. So that's why we gravitated to that But i'm happy to provide more advice, but that's my five I think that that's a Okay, so the question was um, if you have a data set that has like the run-up that has three input Parameters that you're using when do you know that you you might just want to add an extra variable to make the prediction better That's a great question. We're limited right now by people Linking an additional variable to the run-up data set But we're currently working to to add near shore morphology to it or we end wave directional spread to it So I think you're right as soon as you add that everything should start to collapse a little bit for sure Great. Thanks Evan