 Okay, hey everybody, I'm Lucas Mandrake, and I'm a group supervisor of the machine learning and instrument autonomy group at JPL I'll be giving you a very different kind of lecture today instead of going through the mathematics of how machine learning algorithms work I'll be talking to you about some of their applications that are going on right now and giving you some actual concrete advice on workflows that matter And things we've learned lessons learned on what works what doesn't and why especially with pertaining to how to advance physical science Working with scientists, but before we actually start on that material There'll be some announcements about your homework and stuff that Jake Lee will provide you But while we're waiting for everyone to filter in I wanted to give kind of an early opportunity for Q&A And if there aren't any questions, then I want to learn a little bit more about you and your preparation and your interests So first I'll just open it up. Are there any initial questions that people wanted to talk about JPL science advancement? What we do anything like that The microphone doesn't appear to be working fortunately So if you can't hear me in the back tell me and I will project I am capable of projecting that far But you may need to occasionally remind me Can you hear me now if I do this? Okay good. I people usually don't tell me I'm too quiet. So it should be okay Any other questions or thoughts before we go Well then tell me a little bit about yourself for your background and your preparation How many people here feel like they're comfortable with statistics? They're comfortable that if you were asked to do a hypothesis test you kind of know what that is If someone gave you data and said is this actually supporting the hypothesis he claimed you have an idea what to do a show of hands on Who feels confident enough to do that? And by the way that also is a cutting-edge problem that if you really know what you're doing You probably put your hand down because that is a very hard thing to say Yeah, it's a yeah people get more confidence and then less confident depending on what you know How about have people you've learned a few simple machine learning methods in here right now? How many people besides the homework for this course have actually downloaded some libraries and just played with them and seen What they can do and not do that sort of thing and and got a feel for that and Tell me a little bit about your reactions when you saw that what were you what did you learn about what they were doing? Well, and what did you learn about your kind of shock that how badly they were doing? What did you see? But just like how much a small part of like putting the building the model is actually That's a great observation right the the learning the of the machine learning takes almost no time at all and All the decisions up front of your test train splits and your annotation strategy and all that Define the problem that it's going to learn and it's going to do for you And it is almost impossible to do it right the first time so you have to you spend almost all your time Interating on that because you can't do it right the first time. That's a great observation. What else have you seen any observations? And I would actually say if you're not disappointed the first time you try to use machine learning You probably didn't try to do something hard enough because it isn't straightforward It isn't import a library run it and then you're like wow that was amazing It actually is a tool that requires quite a lot of skill to set up and learn properly and Depending on which model you try you'll say oh that learned extremely quickly and took no time at all But I don't like the results or it took so long to run I'm not sure if I really care anymore at the end and we'll talk a little bit about that But there's two other chances that I have to talk with you where we'll really get into that Simple model versus complex models and about how there's a sweet spot and if you use a model That's too big and too complicated you actually now take too long To train and and test and therefore you probably can't do good rigorous validation So why are you bothering you to do it at all? So that is a real challenge as well and industry loves that problem That's the problem. They want to solve is that it takes too long to run But not necessarily what science wants you to solve we'll get to that and data access For an early a core problem of like I try to download an image that it takes like a week Or any of those like open source data sets that you want to try a lot of times It takes forever just to download it on your computer before you can try anything so There's different industry solutions that they're trying to do, but you know A lot of times you'll hear bring the code to the data instead of bring the data to your code or other buzzwords like that But this is a constantly evolving Space where people are trying to come up with different solutions and we try to solve that Time for a few more questions or you want to go If you want to take more questions, alright, there's one thing I just wanted to bounce off you It's not formally part of this lecture, but When's the last time in the news you heard? Scientists does something amazing with a decision tree and learn something new this happens all the time You've never seen it in the news not once, but how many times have you heard? enormous new network that takes hours to run and requires specialized hardware figures out how to do Something slightly better than it did before this is everywhere, right and chat GPT is the latest example of this right? It is so big that it requires millions of dollars of electricity every time They want to retrain it and the fact that you have to log in and use it is not just because they want to use it for themselves But because you don't have the hardware to run this thing Be careful There are companies out there who make a living by selling you hardware that hardware can be used to run large models In order to increase that business model They pay people to use and make models that require that hardware and publish lots of papers And then get lots of publicity about how amazing this is there's funding going on at levels that science normally doesn't have access to And all of that funding goes to say You need my latest big model and here it is for free you can download it and it's cool You can use it, but it requires hardware that happens to be rather expensive and by the way we sell that So there's an enormous conflict of interest in who's dominating what you see in the news about machine learning Small is beautiful small is more understandable You should use the simplest smallest model that solves your problem the latest model is not necessarily the right fit So just be careful. You're working in a high-field right now And it is a fascinating object just don't think you need that to get work done Structures So when I say specialized hardware usually mean GPUs But companies like Google and things that run giant models routinely have actually made like tensor processing units that Just have a slightly different arrangement of how much they special they specialize in the cache Arrangements and things like that to really advance the linear algebra as fast as possible for things like deep neural nets So you can do some specialization, but it's it's nothing more than that you will hear Really radical hardware like neuromorphic circuits and things like that that are supposed to accelerate it enormously But that's that's bleeding edge limited and they're still wandering around saying does anybody need this so we don't even know If that's a good way to go or not. All right brief Interruption just to cover some course logistic stuff Homework to is due tomorrow at 9 p.m. As you already probably know again if you have issues there is a office hour today at 6 held by max max is Doing awesome on piazza right now answering programming questions So if you have issues with your ipython notebooks, they're not running if you're getting overflow. There is something wrong You know, there's some math somewhere. That's not working. You should not be getting numerical overflows in your code So if you have issues go to that office hours. We also have a recitation today at 7 that one's in the locate locations on piazza Hold on let me let me pull up those locations right now So the office hours today at 6 is in annenberg conference room And then we will I will post the recitation location shortly I think you have to confirm we have to confirm the reservation for that But that one is on linear algebra. So if you're kind of Shaky on your lin house if you're confused by matrix multiplication and dot multiplication and then later on we're gonna go on to like Gradients of matrix operations and that gets real fun. So if you're kind of shaky on your linear algebra I suggest you attend slides will also be posted online if you're not able to attend But that's a good one to go to What else and Then homework three will be released subsequently sometime early Friday We try to do it before noon, but sometimes, you know, we have to make changes and things like that So we'll again post another piazza post when that happens By the way, if you're not using piazza and if you're not checking it daily, you are missing out There's a we answer a lot of questions. I think we give away like half the homework, honestly Just answering questions on piazza Take full use of it. If you have issues with your code We are getting private questions where people post a link through a co-lab and say help me debug this No, but we're happy to do that. We're here to help you learn. We're not here to like give you a bad grade, right? So check piazza. We post errata. We post Like PSAs for example numpy dot dot for dot products does not do what you think it does if you give it two-dimensional arrays So there's a lot of these things that show up on piazza. So check it every day keep up to date Make sure you're not missing out on a resource that other students are taking full advantage of. Okay Let's see I don't know if Dr. Revapragada mentioned the ombuds person volunteering So ombuds person will kind of interface with us lecturers to give feedback on the course if there's issues that we're not addressing then an ombuds person can kind of Advocate on your behalf. So if you want to volunteer to be that person, please email the head TA Emile and we'll get that set up and you can kind of advocate for your fellow students And you get to put another line on your resume about that. So if that's something that you're kind of interested in Please go ahead and do that and That's about it by the way If you're submitting a homework and you run into a technical issue or a personal issue or some extenuating circumstances Just email us. We're not strict about the late hour stuff The 48 hours is just there to kind of give you a bit of flexibility on the submission But if you like submitted your PDF and you're a co-lab link doesn't work or something like just email us We'll give you an hour for free. Right. Don't worry about that. Don't stress about the deadlines and things like that Okay All right, that's all from me back to Dr. Mandrake for the lecture itself All right. Thank you So what we're going to be talking about today is how to use machine learning to help physical scientists get done what they need to get done using today's data sets and One of the things I want to really leave you with immediately is that the machine learning You've already learned as simple as it is and as early as it is in the field can immediately make Benefits for physical scientists right now using their field. You do not need the latest and most expressive models to make a really big difference It's much more about the approach that you use and the questions that you ask that you're helping them with You will sometimes want those big models and expressive models when your problem is suited for it And they are powerful and exciting, but the simple stuff Actually is more powerful when it's appropriate and so keep in mind every example I'm going to give you right now could technically be advanced by the machine learning you already know So the outline of what I'm going to talk about today I'm going to describe what the physical science endeavor really is what's it about and how does technology help it go forward I'm going to compare what's going on in industry and the reason I'm comparing that is because they're the ones who are Generating almost all the libraries you're using and advancing this field very aggressively with a lot of funding and availability But how that's different than science applications and the focuses can be different I'm going to talk about the difference between big data Which you hear about in the news all the time and big complexity Which is what I call when you're trying to look at data that has extremely sophisticated Contents multiple data sets overlapping and very high-dimensional problems. All of these are very common in science I'm going to talk about how to incorporate physical knowledge and this is the really big difference Because when you work in industry you are often studying something for which we do not know the underlying rules that govern the system It's complicated. It's often made of people But in science we know a lot about the physical world and about physics and you would not want to ignore that when you set Your problem up. So how do you put that into a machine learning context? I'm going to talk a little bit about feature engineering Which is a way a very simple way that you can start explaining to machine learning what you already know So that it doesn't bother learning that and then a more complicated way when you actually coexist with the sophisticated first principles mock physics model We're going to then go to the iterative discovery concept of how it's not about training once and for all a single piece of ML And then saying science at the end but rather training multiple ML models That are slowly revealing things about your data with the purpose of being discovery in that data set Because that's one of the things the scientists need the most I'm going to give you example of what I call catalogue science Which all science fields are either in or have advanced through at one point or another It is fairly universal and then finally I'm going to give you an example of unethical AI uses not the typical unethical AI You hear about of though There's a difference in race between a facial recognition system and how it's treating sub sub populations But about how you can accidentally lie to yourself while intending to do good and because this is based on statistics It's easier than you think Okay So what's the point of advancing science physical science with machine learning? Why are they coming to us and saying help and it's all about the fact that we can now acquire data at such an enormous scale and complexity That it is almost impossible to even assimilate it into the existing physical models We have and what I mean by that is that we have elaborate beautiful first principles models say for the weather And I will keep going back to the weather example because it's easy to understand We have satellites that generate terabytes of data a day You could assimilate all of that and right now if you assimilate all of that the answers will become worse than if you assimilate part of it So you start saying what parts do I assimilate what don't I the models are breaking down? You can't just take observations throw them at physics equations and hope for the best you have to not start Understanding what's in that data that's challenging my model What's wrong with my model that it's not matching the data? And these are not things that physics helps you answer and this is one of the things that data science can if you set it up Right, but the last part that's in yellow is really what I'm going to talk about Scientists are charged at the end of the day with understanding that means they have a simplified description of what is governing what? Processes were not included in my model and are dominating a particular case in the data I need that understanding simple automation and simple prediction isn't enough And that's not true in industry where prediction is pretty much the name of the game So let's talk briefly about the history of technology and science and this is just a notional graph I did not go through all of human history and actually record these bars But what I wanted to expose to you is that you know in the 1600s a Researcher a physical scientist would spend an enormous amount of time manually taking their data bit by bit And then they would spend an enormous amount of time writing that down with ink and then writing down the equations to fit And then doing the math to fit those equations to the data and at the very end of all that if they had any time left They might make some insight so things were dominated by the process of science more than the action of science discovery And as we increase the technological era, we increase the ability first to make Computation with mechanisms So we have mechanical calculators and the time spent actually doing the calculation went down But more time was spent taking the data to feed those engines Then electricity came along and not only gave us some simple electric calculators But the ability to electrically take data electronic sensors and this suddenly flooded people with more data than they knew What to do with and at that time already there were people saying what's the point of an electronic sensor? We can't handle all the data. So why take it that problem was already there at the time But fortunately shortly after that we got computers and computers were the answer you put it all on a computer and Do what exactly at this point is when computer science touched science and said I can help you And they said I can start writing simulators that use physics equations to predict What's going to happen and you can compare that with your observations to make sense of large datasets And then they said I can actually do one step better I can assimilate that data so that what's coming out of these models isn't just physics But it's physics that has also been negotiated with the observations say in a common filter So that I'm making my best estimate of what the true answer is This is how we predict the weather today and this was so successful that this now defines what science is It redefined itself today when you ask what do we know about the weather? You can read a bunch of things in a book of equations that govern it But no one sits down and solves those equations by hand to calculate the weather they go to gigantic million code line Programs that take in all the observations combine them with those equations and then make predictions So we codify our understanding of the universe now as code in these models and we call that progress That's what it looks like The problem is these codes have become so large that you can get a PhD to improve one small module in it And you don't even know if that was the module that mattered What's missing? What's next? Where do I look what's causing the problem is now no longer a trivial question for these codes? And this is one of the places data science can help so today We spend an enormous amount of time building models people make their entire careers out of a small piece of that model And that's why that green bar is enormous But it has increased science insight these models make exceptional predictions and where they're wrong is the interesting part That's where all the science has run the models disagree over this island at this time Everyone study the heck out of this moment. There's something to learn here. Why are we getting the wrong answer? It might be numerical it might be physics But you can't tell what data science promises to do is reduce the effort required to make the models You don't necessarily need an army of graduate students dedicating their lives to make the next best physical model to quickly Understand if there's something interesting here or not now that doesn't mean it's science Just making a prediction is part of science Understanding the processes that control that is the second part and machine learning actually has trouble doing that That is not an import statement in a code And that's what we're going to talk about today is how to get the insights back out of it But simply having an extremely fast model that tested a hypothesis and then showed you yep You're on the right track. You can predict things that alone is useful as a hint You're on the right track, but it's not enough and that's what the lecture today is about I'm going to pause at the end of every slide to see if there's any questions or anything people would like to add I did not pack this so full that I need to talk to you the entire time. I can trust me But if you want to talk you want to ask questions, please I invite you to do so now please Well, they're lying right it's 300 lines of code that call incredible libraries that are doing amazing things that is not Even remotely possible that these are simple lines of code when I quote millions of lines of code. I mean elemental lines right not Calling enormous functions that do sophisticated things for you However, I do want to emphasize Weather prediction isn't what you think it is it was called the quiet revolution a few decades ago That in the 80s it was impossible. It was numerically impossible to predict the weather for 10 days out No one on the planet could do it We do it routinely now and it was because of advances in the simulative techniques that allowed this transformation It is a civilization level challenge in which we're assimilating terabytes of data every day From space from in situ measurements on the ground and coupled physical models all working together It's unbelievable what they do and no one cares because it's just a product now It's an app, but what that hat what happens there is so big The United States has one model that they maintain because it takes that much funding the European Union has one model They maintain and they don't entirely agree So it is a civilization level challenge. It's truly amazing and yet they're now up against this barrier I'm describing. That was a great question any others Let's keep going and by the way, you don't need to raise your hand just talk out loud All right, so let's talk about what industry has done for us, which is tremendous But how it's not quite the same as what we're doing here So here's some common applications that you've already bumped into in your life right content recognition Show me pictures of bicycles. That's image data Alexa play top hits That's set recognition within a sound a time series of amplitudes This may be melanoma go see your doctor. That's spectral information, right? I'm seeing some bands that are in a certain ratio that I know puts you in risk So content recognition is one of the most common applications that industry has really zoomed in on your phone Has these all over the place now There's also profile similarity people like you did things like this really this is all recommendation systems This is the internet is filled with these now you hit you trip over them trying to get in your way and tell you what you actually want or what you actually mean or Make billions of dollars by predicting how likely this ad is to be clicked on by you, which is an entire company Anomaly detection there's a hundred hours of security footage But you only need to look in four places everything else was normal So that's a form of summarization detecting things that are out of the ordinary and defining norms Temporal prediction. What is the stock market going to do next? It certainly can't tell you the answer But it can tell you so much of the answer that it guides you quickly to where you need to bring in Maybe your human intuition to do a little more analysis and this is also epidemic detection in a really unusual way Let's just watch Twitter and monitor that feed and you can predict Epidemics rise and propagation faster than tracking hospital admissions because it predicts it as they go forward people start saying I'm sick. This is really bad. I think I need to go to the hospital They do all of that before they go so it turns out that it's highly predictive And you certainly don't know the physics equations to translate English text all the way to epidemic levels Finally, there's sequence completion. I literally just typed this into Google I want a green and then it said do you mean card light bean? What do you want? So it's predicting how do people typically complete this and if you and that's a simple example chat GPT is a very complex example Where it's trying to predict sentence after sentence after sentence and it just kind of keeps on going and then we're amazed that it kind of looks realistic Finally, there's style transfer and this is the one that's a little weird I want Mona Lisa, but I and this is misspelled, but I want it in the style of Van Gogh, right? So there it is I don't know why you want that but we have whole systems for doing this nowadays and this is the whole selfie market So this is where the industry puts its money because it either does something that's useful So humans don't have to do it anymore or it's fun And that's the dominant investment in this field things that are useful or things that are fun And because of that we have beautiful libraries available for everyone to use Because it's in the company's interest for all of you to get trained in those libraries because they need to hire you to keep this stuff going So that's why these free libraries are so powerful and why they're out there for your use And when we look at science applications at first it looks like a dead wind Let's use this for science. Well content recognition in the image area Here's an entire surface of a planet. I want to find impact craters. Here's some examples That's great. And here's a map that would have required graduate students to weep for years Right as they're slowly circling things and that is still how it's done in many places Here's a bunch of bird song predict how many species are currently in this forest And then from that predict biodiversity so that we can track it. Same thing as voice recognition Produce a map of likely hemotype composition on Mars. You look at spectral data You learn how to infer the composition of what you might look at And now you have a map of where water once was on Mars because of the hematite location So this is looking great. Profile similarity. This earthquake you said was interesting Kind of looks like these other events. Go focus your attention there Anomaly detection. Here's an entire record of GPS data But there are four seismic events for review that might be of interest So that looks very promising temporal prediction weather everything about weather Here's what happened before. Here's what's coming next. Here's the drivers you need to know today So all of those are great matches and really make you feel like it's the same thing What industry wants is the same thing as what science wants Now the last two I'm just going to briefly touch on this is we're going to get much more in this to the end When you use this for science you end up doing something that I call deep fake science data And it's very dangerous. So filling in any gaps in my data I routinely see proposals come across my desk where people wanting to do this And this and the disturbing thing is ML says sure And it makes a system that can do that And the answer isn't right But it looks right because that's what it was trained to do This is what chat GPT does if you just assume its response to you is the gap and it's filling it in That's exactly the problem that it's solving. It's not answering you. It doesn't know what it's saying It's filling in an empty text box And it's predicting what a human might have said That's it Style transfer This is the weirdest one Here's some seismic data What would this have looked like if I had a camera on the surface watching the earthquake happen People will do this too and ML will say here's a very plausible looking surface event and how it would have shaken around It's completely wrong, but it's extremely believable. So that's the problem Both of these when you bring it to science data produce realistic, but wrong Deep fake science data at the end then if you hand it to someone now, they can't distinguish between that and reality So what do you mean wrong, right? I think that's like Like yes, like it's not the actual footage of The earthquake shaking but let's say like it's a good representation Like would that be you just wrong? So let's take an example of your house from space I cut your house out And then I say with the rest of the earth's data predict what's here It will put a house there because there's probably a house But when you look at it you say but that's not my house But everybody else says but that was a really realistic looking house That's what I made by wrong and the problem in science is it needs to be right It doesn't need to look okay So these systems aren't really So then I will be more precise and say What you fill in that box has a much higher uncertainty than the rest of the data And if you want to be honest and you produce an uncertainty map that goes really big over that Then I have no problem with this at all Then we have no beef All right, so I'm going to give you an example of what industry does because it's kind of fun And it also is disturbing So there's this company out there called Geico and what they really wanted to do for their business model Is give you an instant insurance quotes and those days it took days For them to pour over your data and decide how much they were going to actually charge you for car insurance So this is how they did it. How do we how do we get this? And by the way, we don't have access to any information And we want to ask people as little information as possible about themselves We want to just ask a few key questions and find out who you are So let's make two assumptions and here's where you already are feeling the ethics go wait a minute. What the first is Your past performance predicts future behavior now for a human That's not that scandalous that you should be held responsible for how you behaved But the second one is if you know anybody who's risky then you must be too Because you have to assume something in order to proceed so they made that assumption And that's what it's assuming about you Now so if you know people's history and let's just pause there because you don't upload your history They just know it they skim it of all available sources online. There's a tremendous amount of information about you They do this every day on everything you post and every place you've been and every university you've attended They map that and make a profile about you And then you make a connectivity map Who have you been talking to who do you know who have you lived near all these things make adjacency maps From which we can now infer trustability from one person to another Because you have some data of who wasn't trustworthy. So map that just a few places Then I can now describe every night by inverting that matrix A trustworthy score for every person in the united states In case you click And that is how that daiko does instant insurance quotes for all of you all the time There are obvious ethical implications of this I doubt there's an oversight board ensuring that they are handling all of this well But it also is very convenient that I can go to gaiko and instantly get insured um So I don't want to diminish this This isn't easy. That's an extraordinarily difficult data science problem It's a very sparse matrix and they are interested in custom hardware to be able to overt a matrix That contains a row and column for every person in the entire united states with very sparse data And that's what they hire people like you to do It's also worth about 32 billion dollars and is highly ethically encumbered But this is what the industry looks like Fortunately though because they are desperate for workforce They make all their libraries free and available so we can leverage them. So why not? I'm not here to say this shouldn't happen I'm saying it does So now let's look at you out here big data Let's briefly talk about big data because that's not really the issue You see these exponential curves everywhere the amount of data that human is generating And I don't mean that we're actually generating in terms of just making but that we're capturing and storing right We're actively doing that you make a similar graph for science It looks kind of the same. So he's a okay game same story in both places. It's not the same data This is where we start to diverge on the left. This is intuitive and understandable when you zoom in and reduce the data They are stock market transitions or transactions or they are entries in a medical file They are 100 understandable because humans are doing it and you're human so you get it However, the data collection is approximate and haphazard and sparse. It's horrible noise is crazy All the matrices are sparse. So that's the challenges they struggle with Science data is completely different So it's fractally complex the more you zoom down and try to get at well What's really happening the worse it becomes you are not going to understand that river delta by zooming in and zooming in Unzooming in until you get to the reducible simplicity and then zooming all the way back out It's too many orders of magnitude scales below you So there's more in the data than you are ever ready to really fully embrace and understand The second part is that because it's so complicated to advance science We lock down everything else. The data is dense. It is regular. It is carefully calibrated We understand how we took the data and what the observations are But interpreting them becomes the challenge because what we're studying is so complicated This is a fundamentally different problem. And I want you to imagine just for a moment The company out there right now that's using your data to predict if you'll click on an ad If they use the methods i'll describe to you for science on that They would be trying to understand why you click on that ad Which essentially is trying to mine out human psychology by watching people click on ads That's a really complicated problem and absolutely not what they spend their time doing They just want to make a prediction because that's what they need decision support So this is where we start to diverge So it's not really about big data. It's about complexity In a mere 50 megabyte data set you might have spectral Information which contains a lot of absorption lines that describe the material that you took this data from Big data would look at this and go 50 megabytes. I eat that for lunch, right? This is not anything interesting But I also didn't take all these different spectra in my data set independently They were taken spatially sampled they are near each other and they inform each other So treating each of them independently is wrong. You have to take into consideration how they're arranged in space I also had repeat overpasses of that same place. So now I have a time series of spatially correlated spectra So this is now fundamentally three dimensional data with multiple Entries in that wavelength direction. This is high dimensional data And it's still only 50 megabytes and then there's the goal I'm not trying to predict something simple I'm instead saying given my current theoretical and empirical understanding plus the physical models. I already have What is what are known and understand understood processes that predict what I'm seeing what's unknown And unrecognized and what do I know? I don't know and this is what I'm hunting for I want to break the problem into those three things. Can you help me do this? This is what science looks like. It's not an import statement At the end of all that I'm hoping this helped me understand something And can publish a paper on it So what we're talking about first does have big data because science has the big data problem Huge data is being taken right now, especially about the earth in planetary. It's less But it also has this very complex data. I'm talking about fine scale structure I'm talking about very high dimensionality And I'm also talking about spatiotemporal correlations between it that make it complicated So that you have to take that into consideration I'm also talking about the computation required to execute on this If you try to use protein folding is simple physics. We know 100% of the physics There is no new physics to know, but it's insoluble traditionally Because there are so many different scales of time that you would have to model that in order to get one right The other one would take the lifetime of the universe. So you have to make an approximation to go forward That is what data science can help with. It's good for that finite elements not so much Okay, so science can leverage industries data science tools, but the approach of how you do it is very different So let's talk about how to incorporate that physics usefully. And before I do that, let's actually pause Questions thoughts things you've seen Does this make sense? Have you seen data like this? Have you encountered a problem like this where you thought Boy, I'm trying to do a little bit more than predict How many people are by the way in the physical sciences here that you're getting degrees undergrad or graduate? All right, fantastic Please let's let's keep together because we're trying to get physical sciences to pick up some of these tools and learn them As well as start collaborations between professional data scientists that work with you to help analyze these things too But the more intuition there is between us the better these collaborations go Okay, so how do you bring the physics into your problem? I'm first going to tell you quick and dirty ways You are at first going to go but that's too simple But most people don't do it. So let's start there The first it's called a model rich environment and let's just define what I mean by using machine learning, right? Here's the typical machine learning setup. You have your input data on one side You have the target that you want to predict You have some metrics of merit that you've already started learning about and then you have a really rigorous validation plan I'm going to keep saying that throughout this entire thing because if you're not Regrously validating you might as well go home You can lie to yourself too easily and then you now you do import and run on your model It runs for maybe a few hours and then tada. Here's the algorithm that does that for you You might not know exactly what's in the algorithm But it is now running and you can validate it and show that it works This setup is useful when you don't know any physics So if you set up a problem like this, I want you to say I don't know any physics about this problem I have nothing to add because you didn't so you're pretending that you don't and that's very unlikely in the science domain The second is Your physical model exists But it's so slow that it's computationally Impossible like the protein folding case It's so implausible to use that. I'm going to have to act like I don't know it even though I do And then the third is I'm assuming I have plentiful data because I'm asking the machine learning to learn everything from scratch It has never seen this universe before it doesn't even know there's physics in the world or that humans understand it It's just trying to map data inputs and data output. That's a very hard problem So you're asking to do everything so the performance will be lower So what if there are physical models floating around already that I have access to there's no insert physics here onto your machine learning library What do you do to put that in? And when I say physics models, let's talk about that means there's the simple ones finite element first principles simulations, right? Crunch time step to time step to time step and make your predictions But there's also climatology models. These are where I have models that can predict statistical envelopes on what we'll have Happen, but I'm not going to tell you mechanically how everything goes forward. You might have access to some of those There's a simulative models which use the forward simulation I describe find an element but fuse it in a common filter like way with observations That's even more skilled and then likely there are even multiple and competing models that get slightly different answers And they were written with different assumptions. That's the most common case. You have a mixture of all these We don't want the machine learning to learn things. We already know so what can you do? So here's and then let's go a little bit farther here The input data in science is also nasty You've already learned that your input features in your in your in your all your problems assume individual Individually distributed identical. This is never true in science because physics So if this is weather data pressure temperature humidity albedo and rainfall these are linked Because there's physics if they weren't linked there wouldn't be any physics and there wouldn't be any point to look at them Right, but they also aren't redundant You cannot predict each of these from the others. There are independent terms in them too So this is what science data looks like highly correlated not redundant not a hundred percent So what do you do now? fortunately It turns out machine learning works just fine without iid When you go forward the predictions are still valid But the difference is the model that you get is not unique anymore There are multiple models that could have chosen different arrangements of the variables to get the same predictions And while that doesn't sound bad at first when we try to pull those models apart to learn to discover what they learned That gets hard because those correlations can hide the variables that matter the variables that don't So understand that you're working probably in a correlated regime Feature engineering is the first simplest way to tell the machine learning what you already know about a problem And I would challenge to I would challenge you to say if you aren't feature engineering Then you are pretending you don't know anything and that's usually not a good approach So here's a simple example a direct machine learning model Let's say we're trying to learn faces. I just take a bunch of faces. I dump it in I tell it What's a face what's not a face and I train the system Um, and it's trying to learn from scratch what a face is very simple and easy to set up You still will do this because it's a great feasibility test This will tell you I can model this or not But it doesn't it doesn't tell you how well it can do it because you could have probably helped it out It also doesn't tell you how it did it But if instead you take the time to say I know a lot about faces. They have eyes. They have noses. They have mouths They have eyebrows. Here's kind of how you find them. I build in some features that help it zoom into those It is almost unlimited the benefits that you get out of this first You've made the learning problem for the machine learning much simpler So that it will probably lock on and because of that you need less training data And less validation data to prove that it's working and help it converge It will actually run faster when you try to train. This is a way to actually speed things up too But it's much better than that You can now look through your entire data that you're flowing into it and say How many eyes are in each of those images and if the answer isn't a spike at two You probably want to go look at your data. You have data triage all of a sudden Some of your input data is corrupted or noisy or messed up or maybe just other assumptions are wrong about it You can also start debugging it when it doesn't perform You can say show me the examples that it's not performing well on here. They are Well, did the features lock on? Oh, it's having trouble finding eyes. That's where it's having the issue You can start understanding what goes on all of this is almost for free Because to ask you to write these and i'm not saying each one of these is a sophisticated ml model that looks for eyes across all humanity Just do circle fits is there something in there that's kind of circular that has a dark in the middle And something light on the outside really simple statistics can go a huge way Because you are incorporating the assumptions you know are valid for the thing you're trying to let it learn So feature engineering is about teaching simple things and then feeding that into the as an input to your machine learning I'm not saying you replace the input data with the features necessarily Just put them on the side. They shouldn't hurt you in principle, right? They should only be able to help Sometimes that isn't always true But in another lecture when the next time I get to talk with you we're going to talk about explainability methods Where you get to ask the machine learning which of these variables matters most And now this becomes hypothesis checking How about I encode a bunch of features that might help and then train the model and if it works I can ask it which ones helped you most and get rid of the others and I've now done Hypothesis based code optimization algorithm optimization I've learned more about my problem and I've learned how the ml is doing what it's doing All because I was willing to just help it out a little bit at the beginning So from an engineering point of view, this is irreplaceable The reason I'm harping on this so much and why it's not obvious Is because long ago and using the machine learning that you know right now You have to do this if you want to look at image data or sound data You have to engineer features. It can't handle the raw data But deep learning came out and deep learning can Its model is sophisticated enough that you can put the raw data in and let it learn And it learns features for you somewhere in the neural net You don't know exactly what they are, but they're in there somewhere That seemed like a huge advance But then a lot of people lost the art of making feature engineered features And so now they just take models throw it at the raw data and they say this is the performance And there's nothing to do about it. There's a lot to do about it. You forgot how to feature engineer and for science This is very important Is that clear on what feature engineering is and why it's useful even if you have deep learning you can still do this Um, I see how this is useful for facial recognition, but uh What happens if Because I mean I assume sometimes what you're doing when you're trying to predict stuff You make an assumption of what you think will be important or like What will like kind of help the model out and then when you get the result back that you expected Then you think oh, it didn't it did something correct But could it be the case that based on the features that you chose you basically just hold it Like this is what I want and I said, oh, I'll give you what you want And then you make like the assumption that you arrived, but really you just kind of bet it something What I love about your question is that you are skeptically inquiring that I set the problem up wrong to lie to myself And you should always be asking that What the answer is is you get to run as many experiments as you want Don't encode that feature and see how it does encode the feature does its performance improve Now you didn't tell it you better get a better answer So if the performance improves you have evidence that you have assisted the learner to get where you want to go You can also put other rival features in there that you know should not help it And then show with evidence that they do not So if you're nervous make some features that challenge what you're skeptical about and have evidence that it is not that For instance tanks happen to always be on cloudy days. Oh god. Did I make a cloud detector? Not a tank detector So make a cloud detector and put it in there and show that the cloud detector is skillful in finding clouds It does not correlate with finding tanks and it's not chosen in the features that matter So yes do that and keep asking those questions iterative skepticism is the only way to ensure you learn the right thing Especially in deep learning when you have giant models It has the uncanny ability to learn the right thing the wrong way because it's always trying to find the shortest path So you have to treat it skeptically and not trust your results. It's only when you beat down all the other Alternative interpretations you can start feeling confident That was a great question any others about feature engineering Sure, and it probably is and if you're using a very simple system I mean even lasso is a machine learning system you could use on this problem And if your features are powerful enough it'll actually work really well If it's a big deep learning model we hypothesize it's locked on eyes Prove it How do you know what it's actually using to do a face? It's very difficult There are methods where you can look at activation maps of where it's focusing on But when you do those you discover it has the weirdest answers I looked in this upper right corner and that there knows and that's how I determined it's a face And you're like well, I guess I can make an argument how that would help you But I'm not really sure what you're getting at. It's really hard This makes it explicit And what's really useful about this is after you go through and make some of these then you skeptically falsify it again I actually don't think the eyes are helping you knock it out and you show how much the performance drops And then you look at which faces is getting right and wrong and go all that makes sense because these faces aren't kind of ambiguous If it doesn't have the eyes I can see why it's getting confused now You have evidence that is learning the right thing It is it is getting confused and getting things right in accordance with your understanding of model and that helps And if you go to the logical extreme of this which is almost never done But if you can encode beautiful features that capture your problem your machine learning problem is trivially linear You can do it with a linear regression. You add up the values of all the features that is above this face It's below this that right and if you do that you have entirely understood your problem with respect to those featured engineer features So that from an engineering point of view from communicating to a scientist from getting them to use your model understanding it This is gold because no one's going to say we don't know how that works It's very clear a giant deep learning that might teach us stories I think you got a question Yes In fact your features are often not So the previously dimensionality problems always slightly challenges the learner but most machine learning methods are useful because they handle high dimensionality Fairly well some methods don't and then you shouldn't use those on a problem if you have a lot of dimensions If you're dumping broad data yet, you're already in extreme dimensional case any learner so adding a few more is not Purchase however using your inferiority knowledge to extract out some of the knowledge for it and pre-computing it Allows it to then focus the rest of its parameters learning the rest of the process So the performance will usually go up. Either way you're right that information was encoded in the original ravea But it would have had to use more parameters to extract it and you did it simply for your process So it usually makes the performance Yeah, so in practice as your feature detection basically just a really small machine learning problem You're going to the next step already and I love it so On a large system that's on board a space A lot of decisions of what's in the data and what to do about it You can chain different machine learning systems into each other and then it's kind of almost academic on this is a feature And this is the decider, right? It's a series of these systems all communicating with each other And maybe they have access to the raw data and maybe they don't maybe they're feeding as the input from all the other In general, however, that didn't really get you much because if you make a machine learning thing that looks for eyes Well, now you want to understand how that's working So usually these are simple statistics. Usually you want to use something that's very understandable because that advances your understanding as much as possible But in principle you could and that's actually the next thing we're to do now is say Well, if these can be as complex as I want, but they represent what I already know Why not put the physical models here? And that's the next simple thing I'm curious, because you mentioned, let's say I made a model that predicts eyes Like, yes, I would argue like you run into the problem of how is it to pick the eyes, but How's that different than like how are we able to tell with an eye? Like, you know, it's like it's a feature extraction that like our brains are doing For instance, like we honestly know everything about that process Like that was at what point I know that it's like I know this thing more like an art Which maybe like what point do you say like it's okay to just feed and oh this model like predicts eyes and like X percent and use that as a feature at that and try to like go that yourself It's always okay to try anything you want the question is what are you trying to understand So if you are completely confident that it's okay that this thing detects eyes and your question is not how that's happening No one in the field cares about it use it as a feature If you're trying to solve a problem, we don't understand yet putting another mystery box as an input to your system Didn't get you very far And by the way these usually aren't find an eyes find a nose find a mouth because that's hard There how many circular things are on the face? How many in those circular things do they have a dark region in the middle and a light region around the outside? And then we just say these statistics correlate with eyes. That's why this matters. That's good enough We don't need to get all the way to it has to 100 percent be correct as an eye or not And that's why maybe if it finds three in the image, that's okay, but not one All right, things like that All right, so what happens now if we say if the features are a nice way to put in data Let's put the physical models in there because we definitely understand what's in those physical models. We wrote them And that's what it looks like when you start coexisting with models So here's how scientists are doing things today. They take the observations They put it into a physical model now not a machine learned system It crunches through finite element or whatever it does and it makes predictions coming out That would be in a simulative system So what you can do is you can say well the output of that physical model is everything I don't want you to learn I already know that But you also have access to the observations too, which contain all the physics of the problem because they're just observations So the ml becomes a correct Its focus is to learn the bias or the error residual coming out of that model and predicting a better answer And if that's all you want, this is a quick and dirty way to get a better answer The machine learning just locks onto that correction signal and that's that But what's beautiful about this is if you combine it with feature engineering too and now you start getting to ask the question What else was in that observation that the physical model is getting on? You also aren't limited to use just one model You can add as many outputs of models as you want here And then you can ask questions like which of the models was right when? And this is a side question scientists have all the time Because they plot these models across space and time and then they go, oh, they're disagreeing again Some places they agree some places they don't buy when how often this is hard It's very difficult to get insight into and the ml won't answer the question for you But it gives you a new perspective a new way to ask that question and get down to it Here's another way to do it And this is very different than feature engineering now Maybe you have your physical model making its predictions, but it's agonizingly slow And one example of this is in weather Propagating the sun's rays coming through the atmosphere bouncing off the surface of the earth heating everything up Requires an enormous amount of computation all those results of that radiation propagation To compute the weather you probably don't need that much computation But if you do it with physics you absolutely do there's no other way forward So if you can look in your physics model and you can discover it's this subroutine It's the propagation of solar rays. It's taking 99 of the time Maybe we can take out that correct physics And then we can run it over all the possible parameters You could imagine and train machine learning model to get the same answer Oftentimes the answer is yes, and it's four to five orders of magnitude faster because it turns out that computing over that grid Was sufficient but not necessary There are shortcuts that you can learn patterns that you can find and that's exactly what machine learning is So this is emulation for acceleration and you might say Luke You told me acceleration scaling things up is what industry cares about on science The reason you might want to do this is say if you care about uncertainty You don't want to run your weather model once you want to run your weather model 100 Thousand times and generate a distribution of possible answers And then take statistics on that to see how confident you are and good luck If you're using the full weather model that takes all that compute But when you make an approximate model like this now you can estimate uncertainty on the original model by speeding it up Even if they incur is a little bit of error. So this is an active area of research very powerful Here's exactly the opposite way to incorporate physics and right now we're getting to the bleeding edge of them What I'm telling you right now is not a library. You can import. This is go read some papers and let's talk So instead Let's try to train a machine learning model to make the prediction you care about But you've already learned about loss functions and all the machine learning in the end is really doing this Trying to minimize a loss function for some space in expressivity Will include a term of how much it disagrees with your physics in that loss function So it is doing two things now as an arithmetic sum in a simple idea It's trying to agree to the observations that you told it and the predictions and Agree with the physics predictions that you should be at And you don't have to give it f equals ma over a finite grid You might just say and you know what you just need to conserve energy. That's what I care about And that constraint might be enough to suddenly bring your answers much closer to reality The other thing you're doing is you if it works lock down your potential space of solutions So now you need much less training data to converge useful into an answer And this is enormous because annotate because science data is trivial to get access to but annotated science data is very clear So you really want to do as much as you can to incorporate what you know to reduce the training that you need This is called physics informed machine learning and this is active research Now here's the the full dam if you're going to go that far Well, I'm a scientist and what we really want at the end of the day are new equations Because this is the language of science I want new terms in my equations that I can interpret the reason about and understand and maybe even derive one day So why not make a machine learning model that optimizes in the space of possible operators and terms and derivatives That's the parameters that it's trying to learn And let's seed it with the known physics we have now and let it learn corrective terms That the observations seem to need to explain relating their variables together Now this is so exciting that when I first read this I would ah, we're finally there Why didn't we do this to begin with and the answer is this In order to validate this model and train it you had to observe almost everything about the problem Every term in that equation you have to have observational data to go in if you want to compute it And one was the last time you saw a sensor that can tell you everything you could want to know about a cube of any space Or any time It's very rare. You can actually take data to substantiate a model like this But it is an active area of research and it really is the dream Right, this data is trivial and can be explained with current physics. Thank you I don't need to look at it anymore. That would be an amazing outcome or There's a slight turn is a third order derivative that if you add here explains the data five percent better What the heck is that? What does that mean? That'd be a great way to help scientists So this is bleeding edge research and you usually don't converge and you usually don't have an updated to do it But when you do could be the way forward and making these stable and useful Could be one of the problems that the people in this room actually solve We need help here because it's going to transform everything when we like it Now i'm going to talk to you and i've already talked with you a little bit about the concept that it is not never About setting up the right problem getting an answer and publishing a paper It is an iterative process of discovery and learning this appeal. Yeah. Oh, I'm sorry. That's an excellent suggestion That was a lot of ideas. So thoughts These are hypothetical approaches and the way that i've seen the mostly being used right now Are very simple systems like planetary dynamics Because those are so locked down and the forces are so understandable That we even know some of the terms that are not currently captured. We have ideas and we can kind of do sanity checks if it's any Where they want to use them is in what I would call in-situ measurements of 3d fields So a spacecraft is flying through the solar wind and we have almost no data whatsoever To lock down an estimation of that whole 3d field the magnetic field the electric field Electrons they multiply the species flying by we need physical constraints to see what's reasonable or not So that's what they would really like to use this Not so great for facial recognition. Yeah, it should come up with an agreement. It's not what you know A mature version of this technology You would want to do exactly that to say We believe we're locked down to a real signal and my setup of the problem is real because it Derived now your spokes right out of the box, but then it has its other funky terms and where it can be tested In practice, it actually means us to see most of what we know at the beginning to get close enough to the actual size of the answer Because the insufficiency of the data right and imagine the data you would have to take to derive now your Stokes from surrounds you know with simulations you might I should also say the limitations of this technology are not yet known There are numerical edges to what you can derive and what you can converge to and regimes in which you're stable and unstable And in most of machine learning these will be known but are only partially known now Why machine learning sometimes locks onto things and sometimes doesn't is an active area of applied mathematics right now And they have made progress but not finished that You usually hear things like this we made an enormous progress by assuming a neural net is infinitely wide and has seven You know layers deep and so now we can finally prove one thing about it. You'll see things like that's the stage we're at So it's really quite early All right, so the iterative discovery loop. What is that? And there's many different kinds of iteration here. So this is a very general thing to teach you The traditional approach of how to do science is what i'm going to show you first And i'm going to call this confirmatory statistics at no time should you interpret what i'm saying as an insult This is an excellent way to do science that got us to the moon. It works I just want to introduce it in a certain way So you start out with a scientist being exposed to their current physics models some observations They hope contain new things and then the errors of those models with respect to the observations It is incumbent on the physicist then or the scientist studying the system to come up with a great idea Look through the data find where it disagrees and wonder what might be causing this From your brain right now, there's not a lot of help here And in fact science is often derived when they're training Young scientists up by saying just have an idea please right you have to do that. That's your job So that's a lot of effort and and and and stress to put on that person to have a good idea and good ideas can go really far But it works because humans are very creative You then enter into a small iterative loop with your data you load up that lab You code up your hypothesis and then you say if this is true then on this island at this date I should see this effect or not. Is that there? Oh, that was a bad idea And then you have another one and you have another one and you keep going And so if you get an idea that fails falsification repeatedly you never know if it's right You just fail to show it's wrong repeatedly and then you publish it and see if other people can replicate the same thing So this is pretty good Um and at the end of that hopefully if everything works great We have a new term in our equation a new chapter in a book a little more physics to know There's a different way and this was asked when we started having so much data that scientists said I'm not even using 99 percent of my data in that falsification step up there I am never posing questions that require Consistency with all the data everywhere to show it's true In fact, most science is done by zooming in on these giant data sets to look for clear examples Where maybe you have a chance of falsifying or supporting something So is are there other things you can find by looking at all the data together rather than this zooming in And it's really amazing the science literature You will find the same island or the same storm is studied again and again and again and again because the more people that study it The more clear it becomes for other people to try to use as a learning example to Ignoring all the other storms. So how do we ask questions of the data directly? The new way is to use machine learning and ask a much higher level question rather than is it is this the answer You can say I wonder if There is a trend that relates to some of these variables that I took in observations over time That is influencing or I wonder if there are kinds of data that can group into different areas and how many are there I'm just hypothesizing there's n groups don't even know what n is yet But maybe they cluster together and act similarly or Maybe if I combine these terms in this interesting way is predicted in a useful way I wonder so you can ask these high level questions You can work with the data scientist or maybe you already are one if you've learned these And then you can code this up and the data scientist helps you by saying here's how you set up the problem Here's how we have to condition the variables to feed them in the model I suggest this model architecture based on the complexity of your problem But let's try two others just in case to make sure one a little more one a little less Here's my rigorous validation plan So I don't lie to you or myself and the scientist then has to sit down and oddly enough Do exactly the opposite of wondering what's true But they have to go in and tell you what's true on a very small scale. I hear about this I don't care about this and take the data Endlessly and they will be frustrated and wonder why is this useful? Why do I have to give you 10,000 annotations? Why are five enough? Actually even would have gotten it by now. Why isn't the system getting it? This is the hard spot. So you're not trading pain Yeah, you're not getting rid of pain. You're just converting it To be annotating carefully in your data to explain what the machine learning should do But if you do all that Then machine learning will do what it did. It will find an algorithm for you that does what you explained for it And this algorithm will not be useful It will immediately fail But it will be interesting how it fails. They got the right answer here, but not here. Oh I need to give you more examples like this. So you go back and you fix your annotations And then you really train the model. Okay, now it's feeling here, but not here But it is doing even better If you keep doing this, I'm purifying you again and again and again You make progress until your model's scale goes higher and higher and higher Until its predictions become interesting that it's able to do that. Nobody else can make these predictions How is it to make that? If you ask that question of the model and you can figure out how it's doing that You have now created a hypothesis that is consistent with all your data that didn't necessarily Come from you It came from you indirectly by asking a high level question Annotations But all of a sudden you say, you know how it's working is it's using these three variables And I don't actually understand why these three variables should help There's something interesting that shouldn't be true. That's the most exciting moment Because now they can go back and think all the thoughts they want and keep creating just like they were But they have some ideas So this is one of the ways that you can iterate and at the end of this you throw your model away You didn't train the model to give to anybody else or do anything with other than to inspire you That's pretty exciting However, sometimes you happen to have a model at the end that's useful So you give that out to the community and say I haven't I now have a strangely high Family creator detective that I happen to use it as a purpose that you can use it too. In case it helps your science This is all industry cares about this This is most of the sciences Is trying to get the idea So that's one way to use these systems and that's one of the reasons why using simple methods is useful That step is easier if you use a simple model. It's very hard if you use an enormous complex one So let's just pause there for a moment. Is that Clear on how you might want to use these things and why it doesn't actually matter What your model is or the final answer and accuracy where it gets you're using it as a discovery tool all right Now we're going to advance a little bit here. I've shown you a few things That's not all there is to say on those topics, but I want it and I'm watching the time I want to talk to you a little bit about Different science domains and what they need because machine learning looks very different when you try to do What I just described to you in different areas So let's just take three examples for a moment the earth's atmosphere surface of mars, maybe gps time series and one place That all science starts in a continuum record. I'm just taking observations I have a stream of them don't know what anything means yet, but I'm recording it. I'm recording it systematically Right, and we have huge records of that all science has been there The next step is to say what are discrete events? What are things I can understand? What are storms and clouds and rain and precipitation? There is droughts, right? I can start chonking things together I call this the stamp collecting phase, right? I want to find all the kinds of things there are and then describe and examples of those kinds of things All around they might be all the surface features on the surface the planet all the weather events and subunits of that All the time series inside gps exactly. What can the earth do? How does it move? What does it look like? After that catalog phase we then start building theoretical models that try to explain each of the events For this event, and it's only this kind of event this slow earthquake whatever you're studying Here's my model of how it goes and let's compare it to the data and start pulling out understanding And after that you start taking all these models and putting them together Well, I have good models for all the individual events and pretty much everything happens But do they integrate together? Does it make the coherent story? Does global patterns trends and yearly behavior now make sense? Is the energy budget action? And then if you're really good at that weather is about the only place we've gone past this on You start making this concept called a digital twin and a digital twin means We have parametrized the earth to the point that we have a digital model But it's a simulating data from all around the world in an incredible fidelity And it is our current best estimate of the state of that system and everything you might want to know about So if you want to measure something on the earth, you might be able to make a measurement on the digital twin Instead of on the earth because it's lock-stepped with it And places where your digital twin is disagreeing with some of the incoming sensory data Is where you should go as a scientist is subject to why there's something interesting that we haven't captured yet in some way So we really only have we're starting to do this for weather now We're just transitioning to this thing. You see this in Hollywood all the time and for every engineering system They have a digital twin But the other thing that digital twins are good for is that if you believe in them If they have earned your trust over time by faithfully reproducing data, you can do experience on them You can't know on earth what happens if we increase the sun's fat by a factor two What happens to increase your two levels things like that But what I want you to remember is that the machine learning is anywhere in that system at all And especially if you don't have physics guiding you all machine learning is an interpolative method It interpolates between the data that it's been seen in a very complex space on a simple way But it does not extrapolate ever there are no guarantees that if you show machine learning system something It's never seen before or a parameter range. It has no day of four that it will do the right thing Zero guarantees it looks inward So you want to be very careful when you use things like digital twins and then predict the climate in a machine Where we have the training data say because who knows if that's right at all It's not actually that scandalous because how would you validate it anyway? You're putting it into a system you have no validation data for so. Why would it be right physics can That's why we really like physics models All right, so how does data science look in these different Categorizations and I'll just skip through this in the interest of time But at the very beginning of the continuum record, there's just endless need for data to discover What's in the giant record? What are the individual events in units anyway? Can I use clustering or something just to give me an idea of what I'm looking at Let me find meaningful groups and distinctions event detection and helping in situ event capture There's an event that happened here and I could zoom in I can study the heck out of it, but I needed help finding it When you get to the catalog area, it's now the data different discovery still stays there But now it's about annotating what I currently know so that I can look for things that I don't And having a repeated discussion with your data annotating it more and more changing your annotations changing your mind Splitting groups in two so that you are slowly but surely encoding your knowledge In a larger and larger annotation set over that record and breaking it up into more and more understanding of the chunks of mean So that then you can pair them off the individual grad students and say that's your dissertation And then purification whenever you're annotating you're doing a problem I promise you because it's very hard problem and humans make mistakes So any discovery that you're ever doing in your data is first and foremost finding the errors in your annotation Embrace that and you will live a happy life Forget that and it will light you Finally knowledge capture Annotations are not just for training machine learning. They are a way to communicate with other humans You can check them in a github. You can share them in the community They can make their own annotations and you can argue viciously on why you disagree and that actually is advancing knowledge So this is a systematic way of codifying science understanding how to solve In theoretical models, you start being able to say here's what the observations did. Here's what the model said Here's the variables that might predict why they disagree which ones matter and use that to improve your physical model So it becomes a discovery tool to discover what's wrong in your model and how to make it better And today this is where science is really hitting a bottleneck because the traditional methods of advancing those models are hitting a bottleneck They need help to explore the data in a collaborative sense And then also to get good at with future engineering and others you can have policies check I think this is true. We'll see if that's consistent with all my data for me and you can set up the problem trivial easily in order to do that For global models now it's all about are they right or are they wrong? Where are they wrong? Can we characterize why they're wrong which variables are right or are they wrong? It becomes that kind of a question because we're getting more mature It's about explanatory completeness and then all of a sudden this comes in These models are so big that we can no longer compute them. It's too expensive or takes too long. Can we accelerate them? So suddenly the new data is coming through it and it's harmonization I have 15 overlapping data sets that are all measuring the same place on the earth at the same time How do I use to we bring all this data together? And at the very end this becomes formalized As uncertainty quantification This is the idea that I'm no longer trying to predict. That's not good enough I have to also estimate how confident I am in my prediction In science it's been said if you don't quantify your uncertainty then your prediction has no value Because people need to understand how confident you are That I take issue with that statement because actually in the early stages of any science area It's quite useful to make predictions. Even if you don't know how certain you are you still learn a lot Like in the end if you really want to demonstrate mastery You have to be able to talk about uncertainty and quantify the ranges And machine learning doesn't do that There's no library you can call That is ML dot uncertainty Instead it's a song and it's Bayesian statistics and it's all those things again. It never goes away But if you enable those you probably need emulators and everything else machine learning was helping So let's pause there for a moment Is that clear how how machine learning helps science evolves as the science evolves Every different science domain is in a different place here planetary science is all the way over here I don't have any data the earth we live on is over there because Let's keep going Because I see now that I was overly ambitious instead of under which is great because that means we're having a good discussion Um, I want to give you a very brief catalog example. I want I want to go back to I don't know if you can where's my cursor Well, I'm going to go to the catalog example So in this case We are having a continuous record and we're trying to break it down into discrete events to advance our understanding In a collaborative way that informs the community and works with everybody. Let's just zoom in on that for a moment So this is all about assembling your initial objects and events um And here's some examples of what I mean by that You know impact fresh impacts on mars not just impacts, but the ones that occurred recently look funny So you can find them and recognize them and there's lots of scientists interested in that because they probe the subsurface could you please go find those for me and Uh, particular particular spatial reasons of change is this huge on the earth What's changing on the earth? How fast is it changing? Where is that going to understand the processes that are explaining that change especially humans versus not? Uh unusual outliers. You don't really want it thinking that the insight lander is a normal surface feature on mars So it's a great way to test anomaly detection because it better show all the human intrusion on there Or you're not really working very well Least sequential predictable events. This is great for earthquake detection I learn all the trends and seasonal behaviors and rise and fall of the earth because the water is flowing into the ground And then suddenly something happens that was fundamentally not predictable. That's interesting. Let's make a catalog of those Um different kinds of terrain. These are texturally different than everything around them And that means their geologic interpretation is also different and what geologists really love is the contact points between them So if you can automatically classify on a surface, there are 17 kinds of terrain Here's typical examples of each and here's the contact points between them You've actually helped a geologist a lot of knowing where to focus their attention And then of course storms are very clear on how those events are out there Um, so let's go on how this goes and this is this is a narrative now I'm describing this with you and walk with me for a minute Let's say you have decided to get a dissertation this area You're working with physical scientists. You're the data scientist. You're trying to help them out You start with a few cataloged examples Like I just said up there a few things that you know are helping you make progress You look in this enormous Vast uncatalog source data that's pouring in daily and you say, how can I help? So the first thing you do is you say what's most similar to these? I want more like this and the machine learning helps you and at first it's horrible But you keep retraining your annotations until it gets better and better and you have some single use models that are Helping you focus in on things of interest. You can also ask the opposite What is least similar to these what things have the least score of the things I know about And this is now focus of attention on things you didn't ask about or didn't focus They might show you noise, but now you know there's noise in your data Did you know that some of the images were all noise? There's always something to learn You also can say what's most confusing what has the least confidence in my prediction Not the highest and lowest score, but I just embaffled. I don't know what this is That can be a great way to discover new things that you don't currently understand After you do this for a while you have built up a whole bunch of stuff for you to analyze think about maybe annotate And so the next step that you do is say what are the emergent groups that I see here How would I like to segregate this out in a meaningful way? You can also say what are outliers without me telling you what I know what stands out And you could subtract out all the examples you currently understand and say now what stands out And you can keep doing this and find the weird ones This is for instance one of the ways that we found the most hematite rich rock on the surface of Mars Was by looking for outliers and discovered one that was ridiculously rich and that had very important scientific meaning You can also find subclasses. I thought this was a class, but you know it keeps getting confused It's actually two classes. I did a bad job So you go back and you reanitate and now you learn something and you go forward You now can also compare with theoretical models. Which of these are explained by current physics and which of these are totally baffling That's another annotation that you can add to each of these events now You can ask how well or poorly are they modeled by physics? And that becomes a kind of annotation for you to go just make more discoveries And you are always doing quality control and purifying no matter how hard you try To the end of your project you'll be doing that and on your final slide You'll talk about what errors you suspect are still in there because it's just what happens Then you characterize and you capture you can ask the machine learning to generate simple explanations Between the groups that you've identified To help you get ideas on what might the physics that might be driving the differences between the things you're looking at And finally you publish all of that and make it a community record And you should always look at your problem as though that's what you're doing The data set that you're preparing the annotations that you're painstakingly making and remaking and remaking All need to be published so someone else can take them and keep going And eventually it will get big enough that it will become a website And someone will pay for it and all the scientists will be logging into it and contributing to it And then it will be called jamars or some of these other things Which is exactly what happened and someone will write a custom web viewer for you that puts on layers for all the different annotations People have done so far and that is becoming What science is starting to look like now community-based science because the problems we're solving Aren't really easy for one person to do All right, there's a few other topics before we leave today and we'll go through them rather quickly Data fusion and decision support One of the things that machine learning is really useful for is that you don't have to know physics in order for it to work You can simply dump data in and then you can say can you make the prediction In science you wouldn't think that this would be particularly important But it's actually really important because we want to do science applications Not just science prediction and science advancement But we want to say Does climate science inform us on whether we should build a dam And to get information on that you need to start bringing in socio-economic information A bunch of other information for which there are no terms in your physics equations So we want to be able to fuse data and ask that it's helpful. Is it useful? Is it informative? And that is exactly what explainability methods are for which we'll talk about the next time I get to chat with you You can propose a bunch of information that might have been useful And then it says these are not this one was and here's the percentage that it helped by And now you know where some of the information lay and you maybe start reasoning about it if you want to understand it But you can also help decision makers make decisions that are related to science understanding as well This is how people track pandemics through phone use and things like that. They do not write equations with terms The last thing I want to leave you with today is whoops is what where we go is what not to do I am going to show you a gun And I am going to ask you to not pick it up, but I want you to know what a gun looks like Because when you don't you pick it up So The reason this is what it is is because statistics are trivial to lie with most importantly to yourself If you do not validate and cross validate and then challenge your validation strategy and disbelieve it And skeptically show it and then ask someone else how they would challenge it You probably lied to yourself. It's that easy And here's what it looks like I want to show you the right way first You take in gappy data Your sensors turning off and on your spacecraft is rotating. So you're not measuring things all the time Whatever how that ever this happened and you have the idea Could we use physics and observations and statistics to fill in those gaps and make a continuous record Because none of the downstream algorithms that take in this data can handle gappy data They just segfold they can't handle it and the gappy data is obviously wrong so this must be better and If these informative co-observations Are talking about what's in that gap if they're getting new information that might not be what you wanted But it's related to it. This is a totally valid thing to do. I don't know what was there But I have the shadow of what was there so I get pretty cool That's okay. And if you cross validate, you're all right another example I take low resolution observations from one satellite that's everywhere And I learned to predict very high resolution data that's only in some places So now I have high resolution data everywhere This works And as long as you have other high resolution data That's telling you something about where you didn't have high resolution data, but it's related to it. This is a valid problem again And if you cross validate and really challenge yourself, it's okay But here's the terrible thing If you turn those boxes off your system still works And your answers look great When you examine them people will be disbelieving how beautiful it is if you choose the right algorithm You have enough data and all those other things it takes a lot of time to do But it is plausible not right And then people get excited and they get these to other data There's other scientists and then those scientists run and get conclusions on them and start learning about them And making power laws based on the relationships on these And all of these products are deriving from what you gave them are wrong But plausible They're also not completely wrong if they were completely wrong people would be like this is ridiculous. I just believe it they're not Humans use sniff tests standard checks and quick Examinations to see if something seems right because usually that tells you if something's wrong Machine learning systems shortcut that and say I'm going to give you something that looks reasonable very quickly That you can't as a human necessarily So there are two ways to protect yourself from this outcome The first is with rigorous cross validation If you really do test train splits you split them up and sort your data all around and you honestly ask How correct for you when you fill in this data? You'll discover really quickly that it's doing a reasonable job, but not a right job And you'll be able to estimate it. It also won't be completely wrong. It'll be somewhere in the middle The second way To do this is to really zoom in and study it in great detail and then produce uncertainty maps If you produce an uncertainty map that says exactly how much you can be confident that this is right and wrong And that's increasing and decreasing depending on what you're filled in you're being completely honest That is completely legitimate and you'll be horrified at how big that uncertainty gets when it has to fill in big regions and things like that But that is the honest answer you hand your product to a science scientist And on a certainty map and say don't ignore this. Don't pretend like I'm Okay, so that's where we're going to close for today with the idea that sometimes In a coloration example It's that that would have advanced science your generative model will never make this It will make a very realistic looking human and you will miss the science discovery And tell people that that doesn't exist because it wasn't in your training data That's a big problem Summarizes everything that we've learned today, but it's really the bottom line. I want you to take away with Machine learning is as powerful as your validation If you take out one test set at the end And then test yourself on that. So that's my generalization error. Good luck with that I'm sure you did a perfect job selecting that test set as being exactly representative of everything in your entire problem Because that's what you're saying is Be careful But if you are this is a great way to make discoveries And that's what I love to do with our research. Our group does this a lot and in handling sciences Not automated things But trying to help them find things and boil it down understand Thanks